Link Search Menu Expand Document
View this file on Github Download as Schema.org

MEETUPS Corpus preparation: data cleaning

MEETUPS Corpus preparation: Cleaning data collected from Wikipedia web pages of people in the music scene in Europe

MEETUPS data cleaning is a tool developed using Python and Jupyter Notebook. This software prepares the biographies (collected as text files) in https://github.com/polifonia-project/meetups_corpus_collection for the next step in the extraction of historical meetups process.

  • Use the Wikipedia authors’ webpages collected in https://github.com/polifonia-project/meetups_corpus_collection
  • Clean text blank lines, sections with no historical meetups data
  • Organise the text in sentences as the main unit to extract meetups information

Information on installation and setup

  • Run Jupyter Notebook 01_CleaningText.ipynb

Details of the data

Code location:

|_ 01_CleaningText.ipynb

Raw corpus location
Data output:
|_ text_dataset/            

Clean text location
Data input:
|_ cleanText/

Index data location
Data output:
|_ indexedParagraphs/
|_ indexedSentences/

|_ README_data_cleaning.md

DOI:

DOI