@book{albamoralest_daga_2023, title={polifonia-project/meetups_pilot: v0.2}, url={https://zenodo.org/record/7875353}, DOI={10.5281/ZENODO.7875353}, abstractNote={To be published in the next ecosystem release}, publisher={Zenodo}, author={Albamoralest and Daga, Enrico}, year={2023}, month={Apr} }
MEETUPS Corpus preparation: Cleaning data collected from Wikipedia web pages of people in the music scene in Europe
MEETUPS data cleaning is a tool developed using Python and Jupyter Notebook. This software prepares the biographies (collected as text files) in https://github.com/polifonia-project/meetups_corpus_collection for the next step in the extraction of historical meetups process.
Use the Wikipedia authors’ webpages collected in https://github.com/polifonia-project/meetups_corpus_collection
Clean text blank lines, sections with no historical meetups data
Organise the text in sentences as the main unit to extract meetups information
Information on installation and setup
Run Jupyter Notebook 01_CleaningText.ipynb
Details of the data
Code location:
|_ 01_CleaningText.ipynb
Raw corpus location
Data output:
|_ text_dataset/
Clean text location
Data input:
|_ cleanText/
Index data location
Data output:
|_ indexedParagraphs/
|_ indexedSentences/
|_ README_data_cleaning.md