MEETUPS Corpus collection
Collecting Wikipedia pages of people in the music scene in Europe
MEETUPS Corpus collection is a tool developed in Python and PyCharm IDE. It collects Wikipedia web pages (in txt format) of music authors in Europe. Refer to the Meetups Pilot for use and implementation.
- Uses the “wikipedia” library to download only wikipedia webpage text
- Process the list of files in chunks of 100 units
- The process can start and stop any time as it controls the last downloaded item
Information on installation and setup
- Pre-requirements:
- A CSV file with the list of authors’ wikipedia id and store in sparqlQueryResults/ directory
- Python 3.9
- Install wikipedia library:
- pip install wikipedia
- To execute:
- Download project and execute init.py file
Details of dataset
SPARQL queries to retrieve authors’ names and dbo:wikiPageID information using Dbpedia SPARQL Endpoint https://dbpedia.org/sparql
Query filters:
Categories: <http://dbpedia.org/resource/Category:Music_people>
<http://dbpedia.org/resource/Category:People
Location:
sparqlQueryResults/query.sparql
Query results"
sparqlQueryResults/Q<1>_sparql.csv
Dataset:
Location:
dataset/
Format:
Text files .txt
Name convention:
<Author_wikiPageID>.txt
Total biographies collected:
33,309 authors wikipedia webpage
Summary total biographies collected:
sparqlQueryResults/TOTAL_download_biography.csv
Meetups pilot sample: 1.002
Select random biographies -> sampleBiographies.py
Acknowledgements
This work was supported by the EU’s Horizon Europe research and innovation programme within the Polifonia project (grant agreement N. 101004746).