Link Search Menu Expand Document

MEETUPS Corpus collection

DOI

Collecting Wikipedia pages of people in the music scene in Europe

MEETUPS Corpus collection is a tool developed in Python and PyCharm IDE. It collects Wikipedia web pages (in txt format) of music authors in Europe. Refer to the Meetups Pilot for use and implementation.

  • Uses the “wikipedia” library to download only wikipedia webpage text
  • Process the list of files in chunks of 100 units
  • The process can start and stop any time as it controls the last downloaded item

Information on installation and setup

  • Pre-requirements:
    • A CSV file with the list of authors’ wikipedia id and store in sparqlQueryResults/ directory
    • Python 3.9
  • Install wikipedia library:
    • pip install wikipedia
  • To execute:
    • Download project and execute init.py file

Details of dataset

SPARQL queries to retrieve authors’ names and dbo:wikiPageID information using Dbpedia SPARQL Endpoint https://dbpedia.org/sparql

Query filters:

Categories: <http://dbpedia.org/resource/Category:Music_people>
            <http://dbpedia.org/resource/Category:People
Location:
            sparqlQueryResults/query.sparql
Query results"
            sparqlQueryResults/Q<1>_sparql.csv

Dataset:

Location:
            dataset/
Format:
            Text files .txt
Name convention:
            <Author_wikiPageID>.txt
Total biographies collected: 
            33,309 authors wikipedia webpage
Summary total biographies collected: 
            sparqlQueryResults/TOTAL_download_biography.csv
Meetups pilot sample: 1.002

Select random biographies -> sampleBiographies.py

Acknowledgements

This work was supported by the EU’s Horizon Europe research and innovation programme within the Polifonia project (grant agreement N. 101004746).