Link Search Menu Expand Document
View this file on Github Download as Schema.org

MEETUPS Corpus collection

DOI

Collecting Wikipedia pages of people in the music scene in Europe

MEETUPS Corpus collection is a tool developed in Python and PyCharm IDE. It collects Wikipedia web pages (in txt format) of music authors in Europe. Refer to the Meetups Pilot for use and implementation.

  • Uses the “wikipedia” library to download only wikipedia webpage text
  • Process the list of files in chunks of 100 units
  • The process can start and stop any time as it controls the last downloaded item

Information on installation and setup

  • Pre-requirements:
    • A CSV file with the list of authors’ wikipedia id and store in sparqlQueryResults/ directory
    • Python 3.9
  • Install wikipedia library:
    • pip install wikipedia
  • To execute:
    • Download project and execute init.py file

Details of dataset

SPARQL queries to retrieve authors’ names and dbo:wikiPageID information using Dbpedia SPARQL Endpoint https://dbpedia.org/sparql

Query filters:

Categories: <http://dbpedia.org/resource/Category:Music_people>
            <http://dbpedia.org/resource/Category:People
Location:
            sparqlQueryResults/query.sparql
Query results"
            sparqlQueryResults/Q<1>_sparql.csv

Dataset:

Location:
            dataset/
Format:
            Text files .txt
Name convention:
            <Author_wikiPageID>.txt
Total biographies collected: 
            33,309 authors wikipedia webpage
Summary total biographies collected: 
            sparqlQueryResults/TOTAL_download_biography.csv
Meetups pilot sample: 1.002

Select random biographies -> sampleBiographies.py

Acknowledgements

This work was supported by the EU’s Horizon Europe research and innovation programme within the Polifonia project (grant agreement N. 101004746).