Component id	meetups-corpus-collection
Type	Software
Name	MEETUPS Corpus collection
Description	This is a tool to download the Wikipedia pages of people in the music scene in Europe
Work package	WP4
Pilot	MEETUPS
Project	polifonia-project
Resource	https://github.com/polifonia-project/meetups_corpus_collection/
Release date	20/07/2022
Release number	v1.0
Licence	Apache-2.0
Contributors	https://github.com/albamoralest
Related components	Informed by: MEETUPS Corpus (Corpus)

View this file on Github Download as Schema.org

MEETUPS Corpus collection

Collecting Wikipedia pages of people in the music scene in Europe

MEETUPS Corpus collection is a tool developed in Python and PyCharm IDE. It collects Wikipedia web pages (in txt format) of music authors in Europe. Refer to the Meetups Pilot for use and implementation.

Uses the “wikipedia” library to download only wikipedia webpage text
Process the list of files in chunks of 100 units
The process can start and stop any time as it controls the last downloaded item

Information on installation and setup

Pre-requirements:
- A CSV file with the list of authors’ wikipedia id and store in sparqlQueryResults/ directory
- Python 3.9
Install wikipedia library:
- pip install wikipedia
To execute:
- Download project and execute init.py file

Details of dataset

SPARQL queries to retrieve authors’ names and dbo:wikiPageID information using Dbpedia SPARQL Endpoint https://dbpedia.org/sparql

Query filters:

Categories: <http://dbpedia.org/resource/Category:Music_people>
            <http://dbpedia.org/resource/Category:People
Location:
            sparqlQueryResults/query.sparql
Query results"
            sparqlQueryResults/Q<1>_sparql.csv

Dataset:

Location:
            dataset/
Format:
            Text files .txt
Name convention:
            <Author_wikiPageID>.txt
Total biographies collected: 
            33,309 authors wikipedia webpage
Summary total biographies collected: 
            sparqlQueryResults/TOTAL_download_biography.csv
Meetups pilot sample: 1.002

Select random biographies -> sampleBiographies.py

Acknowledgements

This work was supported by the EU’s Horizon Europe research and innovation programme within the Polifonia project (grant agreement N. 101004746).