Component id	Polifonia-Corpus
Type	Corpus
Name	Polifonia Corpus
Description	Data, metadata, statistics, annotations and interrogation APIs of the Polifonia Textual Corpus.
Work package	WP4
Release date	28/06/2022
Release number	v0.1.3
Licence	CC-BY_v4
Contributors	https://github.com/roccotrip https://github.com/arianna-graciotti https://github.com/EleonoraMarzi
Related components	Persona: Valeriana (Persona) Carolina (Persona) Reuses: Polifonia Lexicon - The Polifonia Multilingual WordNet (Lexicon)
Bibliography	Deliverable document: D4.2 Interrogation and annotation of plurilingual corpora for discourse analysis

View this file on Github Download as Schema.org

Polifonia Textual Corpus

This repository contains the script to access, parse, annotate and interrogate the data and metadata of the Polifonia Textual Corpus.

The high level structure of the repository is the following:

Polifonia-Corpus
│   README.md
│   wikipedia_corpus_parser.py
|   wikipedia_corpus_reader.py    
│
└───annotations
│   │   README.md
│   │
│   └───db
│       │   Wikipedia_EN.db
│       │   Periodicals_EN.db
│       │   Books_EN.db
|       |   ........
|       |   "Module"_"Lang".db
│   
└───interrogation
|   │   README.md
|   │   interrogate.py
|   |
|   |___data
|       |   lex_ent_map.pkl
|       |   pages.pkl
|
|___utils
    |   db_utils.py

The root folder contains the script to access and parse the Polifonia Corpus data and metadata that are linked in this README.md file.

The annotations folder contains a README.md file in which it is explained how the corpus was annotated. A “db” subfolder of the “annotations” folder is set up to store the databases with the annotations of the corpus that will be used for the interrogations of the corpus. The databases will be downloaded automatically the first time each module will be queried. The links for the download are listed in the “urls.csv” file.

The interrogation folder contains a README.md file that explain how to interrogate the corpus. It contains a “data” subfolder used to link mentions, named entities and Wikipedia page titles.

The corpus

The corpus is dived into four modules:

the Wikipedia module
the Books module
the Periodicals module
the Polifonia Pilots module

Each module (except the Pilot module) contains documents in six languages: Dutch (NL), English (EN), French (FR), German (DE),Italian (IT) and Spanish (ES).

The Wikipedia module

It was created selecting from BabelNet domains all the Wikipedia musical pages.

Metadata

The metadata of the module can be downloaded from:

lang	url
DE
EN
ES
FR
IT
NL

Data

The data of the module can be downloaded from:

lang	url
DE
EN
ES
FR
IT
NL

Statistics

Some statistics of the module are provided below:

lang	#documents	#sentences	#tokens	#types	#links	entities
DE	53.986	1.459.265	44.523.547	9.732.779	12.561.177	2.197.438
EN	250.413	7.362.272	198.257.649	1.191.901	54.059.979	25.786.043
ES	57.891	1.247.583	36.229.557	537.465	7.171.759	2.996.185
FR	65.970	2.901.295	82.979.944	653.489	19.208.818	6.212.997
IT	77.986	1.548.981	47.497.487	491.500	14.519.636	2.649.949
NL	36.609	1.246.881	23.539.528	479.962	4.716.170	2.453.332

The Books module

It was created using the Polifonia Textual Corpus Population module that allows to access different digital libraries (such as BNF and BNE) and to select from them documents related to music. The PTCPM allows also to perform optical character recognition (OCR) on images and PDF files.

Metadata

The metadata of the module can be downloaded from:

lang	url
DE
EN
ES
FR
IT
NL

Data

The data of the module cannot be downloaded due to copyright issue. However, it is possible to reconstruct the corpus using the metadata provided in the previous section. Furthermore, the data processed and annotated can be accessed interrogating the corpus (how to interrogate the corpus is explained in a README.md file inside the interrogation folder of this repository).

Statistics

Some statistics of the module are provided below:

lang	#documents	#sentences	#types	#tokens
DE	237	38.633	121.530	489.225
EN	360	49.595	185.280	940.232
ES	41.093	731.606	1.852.430	20.180.197
FR	265	633.173	1.305.283	14.354.611
IT	12200	202.730	405.099	2.571.090
NL	83	116.593	539.102	1.779.824

The Periodicals module

It was created with the help of musicologists that provided the titles of different influencial music periodicals.

Metadata

The metadata of the module can be downloaded from:

lang	url
DE
EN
ES
FR
IT
NL

Data

Statistics

Some statistics of the module are provided below:

lang	#documents	#sentences	#types	#tokens
DE	705	121.113	544.376	2.405.289
EN	2.868	4.400.015	7.342.527	76.180.398
ES	455	87.025	677.041	3.170.480
FR	349	329.166	696.427	5.111.915
IT	1.251	433.465	992.902	7.879.459
NL	125	187.350	716.506	3.880.499

The Polifonia Pilots module

It was created collecting the textual material selected by five Polifonia Pilots:

BELLS
CHILD
MEETUPS
MUSICBO
ORGANS

Metadata

The metadata of the module can be downloaded from:

Pilot	url
BELLS
CHILD
MEETUPS
MUSICBO
ORGANS

Data

The data of the Pilots Module of the Polifonia textual Corpus collected for Bells, MusicBo and Organs pilots cannot be published in their integral form because they are subject to heterogeneous license restrictions. The respective set of published metadata (see table above) allows for the reproduction of the whole corpora. Texts collected for Child and Meetups Pilots are royalty-free, therefore we report links to retrieve them from their corresponding GitHub repositories:

Pilot	url
CHILD	https://github.com/polifonia-project/documentary-evidence-benchmark/tree/main/corpus
MEETUPS	https://github.com/polifonia-project/meetups_pilot/tree/main/cleanText

However, it is possible to reconstruct the corpus using the metadata provided in the previous section. Furthermore, the data processed and annotated can be accessed interrogating the corpus (how to interrogate the corpus is explained in a README.md file inside the interrogation folder of this repository).

Statistics

Some statistics of the module are provided below:

pilot	#documents	#sentences	#types	#tokens
BELLS	59	18.481	128.061	434.439
CHILD	30	157.815	361.550	3.442.840
MEETUPS	19.476	822.861	1.631.371	21.536.135
MUSICBO	46	51.781	289.247	1.412.456
ORGANS	1.660	25.647	45.298	368.439