Polifonia Textual Corpus
This repository contains the script to access, parse, annotate and interrogate the data and metadata of the Polifonia Textual Corpus.
The high level structure of the repository is the following:
Polifonia-Corpus
│ README.md
│ wikipedia_corpus_parser.py
| wikipedia_corpus_reader.py
│
└───annotations
│ │ README.md
│ │
│ └───db
│ │ Wikipedia_EN.db
│ │ Periodicals_EN.db
│ │ Books_EN.db
| | ........
| | "Module"_"Lang".db
│
└───interrogation
| │ README.md
| │ interrogate.py
| |
| |___data
| | lex_ent_map.pkl
| | pages.pkl
|
|___utils
| db_utils.py
The root folder contains the script to access and parse the Polifonia Corpus data and metadata that are linked in this README.md file.
The annotations folder contains a README.md file in which it is explained how the corpus was annotated. A “db” subfolder of the “annotations” folder is set up to store the databases with the annotations of the corpus that will be used for the interrogations of the corpus. The databases will be downloaded automatically the first time each module will be queried. The links for the download are listed in the “urls.csv” file.
The interrogation folder contains a README.md file that explain how to interrogate the corpus. It contains a “data” subfolder used to link mentions, named entities and Wikipedia page titles.
The corpus
The corpus is dived into four modules:
- the Wikipedia module
- the Books module
- the Periodicals module
- the Polifonia Pilots module
Each module (except the Pilot module) contains documents in six languages: Dutch (NL), English (EN), French (FR), German (DE),Italian (IT) and Spanish (ES).
The Wikipedia module
It was created selecting from BabelNet domains all the Wikipedia musical pages.
Metadata
The metadata of the module can be downloaded from:
Data
The data of the module can be downloaded from:
Statistics
Some statistics of the module are provided below:
lang | #documents | #sentences | #tokens | #types | #links | entities |
---|---|---|---|---|---|---|
DE | 53.986 | 1.459.265 | 44.523.547 | 9.732.779 | 12.561.177 | 2.197.438 |
EN | 250.413 | 7.362.272 | 198.257.649 | 1.191.901 | 54.059.979 | 25.786.043 |
ES | 57.891 | 1.247.583 | 36.229.557 | 537.465 | 7.171.759 | 2.996.185 |
FR | 65.970 | 2.901.295 | 82.979.944 | 653.489 | 19.208.818 | 6.212.997 |
IT | 77.986 | 1.548.981 | 47.497.487 | 491.500 | 14.519.636 | 2.649.949 |
NL | 36.609 | 1.246.881 | 23.539.528 | 479.962 | 4.716.170 | 2.453.332 |
The Books module
It was created using the Polifonia Textual Corpus Population module that allows to access different digital libraries (such as BNF and BNE) and to select from them documents related to music. The PTCPM allows also to perform optical character recognition (OCR) on images and PDF files.
Metadata
The metadata of the module can be downloaded from:
Data
The data of the module cannot be downloaded due to copyright issue. However, it is possible to reconstruct the corpus using the metadata provided in the previous section. Furthermore, the data processed and annotated can be accessed interrogating the corpus (how to interrogate the corpus is explained in a README.md file inside the interrogation folder of this repository).
Statistics
Some statistics of the module are provided below:
lang | #documents | #sentences | #types | #tokens |
---|---|---|---|---|
DE | 237 | 38.633 | 121.530 | 489.225 |
EN | 360 | 49.595 | 185.280 | 940.232 |
ES | 41.093 | 731.606 | 1.852.430 | 20.180.197 |
FR | 265 | 633.173 | 1.305.283 | 14.354.611 |
IT | 12200 | 202.730 | 405.099 | 2.571.090 |
NL | 83 | 116.593 | 539.102 | 1.779.824 |
The Periodicals module
It was created with the help of musicologists that provided the titles of different influencial music periodicals.
Metadata
The metadata of the module can be downloaded from:
Data
The data of the module cannot be downloaded due to copyright issue. However, it is possible to reconstruct the corpus using the metadata provided in the previous section. Furthermore, the data processed and annotated can be accessed interrogating the corpus (how to interrogate the corpus is explained in a README.md file inside the interrogation folder of this repository).
Statistics
Some statistics of the module are provided below:
lang | #documents | #sentences | #types | #tokens |
---|---|---|---|---|
DE | 705 | 121.113 | 544.376 | 2.405.289 |
EN | 2.868 | 4.400.015 | 7.342.527 | 76.180.398 |
ES | 455 | 87.025 | 677.041 | 3.170.480 |
FR | 349 | 329.166 | 696.427 | 5.111.915 |
IT | 1.251 | 433.465 | 992.902 | 7.879.459 |
NL | 125 | 187.350 | 716.506 | 3.880.499 |
The Polifonia Pilots module
It was created collecting the textual material selected by five Polifonia Pilots:
- BELLS
- CHILD
- MEETUPS
- MUSICBO
- ORGANS
Metadata
The metadata of the module can be downloaded from:
Data
The data of the Pilots Module of the Polifonia textual Corpus collected for Bells, MusicBo and Organs pilots cannot be published in their integral form because they are subject to heterogeneous license restrictions. The respective set of published metadata (see table above) allows for the reproduction of the whole corpora. Texts collected for Child and Meetups Pilots are royalty-free, therefore we report links to retrieve them from their corresponding GitHub repositories:
Pilot | url |
---|---|
CHILD | https://github.com/polifonia-project/documentary-evidence-benchmark/tree/main/corpus |
MEETUPS | https://github.com/polifonia-project/meetups_pilot/tree/main/cleanText |
However, it is possible to reconstruct the corpus using the metadata provided in the previous section. Furthermore, the data processed and annotated can be accessed interrogating the corpus (how to interrogate the corpus is explained in a README.md file inside the interrogation folder of this repository).
Statistics
Some statistics of the module are provided below:
pilot | #documents | #sentences | #types | #tokens |
---|---|---|---|---|
BELLS | 59 | 18.481 | 128.061 | 434.439 |
CHILD | 30 | 157.815 | 361.550 | 3.442.840 |
MEETUPS | 19.476 | 822.861 | 1.631.371 | 21.536.135 |
MUSICBO | 46 | 51.781 | 289.247 | 1.412.456 |
ORGANS | 1.660 | 25.647 | 45.298 | 368.439 |