musicbo-knowledge-graph
MusicBO Knowledge Graph stores information about the role of music in the city of Bologna from a historical and social perspective. It aims to satisfy the requirements of MusicBO pilot use case, namely conveying knowledge about music performances in Bologna and encounters between musicians, composers, critics and historians who passed through Bologna.
MusicBO Knowledge Graph is available via the MusicBO SPARQL endpoint.
MusicBO Knowledge Graph is automatically extracted from natural language texts by applying a custom text-to-Knowledge Graph (text2KG) process to the MusicBO corpus documents. The MusicBO corpus is part of the Polifonia Corpus.
The process leverages two modules: the Polifonia Knowledge Extractor pipeline and the AMR2Fred tool. The first one uses AMR (Abstract Meaning Representation) to parse sentences into semantic graphs. The second one transforms AMR graphs into RDF/OWL KGs based on FRED logic form by exploiting the similarities between AMR graphs and FRED’s output representation, such as being both graph-based and event-centric. The Polifonia Knowledge Extractor pipeline provides input to the AMR2Fred tool. The two modules are orchestrated by the Machine Reading suite, which queries both components through the Text-to-AMR-to-FRED API and generates RDF named graphs from natural language text.
The Text2KG process for the automatic creation of the MusicBO KG can be broken down into its main steps as follows:
- [Input.] For the scope of this Deliverable, we applied our text2KG process to the English and Italian language documents of MusicBO corpus. We took as input 47 documents in English and 51 documents in Italian from the MusicBO corpus.
- [Pre-processing.] The MusicBO corpus documents that we chose as input were originally in .PDF, image or .docx formats. Therefore, we needed to extrapolate the plain text from them, leveraging ad hoc Optical Character Recognition (OCR) technologies from textual-corpus-population when necessary. We then performed co-reference resolution: for English language documents, we implemented a co-reference resolution pipeline based on Spacy’s neuralcoref. We have not yet implemented any co-reference resolution procedure for the Italian language documents, as we are still evaluating the performances of state-of-the-art Italian language co-reference resolution tools. We also performed rule-based minimal post-OCR corrections and sentence splitting on the extrapolated plain texts.
- [Text2AMR Parsing.] The sentences resulting from the pre-processing steps described at point 2 above are submitted to state-of-the-art neural text-to-AMR parsers. MR has gained significant attention in recent years as a meaning representation formalism, given its ability to abstract away from syntactic variability and its potential to act as an interlingua in scenarios that encompass multilingual textual sources. For sentences in English we used SPRING. For sentences in Italian, we used USeA.
- [Filtering.] This step is a preliminary tentative to tackle AMR graphs evaluation. Given that we are concentrating on non-standard texts (historical documents), the results of the state-of-the-art AMR parsers may be inaccurate. Human validation is time-consuming, and there are no standard benchmarks for the semantic parsing of historic and OCRed text. For this reason, we decided to use a back-translation approach that converts the generated AMR graphs back to sentences (AMR2text) to compute similarity scores between the original sentence and the generated one. For English, we used SPRING for AMR2Text generation and computed BLEURT as a similarity score. For Italian, we used m-AMR2Text for AMR2Text generation. Then, we computed the cosine similarity between the embedding of the original and the generated sentences. We generated the embeddings by leveraging LASER embeddings, an off-the-shelf multilingual sentence embedding toolkit. We hypothesise that generated sentences with high BLEURT or cosine similarity scores correspond to high-quality graphs. We decided to discard all the graphs in our English AMR graphs bank corresponding to AMR2Text-generated sentences with a negative BLEURT score. With regard to our Italian AMR graphs bank, we decided to discard the graphs associated with AMR2Text-generated sentences having a cosine similarity <0,90. In fact, according to our sample-based qualitative error analysis, negative BLEURT scores and cosine similarity <0,90 corresponded to low-quality generated sentences and, consequentially, to low-quality AMR graphs. The quality issues observed in the AMR graphs correlated with input sentences affected, for example, by severe OCR errors.
- [AMR2Fred translation.] Finally, we transformed the graphs filtered at step above into OWL/RDF Knowledge Graphs that follow FRED knowledge representation patterns. This transformation is done by querying the AMR2Fred tool via the Machine Reading suite. The output is named graphs produced by using the N-Quad syntax. Named graphs allow for extending the standard RDF triple model with a “context” element which, among the other features, allows the association of each triple with information about their provenance. In our case, the context element of MusicBO Knowledge Graph triples indicates which sentence’s graph the triple is part of. At this step, we enrich the resulting FRED-like RDF/OWL KGs using Framester semantic hub. In fact, thanks to Framester, the information implicitly enclosed in the text could be unveiled by integrating knowledge from different knowledge bases such as FrameNet, WordNet, VerbNet, BabelNet, DBPedia, Yago, DOLCE-Zero and other resources. In particular, we enrich the FRED-like RDF/OWL KGs with Word Sense Disambiguation (WSD) information. The WSD process currently implemented applies to those elements of the FRED-like RDF/OWL KGs which correspond to nodes of the AMR graphs that are not linked to any lexical resources or knowledge bases, namely all the AMR graphs nodes that are not treated as PropBank predicates or named entities. The implemented WSD process consists of submitting the sentence associated with the FRED-like RDF/OWL KG to EWISER, a WSD system, and of associating the resulting WSD information with the AMR2Fred nodes whose corresponding labels in the AMR graph matches the lemmas of the processed sentence (if the graph’s node is among those that need to be enriched with WSD information). We leverage WordNet as the lexical resource from which we take the word senses information.
We provide in folder “input_csv" of this repository the input CSV that contains the pre-processed and filtered EN sentences of the MusicBO corpus (steps 1-4 of the process described above). The CSV is ready to be sent as input to the Machine Reading suite, to enable the creation of named graphs as per step 5 of the Text2KG process. The CSV is made of 6 columns:
- corpus_id, which is an identifier for a corpus;
- document_id, which is an identifier of a document within a corpus;
- sentence_id, which is an identifier of document sentence within a corpus;
- content, which is the content of sentence to process;
- document_uri, which is a link to a Web page from which the document of a corpus can be retrieved;
- corpus_uri, which is the DOI of a corpus.
Here’s an excerpt of the csv file:
corpus_id | document_id | sentence_id | content | document_uri | corpus_uri |
---|---|---|---|---|---|
MusicBO | 1 | 1009 | The more I deviated from the path which my wife regarded as the only profitable one, due partly to the change of my views (which I grew ever less willing to communicate to her), and partly to the modification in my attitude towards the stage, the more my wife retreated from that position of close fellowship with me which my wife had enjoyed in former years, and which my wife thought herself justified in connecting in some way with my successes. | https://freeditorial.com/en/books/my-life-volume-1 | https://doi.org/10.5281/zenodo.6672165 |
MusicBO | 4 | 363 | And “off and on” we should be sure to undertake something to give vent to our energies in the outer world. | https://www.gutenberg.org/cache/epub/4234/pg4234.txt | https://doi.org/10.5281/zenodo.6672165 |
MusicBO | 35 | 28 | To this Artusi replied in Considerationi musicali, printed in Seconda parte dell’Artusi (1603), mockingly dedicated to Bottrigari. | https://doi.org/10.1093/gmo/9781561592630.article.01383 | https://doi.org/10.5281/zenodo.6672165 |
We provide in folder “data" of this repository the KGs obtained through our Text2KG process described above. Stats of the KGs latest release can be found in the table below:
Languages | #sent-AMR graph pairs (Text2AMR) | #filtered sent-AMR graph pairs (Automatic metrics evaluation) | #named graphs (AMR2RDF) | #triples |
---|---|---|---|---|
EN | 51.814 | 5.965 | 5.798 | 410.132 |
ITA | 10.563 | 1.815 | 1.759 | 118.162 |
TOTAL | 15.747 | 7.780 | 7.557 | 531.222 |
MusicBO Knowledge Graph is described in a dedicated MELODY data story.