Component id	meetups-data-cleaning
Type	Software
Name	MEETUPS preparation - data cleaning
Description	This tool is part of the corpus preparation process and it is used to clean data collected from Wikipedia.
Work package	WP4
Pilot	MEETUPS
Project	polifonia-project
Resource	https://github.com/polifonia-project/meetups_pilot/blob/main/01_CleaningText.ipynb
Release date	20/07/2022
Release number	v0.1
Release link	https://github.com/polifonia-project/meetups_pilot/releases/tag/v0.2
Doi	10.5281/zenodo.7875353
Changelog	https://github.com/polifonia-project/meetups_pilot/releases/tag/v0.2
Licence	Apache-2.0
Copyright	Copyright (c) 2023 MEETUPS @ The Open University
Contributors	Alba Morales Tirado Enrico Daga
Related components	Persona: Ortenz (Persona) David (Persona) Sophie

View this file on Github Download as Schema.org

MEETUPS Corpus preparation: data cleaning

MEETUPS Corpus preparation: Cleaning data collected from Wikipedia web pages of people in the music scene in Europe

MEETUPS data cleaning is a tool developed using Python and Jupyter Notebook. This software prepares the biographies (collected as text files) in https://github.com/polifonia-project/meetups_corpus_collection for the next step in the extraction of historical meetups process.

Use the Wikipedia authors’ webpages collected in https://github.com/polifonia-project/meetups_corpus_collection
Clean text blank lines, sections with no historical meetups data
Organise the text in sentences as the main unit to extract meetups information

Information on installation and setup

Run Jupyter Notebook 01_CleaningText.ipynb

Details of the data

Code location:

|_ 01_CleaningText.ipynb

Raw corpus location
Data output:
|_ text_dataset/            

Clean text location
Data input:
|_ cleanText/

Index data location
Data output:
|_ indexedParagraphs/
|_ indexedSentences/

|_ README_data_cleaning.md

DOI: