MEETUPS
MEETUPS identification of temporal knowledge is a tool developed using Python and Jupyter Notebook. This software uses NLTK Toolkit and heuristic rules to identify and annotate time expressions from input text. The tool allows the extraction of one (when a historical meetup happened) of the four elements that define a historical meetup.
This implementation is a rule-based Time Expression recognition tagger based on research by Zhong et al. and SynTime software (https://github.com/zhongxiaoshi/syntime). Their work was originally tested using three datasets: TimeBank, WikiWars and Tweets. The authors implement a three-layer system that recognises time expressions using syntactic token types and general heuristic rules.
First layer - token identification:
- Annotate tokes with POS tags, we use NLTK. In SynTime they used CoreNLP.
- Annotate tokes according time tokens proposed by Zhong et al. Three types of tokens: TIME, MODIFIER, NUMERAL. Each type have more specific types: MODIFIER = [“PREFIX”,”SUFFIX”,”LINKAGE”,”COMMA”,”PARENTHESIS”,”INARTICLE”] NUMERAL = [“BASIC”,”DIGIT”,”ORDINAL”] TIME = [“DECADE”, “YEAR”, “SEASON”, “MONTH”, “WEEK”, “DATE”, “TIME”, “DAY_TIME”, “TIMELINE”, “HOLIDAY”, “PERIOD”, “DURATION”, “TIME_UNIT”,”TIME_ZONE”, “ERA”,”MID”,”TIME_ZONE”,”DAY”,”HALFDAY”] *In our implementation we add the type PARENTHESIS and improve regular expressions
Second layer - time segment identification:
- Search the surroundings of each time token identified previously for modifiers and numerals
- Gather the time token with its modifiers and numerals and form a time segment
- The search is under heuristic rules Search tokens on the left If PREFIX or NUMERAL or IN_ARTICLE continue searching Search tokens on the right If SUFIX or NUMERAL continue searching For right and left search, if token is COMMA or LINKAGE then stop
Third layer - time expression extraction If time segments overlap, then apply heuristic rules and merge segments
Time expressions classification: We add a step and classify time expressions according to literature - Time range: generally, one or two bounds, e.g., from XX to XX, from XX, to XX, until XX. - Time point: exact date and or time description 23/03/1294 - Time reference: usually incomplete dates (19 April), 2 weeks, later this year, relative to the document (the author’s date of birth? Sentence context? For later)
Finally the tool stores the results as a CSV file in extractedTimeExpressions/
Information on installation and setup
- Jupyter Notebook: 03_Identify_TimeE.ipynb
Details of the data
Code location:
|_ 03_Identify_TimeE.ipynb
Regular expressions:
|_ timeRegex.txt
Data location
Data input:
|_ indexedSentences/
Time expressions annotations
Data ouput:
|_ extractedTimeExpressions/
|_ README_people_places_identification.md
DOI:
TODO