Have a personal or library account? Click to login
A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks Cover

A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks

Open Access
|Nov 2024

Figures & Tables

Figure 1.

Example of a closed-domain EE using a predefined event schema.
Example of a closed-domain EE using a predefined event schema.

Figure 2.

Flowchart of the manual corpus annotation procedure.
Flowchart of the manual corpus annotation procedure.

Figure 3.

Structure of annotated corpus in the BioNLP Standoff format.
Structure of annotated corpus in the BioNLP Standoff format.

Figure 4.

Structure of annotated corpus in BRAT Standoff format.
Structure of annotated corpus in BRAT Standoff format.

Figure 5.

Structure of annotated corpus in the CoNLL-U format.
Structure of annotated corpus in the CoNLL-U format.

Figure 6.

Structure of annotated corpus in the OneIE’s JSON format.
Structure of annotated corpus in the OneIE’s JSON format.

Figure 7.

AlvisAE text annotation editor.
AlvisAE text annotation editor.

Figure 8.

BRAT annotation tool.
BRAT annotation tool.

Figure 9.

TextAE online text annotation editor.
TextAE online text annotation editor.

Figure 10.

Steps for selecting documents to build an event extraction corpus.
Steps for selecting documents to build an event extraction corpus.

Figure 11.

Example of event annotation using the BRAT annotation tool.
Example of event annotation using the BRAT annotation tool.

Figure 12.

The distribution of corpus based on language and domain.
The distribution of corpus based on language and domain.

Figure 13.

Top five largest annotated corpora for event extraction tasks.
Top five largest annotated corpora for event extraction tasks.

Figure 14.

Comparison of tokens, sentences, and event mentions in existing annotated corpora.
Comparison of tokens, sentences, and event mentions in existing annotated corpora.

Figure 15.

The count of event mentions in each corpus.
The count of event mentions in each corpus.

Figure 16.

Conceptual representation of the universal text annotation converter.
Conceptual representation of the universal text annotation converter.

Summary of challenges and recommendations_

ChallengesRecommendations
Lack of high-quality annotated data
  • To facilitate the rapid development of an annotated corpus, it is suggested to employ a hybrid approach as demonstrated by Li et al. (2022).

  • This involves partially annotating the texts manually, and then training ML algorithms to annotate the remaining data based on the trained model.

  • This strategy is faster than manually annotating all data. However, it is critical to measure the accuracy of the automatic annotations.

Incompatibility of annotated corpus formats
  • Develop a standardized annotation format that is universally accepted for annotating text corpus.

  • These formats should store all information required for common EE tasks.

  • Develop a universal text annotation converter for converting annotations between different formats (Figure 16).

Subjectivity and text ambiguity
  • Develop a complete annotation guideline and strictly adhere to it throughout the annotation process.

  • Utilize tools like Git version control to manage the version of annotation files.

Summary of recent studies on LLMs for corpus annotations_

StudyResultsAdvantagesLimitations
Csanády et al. (2024)BERT shows 91.2% to 96.5% test accuracies on the IMDb datasets using the model on random baselines.
  • The proposed method can handle large-scale text annotation tasks.

  • Provides a cost-effective alternative to annotate large amounts of text.

  • Annotation using LLMs slightly compromises the annotation accuracy.

  • LLMs alone cannot provide high-quality corpus annotations. The annotated corpus is not suitable for EE task.

Akkurt et al. (2024)
  • The proposed approach improved result by 2%.

  • All models show improved performance with the GPT-4 + UD Turkish BOUN v2.11: 76.9% (best performance).

  • The model has been tested with data from UD English and Turkish Treebanks.

  • The authors use public data and verify the methodology complies with ethical standards.

  • The annotation outcome varies (inconsistent) depending on the user’s prompt.

  • The method is for entity annotation; thus output is not suitable for EE tasks.

Frei and Kramer (2023)Result on various baseline models:
  • gbert-large (P: 70.7%, R:97.9%, F1: 82.1%)

  • GottBERT-base (P: 80.0%, R: 89.9%, F1: 84.7%)

  • German-MedBERT (P: 72.7%, R: 81.8%, F1: 77.0%)

  • Solves limited corpus availability for non-English medical texts.

  • The proposed method shows a reliable performance.

  • The proposed method is computationally expensive.

  • The annotated corpus cannot be considered a gold-standard and requires more validation.

  • The method is for entity annotation, thus output is not suitable for EE tasks.

Li et al. (2023)The result shows up to 21% performance improvement over random baselines.
  • The annotation process is done together by humans and LLMs.

  • Provides a cost-effective alternative to annotate large amounts of text.

  • The study does not assess if LLM-generated annotations outperform human-annotated corpus.

  • The method is for entity annotation, thus output is not suitable for EE tasks.

Annotated corpus for the event extraction task_

IDCorpus Short NameCorpus Full NameDomain AreaLanguageCorpus Size (# docs)Annotation MethodPublic AccessChargesFormatBenchmark Corpus
C01MUSIEDMulti-Source Informal Event DetectionGeneralChinese11,381ManualFree of chargeJSON×
C02MAVENMAssive eVENt detection datasetGeneralEnglish4,480ManualFree of chargeJSON
C03ACE 2005ACE 2005 Multilingual Training Corpus1GeneralEnglish, Chinese599 (En), 633 (Ch)Manual×Licensed (Paid)XML
C04CFEEChinese Financial Event ExtractionFinanceChinese2,976AutomaticFree of chargeJSON
C05ChFinAnnChFinAnnFinanceChinese32,040ManualFree of chargeJSON
C06FEEDChinese Financial Event Extraction DatasetFinanceChinese31,748Automatic & manualFree of chargeJSON×
C07EPIEpigenetics and Post-Translational Modifications 2011BiomedicalEnglish1,200Manual×Free of chargeBioNLP Standoff
C08IDInfectious Diseases 20112BiomedicalEnglish30ManualFree of chargeBioNLP Standoff
C09GE 11Genia Event Extraction 2011BiomedicalEnglish1,210ManualFree of chargeBioNLP Standoff
C10PCPathway Curation 2013BiomedicalEnglish525ManualFree of chargeBioNLP Standoff
C11CGCancer Genetics 2013 (CG)BiomedicalEnglish600ManualFree of chargeBioNLP Standoff
C12BB3Bacteria Biotope 2016BiomedicalEnglish215Manual×Free of chargeBioNLP Standoff
C13MLEEMulti-Level Event ExtractionBiomedicalEnglish262ManualFree of chargeBRAT Standoff, CoNLL-U
C14LEVENLarge-Scale Chinese Legal Event Detection DatasetLegalChinese8,116Automatic & manualFree of chargeJSON

Corpus statistics_

IDCorpus NameData SourcesTokens CountSentences CountEvent MentionsNegative EventsEvent Types
C01MUSIED11,381 docs7.105 M315,74335,313N/A21
C02MAVEN4,480 docs1.276 M49,873118,732497,261168
C03ACE 20051599 docs (En), 633 docs (Ch)303k (En), 321k (Ch)15,789 (En), 7,269 (Ch)5,349 (En), 3,333 (Ch)N/A5
C04CFEE2,976 docsN/AN/A3,04432,9364
C05ChFinAnn32,040 docs29,220,480640,800> 48,000N/A5
C06FEED31,748 docs28,954,176603,21246,960N/A5
C07EPI1,200 abstracts253,628N/A3,7143698
C08ID30 full-text articles153,1535,1185,15021410
C09GE 111,210 abstracts267,229N/A13,603N/A9
C10PC525 docs108,356N/A12,12557121
C11CG600 abstracts129,878N/A17,2481,32640
C12BB3146 abstracts (ee), 161 abstracts (ee+ner)35,380 (ee), 39, 118 (ee+ner)N/A890 (ee), 864 (ee+ner)N/A2
C13MLEE262 docs56,5882,6086,677N/A29
C14LEVEN8,116 docs2.241 M63,616150,977N/A108

Comparison summary of the common annotation formats_

Annotation FormatSummaryOutput FilesImplementation methodAnnotation structure
BioNLP StandoffThe annotation format is widely used in BioNLP Shared Task and BioNLP Open Shared Task challenges..txt.a1.a2Manual annotation using text corpus annotation toolTab-delimited data
BRAT StandoffThe annotation format is almost identical to the BioNLP format, with the annotations combined into a single annotation file (.ann)..txt.annManual annotation using text corpus annotation toolTab-delimited data
CoNLL-UThe sentence-level annotations are presented in three types of lines: comment, word, and blank lines..txt.conllPython’s spacy_conll packageTab-delimited data
OneIE’s JSON formatProvides a comprehensive annotations storage for each sentence in a JSON objects structure..JSONUse OneIE’s package preprocessing script1 or manual data transformationJSON structure

Corpus annotation tools_

IDTool NamePlatform CompatibilityOutput FormatCharges & License InformationLatest Stable Release1
T01AlvisAEWeb-based (RESTful web app)JSONFree (Open Source) No license provided2016
T02BRAT Rapid Annotation ToolWeb-based (Python package)BRAT StandoffFree MIT Licensevl.3 Crunchy Frog (Nov 8, 2012)
T03TextAEOnline/Web-based (Python package)JSONFree (Open Source) MIT Licensev4.5.4 (Mar 1, 2017)
DOI: https://doi.org/10.2478/jdis-2024-0029 | Journal eISSN: 2543-683X | Journal ISSN: 2096-157X
Language: English
Page range: 196 - 238
Submitted on: Apr 27, 2024
Accepted on: Sep 3, 2024
Published on: Nov 19, 2024
Published by: Chinese Academy of Sciences, National Science Library
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2024 Mohd Hafizul Afifi Abdullah, Norshakirah Aziz, Said Jadid Abdulkadir, Kashif Hussain, Hitham Alhussian, Noureen Talpur, published by Chinese Academy of Sciences, National Science Library
This work is licensed under the Creative Commons Attribution 4.0 License.