Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

Figure 11.

Figure 12.

Figure 13.

Figure 14.

Figure 15.

Figure 16.

Summary of challenges and recommendations_
| Challenges | Recommendations |
|---|---|
| Lack of high-quality annotated data |
|
| Incompatibility of annotated corpus formats |
|
| Subjectivity and text ambiguity |
|
Summary of recent studies on LLMs for corpus annotations_
| Study | Results | Advantages | Limitations |
|---|---|---|---|
| Csanády et al. (2024) | BERT shows 91.2% to 96.5% test accuracies on the IMDb datasets using the model on random baselines. |
|
|
| Akkurt et al. (2024) |
|
|
|
| Frei and Kramer (2023) | Result on various baseline models:
|
|
|
| Li et al. (2023) | The result shows up to 21% performance improvement over random baselines. |
|
|
Annotated corpus for the event extraction task_
| ID | Corpus Short Name | Corpus Full Name | Domain Area | Language | Corpus Size (# docs) | Annotation Method | Public Access | Charges | Format | Benchmark Corpus |
|---|---|---|---|---|---|---|---|---|---|---|
| C01 | MUSIED | Multi-Source Informal Event Detection | General | Chinese | 11,381 | Manual | √ | Free of charge | JSON | × |
| C02 | MAVEN | MAssive eVENt detection dataset | General | English | 4,480 | Manual | √ | Free of charge | JSON | √ |
| C03 | ACE 2005 | ACE 2005 Multilingual Training Corpus1 | General | English, Chinese | 599 (En), 633 (Ch) | Manual | × | Licensed (Paid) | XML | √ |
| C04 | CFEE | Chinese Financial Event Extraction | Finance | Chinese | 2,976 | Automatic | √ | Free of charge | JSON | √ |
| C05 | ChFinAnn | ChFinAnn | Finance | Chinese | 32,040 | Manual | √ | Free of charge | JSON | √ |
| C06 | FEED | Chinese Financial Event Extraction Dataset | Finance | Chinese | 31,748 | Automatic & manual | √ | Free of charge | JSON | × |
| C07 | EPI | Epigenetics and Post-Translational Modifications 2011 | Biomedical | English | 1,200 | Manual | × | Free of charge | BioNLP Standoff | √ |
| C08 | ID | Infectious Diseases 20112 | Biomedical | English | 30 | Manual | √ | Free of charge | BioNLP Standoff | √ |
| C09 | GE 11 | Genia Event Extraction 2011 | Biomedical | English | 1,210 | Manual | √ | Free of charge | BioNLP Standoff | √ |
| C10 | PC | Pathway Curation 2013 | Biomedical | English | 525 | Manual | √ | Free of charge | BioNLP Standoff | √ |
| C11 | CG | Cancer Genetics 2013 (CG) | Biomedical | English | 600 | Manual | √ | Free of charge | BioNLP Standoff | √ |
| C12 | BB3 | Bacteria Biotope 2016 | Biomedical | English | 215 | Manual | × | Free of charge | BioNLP Standoff | √ |
| C13 | MLEE | Multi-Level Event Extraction | Biomedical | English | 262 | Manual | √ | Free of charge | BRAT Standoff, CoNLL-U | √ |
| C14 | LEVEN | Large-Scale Chinese Legal Event Detection Dataset | Legal | Chinese | 8,116 | Automatic & manual | √ | Free of charge | JSON | √ |
Corpus statistics_
| ID | Corpus Name | Data Sources | Tokens Count | Sentences Count | Event Mentions | Negative Events | Event Types |
|---|---|---|---|---|---|---|---|
| C01 | MUSIED | 11,381 docs | 7.105 M | 315,743 | 35,313 | N/A | 21 |
| C02 | MAVEN | 4,480 docs | 1.276 M | 49,873 | 118,732 | 497,261 | 168 |
| C03 | ACE 20051 | 599 docs (En), 633 docs (Ch) | 303k (En), 321k (Ch) | 15,789 (En), 7,269 (Ch) | 5,349 (En), 3,333 (Ch) | N/A | 5 |
| C04 | CFEE | 2,976 docs | N/A | N/A | 3,044 | 32,936 | 4 |
| C05 | ChFinAnn | 32,040 docs | 29,220,480† | 640,800† | > 48,000 | N/A | 5 |
| C06 | FEED | 31,748 docs | 28,954,176† | 603,212† | 46,960 | N/A | 5 |
| C07 | EPI | 1,200 abstracts | 253,628 | N/A | 3,714 | 369 | 8 |
| C08 | ID | 30 full-text articles | 153,153 | 5,118 | 5,150 | 214 | 10 |
| C09 | GE 11 | 1,210 abstracts | 267,229 | N/A | 13,603 | N/A | 9 |
| C10 | PC | 525 docs | 108,356 | N/A | 12,125 | 571 | 21 |
| C11 | CG | 600 abstracts | 129,878 | N/A | 17,248 | 1,326 | 40 |
| C12 | BB3 | 146 abstracts (ee), 161 abstracts (ee+ner) | 35,380 (ee), 39, 118 (ee+ner) | N/A | 890 (ee), 864 (ee+ner) | N/A | 2 |
| C13 | MLEE | 262 docs | 56,588 | 2,608 | 6,677 | N/A | 29 |
| C14 | LEVEN | 8,116 docs | 2.241 M | 63,616 | 150,977 | N/A | 108 |
Comparison summary of the common annotation formats_
| Annotation Format | Summary | Output Files | Implementation method | Annotation structure |
|---|---|---|---|---|
| BioNLP Standoff | The annotation format is widely used in BioNLP Shared Task and BioNLP Open Shared Task challenges. | .txt.a1.a2 | Manual annotation using text corpus annotation tool | Tab-delimited data |
| BRAT Standoff | The annotation format is almost identical to the BioNLP format, with the annotations combined into a single annotation file (.ann). | .txt.ann | Manual annotation using text corpus annotation tool | Tab-delimited data |
| CoNLL-U | The sentence-level annotations are presented in three types of lines: comment, word, and blank lines. | .txt.conll | Python’s spacy_conll package | Tab-delimited data |
| OneIE’s JSON format | Provides a comprehensive annotations storage for each sentence in a JSON objects structure. | .JSON | Use OneIE’s package preprocessing script1 or manual data transformation | JSON structure |
Corpus annotation tools_
| ID | Tool Name | Platform Compatibility | Output Format | Charges & License Information | Latest Stable Release1 |
| T01 | AlvisAE | Web-based (RESTful web app) | JSON | Free (Open Source) No license provided | 2016 |
| T02 | BRAT Rapid Annotation Tool | Web-based (Python package) | BRAT Standoff | Free MIT License | vl.3 Crunchy Frog (Nov 8, 2012) |
| T03 | TextAE | Online/Web-based (Python package) | JSON | Free (Open Source) MIT License | v4.5.4 (Mar 1, 2017) |