Have a personal or library account? Click to login
A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks Cover

A comprehensive review of existing corpora and methods for creating annotated corpora for event extraction tasks

Open Access
|Nov 2024

References

  1. Abdullah, M. H. A., Aziz, N., Abdulkadir, S. J., Akhir, E. A. P., & Talpur, N. (2022). Event detection and information extraction strategies from text: A preliminary study using GENIA corpus. In International Conference on Emerging Technologies and Intelligent Systems(pp. 118-127). Cham: Springer International Publishing.
  2. Abdullah, M. H. A., Aziz, N., Abdulkadir, S. J., Alhussian, H. S. A., & Talpur, N. (2023). Systematic literature review of information extraction from textual data: Recent methods, applications, trends, and challenges. IEEE Access, 11, 10535-10562. https://doi.org/10.1109/ACCESS.2023.3240898
  3. Adnan, K., & Akbar, R. (2019a). An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data, 6(1), 91. https://doi.org/10.1186/s40537-019-0254-8
  4. Adnan, K., & Akbar, R. (2019b). An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data, 6(1), 91. https://doi.org/10.1186/s40537-019-0254-8
  5. Adnan, K., Akbar, R., Khor, S. W., & Ali, A. B. A. (2019). Role and challenges of unstructured big data in healthcare. In N. Sharma, A. Chakrabarti, & V. E. Balas (Eds.), Data management, analytics and innovation: Proceedings of ICDMAI 2019 (Vol. 1, pp. 301-323). Springer
  6. Akkurt, F., Gungor, O., Marşan, B., Gungor, T., Ozturk Basaran, B., Özgür, A., & Uskudarli, S. (2024). Evaluating the quality of a corpus annotation scheme using pretrained language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 6504-6514). Torino, Italia.
  7. Akmal, M., & Romadhony, A. (2020). Corpus development for Indonesian product named entity recognition using semi-supervised approach. In 2020 international conference on data science and its applications (ICoDSA) (pp. 1-5). IEEE.
  8. Alkaissi, H., & McFarlane, S. I. (2023). Artificial hallucinations in ChatGPT: Implications in scientific writing. Cureus, 15(2), e35179. https://doi.org/10.7759/cureus.35179
  9. Bossy, R., Jourde, J., Manine, A.-P., Veber, P., Alphonse, E., van de Guchte, M., Bessières, P., & Nédellec, C. (2012). BioNLP Shared Task - The Bacteria Track. BMC Bioinformatics, 13(Suppl 11), S3. https://doi.org/10.1186/1471-2105-13-S11-S3
  10. Buchholz, S., & Marsi, E. (2006). CoNLL-X Shared Task on multilingual dependency parsing. In Proceedings of the tenth conference on computational natural language learning (CoNLL-X) (pp. 149-164).
  11. Cohen, K. B., Lanfranchi, A., Choi, M. J., Bada, M., Baumgartner, W. A., Jr., Panteleyeva, N., Verspoor, K., Palmer, M., & Hunter, L. E. (2017). Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles. BMC Bioinformatics, 18(1), 372. https://doi.org/10.1186/s12859-017-1775-9
  12. Csanády, B., Muzsai, L., Vedres, P., Nádasdy, Z., & Lukács, A. (2024). LlamBERT: Large-scale low-cost data annotation in NLP. arXiv. https://doi.org/10.48550/arXiv.2403.15938
  13. Deléger, L., Bossy, R., Chaix, E., Ba, M., Ferré, A., Bessières, P., & Nédellec, C. (2016). Overview of the Bacteria Biotope Task at BioNLP Shared Task 2016. In Proceedings of the 4th BioNLP Shared Task Workshop (pp. 12-22). Berlin, Germany.
  14. Frei, J., & Kramer, F. (2023). Annotated dataset creation through large language models for non-English medical NLP. Journal of Biomedical Informatics, 145, 104478. https://doi.org/10.1016/j.jbi.2023.104478
  15. Gao, C. A., Howard, F. M., Markov, N. S., Dyer, E. C., Ramesh, S., Luo, Y., & Pearson, A. T. (2023). Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. NPJ Digital Medicine, 6, Article 75. https://doi.org/10.1038/s41746-023-00774-5
  16. Gao, J., Zhao, H., Yu, C., & Xu, R. (2023). Exploring the feasibility of ChatGPT for event extraction. arXiv. https://doi.org/10.48550/arXiv.2303.03836 Retrieved March 01, 2023, from https://ui.adsabs.harvard.edu/abs/2023arXiv230303836G
  17. Grynbaum, M. M., & Mac, R. (2023). The Times sues OpenAI and Microsoft over A.I. use of copyrighted work. The New York Times. Retrieved 15 April 2024 from https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
  18. Hadi, M. U., Al-Tashi, Q., Qureshi, R., Shah, A., Muneer, A., Irfan, M., Zafar, A., Shaikh, M., Akhtar, N., Wu, J., & Mirjalili, S. (2023). Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. https://doi.org/10.36227techrxiv.23589741.v4
  19. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), Article 248. https://doi.org/10.1145/3571730
  20. Jurafsky, D., Chai, J., Schluter, N., & Tetreault, J. (2020). Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online. https://aclanthology.org/2020.acl-main.0.pdf
  21. Kim, J.-D., Ohta, T., & Tsujii, J. (2008). Corpus annotation for mining biomedical events from literature. BMC Bioinformatics, 9(1), 10. https://doi.org/10.1186/1471-2105-9-10
  22. Kim, J.-D., Wang, Y., Takagi, T., & Yonezawa, A. (2011). Overview of GENIA event task in BioNLP shared task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop, (pp. 7-15). Portland, Oregon, USA.
  23. Kim, J.-D., Ohta, T., Tateisi, Y., & Tsujii, J. (2003). GENIAcorpus-semantically annotated corpus for bio-textmining. Bioinformatics, 19(Suppl 1), i180-182. https://doi.org/10.1093/bioinformatics/btg1023
  24. Lever, J., Altman, R., & Kim, J.-D. (2020). Extending TextAE for annotation of non-contiguous entities. Genomics Inform, 18(2), e15. https://doi.org/10.5808/GI.2020.18.2.e15
  25. Li, G., Wang, P., Xie, J., Cui, R., & Deng, Z. (2022). FEED: A Chinese financial event extraction dataset constructed by distant supervision, In Proceedings of the 10th International Joint Conference on Knowledge Graphs, Virtual Event, Thailand. https://doi.org/10.1145/3502223.3502229
  26. Li, M., Shi, T., Ziems, C., Kan, M.-Y., Chen, N. F., Liu, Z., & Yang, D. (2023). Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. arXiv. https://doi.org/10.48550/arXiv.2310.15638
  27. Li, Z. (2023). The dark side of ChatGPT: legal and ethical challenges from stochastic parrots and hallucination. arXiv. https://doi.org/10.48550/arXiv.2304.14347
  28. Lin, Y. (2020). Multilingual multitask joint neural information extraction (Doctoral dissertation, University of Illinois at Urbana-Champaign). https://hdl.handle.net/2142/109521
  29. Lin, Y., Ji, H., Huang, F., & Wu, L. (2020). A joint neural model for information extraction with global features. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7999-8009).
  30. Linguistic Data Consortium (2005). ACE (Automatic Content Extraction) English annotation guidelines for events. https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-events-guidelines-v5.4.3.pdf
  31. Liu, X., Luo, Z., & Huang, H. (2018). Jointly multiple events extraction via attention-based graph information aggregation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing Brussels, Belgium.
  32. McIntosh, T. R., Liu, T., Susnjak, T., Watters, P., Ng, A., & Halgamuge, M. N. (2024). A culturally sensitive test to evaluate nuanced GPT hallucination. IEEE Transactions on Artificial Intelligence, 5(6), 2739-2751. https://doi.org/10.1109/TAI.2023.3332837
  33. Metz, C., & Robertson, K. (2024). OpenAI Seeks to Dismiss Parts of The New York Times’s Lawsuit. The New York Times. Retrieved 15 April 2024 from https://www.nytimes.com/2024/02/27/technology/openai-new-york-times-lawsuit.html
  34. Mirzakhmedova, N., Gohsen, M., Chang, C. H., & Stein, B. (2024). Are large language models reliable argument quality annotators? In Conference on Advances in Robust Argumentation Machines (pp. 129-146). Cham: Springer Nature Switzerland.
  35. Nawaz, R., Thompson, P., McNaught, J., & Ananiadou, S. (2010). Meta-Knowledge Annotation of Bio-Events. In LREC (Vol. 17, pp. 2498-2507).
  36. Nédellec, C., Bossy, R., Chaix, E., & Deléger, L. (2018). Text-mining and ontologies: New approaches to knowledge discovery of microbial diversity. arXiv. https://doi.org/10.48550/arXiv.1805.04107
  37. Neves, M., & Leser, U. (2012). A survey on annotation tools for the biomedical literature. Briefings in Bioinformatics, 15(2), 327-340. https://doi.org/10.1093/bib/bbs084
  38. Neves, M., & Ševa, J. (2019). An extensive review of tools for manual annotation of documents. Briefings in Bioinformatics, 22(1), 146-163. https://doi.org/10.1093/bib/bbz130
  39. O’Donnell, M. (2008). The UAM CorpusTool: Software for corpus annotation and exploration. In Proceedings of the XXVI Congreso de AESLA (Vol. 3, p. 5). Spain: Almeria
  40. Ohta, T., Kim, J.-D., & Tsujii, J. (2007). Guidelines for event annotation. Department of Information Science, Graduate School of Science, University of Tokyo
  41. Ohta, T., Pyysalo, S., Rak, R., Rowley, A., Chun, H.-W., Jung, S.-J., Choi, S.-P., Ananiadou, S., & Tsujii, J. (2013). Overview of the Pathway Curation (PC) task of BioNLP Shared Task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop Sofia, Bulgaria.
  42. Ohta, T., Pyysalo, S., & Tsujii, J. (2011). Overview of the epigenetics and post-translational modifications (EPI) task of BioNLP shared task 2011. In Proceedings of BioNLP Shared Task 2011 Workshop (pp. 16-25).
  43. Papazian, F., Bossy, R., & Nédellec, C. (2012). AlvisAE: A collaborative web text annotation editor for knowledge acquisition. In Proceedings of the Sixth Linguistic Annotation Workshop (pp. 149-152).
  44. Pestian, J., Brew, C., Matykiewicz, P., Hovermale, D. J., Johnson, N., Cohen, K. B., & Duch, W. (2007). A shared task involving multi-label classification of clinical free text. In Biological, translational, and clinical language processing (pp. 97-104).
  45. Pyysalo, S., Ohta, T., & Ananiadou, S. (2013). Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop Sofia, Bulgaria.
  46. Pyysalo, S., Ohta, T., Miwa, M., Cho, H.-C., Tsujii, J., & Ananiadou, S. (2012). Event extraction across multiple levels of biological organization. Bioinformatics, 28(18), i575-i581. https://doi.org/10.1093/bioinformatics/bts407
  47. Pyysalo, S., Ohta, T., Rak, R., Sullivan, D., Mao, C., Wang, C., Sobral, B., Tsujii, J., & Ananiadou, S. (2011a). Annotation guidelines for infectious diseases event corpus. In Tech rep, Tsujii Laboratory, University of Tokyo.
  48. Pyysalo, S., Ohta, T., Rak, R., Sullivan, D., Mao, C., Wang, C., Sobral, B., Tsujii, J., & Ananiadou, S. (2011b). Overview of the Infectious Diseases (ID) task of BioNLP Shared Task 2011. In J. Tsujii, J.-D. Kim, & S. Pyysalo, Proceedings of BioNLP Shared Task 2011 Workshop Portland, Oregon, USA.
  49. Pyysalo, S., Ohta, T., Rak, R., Sullivan, D., Mao, C., Wang, C., Sobral, B., Tsujii, J., & Ananiadou, S. (2012). Overview of the ID, EPI and REL tasks of BioNLP shared task 2011. BMC Bioinformatics, 13(Suppl 11), S2. https://doi.org/10.1186/1471-2105-13-S11-S2
  50. Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. (2012). BRAT: A web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 102-107).
  51. Stenetorp, P., Topić, G., Pyysalo, S., Ohta, T., Kim, J.-D., & Tsujii, J. (2011). BioNLPShared Task 2011: Supporting resources. In Proceedings of Bionlp Shared Task 2011 Workshop Portland, Oregon, USA.
  52. Talpur, N., Abdulkadir, S. J., Alhussian, H., Hasan, M. H., Aziz, N., & Bamhdi, A. (2022). A comprehensive review of deep neuro-fuzzy system architectures and their optimization methods. Neural Computing and Applications, 34(3), 1837-1875. https://doi.org/10.1007/s00521-021-06807-9
  53. Talpur, N., Abdulkadir, S. J., Akhir, E. A. P. A., Hasan, M. H., Alhussian, H., & Abdullah, M. H. A. (2023). A novel bitwise arithmetic optimization algorithm for the rule base optimization of deep neuro-fuzzy system. Journal of King Saud University-Computer and Information Sciences, 35(2), 821-842. https://doi.org/10.1016/j.jksuci.2023.01.020
  54. Tan, Z., Beigi, A., Wang, S., Guo, R., Bhattacharjee, A., Jiang, B., Karami, M., Li, J., Cheng, L., & Liu, H. (2024). Large language models for data annotation: A survey. arXiv. https://doi.org/10.48550/arXiv.2402.13446
  55. Törnberg, P. (2024). Best practices for text annotation with large language models. arXiv. https://doi.org/10.48550/arXiv.2402.05129
  56. Vauth, M., Hatzel, H. O., Gius, E., & Biemann, C. (2021). Automated event annotation in literary texts. In Proceedings of the Conference on Computational Humanities Research, CHR2021, (pp.333-345). Amsterdam, The Netherlands.
  57. Walker, C., Strassel, S., Medero, J., & Maeda, K. (2006). ACE 2005 multilingual training corpus (LDC2006T06) [Data set]. Linguistic Data Consortium. https://doi.org/10.35111/mwxc-vh88
  58. Wang, X., Wang, Z., Han, X., Jiang, W., Han, R., Liu, Z., Li, J., Li, P., Lin, Y., & Zhou, J. (2020). MAVEN: A massive general domain event detection dataset. arXiv. https://doi.org/10.48550/arXiv.2004.13590
  59. Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., Mirzaei, A., Arunkumar, A., Ashok, A., Dhanasekaran, A. S., Naik, A., Stap, D., Pathak, E., Karamanolakis, G., Lai, H. G., Purohit, I., Mondal, I., Anderson, J., Kuznia, K., Doshi, K., Patel, M., … Khashabi, D. (2022). Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing Abu Dhabi, United Arab Emirates. https://arxiv.org/abs/2204.07705
  60. Wu, H., Lei, Q., Zhang, X., & Luo, Z. (2020). Creating a large-scale financial news corpus for relation extraction. In 2020 3rd International Conference on Artificial Intelligence and Big Data (ICAIBD) (pp. 259-263). IEEE. https://doi.org/10.1109/ICAIBD49809.2020.913744
  61. Xi, X., Lv, J., Liu, S., Ye, W., Yang, F., & Wan, G. (2022). MUSIED: A benchmark for event detection from multi-source heterogeneous Informal Texts. arXiv. https://doi.org/10.48550/arXiv.2211.13896
  62. Xu, R., Liu, T., Li, L., & Chang, B. (2021). Document-level event extraction via heterogeneous graph-based interaction model with a tracker. In C. Zong, F. Xia, W. Li, & R. Navigli (EDs), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) Online
  63. Yang, H., Chen, Y., Liu, K., Xiao, Y., & Zhao, J. (2018). DCFEE: A document-level Chinese financial event extraction system based on automatically labeled training data. In Proceedings of ACL 2018, System Demonstrations (pp. 50-55).
  64. Yao, F., Xiao, C., Wang, X., Liu, Z., Hou, L., Tu, C., Li, J., Liu, Y., Shen, W., & Sun, M. (2022). LEVEN: A large-scale Chinese legal event detection dataset. arXiv. https://doi.org/10.48550/arXiv.2203.08556
  65. Zaman, G., Mahdin, H., Hussain, K., & Rahman, A. (2020). Information extraction from semi-and unstructured data sources: A systematic literature review. ICIC Express Letters, 14(6), 593-603.
  66. Zheng, S., Cao, W., Xu, W., & Bian, J. (2019). Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 337-346). Hong Kong, China. https://doi.org/10.18653/v1/D19-1032
DOI: https://doi.org/10.2478/jdis-2024-0029 | Journal eISSN: 2543-683X | Journal ISSN: 2096-157X
Language: English
Page range: 196 - 238
Submitted on: Apr 27, 2024
Accepted on: Sep 3, 2024
Published on: Nov 19, 2024
Published by: Chinese Academy of Sciences, National Science Library
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2024 Mohd Hafizul Afifi Abdullah, Norshakirah Aziz, Said Jadid Abdulkadir, Kashif Hussain, Hitham Alhussian, Noureen Talpur, published by Chinese Academy of Sciences, National Science Library
This work is licensed under the Creative Commons Attribution 4.0 License.