eeke 2021 - workshop on extraction and evaluation of
TRANSCRIPT
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
Bureau for Rapid Annotation Tool: Collaboration can do Moreover Variety-oriented Annotations
Zheng [email protected]
Institute of Scientific and Technical Information of ChinaHaidian District, Beijing, P. R. China
Shuo Xu∗
[email protected] University of Technology
Chaoyang District, Beijing, P. R. China
ABSTRACTA high-quality manually annotated corpus is crucial for many textmining and information extraction tasks. Several workbenches havebeen developed in the literature to facilitate collaborative annota-tion. However, given the growing volumes of un-annotated docu-ments, these variety-oriented annotation workbenches have manyshortcoming in terms of teamwork, quality control and time effort.For this purpose, we develop a novel workbench such that collabora-tion can do more over variety-oriented annotation. Our workbenchis named as Bureau for Rapid Annotation Tool (Brat for short).Main functionalities include enhanced semantic constraint system,Vim-like shortcut keys, annotation filter and graph-visualizing an-notation browser. Until now, over 500,000 mentions have been an-notated with our Brat workbench.
1 INTRODUCTIONA high-quality manually annotated corpus is very crucial for manytext mining and information extraction tasks [1, 3, 4, 6, 7, 14, 15].Several workbenches have been developed in the literature to fa-cilitate collaborative annotation [8, 11, 13, 16]. However, given thegrowing volumes of un-annotated documents, these variety-orientedworkbenches still have many shortcomings in terms of teamwork,quality control and time effort. Let’s take the sentence ”Depend-ing on the model, a Tesla costs somewhere between 1 and 3.33BTC” [2] as an example. A practical issue we face is whether ornot to assign ”Price” type to the mention ”between 1 and 3.33 BTC”.This actually depends on a consensus acknowledging Bitcoin as ac-tual money [5, 17].
Reaching this consensus is extremely time-consuming and heavi-ly rely on two types of annotation collaborations in Table 1: ground-ed collaboration and trusted collaboration. By grounded collab-oration, we mean that the resulting annotators are restricted withsounded pre-arrangements. For example, U-Compare only supportsnamed entity annotations in the UIMA-type system [10], which canavoid many conflicts. An alternative [13] takes the form of seman-tic constraints. In more details, a certain relationship should takeparameters with specific entity types. As for trusted collaboration,YEDDA [16] recognized common gestures from BRAT [13] andembedded many functionalities including teamwork, multi-annotatoranalysis and pairwise annotators comparison. Then, on the basisof various annotations, the inter-project agreement can be calcu-lated. Another strategy of trusted collaboration, user-independentworkspace, was utilized in TeamTat [8].
This paper combines these two types of annotation collabora-tions to structure various mentions annotated by each annotator
∗Corresponding author
Table 1: Two types of annotation collaborations used in previ-ous workbenches.
Grounded Trusted
UIMA-type system[10] Teamwork [8, 16]Annotation semantic constraint [13] Personal workspace TeamTat [16]
Multi-annotator analysis [8]Pairwise annotators comparison [8]
and develop a workbench named as Bureau for Rapid AnnotationTool (Brat for short). Main functionalities include enhanced seman-tic constraint system, Vim-like shortcut keys, annotation filter andgraph-visualizing annotation browser. Until now, over 500,000 men-tions have been annotated with our Brat workbench.
2 FUNCTIONALITIES2.1 Enhanced Semantic Constraint SystemIt is well known that not all parameters are valid to a specific rela-tionship. To limit invalid annotated results for an annotation project,its manager can customize the schema at any time. Once the schemais modified, all involved annotated mentions will be adjusted corre-spondingly. A readable name is usually assigned to each type ofentity and relation. In addition, a list of rules are also attached to ex-pression the constraint conditions between parameters in each typeof relation. In this way, the understanding on entities and relationsfrom the manager can be delivered to all annotators.
2.2 Vim-like Shortcut KeyAccording to our observation, the conventional annotating opera-tions (marking, selecting and confirming [13]) is time-expensive tochoose a proper candidate from more than 5 entity types or rela-tionships. To speed up the annotation procedure, our workbenchembeds many Vim-like shortcut keys [12]. In this time, one can an-notate smoothly an entity by the following steps (cf. Figure 2): 1) tomove cursor and select a span of text with Figure 1, 2) to acknowl-edge one command from recommended candidates with TAB andENTER, 3) to type leading characters and confirm entity types. Sim-ilar operations can be followed for relation mention annotation. It isworth noting that the key feature of this functionality is code auto-completion. This is based on enhanced semantic constraint systemand polymorphic type inference [9].
2.3 Configurable Annotation FilterIt’s unknown in advance how many mentions should be annotatedfor a single document, especially a very long document. To correctwrong annotations in time and reduce the conflicts among multi-ple documents, a feasible solution is to only display the mentions
Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
80
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
Figure 1: Vim-like Shortcut Keys Mapping
Figure 2: The novel annotation procedure powered by Vim-likeShortcut Keys
with interested types in current workspace. Thereupon, we providea configurable annotation filter by toggling or un-toggling entitytypes and relationships.
2.4 Graph-visualizing BrowserIn real-world scenario, it is not trivial to reach an agreement whenmultiple annotators are involved, and an entity or relation is men-tioned simultaneously in multiple documents. To inspect the un-derling disagreements, our workbench can load and index all texts,mentions and their types, and then visualize them in a graph brows-er, as illustrated in Figure 3.
3 CONCLUSIONMany projects utilized our Brat workbench to annotate interestedentities and/or relations, and inspect potential conflicts over variety-oriented annotations. Nowadays, over 500,000 mentions have beenannotated with our Brat workbench. In the near future, the Vim-like shortcut keys will be strengthen further, and machine learningmethods will be incorporated to accelerate conflict inspection.
Figure 3: The visualization of the entity ”Disk” and its relevantinformation in the TFH-2020 corpus [4] via our browser
ACKNOWLEDGMENTSThis work is supported partially by the Strategic Priority ResearchProgram of Chinese Academy of Sciences (Grant No. XDA16040504),National Key Research & Development Program of China (GrantNo. 2019YFA0707202), and National Natural Science Foundationof China(Grant No. 71704169 and 72074014). We also thank Pro-fessor Yiming Jing and Rui Zheng for their assistance on how tounderstand the collaboration in the field of psychology.
REFERENCES[1] Soren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak,
and Zachary Ives. 2007. DBpedia: A Nucleus for a Web of Open Data. LectureNotes in Computer Science 4825 LNCS (2007), 722–735.
[2] Jeff Benson. 2021. Here’s How Much a Ful-ly Loaded Tesla Model S Will Cost You in Bitcoin.https://decrypt.co/57071/heres-how-much-a-fully-loaded-tesla-model-s-will-cost-you-in-bitcoin[Online; accessed 16-Mars-2021].
[3] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor.2008. Freebase: A Collaboratively Created Graph Database for Structuring Hu-man Knowledge. In Proceedings of the 2008 ACM SIGMOD International Con-ference on Management of Data. 1247–1250.
[4] Liang Chen, Shuo Xu, Lijun Zhu, Jing Zhang, Xiao-ping Lei, and Guancan Yang.2020. A Deep Learning based Method for Extracting Semantic Information fromPatent Documents. Scientometrics 125, 1 (2020), 289–312.
[5] Vanessa Dirwai. 2021. Should Christian-s Trade Bitcoin And Other Cryptocurrencies?https://preciousearnings.medium.com/should-christians-trade-bitcoin-and-other-cryptocurrencies-878441e702c5[Online; accessed 22-August-2021].
[6] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Mur-phy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge Vault:a Web-scale Approach to Probabilistic Knowledge Fusion. In Proceedings of the20th ACM SIGKDD International Conference. 601–610.
[7] Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S. Weld. 2008.Open Information Extraction from the Web. Commun. ACM 51, 12 (2008), 68.
[8] Rezarta Islamaj, Dongseop Kwon, Sun Kim, and Zhiyong Lu. 2020. TeamTat: ACollaborative Text Annotation Tool. CoRR abs/2004.11894 (2020).
[9] Steven L. Jenkins and Gary T. Leavens. 1995. Polymorphic Type Inference inScheme. Computer Science Technical Reports 75 (1995).
[10] Yoshinobu Kano, William A. Baumgartner Jr., Luke McCrohon, Sophia Ana-niadou, K. Bretonnel Cohen, Lawrence Hunter, and Jun’ichi Tsujii. 2009. U-Compare: Share and Compare Text Mining Tools with UIMA. Bioinform. 25, 15(2009), 1997–1998.
[11] Mariana L. Neves and Ulf Leser. 2014. A Survey on Annotation Tools for theBiomedical Literature. Briefings Bioinform 15, 2 (2014), 327–340.
[12] Kim Schulz. 2007. Hacking Vim: A Cookbook to Get the Most Out of The LatestVim Editor. Packt Publishing Ltd.
[13] Pontus Stenetorp, Sampo Pyysalo, Goran Topic, Tomoko Ohta, Sophia Anani-adou, and Jun’ichi Tsujii. 2012. BRAT: A Web-based Tool for NLP-AssistedText Annotation. In Conference of the 13th European Chapter of the Associa-tion for Computational Linguistics, Walter Daelemans, Mirella Lapata, and LluısMarquez (Eds.). 102–107.
81
EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents
[14] Zheng Wang, Shuo Xu, and Lijun Zhu. 2018. Semantic RelationExtraction Aware of N-Gram Features from Unstructured Biomed-ical Text. Journal of Biomedical Informatics 86 (2018), 59–70.https://doi.org/10.1016/j.jbi.2018.08.011
[15] Shuo Xu, Xin An, Lijun Zhu, Yunliang Zhang, and Haodong Zhang. 2015.A CRF-based System for Recognizing Chemical Entity Mentions (CEMs) inBiomedical Literature. Journal of Cheminformatics 7, Suppl 1 (2015), S11.https://doi.org/10.1186/1758-2946-7-S1-S11
[16] Jie Yang, Yue Zhang, Linwei Li, and Xingxuan Li. 2018. YEDDA: A Light-weight Collaborative Text Span Annotation Tool. In Proceedings of the 56th An-nual Meeting Association for Computational Linguistics, Fei Liu and ThamarSolorio (Eds.). 31–36.
[17] David Yermack. 2015. Chapter 2 - Is Bitcoin a Real Currency? An EconomicAppraisal. In Handbook of Digital Currency, David Lee Kuo Chuen (Ed.). Aca-demic Press, San Diego, 31–43.
82