arxiv:1908.11302v1 [cs.cl] 29 aug 2019 · 400 english-language physical therapy initial as-sessment...

HARE: a Flexible Highlighting Annotator for Ranking and Exploration

Denis Newman-Griffis†,‡ and Eric Fosler-Lussier††Dept. of Computer Science and Engineering, The Ohio State University, Columbus, OH

‡Rehabilitation Medicine Dept., Clinical Center, National Institutes of Health, Bethesda, MD{newman-griffis.1, fosler-lussier.1}@osu.edu

Abstract

Exploration and analysis of potential datasources is a significant challenge in the appli-cation of NLP techniques to novel informa-tion domains. We describe HARE, a systemfor highlighting relevant information in docu-ment collections to support ranking and triage,which provides tools for post-processing andqualitative analysis for model developmentand tuning. We apply HARE to the use caseof narrative descriptions of mobility informa-tion in clinical data, and demonstrate its utilityin comparing candidate embedding features.We provide a web-based interface for annota-tion visualization and document ranking, witha modular backend to support interoperabilitywith existing annotation tools.

1 Introduction

As natural language processing techniques be-come useful for an increasing number of new in-formation domains, it is not always clear how bestto identify information of interest, or to evaluatethe output of automatic annotation tools. Thiscan be especially challenging when target data inthe form of long strings or narratives of complexstructure, e.g., in financial data (Fisher et al., 2016)or clinical data (Rosenbloom et al., 2011).

We introduce HARE, a Highlighting Annotatorfor Ranking and Exploration. HARE includestwo main components: a workflow for supervisedtraining of automated token-wise relevancy tag-gers, and a web-based interface for visualizing andanalyzing automated tagging output. It is intendedto serve two main purposes: (1) triage of docu-ments when analyzing new corpora for the pres-ence of relevant information, and (2) interactiveanalysis, post-processing, and comparison of out-put from different annotation systems.

In this paper, we demonstrate an applicationof HARE to information about individuals’ mo-

bility status, an important aspect of functioningconcerned with changing body position or loca-tion. This is a relatively new type of health-relatednarrative information with largely uncharacterizedlinguistic structure, and high relevance to overallhealth outcomes and work disability programs. Inexperiments on a corpus of 400 clinical records,we show that with minimal tuning, our tagger isable to produce a high-quality ranking of docu-ments based on their relevance to mobility, and tocapture mobility-likely document segments withhigh fidelity. We further demonstrate the use ofpost-processing and qualitative analytic compo-nents of our system to compare the impact of dif-ferent feature sets and tune processing settings toimprove relevance tagging quality.

2 Related work

Corpus annotation tools are plentiful in NLP re-search: brat (Stenetorp et al., 2012) and Know-tator (Ogren, 2006) being two heavily used ex-amples among many. However, the primary pur-pose of these tools is to streamline manual anno-tation by experts, and to support review and revi-sion of manual annotations. Some tools, includingbrat, support automated pre-annotation, but analy-sis of these annotations and corpus exploration isnot commonly included. Other tools, such as Sci-KnowMine,1 use automated techniques for triage,but for routing to experts for curation rather thanranking and model analysis. Document rankingand search engines such as Apache Lucene,2 bycontrast, can be overly fully-featured for early-stage analysis of new datasets, and do not directlyoffer tools for annotation and post-processing.

Early efforts towards extracting mobility infor-mation have illustrated that it is often syntactically

1 https://www.isi.edu/projects/sciknowmine/overview

2 https://lucene.apache.org/

arX

iv:1

908.

1130

2v1

[cs

.CL

] 2

9 A

ug 2

019

https://www.isi.edu/projects/sciknowmine/overview

https://www.isi.edu/projects/sciknowmine/overview

https://lucene.apache.org/

SpaCy WordPieceNum documents 400

Avg tokens per doc 537 655Avg mobility tokens per doc 97 112

Avg mobility segments per doc 9.2

Table 1: Statistics for dataset of mobility information,using SpaCy and WordPiece tokenization.

and semantically complex, and difficult to ex-tract reliably (Newman-Griffis and Zirikly, 2018;Newman-Griffis et al., 2019). Some characteriza-tion of mobility-related terms has been performedas part of larger work on functioning (Skube et al.,2018), but a lack of standardized terminologieslimits the utility of vocabulary-driven clinical NLPtools such as CLAMP (Soysal et al., 2018) orcTAKES (Savova et al., 2010). Thus, it forms auseful test case for HARE.

3 System Description

Our system has three stages for analyzing docu-ment sets, illustrated in Figure 1. First, data anno-tated by experts for token relevance can be used totrain relevance tagging models, and trained mod-els can be applied to produce relevance scores onnew documents (Section 3.1). Second, we pro-vide configurable post-processing tools for clean-ing and smoothing relevance scores (Section 3.2).Finally, our system includes interfaces for review-ing detailed relevance output, ranking documentsby their relevance to the target criterion, and an-alyzing qualitative outcomes of relevance scoringoutput (Sections 3.3-3.5); all of these interfaces al-low interactive re-configuration of post-processingsettings and switching between output relevancescores from different models for comparison.

For our experiments on mobility information,we use an extended version of the dataset de-scribed by Thieu et al. (2017), which consists of400 English-language Physical Therapy initial as-sessment and reassessment notes from the Reha-bilitation Medicine Department of the NIH Clin-ical Center. These text documents have been an-notated at the token level for descriptions and as-sessments of patient mobility status. Further in-formation on this dataset is given in Table 1. Weuse ten-fold cross validation for our experiments,splitting into folds at the document level.

3.1 Relevance tagging workflowAll hyperparameters discussed in this section weretuned on held-out development data in cross-

Docs

Backend

Preprocessing

Feature Extraction

Token Annotation

Post-processing

Viewer Ranking

Frontend

Figure 1: HARE workflow for working with a set ofdocuments; outlined boxes indicate automated compo-nents, and gray boxes signify user interfaces.

validation experiments. We report the best settingshere, and provide full comparison of hyperparam-eter settings in Appendix A.

3.1.1 PreprocessingDifferent domains exhibit different patterns in to-ken and sentence structure that affect preprocess-ing. In clinical text, tokenization is not a consen-sus issue, and a variety of different tokenizers areused regularly (Savova et al., 2010; Soysal et al.,2018). As mobility information is relatively un-explored, we relied on general-purpose tokeniza-tion with spaCy (Honnibal and Montani, 2017) asour default tokenizer, and WordPiece (Wu et al.,2016) for experiments using BERT. We did not ap-ply sentence segmentation, as clinical toolkits of-ten produced short segments that interrupted mo-bility information in our experiments.

3.1.2 Feature extractionOur system supports feature extraction for indi-vidual tokens in input documents using both staticand contextualized word embeddings.

Static embeddings Using static (i.e., non-contextualized) embeddings, we calculate inputfeatures for each token as the mean embedding ofthe token and 10 words on each side (truncatedat sentence/line breaks). We used FastText (Bo-janowski et al., 2017) embeddings trained on a 10-year collection of physical and occupational ther-apy records from the NIH Clinical Center.

ELMo (Peters et al., 2018) ELMo features arecalculated for each token by taking the hiddenstates of the two bLSTM layers and the tokenlayer, multiplying each vector by learned weights,and summing to produce a final embedding. Com-bination weights are trained jointly with the tokenannotation model. We used a 1024-dimensionalELMo model pretrained on PubMed data3 for our

3 https://allennlp.org/elmo

https://allennlp.org/elmo

Figure 2: Precision, recall, and F-2 when varying bi-narization threshold from 0 to 1, using ELMo embed-dings. The threshold corresponding to the best F-2 ismarked with a dotted vertical line.

mobility experiments.BERT (Devlin et al., 2019) For BERT features,

we take the hidden states of the final k layers ofthe model; as with ELMo embeddings, these out-puts are then multiplied by a learned weight vec-tor, and the weighted layers are summed to cre-ate the final embedding vectors.4 We used the768-dimensional clinicalBERT (Alsentzer et al.,2019) model5 in our experiments, extracting fea-tures from the last 3 layers.

3.1.3 Automated token-level annotationWe model the annotation process of assigninga relevance score for each token using a feed-forward deep neural network that takes embeddingfeatures as input and produces a binomial softmaxdistribution as output. For mobility information,we used a DNN with three 300-dimensional hid-den layers, relu activation, and 60% dropout.

As shown in Table 1, our mobility dataset isconsiderably imbalanced between relevant and ir-relevant tokens. To adjust for this balance, foreach epoch of training, we used all of the rele-vant tokens in the training documents, and sam-pled irrelevant tokens at a 75% ratio to produce amore balanced training set; negative points werere-sampled at each epoch. As token predictionsare conditionally independent of one another giventhe embedding features, we did not maintain anysequence in the samples drawn. Relevant sampleswere weighted at a ratio of 2:1 during training.

After each epoch, we evaluate the model on alltokens in a held-out 10% of the documents, andcalculate F-2 score (preferring recall over preci-sion) using 0.5 as the binarization threshold ofmodel output. We use an early stopping thresh-old of 1e-05 on this F-2 score, with a patience of 5

4 Note that as BERT is constrained to use WordPiece tok-enization, it may use slightly longer token sequences than theother methods.

5 https://github.com/EmilyAlsentzer/clinicalBERT

(a) No collapsing

(b) Collapse one blank

Figure 3: Collapsing adjacent segments illustration.

epochs and a maximum of 50 epochs of training.

3.2 Post-processing methods

Given a set of token-level relevance annotations,HARE provides three post-processing techniquesfor analyzing and improving annotation results.

Decision thresholding The threshold for bina-rizing token relevance scores is configurable be-tween 0 and 1, to support more or less conservativeinterpretation of model output; this is akin to ex-ploring the precision/recall curve. Figure 2 showsprecision, recall, and F-2 for different threshold-ing values from our mobility experiments, usingscores from ELMo embeddings.

Collapsing adjacent segments We considerany contiguous sequence of tokens with scores ator above the binarization threshold to be a relevantsegment. As shown in Figure 3, multiple segmentsmay be interrupted by irrelevant tokens such aspunctuation, or by noisy relevance scores fallingbelow the binarization threshold. As multiple ad-jacent segments may inflate a document’s overallrelevance, our system includes a setting to collapseany adjacent segments that are separated by k orfewer tokens into a single segment.

Viterbi smoothing By modeling token-leveldecisions as conditionally independent of one an-other given the input features, we avoid assump-tions of strict segment bounds, but introduce somenoisy output, as shown in Figure 4. To reducesome of this noise, we include an optional smooth-

(a) Without smoothing

(b) With smoothing

Figure 4: Illustration of Viterbi smoothing.

https://github.com/EmilyAlsentzer/clinicalBERT


Figure 5: Annotation viewer interface.

ing component based on the Viterbi algorithm.We model the “relevant”/“irrelevant” state se-

quence discriminatively, using annotation modeloutputs as state probabilities for each timestep,and calculate the binary transition probability ma-trix by counting transitions in the training data.We use these estimates to decode the most likelyrelevance state sequence R for a tokenized line Tin an input document, along with the correspond-ing path probability matrixW , whereWj,i denotesthe likelihood of being in state j at time i givenri−1 and ti. In order to produce continuous scoresfor each token, we then backtrace through R andassign score si to token ti as the conditional prob-ability that ri is “relevant”, given ri−1. Let Qj,i bethe likelihood of transitioning from state Ri−1 toj, conditioned on Ti, as:

Qj,i =Wj,i

WRi−1,i−1(1)

The final conditional probability si is calculatedby normalizing over possible states at time i:

si =Q1,i

Q0,i +Q1,i(2)

These smoothed scores can then be binarized us-ing the configurable decision threshold.

3.3 Annotation viewerAnnotations on any individual document can beviewed using a web-based interface, shown in Fig-ure 5. All tokens with scores at or above the de-cision threshold are highlighted in yellow, witheach contiguous segment shown in a single high-light. Configuration settings for post-processingmethods are provided, and update the displayedannotations when changed. On click, each tokenwill display the score assigned to it by the anno-tation model after post-processing. If the docu-ment being viewed is labeled with gold annota-tions, these are shown in bold red text. Addition-

Figure 6: Ranking interface.

ally, document-level summary statistics and eval-uation measures, with current post-processing, aredisplayed next to the annotations.

3.4 Document set ranking

3.4.1 Ranking methodsRelevance scoring methods are highly task-dependent, and may reflect different prioritiessuch as information density or diversity of infor-mation returned. In this system, we provide threegeneral-purpose relevance scorers, each of whichoperates after any post-processing.

Segments+Tokens Documents are scored bymultiplying their number of relevant segments bya large constant and adding the number of relevanttokens to break any ties by segment count. As rel-evant information may be sparse, no normalizationby document length is used.

SumScores Documents are scored by summingthe continuous relevance scores assigned to all oftheir tokens. As with the Segments+Tokens scorer,no adjustment is made for document length.

Density Document scores are the ratio of bina-rized relevant tokens to total number of tokens.

The same scorer can be used to rank gold anno-tations and model annotations, or different scorerscan be chosen. Ranking quality is evaluated usingSpearman’s ρ, which ranges from -1 (exact oppo-site ranking) to +1 (same ranking), with 0 indicat-ing no correlation between rankings. We use Seg-ments+Tokens as default; a comparison of rankingmethods is in Appendix B.

3.4.2 Ranking interfaceOur system also includes a web-based ranking in-terface, which displays the scores and correspond-ing ranking assigned to a set of annotated doc-uments, as shown in Figure 6. For ease of vi-sual distinction, we include colorization of rowsbased on configurable score thresholds. Rank-ing methods used for model scores and gold an-

Embeddings Smoothing Annotation RankingPr Rec F-2 ρ

Static No 59.0 94.7 84.4 0.862Yes 60.5 93.7 84.3 0.899

ELMo No 60.2 94.1 84.4 0.771Yes 66.5 91.4 84.8 0.886

BERT No 55.3 93.8 82.2 0.689Yes 62.3 90.8 84.3 0.844

Table 2: Annotation and ranking evaluation results onmobility documents, using three embedding sources.Results are given with and without Viterbi smooth-ing, using binarization threshold=0.5 and no collaps-ing of adjacent segments. Pr=precision, Rec=recall,ρ=Spearman’s ρ Pr/Rec/F2 are macro-averaged overfolds, ρ is over all test predictions.

notations (when present) can be adjusted inde-pendently, and our post-processing methods (Sec-tion 3.2) can also be adjusted to affect ranking.

3.5 Qualitative analysis tools

We provide a set of three tools for performingqualitative analysis of annotation outcomes. Thefirst measures lexicalization of each unique tokenin the dataset with respect to relevance score, byaveraging the assigned relevance score (with orwithout smoothing) for each instance of each to-ken. Tokens with a frequency below a config-urable minimum threshold are excluded.

Our other tools analyze the aggregate relevancescore patterns in an annotation set. For labeleddata, as shown in Figure 2, we provide a visual-ization of precision, recall, and F-2 when vary-ing the binarization threshold, including identify-ing the optimal threshold with respect to F-2. Wealso include a label-agnostic analysis of patterns inoutput relevance scores, illustrated in Figure 7, asa way to evaluate the confidence of the annotator.Both of these tools are provided at the level of anannotation set and individual documents.

3.6 Implementation details

Our automated annotation, post-processing, anddocument ranking algorithms are implementedin Python, using the NumPy and Tensorflow li-braries. Our demonstration interface is imple-mented using the Flask library, with all backendlogic handled separately in order to support mod-ularity of the user interface.

4 Results on mobility

Table 2 shows the token-level annotation and doc-ument ranking results for our experiments on mo-

Figure 7: Distribution of token relevance scores on mo-bility data: (a) word2vec, (b) ELMo, and (c) BERT.

bility information. Static and contextualized em-bedding models performed equivalently well ontoken-level annotations; BERT embeddings actu-ally underperformed static embeddings and ELMoon both precision and recall. Interestingly, staticembeddings yielded the best ranking performanceof ρ = 0.862, compared to 0.771 with ELMoand 0.689 with BERT. Viterbi smoothing makes aminimal difference in token-level tagging, but in-creases ranking performance considerably, partic-ularly for contextualized models. It also producesa qualitative improvement by trimming out extra-neous tokens at the start of several segments, asreflected by the improvements in precision.

The distribution of token scores from eachmodel (Figure 7) shows that all three embeddingmodels yielded a roughly bimodal distribution,with most scores in the ranges [0, 0.2] or [0.7, 1.0].

5 Discussion

Though our system is designed to address differ-ent needs from other NLP annotation tools, com-ponents such as annotation viewing are also ad-dressed in other established systems. Our imple-mentation decouples backend analysis from thefront-end interface; in future work, we plan to addsupport for integrating our annotation and rankingsystems into existing platforms such as brat. Ourtool can also easily be extended to both multi-classand multilabel applications; for a detailed discus-sion, see Appendix C.

In terms of document ranking methods, it maybe preferred to rank documents jointly instead ofindependently, in order to account for challengessuch as duplication of information (common inclinical data; Taggart et al. (2015)) or subtopics.However, these decisions are highly task-specific,and are an important focus for designing rankingutility within specific domains.

6 Conclusions

We introduced HARE, a supervised system forhighlighting relevant information and interactiveexploration of model outcomes. We demonstratedits utility in experiments with clinical records an-notated for narrative descriptions of mobility sta-tus. We also provided qualitative analytic toolsfor understanding the outcomes of different an-notation models. In future work, we plan toextend these analytic tools to provide rationalesfor individual token-level decisions. Additionally,given the clear importance of contextual informa-tion in token-level annotations, the static transitionprobabilities used in our Viterbi smoothing tech-nique are likely to degrade its effect on the out-put. Adding support for dynamic, contextualizedestimations of transition probabilities will providemore fine-grained modeling of relevance, as wellas more powerful options for post-processing.

Our system is available online at https://github.com/OSU-slatelab/HARE/. Thisresearch was supported by the Intramural Re-search Program of the National Institutes of Healthand the US Social Security Administration.

ReferencesEmily Alsentzer, John Murphy, William Boag, Wei-

Hung Weng, Di Jindi, Tristan Naumann, andMatthew McDermott. 2019. Publicly AvailableClinical BERT Embeddings. In Clinical NLP Work-shop, pages 72–78. ACL.

Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching Word Vectors withSubword Information. TACL, 5:135–146.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In NAACL-HLT, pages 4171–4186. ACL.

Ingrid E Fisher, Margaret R Garnsey, and Mark EHughes. 2016. Natural Language Processing in Ac-counting, Auditing and Finance: A Synthesis of theLiterature with a Roadmap for Future Research. In-telligent Systems in Accounting, Finance and Man-agement, 23(3):157–214.

Matthew Honnibal and Ines Montani. 2017. spaCy 2:Natural language understanding with Bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So,and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model

for biomedical text mining. arXiv preprintarXiv:1901.08746, pages 1–8.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient Estimation of WordRepresentations in Vector Space. arXiv preprintarXiv:1301.3781, pages 1–12.

Denis Newman-Griffis and Ayah Zirikly. 2018. Em-bedding Transfer for Low-Resource Medical NamedEntity Recognition: A Case Study on Patient Mobil-ity. In BioNLP, pages 1–11. ACL.

Denis Newman-Griffis, Ayah Zirikly, Guy Divita, andBart Desmet. 2019. Classifying the reported abilityin clinical mobility descriptions. In BioNLP.

Philip V Ogren. 2006. Knowtator: A Protege plug-infor annotated corpus construction. In NAACL-HLT,pages 273–275, New York City, USA. ACL.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP, pages 1532–1543.ACL.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep Contextualized Word Rep-resentations. In NAACL-HLT, pages 2227–2237,New Orleans, Louisiana. ACL.

S Trent Rosenbloom, Joshua C Denny, Hua Xu, NancyLorenzi, William W Stead, and Kevin B Johnson.2011. Data from clinical notes: a perspective on thetension between structure and flexible documenta-tion. JAMIA, 18(2):181–186.

Guergana K Savova, James J Masanz, Philip V Ogren,Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, and Christopher G Chute. 2010. Mayo clin-ical Text Analysis and Knowledge Extraction Sys-tem (cTAKES): architecture, component evaluationand applications. JAMIA, 17(5):507–513.

Steven J Skube, Elizabeth A Lindemann, Elliot GArsoniadis, Mari Akre, Elizabeth C Wick, andGenevieve B Melton. 2018. Characterizing Func-tional Health Status of Surgical Patients in Clini-cal Notes. In AMIA Joint Summits, pages 379–388.AMIA.

Ergin Soysal, Jingqi Wang, Min Jiang, Yonghui Wu,Serguei Pakhomov, Hongfang Liu, and Hua Xu.2018. CLAMP a toolkit for efficiently build-ing customized clinical natural language processingpipelines. JAMIA, 25(3):331–336.

Pontus Stenetorp, Sampo Pyysalo, Goran Topic,Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-jii. 2012. brat: a Web-based Tool for NLP-AssistedText Annotation. In EACL, pages 102–107. ACL.

Jane Taggart, Siaw-Teng Liaw, and Hairong Yu. 2015.Structured data quality reports to improve EHR dataquality. Int J Med Info, 84(12):1094–1098.

https://github.com/OSU-slatelab/HARE/

https://github.com/OSU-slatelab/HARE/

https://www.aclweb.org/anthology/W19-1909

https://www.aclweb.org/anthology/W19-1909

https://doi.org/1511.09249v1

https://doi.org/1511.09249v1

https://www.aclweb.org/anthology/N19-1423



https://doi.org/10.1002/isaf.1386



http://arxiv.org/abs/1901.08746





http://aclweb.org/anthology/W18-2301






http://www.aclweb.org/anthology/D14-1162

http://www.aclweb.org/anthology/D14-1162

https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.18653/v1/N18-1202

https://doi.org/10.1136/jamia.2010.007237







https://doi.org/10.1093/jamia/ocx132



https://www.aclweb.org/anthology/E12-2021

https://www.aclweb.org/anthology/E12-2021

https://doi.org/https://doi.org/10.1016/j.ijmedinf.2015.09.008

https://doi.org/https://doi.org/10.1016/j.ijmedinf.2015.09.008

Thanh Thieu, Jonathan Camacho, and Pei-Shu Ho etal. 2017. Inductive identification of functional sta-tus information and establishing a gold standard cor-pus: A case study on the Mobility domain. In BIBM,pages 2300–2302. IEEE.

Yonghui Wu, Mike Schuster, and Zhifeng Chen etal. 2016. Google’s Neural Machine Transla-tion System: Bridging the Gap between Hu-man and Machine Translation. arXiv preprintarXiv:1609.08144.

https://doi.org/10.1109/BIBM.2017.8218042






A Hyperparameters

This section describes each of the settings evalu-ated for the various hyperparameters used in ourexperiments on mobility information. We first ex-perimented with different pretrained embeddingsused for each of our embedding model options; re-sults are shown in Figure 8.

Static embedding model (Figure 8a) We evalu-ated three commonly used benchmark embeddingsets: word2vec skipgram (Mikolov et al., 2013)using GoogleNews,6 FastText skipgram with sub-word information on WikiNews,7 and GloVe (Pen-nington et al., 2014) on 840 billion tokens ofCommon Crawl.8 Additionally, we experimentedwith two in-domain embedding sets, trained on10 years of Physical Therapy and OccupationalTherapy records from the NIH Clinical Center (re-ferred to as “PT/OT”), using word2vec skipgramand FastText skipgram. word2vec GoogleNewsembeddings produced the best dev F-2.

ELMo model (Figure 8b) We experimentedwith three pretrained ELMo models:9 the “Orig-inal” model trained on the 1 Billion Word Bench-mark, the “Original (5.5B)” model trained withthe same settings on Wikipedia and machine trans-lation data, and a model trained on PubMed ab-stracts. The Original (5.5B) model produced thebest dev F-2.

BERT model (Figure 8c) We experimentedwith three pretrained BERT models: BERT-Base,10 BioBERT (Lee et al., 2019) (v1.1) trainedon 1 million PubMed abstracts,11 and clinical-BERT (Alsentzer et al., 2019) trained on MIMICdata.12 We use uncased versions of BERT-Baseand clinicalBERT, as casing is not a reliable sig-nal in clinical data; BioBERT is only available ina cased version. clinicalBERT produced the bestdev F-2.

Once the best embedding models for eachmethod were identified, we experimented withnetwork and training hyperparameters, with re-

6 http://google.com/archive/p/word2vec/7 https://fasttext.cc/docs/en/8 http://nlp.stanford.edu/projects/

glove/9 All downloaded from https://allennlp.org/

elmo10 https://github.com/google-research/

bert11 https://github.com/naver/

biobert-pretrained12 https://github.com/EmilyAlsentzer/

clinicalBERT

sults shown in Figure 9.Irrelevant:relevant sampling ratio (Figure 9a)

We experimented with the ratio of irrelevant torelevant samples drawn for each training epochin 0.25, 0.5, 0.75, 1, 1.5, 2, 2.5, 3. A ratio of 0.75gave the best dev F-2.

Positive fraction (Figure 9b) We varied thefraction of total dataset positive samples drawn foreach training epoch from 10% to 100% at intervalsof 10%. The best dev F-2 was produced by usingall positive samples in each epoch.

Dropout rate (Figure 9c) We experimentedwith an input dropout rate from 0% to 90%, atintervals of 10%; the best results were producedwith a 60% dropout rate.

Weighting scheme (Figure 9d) Given the im-balance of relevant to irrelevant samples in ourdataset, we experimented with weighting relevantsamples by a factor of 1 (equal weight), 2, 3, 4,and 5. A weighting of 2:1 produced the best devF-2.

Hidden layer configuration (Figure 9e) We ex-perimented with the configuration of our DNNmodel, using hidden layer size ∈ {10, 100, 300}and number of layers from 1 to 3. The best devF-2 results were achieved using 3 hidden layers ofsize 300.

B Ranking methods

A comparison of ranking methods used for modeland gold scores is provided in Figure ??. Wefound that for our experiments, Segments+Tokensand SumScores correlated fairly well with one an-other, but Density, due to its normalization fordocument length, works best when used to rankboth model and gold scores. SumScores providedthe best overall ranking correlation; however, weuse Segments+Tokens as the default setting for oursystem for its clear interpretation.

C Extending to multi-class/multilabelapplications

Our experiments focused on binary relevance withrespect to mobility information. However, oursystem can be fairly straightfowardly extended toboth multi-label (i.e., multiple relevance criteria)and multi-class (e.g., NER) settings.

For multi-label settings, such as looking for ev-idence of limitations in either mobility or inter-personal interactions, the only requirement is hav-ing data that are annotated for each relevance cri-

http://google.com/archive/p/word2vec/

https://fasttext.cc/docs/en/

http://nlp.stanford.edu/projects/glove/

http://nlp.stanford.edu/projects/glove/



https://github.com/google-research/bert

https://github.com/google-research/bert

https://github.com/naver/biobert-pretrained

https://github.com/naver/biobert-pretrained



terion. These can be the same data with mul-tiple annotations, or different datasets; in eithercase, binary relevance annotators can be trainedindependently for each specific relevance crite-rion. Our post-processing components such asViterbi smoothing can then be applied indepen-dently to each set of relevance annotations as de-sired. The primary extension required would beto the visualization interface, to support displayof multiple (potentially overlapping) annotations.Alternatively, our modular handling of relevanceannotations could be redirected to another visual-ization interface with existing support for multipleannotations, such as brat.

Extending to multi-class settings would requirefairly minimal updates to both the interface andour relevance annotation model. Our model istrained using two-class cross (relevant and irrele-vant) cross-entropy; this could easily be extendedto n-ary cross entropy for any desired number ofclasses, and trained with token-level data anno-tated with the appropriate classes. In terms of vi-sualization and analysis, the two modifications re-quired would be adding differentiating displays forthe different classes annotated (e.g., different col-ors), and updating the displayed evaluation statis-tics to micro/macro evaluations over the multipleclasses. Qualitative analysis features such as rele-vance score distribution and lexicalization are al-ready dependent only on the scores assigned to the“relevant” class, and could be presented for eachclass independently.

(a) Static word embedding models (b) ELMo embedding models (c) BERT embedding models

Figure 8: Embedding model selection results, by F-2 on cross validation development set. Default settings forother hyperparameters were: relevant:irrelevant ratio of 1:1, sampling 50% of positive samples per epoch, dropoutof 0.5, equal class weights, and DNN configuration one 100-dimensional hidden layer.

(a) Irrelevant:relevant sampling ratio in training (b) Fraction of relevant samples per epoch

(c) Dropout (d) Weighting scheme (Relevant:Irrelevant)

(e) Configurations of DNN token annotator; NxM indi-cates N layers of dimensionality M ; M1/M2 indicatesthat the first hidden layer is of dimensionality M1, thesecond of M2.

Figure 9: Hyperparameter tuning results, measuring F-2 on development set in cross-validation experiments. Foreach embedding method, the best model (shown in Figure 8) was used. All other hyperparameter setting defaultswere as described in Figure 8. The best-performing setting for each hyperparameter, determined by the mean DevF-2 across all three embedding methods, is indicated with a vertical dashed line.

(a) Static (no smoothing) (b) Static (with smoothing)

(c) ELMo (no smoothing) (d) ELMo (with smoothing)

(e) BERT (no smoothing) (f) BERT (with smoothing)

Figure 10: Comparison of ranking methods used for model scores and gold scores. Scores given are Spearman’srank correlation coefficient (ρ). System outputs using all three embedding methods are compared, both with andwithout Viterbi smoothing.