wbi at clef ehealth 2018 task 1: language-independent icd ...ceur-ws.org/vol-2125/paper_118.pdf ·...

WBI at CLEF eHealth 2018 Task 1:Language-independent ICD-10 coding using

multi-lingual embeddings and recurrent neuralnetworks

Jurica Seva, Mario Sanger, and Ulf Leser

Humboldt-Universitat zu Berlin, Knowledge Management in Bioinformatics,Berlin, Germany

{seva,saengema,leser}@informatik.hu-berlin.de

Abstract. This paper describes the participation of the WBI team inthe CLEF eHealth 2018 shared task 1 (“Multilingual Information Ex-traction - ICD-10 coding”). Our contribution focus on the setup andevaluation of a baseline language-independent neural architecture forICD-10 classification as well as a simple, heuristic multi-language wordembedding space. The approach builds on two recurrent neural networksmodels to extract and classify causes of death from French, Italian andHungarian death certificates. First, we employ a LSTM-based sequence-to-sequence model to obtain a death cause from each death certificateline. We then utilize a bidirectional LSTM model with attention mech-anism to assign the respective ICD-10 codes to the received death causedescription. Both models take multi-language word embeddings as in-puts. During evaluation our best model achieves an F-score of 0.34 forFrench, 0.45 for Hungarian and 0.77 for Italian. The results are encour-aging for future work as well as the extension and improvement of theproposed baseline system.

Keywords: ICD-10 coding · Biomedical information extraction ·Multi-lingual sequence-to-sequence model · Represention learning · Recurrentneural network · Attention mechanism · Multi-language embeddings

1 Introduction

Automatic extraction, classification and analysis of biological and medical con-cepts from unstructured texts, such as scientific publications or electronic healthdocuments, is a highly important task to support many applications in research,daily clinical routine and policy-making. Computer-assisted approaches can im-prove decision making and support clinical processes, for example, by givinga more sophisticated overview about a research area, providing detailed infor-mation about the aetiopathology of a patient or disease patterns. In the pastyears major advances have been made in the area of natural-language process-ing (NLP). However, improvements in the field of biomedical text mining lag

behind other domains mainly due to privacy issues and concerns regarding theprocessed data (e.g. electronic health records).

The CLEF eHealth lab1 attends to circumvent this situation through organi-zation of various shared tasks to exploit electronically available medical content[34]. In particular, Task 12 of the lab is concerned with the extraction andclassification of death causes from death certificates originating from differentlanguages [27]. Participants were asked to classify the death causes mentioned inthe certificates according to the International Classification of Disease version 10(ICD-10)3. The task was concerned with French and English death certificatesin previous years. In contrast, this year the organizers provided annotated deathreports as well as ICD-10 dictionaries for French, Italian and Hungarian. Thedevelopment of language-independent, multilingual approaches was encouraged.

Inspired by the recent success of recurrent neural network models (RNN)[6,20,10] in general and the convincing performance of the work from Miftahut-dinov and Tutubalina [21] in the last edition of the lab, we opt for the develop-ment of a deep learning model for this year’s competition. Our work introducesa prototypical, language independent approach for ICD-10 classification usingmulti-language word embeddings and long short-term memory models (LSTMs).We divide the proposed pipeline into two tasks. First, we perform named entityrecognition (NER), i.e. extract the death cause description from a certificateline, with an an encoder-decoder model. Given the death cause, named entitynormalization (NEN), i.e. assigning an ICD-10 code to extracted death cause,is performed by a separate LSTM. Our approach builds upon a heuristic multi-language embedding space and therefore only needs one single model for all threedata sets. With this work we want to experiment and evaluate which performancecan be achieved with such a simple shared embedding space.

2 Related work

This section highlights previous work related to our approach. We give a briefintroduction to the methodical foundations of our work, RNNs and word embed-dings. The section concludes with a summary of ICD-10 classification approachesused in previous eHealth Lab competitions.

2.1 Recurrent neural networks (RNN)

RNNs are a widely used technique for sequence learning problems such as ma-chine translation [1,6], image captioning [2], NER [20,40], dependency parsing[10] and part-of-speech tagging [39]. RNNs model dynamic temporal behaviourin sequential data through recurrent units, i.e. the hidden, internal state of aunit in one time step depends on the state of the unit in the previous time step.

1 https://sites.google.com/site/clefehealth/2 https://sites.google.com/view/clef-ehealth-2018/task-1-multilingual-information-

extraction-icd10-coding3 http://www.who.int/classifications/icd/en/

https://sites.google.com/site/clefehealth/

https://sites.google.com/view/clef-ehealth-2018/task-1-multilingual-information-extraction-icd10-coding

https://sites.google.com/view/clef-ehealth-2018/task-1-multilingual-information-extraction-icd10-coding

http://www.who.int/classifications/icd/en/

These feedback connections enable the network to memorize information fromrecent time steps and add the ability to capture long-term dependencies.

However, training of RNNs can be difficult due to the vanishing gradientproblem [16,3]. The most widespread modifications of RNNs to overcome thisproblem are LSTMs [17] and gated recurrent units (GRU) [6]. Both modificationsuse gated memories which control and regulate the information flow between tworecurrent units. A common LSTM unit consists of a cell and three gates, an inputgate, an output gate and a forget gate. In general, LSTMs are chained togetherby connecting the outputs of the previous unit to the inputs of the next one.

A further extension of the general RNN architecture are bidirectional net-works, which make the past and future context available in every time step. Abidirectional LSTM model consists of a forward chain, which processes the in-put data from left to right, and and backward chain, consuming the data in theopposite direction. The final representation is typically the concatenation or alinear combination of both states.

2.2 Word Embeddings

Distributional semantic models (DSMs) have been researched for decades in NLP[37]. Based on a huge amount of unlabeled texts, DSMs aim to represent wordsusing a real-valued vector (also called embedding) which captures syntactic andsemantic similarities between the words. Starting with the publication of thework from Collobert et al. [7] in 2011, learning embeddings for linguistic units,such as words, sentences or paragraphs, is one of the hot topics in NLP and aplethora of approaches have been proposed [4,23,28,30].

The majority of todays embedding models are based on deep learning modelstrained to perform some kind of language modeling task [29,30,32]. The mostpopular embedding model is the Word2Vec model introduced by Mikolov etal. [22,23]. They propose two shallow neural network models, continuous bag-of-words (CBOW) and SkipGram, that are trained to reconstruct the context givena center word and vice versa. In contrast, Pennington et al. [28] use the ratio be-tween co-occurrence probabilities of two words with another one to learn a vectorrepresentation. In [30] multi-layer, bi-directional LSTM models are utilized tolearn word embeddings that also capture different contexts of it.

Several recent models focus on the integration of subword and morpholog-ical information to provide suitable representations even for unseen, out-of-vocabulary words. For example, Pinter et al. [32] try to reconstruct a pre-trainedword embedding by learning a bi-directional LSTM model on character level.Similarly, Bojanowski et al. [4] adapt the SkipGram by taking character n-gramsinto account. Their fastText model assigns a vector representation to each char-acter n-gram and represents words by summing over all of these representationsof a word.

In addition to embeddings that capture word similarities in one language,multi- and cross-lingual approaches have also been investigated. Proposed meth-ods either learn a linear mapping between monolingual representations [12,41]

or utilize word- [13,38], sentence- [31] or document-aligned [36] corpora to builda shared embedding space.

2.3 ICD-10 Classification

The ICD-10 coding task has already been carried out in the 2016 [26] and 2017[25] edition of the eHealth lab. Participating teams used a plethora of differ-ent approaches to tackle the classification problem. The methods can essentiallybe divided into two categories: knowledge-based [5,18,24] and machine learning(ML) approaches [8,11,15,21]. The former relies on lexical sources, medical ter-minologies and other dictionaries to match (parts of) the certificate text withentries from the knowledge-bases according to a rule framework. For example, DiNunzio et al. [9] calculate a score for each ICD-10 dictionary entry by summingthe binary or tf-idf weights of each term of a certificate line segment and assignthe ICD-10 code with the highest score. In contrast, Ho-Dac et al. [14] treat theproblem as information retrieval task and utilize the Apache Solr search engine4

to classify the individual lines.The ML-based approaches employ a variety of techniques, e.g. Conditional

Random Fields (CRFs) [15], Labeled Latent Dirichlet Analysis (LDA) [8] andSupport Vector Machines (SVMs) [11] with diverse hand-crafted features. Mostsimilar to our approach is the work from Miftahutdinov and Tutubalina [21],which achieved the best results for English certificates in the last year’s com-petition. They use a neural LSTM-based encoder-decoder model that processesthe raw certificate text as input and encodes it into a vector representation. Ad-ditionally, a vector which captures the textual similarity between the certificateline and the death causes of the individual ICD-10 codes is used to integrateprior knowledge into the model. The concatenation of both vector representa-tions is then used to output the characters and numbers of the ICD-10 code inthe decoding step. In contrast to their work, our approach introduces a modelfor multi-language ICD-10 classification. Moreover, we divide the task into twodistinct steps: death cause extraction and ICD-10 classification.

3 Methods

Our approach models the extraction and classification of death causes as two-stepprocess. First, we employ a neural, multi-language sequence-to-sequence modelto receive a death cause description for a given death certificate line. We thenuse a second classification model to assign the respective ICD-10 codes to theobtained death cause. The remainder of this section gives a detailed explanationof the architecture of the two models.

3.1 Death Cause Extraction Model

The first step in our pipeline is the extraction of the death cause from a givencertificate line. We use the training certificate lines (with their corresponding

4 http://lucene.apache.org/solr/

http://lucene.apache.org/solr/

ICD-10 codes) and the ICD-10 dictionaries as basis for our model. The dictio-naries provide us with death causes for each ICD-10 code. The goal of the modelis to reassemble the dictionary death cause text from the certificate line.

For this we adopt the encoder-decoder architecture proposed in [35]. Figure1 illustrates the architecture of the model. As encoder we utilize a unidirectionalLSTM model, which takes the single words of a certificate line as inputs and scansthe line from left to right. Each token is represented using pre-trained fastText5

word embeddings [4]. We utilize fastText embedding models for French, Italianand Hungarian trained on Common Crawl and Wikipedia articles6. Indepen-dently from the original language a word we represent it by looking up the wordin all three embedding models and concatenate the obtained vectors. Throughthis we get a simple multi-language representation of the word. This heuristiccomposition constitutes a naive solution to build a multi-language embeddingspace. However we opted to evaluate this approach as a simple baseline for fu-ture work. Encoders’ final state represents the semantic representation of thecertificate line and serves as initial input for decoding process.

Encoder Decoder

Input colite infectieuse ou ischemique \s colite infectieuse Input

Embedding

Embedding

LSTM

LSTM

Softmax

colite infectieuse \e Output

Hungarian Embedding Italian Embedding French Embedding

Semantic

Line Emb.

Fig. 1. Illustration of the encoder-decoder model for death cause extraction. The en-coder processes a death certificate line token-wise from left to right. The final state ofthe encoder forms a semantic representation of the line and serves as initial input forthe decoding process. The decoder will be trained to predict the death cause text fromthe provided ICD-10 dictionaries word by word (using special tags \s and \e for startresp. end of a sequence). All input tokens will be represented using the concatenationof the fastText embeddings of all three languages.

For the decoder we utilize another LSTM model. The initial input of thedecoder is the final state of the encoder model. Moreover, each token of the

5 https://github.com/facebookresearch/fastText/6 https://github.com/facebookresearch/fastText/blob/master/docs/crawl-

vectors.md

https://github.com/facebookresearch/fastText/

https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md

https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md

dictionary death cause text (padded with special start and end tag) serves as(sequential) input. Again, we use fastText embeddings of all three languagesto represent the input tokens. The decoder predicts one-hot-encoded words ofthe death cause. During test time we use the encoder to obtain a semanticrepresentation of the certificate line and decode the death cause descriptionword by word starting with the special start tag. The decoding process finisheswhen the decoder outputs the end tag.

3.2 ICD-10 Classification Model

The second step in our pipeline is to assign a ICD-10 code to the generateddeath cause description. For this we employ a bidirectional LSTM model whichis able to capture the past and future context for each token of a death causedescription. Just as in our encoder-decoder model we encode each token using theconcatenation of the fastText embeddings of the word from all three languages.To enable our model to attend to different parts of the death cause we addan extra attention layer [33] to the model. Through the attention mechanismour model learns a fixed-sized embedding of the death cause description bycomputing an adaptive weighted average of the state sequence of the LSTMmodel. This allows the model to better integrate information over time. Figure 2presents the architecture of our ICD-10 classification model. We train the modelusing the provided ICD-10 dictionaries from all three languages.

Input colite ischémique sigmoïde

Embedding

Bi - LSTM

Attention

Softmax

Output K559

Death Cause Emb.

Hungarian Embedding Italian Embedding French Embedding

Fig. 2. Illustration of the ICD-10 classification model. The model utilizes a bi-directional LSTM layer, which processes the death cause from left to right and viceversa. The attention layer summarizes the whole description by computing an adap-tive weighted average over the LSTM states. The resulting death cause embeddingwill be feed through a softmax layer to get the final classification. Equivalent to ourencoder-decoder model all input tokens will be represented using the concatenation ofthe fastText embeddings of all three languages.

4 Experiments and Results

In this section we will present experiments and obtained results for the twodeveloped models, both individually as well as combined in a pipeline setting.

4.1 Training Data and Experiment Setup

The CLEF e-Health 2018 Task 1 participants where provided with annotateddeath certificates for the three selected languages: French, Italian and Hungarian.Each of the languages is supported by training certificate lines as well as adictionary with death cause descriptions resp. diagnosis for the different ICD-10codes. The provided training data sets were imbalanced concerning the differentlanguages: the Italian corpora consists of 49,823, French corpora of 77,3487 andHungarian corpora 323,175 certificate lines. We split each data set into a trainingand a hold-out evaluation set. The complete training data set was then created bycombining the certificate lines of all three languages into one data set. Beside theprovided certificate data we used no additional knowledge resources or annotatedtexts.

Due to time constraints during development no cross-validation to optimizethe (hyper-) parameters and the individual layers of our models was performed.We either keep the default values of the hyper-parameters or set them to rea-sonable values according to existing work. During model training we shuffle thetraining instances and use varying instances to perform a validation of the epoch.

Pre-trained fastText word embeddings were trained using the following pa-rameter settings: CBOW with position-weights, embedding dimension size 300,with character n-grams of length 5, a window of size 5 and 10 negative sam-ples. Unfortunately, they are trained on corpora not related with the biomedicaldomain and therefore do not represent the best possible textual basis for anembedding space for biomedical information extraction. Final embedding spaceused by our models is created by concatenating individual embedding vectorsfor all three languages. Thus the input of our model is embedding vector of size900. All models were implemented with the Keras8 library.

4.2 Death cause extraction model

To identify possible candidates for a death cause description, we focus on the useof an encoder-decoder model. The encoder model uses an embedding layer withinput masking on zero values and a LSTM layer with 256 units. The encoders’output is used as the initial state of the decoder model.

Based on the input description from the dictionary and a special start token,the decoder generates a death cause word by word. This decoding process con-tinues until a special end token is generated. The entire model is optimized usingthe Adam optimization algorithm [19] and a batch size of 700. Model training

7 For French we only took the provided data set from 2014.8 https://keras.io/

https://keras.io/

was performed either for 100 epochs or until an early stopping criteria is met(no change in validation loss for two epochs).

As the provided data set are imbalanced regarding the tasks’ languages,we devised two different evaluation settings: (1) DCEM-Balanced, where eachlanguage was supported by 49.823 randomly drawn instances (size of the smallestcorpus) and (2) DCEM-Full, where all available data is used. Table 4.2 showsthe results obtained on the training and validation set. The figures indicate thatthe distribution of training instances per language have a huge influence on theperformance of the model. The model trained on the full training data achievesan accuracy of 0.678 on the validation set. In contrast using the balanced dataset the model reaches an accuracy of 0.899 (+ 32.5%).

Setting Trained EpochsTrain Validation

Accuracy Loss Accuracy Loss

DCEM-Balanced 18 0.958 0.205 0.899 0.634

DCEM-Full 9 0.709 0.098 0.678 0.330

Table 1. Experiment results of our death cause extraction sequence-to-sequence modelconcerning balanced (equal number of training instances per language) and full dataset setting.

4.3 ICD-10 Classification Model

The classification model is responsible for assigning a ICD-10 code to deathcause description obtained during the first step. Our model uses an embeddinglayer with input masking on zero values, followed by a bidirectional LSTM layerwith 256 dimension hidden layer. Thereafter an attention layer builds an adap-tive weighted average over all LSTM states. The respective ICD-10 code willbe determined by a dense layer with softmax activation function. We use theAdam optimizer to perform model training. The model was validated on 25% ofthe data. As for the extraction model, no cross-validation or hyper-parameteroptimization was performed.

Once again, we devised two approaches. This was mainly caused by the lack ofadequate training data in terms of coverage for individual ICD-10 codes. There-fore, we defined two training data settings: (1) minimal (ICD-10 Minimal), whereonly ICD-10 codes with two or more supporting training instances are used. Thisleaves us with 6,857 unique ICD-10 codes and discards 2,238 unique codes withsupport of one. This, of course, minimizes the number of ICD-10 codes in thelabel space. Therefore, (2) an extended (ICD-10 Extended) data set was defined.Here, the original ICD-10 code mappings, found in the supplied dictionaries, areextended with the training instances from individual certificate data from thethree languages. This generates 9,591 unique ICD-10 codes. Finally, for the re-maining ICD-10 codes that have only one supporting description, we duplicatethose data points.

The goal of this approach is to extend our possible label space to all availableICD-10 codes. The results obtained from the two approaches on the validationset are shown in Table 4.3. Using the minimal data set the model achieves anaccuracy of 0.937. In contrast, using the extended data set the model reaches anaccuracy of 0.954 which represents an improvement of 1.8%.

Setting Trained EpochsTrain Validation

Accuracy Loss Accuracy Loss

ICD-10 Minimal 69 0.925 0.190 0.937 0.169

ICD-10 Extended* 41 0.950 0.156 0.954 0.141

Table 2. Experiment results for our ICD-10 classification model regarding differentdata settings. The Minimal setting uses only ICD-10 codes with two or more traininginstances in the supplied dictionary. In contrast, Extended additionally takes the di-agnosis texts from the certificate data and duplicates ICD-10 training instances withonly one diagnosis text in the dictionary and certificate lines. * Used in final pipeline.

4.4 Complete Pipeline

The two models where combined to create the final pipeline. We tested bothdeath cause extraction models (based on the balanced and unbalanced data set)in the final pipeline, as their performance differs greatly. On the contrary, bothICD-10 classification models perform similarly, so we just used the extendedICD-10 classification model, with word level tokens9, in the final pipeline. Toevaluate the pipeline we build a training and a hold-out validation set duringdevelopment. The obtained results on the validation set are presented in Table4.4. The scores are calculated using a prevalence-weighted macro-average acrossthe output classes, i.e. we calculated precision, recall and F-score for each ICD-10code and build the average by weighting the scores by the number occurrencesof the code in the gold standard.

Although the individual models, as shown in Tables 4.2 and 4.3 are promising,the performance decreases considerably in a pipeline setting . The pipeline modelbased on the balanced data set reaches a F-score of 0.61, whereas the full modelachieves a slightly higher value of 0.63. Both model configurations have a higherprecision than recall (0.73/0.61 resp. 0.74/0.62).

This can be contributed to several factors. First of all, a pipeline architecturealways suffers from error-propagation, i.e. errors in a previous step will influencethe performance of the following layers and generally lower the performance ofthe overall system. Investigating the obtained results, we found that the im-balanced distribution of ICD-10 codes represents one the main problems. This

9 Although models supporting character level tokens were developed and evaluated,their performance fared poorly compared to the word level tokens.

Model Precision Recall F-score

Final-Balanced 0.73 0.61 0.61

Final-Full 0.74 0.62 0.63

Table 3. Evaluation results of the final pipeline on the validation set of the train-ing data. Reported figures represent the prevalence-weighted macro-average across theoutput classes. Final-Balanced = DCEM-Balanced + ICD-10 Extended. Final-Full =DCEM-Full + ICD-10 Extended

severely impacts the decoder-encoder architecture used here as the token genera-tion is biased towards the available data points. Therefore the models misclassifycertificate lines associated with ICD-10 codes that only have a small number ofsupporting training instances very often.

Results obtained on the test data set, resulting from the two submitted officialruns, are shown in Table 4. Similar to the evaluation results during development,the model based on the full data set performs slightly better than the modeltrained on the balanced data set. The full model reaches a F-score of 0.34 forFrench, 0.45 for Hungarian and 0.77 for Italian. All of our approaches performbelow the mean and median averages of all participants.

Surprisingly, there is a substantial difference in results obtained between theindividual languages. This confirms our assumptions about the (un-) suitabilityof the proposed multi-lingual embedding space for this task. The results alsosuggest that the size of the training corpora is not influencing the final results.As seen, best results were obtained on the Italian data set were trained on thesmallest corpora. Worst results were obtained on the middle, French, corpuswhile the biggest corpus, Hungarian, is in second place.

We identified several possible reasons for the obtained results. These also rep-resent (possible) points for future work. One of the main disadvantages of ourapproach is the quality of the used word embeddings as well as the propertiesof the proposed language-independent embedding space. The usage of out-of-domain word embeddings which aren’t targeted to the biomedical domain arelikely a suboptimal solution to this problem. We tried to alleviate this by find-ing suitable external corpora to train domain-dependent word embeddings foreach of the supported languages, however we were unable to find any significantamount of in-domain documents (e.g. PubMed search for abstracts in eitherFrench, Hungarian or Italian found 7843, 786 and 1659 articles respectively).Furthermore, we used a simple, heuristic solution by just concatenating the em-beddings of all three languages to build a shared vector space.

Besides the issues with the used word embeddings, the inability to obtain fullICD-10 dictionaries for the selected languages has also negatively influenced theresults. As a final limitation to our approach, lack of multi-label classificationsupport has also been identified (i.e. not recognizing more than one death causein a single input text).

Language Model Precision Recall F-score

FrenchFinal-Balanced 0.494 0.246 0.329Final-Full 0.512 0.253 0.339Baseline 0.341 0.200 0.253Average 0.723 0.410 0.507Median 0.798 0.475 0.579

HungarianFinal-Balanced 0.518 0.384 0.441Final-Full 0.522 0.388 0.445Baseline 0.243 0.174 0.202Average 0.827 0.783 0.803Median 0.922 0.897 0.910

ItalianFinal-Balanced 0.857 0.685 0.761Final-Full 0.862 0.689 0.766Baseline 0.165 0.172 0.169Average 0.844 0.760 0.799Median 0.900 0.824 0.863

Table 4. Test results of the final pipeline. Final-Balanced = DCEM-Balanced + ICD-10 Extended. Final-Full = DCEM-Full + ICD-10 Extended

5 Conclusion and Future Work

In this paper we tackled the problem of information extraction of death causesin an multilingual environment. The proposed solution was focused on the setupand evaluation of an initial language-independent model which relies on a heuris-tic mutual word embedding space for all three languages. The proposed pipelineis divided in two steps: possible token describing the death cause are generatedby using a sequence to sequence model first. Afterwards the generated token se-quence is normalized to a ICD-10 code using a distinct LSTM-based classificationmodel with attention mechanism. During evaluation our best model achieves anF-score of 0.34 for French, 0.45 for Hungarian and 0.77 for Italian. The obtainedresults are encouraging for further investigation however can’t compete with thesolutions of the other participants yet.

We detected several issues with the proposed pipeline. These issues serveas prospective future work to us. First of all the representation of the inputwords can be improved in several ways. The word embeddings we used are notoptimized to the biomedical domain but are trained on general text. Existingwork was proven that in-domain embeddings improve the quality of achievedresults. Although this was our initial approach, the difficulties of finding adequatein-domain corpora for selected languages has proven to be to a hard to tackle.Moreover, the multi-language embedding space is currently heuristically definedas concatenation of the three word embeddings models for individual tokens.Creating an unified embedding space would create a truly language-independenttoken representation. The improvement of the input layer will be the main focusof our future work.

The ICD-10 classification step also suffers from lack of adequate trainingdata. Unfortunately, we were unable to obtain extensive ICD-10 dictionaries forall languages and therefore can’t guarantee the completeness of the ICD-10 labelspace. Another disadvantage of the current pipeline is the missing support formulti-label classification.

References

1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. In: Proceedings of the 6th International Conference onLearning Representations (ICLR 2018) (2018)

2. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled Sampling for SequencePrediction with Recurrent Neural Networks. In: Advances in Neural InformationProcessing Systems 28, pp. 1171–1179. Curran Associates, Inc. (2015)

3. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradientdescent is difficult. IEEE transactions on neural networks 5(2), 157–166 (1994)

4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors withSubword Information. Transactions of the Association of Computational Linguis-tics 5(1), 135–146 (2017)

5. Cabot, C., Soualmia, L.F., Dahamna, B., Darmoni, S.J.: SIBM at CLEF eHealthEvaluation Lab 2016: Extracting Concepts in French Medical Texts with ECMTand CIMIND. In: CLEF 2015 Online Working Notes. CEUR-WS (2016)

6. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,H., Bengio, Y.: Learning Phrase Representations using RNN Encoder–Decoderfor Statistical Machine Translation. In: Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing (EMNLP). pp. 1724–1734.Association for Computational Linguistics, Doha, Qatar (October 2014)

7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:Natural language processing (almost) from scratch. Journal of Machine LearningResearch 12(Aug), 2493–2537 (2011)

8. Dermouche, M., Looten, V., Flicoteaux, R., Chevret, S., Velcin, J., Taright, N.:ECSTRA-INSERM@ CLEF eHealth2016-task 2: ICD10 Code Extraction fromDeath Certificates. In: CLEF 2016 Online Working Notes (2016)

9. Di Nunzio, G.M., Beghini, F., Vezzani, F., Henrot, G.: A Lexicon Based Approachto Classification of ICD10 Codes. IMS Unipd at CLEF eHealth Task. In: CLEF2017 Online Working Notes. CEUR-WS (2017)

10. Dyer, C., Ballesteros, M., Ling, W., Matthews, A., Smith, N.A.: Transition-BasedDependency Parsing with Stack Long Short-Term Memory. In: Proceedings of the53rd Annual Meeting of the Association for Computational Linguistics and the 7thInternational Joint Conference on Natural Language Processing (Volume 1: LongPapers). vol. 1, pp. 334–343 (2015)

11. Ebersbach, M., Herms, R., Eibl, M.: Fusion Methods for ICD10 Code Classificationof Death Certificates in Multilingual Corpora. In: CLEF 2017 Online WorkingNotes. CEUR-WS (2017)

12. Faruqui, M., Dyer, C.: Improving vector space word representations using multilin-gual correlation. In: Proceedings of the 14th Conference of the European Chapterof the Association for Computational Linguistics. pp. 462–471 (2014)

13. Guo, J., Che, W., Yarowsky, D., Wang, H., Liu, T.: Cross-lingual dependencyparsing based on distributed representations. In: Proceedings of the 53rd Annual

Meeting of the Association for Computational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing (Volume 1: Long Papers). vol. 1,pp. 1234–1244 (2015)

14. Ho-Dac, L.M., Fabre, C., Birski, A., Boudraa, I., Bourriot, A., Cassier, M.,Delvenne, L., Garcia-Gonzalez, C., Kang, E.B., Piccinini, E.: LITL at CLEFeHealth2017: automatic classification of death reports. In: CLEF 2017 OnlineWorking Notes. CEUR-WS (2017)

15. Ho-Dac, L.M., Tanguy, L., Grauby, C., Mby, A.H., Malosse, J., Riviere, L., Veltz-Mauclair, A.: LITL at CLEF eHealth2016: recognizing entities in French biomedicaldocuments. In: CLEF 2016 Online Working Notes. CEUR-WS (2016)

16. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrentnets: the difficulty of learning long-term dependencies. A field guide to dynamicalrecurrent neural networks. IEEE Press (2001)

17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation9(8), 1735–1780 (1997)

18. Jonnagaddala, J., Hu, F.: Automatic coding of death certificates to ICD-10 termi-nology. In: CLEF 2017 Online Working Notes. CEUR-WS (2017)

19. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedingsof the 3rd International Conference on Learning Representations (ICLR) (2014)

20. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: NeuralArchitectures for Named Entity Recognition. In: Proceedings of the 15th AnnualConference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies. pp. 260–270 (2016)

21. Miftahutdinov, Z., Tutubalina, E.: Kfu at clef ehealth 2017 task 1: Icd-10 codingof english death certificates with recurrent neural networks. In: CLEF 2017 OnlineWorking Notes. CEUR-WS (2017)

22. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)

23. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-sentations of words and phrases and their compositionality. In: Advances in neuralinformation processing systems. pp. 3111–3119 (2013)

24. van Mulligen, E.M., Afzal, Z., Akhondi, S.A., Vo, D., Kors, J.A.: Erasmus MC atCLEF eHealth 2016: Concept Recognition and Coding in French Texts. In: CLEF2016 Online Working Notes. CEUR-WS (2016)

25. Neveol, A., Anderson, R.N., Cohen, K.B., Grouin, C., Lavergne, T., Rey, G.,Robert, A., Rondet, C., Zweigenbaum, P.: CLEF eHealth 2017 Multilingual In-formation Extraction task overview: ICD10 coding of death certificates in Englishand French. In: CLEF 2017 Evaluation Labs and Workshop: Online Working Notes,CEUR-WS. p. 17 (2017)

26. Neveol, A., Cohen, K.B., Grouin, C., Hamon, T., Lavergne, T., Kelly, L., Goeuriot,L., Rey, G., Robert, A., Tannier, X., Zweigenbaum, P.: Clinical Information Ex-traction at the CLEF eHealth Evaluation lab 2016. CEUR workshop proceedings1609, 28–42 (September 2016)

27. Neveol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikan, L., Ramadier,L., Rey, G., Zweigenbaum, P.: CLEF eHealth 2018 Multilingual Information Ex-traction task Overview: ICD10 Coding of Death Certificates in French, Hungarianand Italian. In: CLEF 2018 Evaluation Labs and Workshop: Online Working Notes.CEUR-WS (September 2018)

28. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre-sentation. In: Proceedings of the 2014 conference on empirical methods in naturallanguage processing (EMNLP). pp. 1532–1543 (2014)

29. Peters, M., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequencetagging with bidirectional language models. In: Proceedings of the 55th AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers).vol. 1, pp. 1756–1765 (2017)

30. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle-moyer, L.: Deep contextualized word representations. In: The 16th Annual Con-ference of the North American Chapter of the Association for Computational Lin-guistics (2018)

31. Pham, H., Luong, T., Manning, C.: Learning distributed representations for mul-tilingual text sequences. In: Proceedings of the 1st Workshop on Vector SpaceModeling for Natural Language Processing. pp. 88–94 (2015)

32. Pinter, Y., Guthrie, R., Eisenstein, J.: Mimicking Word Embeddings using SubwordRNNs. In: Proceedings of the 2017 Conference on Empirical Methods in NaturalLanguage Processing. pp. 102–112 (2017)

33. Raffel, C., Ellis, D.P.: Feed-forward networks with attention can solve some long-term memory problems. In: Workshop Extended Abstracts of the 4th InternationalConference on Learning Representations (2016)

34. Suominen, H., Kelly, L., Goeuriot, L., Kanoulas, E., Azzopardi, L., Spijker, R., Li,D., Neveol, A., Ramadier, L., Robert, A., Zuccon, G., Palotti, J.: Overview of theCLEF eHealth Evaluation Lab 2018. In: CLEF 2018 - 8th Conference and Labsof the Evaluation Forum. Lecture Notes in Computer Science (LNCS), Springer(2018)

35. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neuralnetworks. In: Advances in neural information processing systems. pp. 3104–3112(2014)

36. Søgaard, A., Agic, Z., Alonso, H.M., Plank, B., Bohnet, B., Johannsen, A.: Invertedindexing for cross-lingual NLP. In: The 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th International Joint Conference of theAsian Federation of Natural Language Processing (ACL-IJCNLP 2015) (2015)

37. Turney, P.D., Pantel, P.: From frequency to meaning: Vector space models of se-mantics. Journal of artificial intelligence research 37, 141–188 (2010)

38. Vyas, Y., Carpuat, M.: Sparse bilingual word representations for cross-lingual lex-ical entailment. In: Proceedings of the 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Language Tech-nologies. pp. 1187–1197 (2016)

39. Wang, P., Qian, Y., Soong, F.K., He, L., Zhao, H.: Part-of-speech tagging withbidirectional long short-term memory recurrent neural network. arXiv preprintarXiv:1510.06168 (2015)

40. Wei, Q., Chen, T., Xu, R., He, Y., Gui, L.: Disease named entity recognition bycombining conditional random fields and bidirectional recurrent neural networks.Database: The Journal of Biological Databases and Curation 2016 (2016)

41. Xing, C., Wang, D., Liu, C., Lin, Y.: Normalized word embedding and orthogonaltransform for bilingual word translation. In: Proceedings of the 2015 Conferenceof the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. pp. 1006–1011 (2015)

wbi at clef ehealth 2018 task 1: language-independent icd ...ceur-ws.org/vol-2125/paper_118.pdf ·...

Documents