biomedical document clustering and visualization based on ...the som map, the distribution of the...

8
Biomedical Document Clustering and Visualization based on the Concepts of Diseases Setu Shah Purdue School of Engineering and Technology, IUPUI 799 W. Michigan Street Indianapolis, Indiana 46202 [email protected] Xiao Luo Purdue School of Engineering and Technology, IUPUI 799 W. Michigan Street Indianapolis, Indiana 46202 [email protected] ABSTRACT Document clustering is a text mining technique used to provide beer document search and browsing in digital libraries or online corpora. A lot of research has been done on biomedical document clustering that is based on using existing ontology. But, associations and co-occurrences of the medical concepts are not well represented by using ontology. In this research, a vector representation of con- cepts of diseases and similarity measurement between concepts are proposed. ey identify the closest concepts of diseases in the con- text of a corpus. Each document is represented by using the vector space model. A weight scheme is proposed to consider both local content and associations between concepts. A Self-Organizing Map is used as document clustering algorithm. e vector projection and visualization features of SOM enable visualization and analysis of the clusters distributions and relationships on the two dimensional space. e experimental results show that the proposed document clustering framework generates meaningful clusters and facilitate visualization of the clusters based on the concepts of diseases. CCS CONCEPTS Information systems Clustering and classication; Docu- ment topic models; KEYWORDS Document Clustering, Clustering Visualization, Concept Similarity Measure ACM Reference format: Setu Shah and Xiao Luo. 2016. Biomedical Document Clustering and Visu- alization based on the Concepts of Diseases. In Proceedings of ACM KDD conference, Data-Drive Discovery Workshop, Halifax, NS, CANADA, August 2017 (KDD’17), 8 pages. DOI: 10.1145/nnnnnnn.nnnnnnn 1 INTRODUCTION Active research in the medical and biomedical domain has generated pervasive documents and articles. MEDLINE, the largest biomedical text database, has more than 16 million articles. It is estimated that Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from [email protected]. KDD’17, Halifax, NS, CANADA © 2016 ACM. 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 DOI: 10.1145/nnnnnnn.nnnnnnn more than 10,000 articles are added to MEDLINE weekly [19]. ere are continuously needs for development of techniques to discover, search, access and share knowledge from these documents and articles. Text clustering techniques enable us to group similar text documents in an unsupervised manner. Most of the research related to the biomedical document cluster- ing focuses on either reforming the representation of biomedical documents or improving the clustering algorithms. Biomedical document clustering is dierent from the general text document clustering task, because in the laer, semantic similarities between words or phrases are not usually considered. One medical concept of disease might be represented in dierent forms, and some medi- cal concepts of diseases might be highly correlated. For example, ‘Type 2 Diabetes’ is the same concept of disease as ‘Diabetes Mel- litus Type 2’. ‘Hypertension’ might co-occur oen with ‘Stroke’. In order to capture the semantic similarities between words or phrases, previous research on document representation reform- ing [12][18][20] oen use existing ontology such as MeSH or WordNet to identify the semantic relationships. However, ontology doesn’t reect the co-occurrences of medical concepts. is paper focuses on biomedical document clustering based on the concepts of diseases. e proposed similarity measure between the concepts of diseases is based on the Word2vec model [13]. is similarity measure identies the closest concepts based on co-occurrences of the concepts. e proposed concept weighting scheme is the linear combination of the TF-IDF value which reects the content similarity between documents and the similarity score based on the proposed similarity measurements that reect the semantic similarity between documents. e unsupervised learning algorithm Self-Organizing Map (SOM) [10] has been used as the clustering technique. SOM has properties of both vector quantization and vector projection. e neurons of an SOM can be presented on a two dimensional space. By projecting the input data instances to their best matching units (BMUs) on the SOM map, the distribution of the inputs can be visualized on the two dimensional space with the U-matrix and hit histogram of the SOM. e relationships of the clusters based on the concepts of diseases can be visualized. is clustering visualization is a benecial feature for biomedical literature search and browsing based on concepts of diseases. e rest of the paper is organized as followings. In section 2, re- lated work is described. Section 3 demonstrates how the concepts of diseases are extracted by using UMLS MetaMap. Section 4 and 5 de- tail the measurement of concepts similarity and weighting scheme for each concept in the document representation. Experimental arXiv:1810.09597v1 [cs.CL] 22 Oct 2018

Upload: others

Post on 15-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biomedical Document Clustering and Visualization based on ...the SOM map, the distribution of the inputs can be visualized on ... surement and the concept weighting scheme are •rst

Biomedical Document Clustering and Visualization based on theConcepts of Diseases

Setu ShahPurdue School of Engineering and Technology, IUPUI

799 W. Michigan StreetIndianapolis, Indiana 46202

[email protected]

Xiao LuoPurdue School of Engineering and Technology, IUPUI

799 W. Michigan StreetIndianapolis, Indiana 46202

[email protected]

ABSTRACTDocument clustering is a text mining technique used to providebe�er document search and browsing in digital libraries or onlinecorpora. A lot of research has been done on biomedical documentclustering that is based on using existing ontology. But, associationsand co-occurrences of the medical concepts are not well representedby using ontology. In this research, a vector representation of con-cepts of diseases and similarity measurement between concepts areproposed. �ey identify the closest concepts of diseases in the con-text of a corpus. Each document is represented by using the vectorspace model. A weight scheme is proposed to consider both localcontent and associations between concepts. A Self-Organizing Mapis used as document clustering algorithm. �e vector projection andvisualization features of SOM enable visualization and analysis ofthe clusters distributions and relationships on the two dimensionalspace. �e experimental results show that the proposed documentclustering framework generates meaningful clusters and facilitatevisualization of the clusters based on the concepts of diseases.

CCS CONCEPTS•Information systems→ Clustering and classi�cation; Docu-ment topic models;

KEYWORDSDocument Clustering, Clustering Visualization, Concept SimilarityMeasureACM Reference format:Setu Shah and Xiao Luo. 2016. Biomedical Document Clustering and Visu-alization based on the Concepts of Diseases. In Proceedings of ACM KDDconference, Data-Drive Discovery Workshop, Halifax, NS, CANADA, August2017 (KDD’17), 8 pages.DOI: 10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONActive research in themedical and biomedical domain has generatedpervasive documents and articles. MEDLINE, the largest biomedicaltext database, has more than 16 million articles. It is estimated that

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permi�ed. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speci�c permission and/or afee. Request permissions from [email protected]’17, Halifax, NS, CANADA© 2016 ACM. 978-x-xxxx-xxxx-x/YY/MM. . .$15.00DOI: 10.1145/nnnnnnn.nnnnnnn

more than 10,000 articles are added to MEDLINE weekly [19]. �ereare continuously needs for development of techniques to discover,search, access and share knowledge from these documents andarticles. Text clustering techniques enable us to group similar textdocuments in an unsupervised manner.

Most of the research related to the biomedical document cluster-ing focuses on either reforming the representation of biomedicaldocuments or improving the clustering algorithms. Biomedicaldocument clustering is di�erent from the general text documentclustering task, because in the la�er, semantic similarities betweenwords or phrases are not usually considered. One medical conceptof disease might be represented in di�erent forms, and some medi-cal concepts of diseases might be highly correlated. For example,‘Type 2 Diabetes’ is the same concept of disease as ‘Diabetes Mel-litus Type 2’. ‘Hypertension’ might co-occur o�en with ‘Stroke’.In order to capture the semantic similarities between words orphrases, previous research on document representation reform-ing [12] [18] [20] o�en use existing ontology such as MeSH orWordNet to identify the semantic relationships. However, ontologydoesn’t re�ect the co-occurrences of medical concepts. �is paperfocuses on biomedical document clustering based on the conceptsof diseases. �e proposed similarity measure between the conceptsof diseases is based on the Word2vec model [13]. �is similaritymeasure identi�es the closest concepts based on co-occurrencesof the concepts. �e proposed concept weighting scheme is thelinear combination of the TF-IDF value which re�ects the contentsimilarity between documents and the similarity score based onthe proposed similarity measurements that re�ect the semanticsimilarity between documents.

�e unsupervised learning algorithm Self-OrganizingMap (SOM)[10] has been used as the clustering technique. SOM has propertiesof both vector quantization and vector projection. �e neurons ofan SOM can be presented on a two dimensional space. By projectingthe input data instances to their best matching units (BMUs) onthe SOM map, the distribution of the inputs can be visualized onthe two dimensional space with the U-matrix and hit histogram ofthe SOM. �e relationships of the clusters based on the conceptsof diseases can be visualized. �is clustering visualization is abene�cial feature for biomedical literature search and browsingbased on concepts of diseases.

�e rest of the paper is organized as followings. In section 2, re-lated work is described. Section 3 demonstrates how the concepts ofdiseases are extracted by using UMLS MetaMap. Section 4 and 5 de-tail the measurement of concepts similarity and weighting schemefor each concept in the document representation. Experimental

arX

iv:1

810.

0959

7v1

[cs

.CL

] 2

2 O

ct 2

018

Page 2: Biomedical Document Clustering and Visualization based on ...the SOM map, the distribution of the inputs can be visualized on ... surement and the concept weighting scheme are •rst

KDD’17, August 2017, Halifax, NS, CANADA Setu Shah and Xiao Luo

se�ings and results are given in section 6. Section 7 concludes hisresearch and discusses potential future work.

2 RELATEDWORKA lot of research has been done in biomedical document clusteringin past decades. Some of it focused on document presentation re-forming based on medical ontology or on using di�erent weightingscheme other than TF-IDF, while some others focused on investi-gating various clustering algorithms.

Zhang et al. [20] reviewed three di�erent ontology based termsimilarity measurements: path based [17], information contentbased [15], and feature based [9] and then proposed their ownsimilarity measurement and term re-weighting scheme. K-meansalgorithm is used for document clustering. Based on the resultscomparison, some of them are slightly worse than the word basedscheme. �e authors mentioned that it might because of the lim-itation of the domain ontology, term extraction and sense disam-biguation. Visualization of the relationships between the clusterswere not included in this research.

Yoo et al. [19] used a graphical representation method to repre-sent a set documents based on the MeSH ontology, and proposedthe document clustering and summarization with this graphicalrepresentation. �e document clustering and summarization modelgained comparable results on clustering and also provided somevisualization on the documents cluster model based on the relation-ships of the terms. However, this visualization relies largely on theMeSH ontology instead of the document relationships themselves.

Logeswari et al. [12] proposed a concept weighting schemebased on the MeSH ontology and tri-gram extraction to extractconcepts from the text corpus. �e semantic relationship betweentri-grams are weighted through a heuristic weight assignment offour prede�ned semantic relationships. �e K-means clusteringalgorithm results show that concept based representationwas be�erthan word based representation. Visualization of the clusteringresults was not investigated.

Gu et al. [8] proposed a concept similarity measurement byusing a linear combination of multiple similarity measurementsbased on MeSH ontology and local content which include TF-IDFweighting and co-e�cient calculation between related article sets. Asemi-supervised clustering algorithm was employed at the stage ofdocument clustering. �eir focus was not clustering visualization.

Some research has been done about the visualization processto support biomedical literature search. Gorg et al. [7] developeda visual analytics system, named Bio-Jigsaw by using the MeSHontology. �is research demonstrated how visual analytics can beused to analyze a search query on a gene related to breast cancer.Neither document representation nor document clustering werediscussed.

To the best of authors’ knowledge, this research is the �rst topresent concepts of diseases using vectors based on the Word2vecmodel instead of using an ontology. �e proposed similarity mea-surement and the concept weighting scheme are �rst applied tothe biomedical document clustering. �e SOM based clustering isemployed to visualize the distribution of document clusters basedon the concepts of diseases.

A retrospective evaluation of Haemophilus influenzae type b meningitis 

observed over a 2‐year period  documented 86 cases

                  Disease or Syndrome

Functional Concept

TemporalConcept

TemporalConcept

Health Care Activity

Quantitative Concept

Health Care Activity

Research Activity

Figure 1: Semantic Mapping of using UMLS MetaMap

3 CONCEPTS OF DISEASES EXTRACTIONIn this work, the focus is on clustering biomedical documents basedon the concepts of diseases that are addressed by or mentionedin the documents. To extract the concepts of diseases from thedocuments, Uni�ed Medical Language System (UMLS) MetaMapis used. UMLS MetaMap [2] is a natural language processing toolthat makes use of various sources such as UMLS Metathesaurus[1] and SNOMED CT [4] to map the phrases or terms in the text todi�erent semantic types.

Figure 1 provides an example of mapping phrases to di�erentsemantic types using UMLS MetaMap. In this example, eight termsor phrases in the sentence have been mapped to six semantic types.�e phrase ‘Haemophilus in�uenzae type b meningitis’ in the sen-tence has been identi�ed as semantic type ‘disease or syndrome’and mapped to phrase ‘Type B Hemophilus in�uenzae Meningitis’based on the lexicon that UMLS MetaMap uses. In this research, ifa term or phrase has been mapped to semantic types ‘Disease orSyndrome’ or ‘Neoplastic Process’, the corresponding phrase in thelexicon produced by MetaMap is extracted.

4 CONCEPTS SIMILARITY MEASUREIn the biomedical literature, same concepts of a disease can be pre-sented by di�erent terms or combinations of words. For example,‘cancer of breast’ and ‘ breast cancer’ are two phrases that presentthe concept of the same disease. However, they are treated as twodi�erent concepts if typical vector space model and TF-IDF weight-ing scheme are used for document presentation, and the semanticsimilarity between them is not measured. In this research, a se-mantic similarity measure between di�erent concepts of diseasesis proposed. Given a total of L concepts of diseases extracted fromthe raw text corpus, the similarities between any two concepts arestored in the similarity matrix S as presented in Equation 1. Eachentry si, j in the matrix S represents the similarity between conceptCi and Cj .

SL,L =

©­­­­«s1,1 s1,2 · · · s1,Ls2,1 s2,2 · · · s2,L...

.... . .

...

sL,1 sL,2 · · · sL,L

ª®®®®¬(1)

To calculate the similarity between two concepts, �rst, eachword is represented by a vector (as proposed in Equation 2). �isvector representation is learned by training the Word2Vec model.

Page 3: Biomedical Document Clustering and Visualization based on ...the SOM map, the distribution of the inputs can be visualized on ... surement and the concept weighting scheme are •rst

Biomedical Document Clustering and Visualization based on the Concepts of Diseases KDD’17, August 2017, Halifax, NS, CANADA

�eWord2Vec training algorithm was developed by a team of re-searchers at Google led by TomasMikolov [13]. It is a computationally-e�cient algorithm to generate vectors of real numbers to presentwords in a given raw text corpus. �ese vector representationsare learned using three-layer neural networks using either a con-tinuous bag-of-words approach or a skip-gram architecture. �evectors preserve the distances between words in the vector space sothat the words that share common contexts in the raw text corpusare located in close proximity to one another. �e dimension of thevector created depends on the number of neurons in the hiddenlayer of the neural network when training a Word2Vec model.

Word = (wv1,wv2, . . . ,wvm ) (2)

m: the dimension of the vector.In this research, a trained Word2Vec model [14] created from

a subset of PubMed literature database and a subset of PubMedCentral (PMC) Open Access database is employed. �ese two textcorpus contains a large number of biomedical documents. �etrained model creates 200 dimensional vectors to present the wordsextracted in the two text corpus. �e skip-gram architecture with awindow size of 5 is adopted for the learning process [14].

Although some of concepts of diseases contain only one word,many of them span multiple words. In this work, if a concept ofdisease spans multiple words, a concept vector is generated byaggregating the vectors of all the words in the concept, as shownin Equation 3. For example, for the disease ‘diabetes mellitus’, thevector for ‘diabetes’ and the vector for ‘mellitus’ are aggregated byadding them together.

C =M∑i=1

Wordi (3)

M : the total number of words in a concept C .�e similarity score between the concepts are calculated using

the cosine distance between the vectors as shown in Equation 4.

Si, j =Ci · Cj√∑m

i=1C2i ·

√∑mi=1C

2j

(4)

By presenting concepts in vector and using this similarity mea-sure, it is observed that the more the diseases are associated, thehigher the similarity scores between them are. Table 1 providessome examples of concepts of diseases and the their top 3 closestconcepts based on the similarity scores. ‘Hypertension’ is o�enassociated with ‘hyperlipidaemia’ in the literature, so the similaritybetween them is higher than that between ‘Hypertension’ and otherconcepts of diseases.

5 DOCUMENT REPRESENTATION ANDWEIGHTING SCHEME

In this research, the typical vector space model is used to present abiomedical document, each entry of the vector corresponding to aconcept of disease which is identi�ed through the UMLS MetaMap.�e proposed weight (WeiдhtCi,d ) that is given to each concept(Ci,d ) is calculated as equation 5:

Table 1: Examples of concepts and the top 3 closest conceptsbased on the similarity scores

Concept Closest Concepts Score ofSimilarity

essential hypertension 0.813hypertension hyperlipidaemia 0.692

dyslipidemia 0.659endothelial dysfunction 0.739dysfunction renal dysfunction 0.660

cortical dysfunction 0.639carpal tunnel bilateral carpal tunnel syndrome 0.970syndrome cts carpal tunnel syndrome 0.957

carpal tunnel 0.941diabetes mellitus 0.918

diabetes diabetes mellitus type ii 0.868dm diabetes mellitus 0.845

cardiovascular cardiac diseases 0.8181disease metabolic diseases 0.8179

heart diseases 0.787

WeiдhtCi,d =t fCi ,d × log

|D |dfCi

+∑Mj=1 Si, j if t fCi ,d > 0∑N

j=1N−(j−1)

N Si, j if t fCi ,d = 0(5)

d fCi : the number of documents in which concept Ci occurs atleast once

t fCi,d : frequency of concept Ci in document d|D |: total number of documents in the corpusSi, j : the similarity between Ci and concept Cj that both occur

in document d . Cj is the jth frequent concept in the document d .M : the total number of concepts in document d .N : top N closest concepts of Ci . In this research, N = 3.If a concept occurs in a document, the weighting scheme uses the

TF-IDF value to underline the occurrence of the concept in the localcontent. �e

∑Mj=1 Si, j calculates the sum of similarity scores be-

tween the occurred conceptCi,d and other concepts (Cj,d , j = 1, Û,M)that also occurs within the document. If a concept does not occurin the document, the weight is calculated by a weighted sum of thetop 3 closest concepts (Cj,d , j = 1, Û,3) that appear in the documentbased on the similarities scores. By using this weighting scheme,the representation measures the occurrences of di�erent represen-tations of the same or similar concepts. For example, ‘diabetes’occurs in one document, but ‘diabetes mellitus’ occurs in anotherdocument. By using the traditional TF-IDF weighting scheme, theirvalues would be 0 for documents in which the concept does notappear. However, by using the proposed weighting scheme, theyare weighted based on the similarity between the concept and itsclosest concepts. �us, for the document that does not contain theconcept ‘diabetes mellitus’, instead of using 0, the similarity scorebetween ‘diabetes mellitus’ and other concepts that appear in thedocument is used.

Page 4: Biomedical Document Clustering and Visualization based on ...the SOM map, the distribution of the inputs can be visualized on ... surement and the concept weighting scheme are •rst

KDD’17, August 2017, Halifax, NS, CANADA Setu Shah and Xiao Luo

6 CLUSTERING ALGORITHMSelf-Organizing Map (SOM) is used for document clustering visual-ization [11]. SOM implements the topologically ordered display ofthe data to facilitate understanding the structures of the input dataset. It is also readily explainable and easy to visualize. �e visual-ization of the multidimensional data is one of the main applicationareas of SOM [10]. �ese features make SOM an appropriate choiceas a clustering algorithm for this paper.

A basic SOM consists ofM neurons located on a low dimensionalgrid (usually 1 or 2 dimensional) [10]. �e algorithm responsiblefor the formation of the SOM involves three basic steps a�er ini-tialization: sampling, similarity matching, and updating. �esethree steps are repeated until formation of the feature map hascompleted. Each neuron i has a d-dimensional prototype weightvectorWi =Wi1,Wi1, ...,Wid . Given X is a d-dimensional sampledata(input vector), the algorithm is summarized as follows:

• Initialization:Choose randomvalues to initialize all the neuronweight

vectorsWi (0), i = 1, 2, ...,M, whereM is the total numberof neurons in the map.

• Sampling:Draw a sample data X from the input space with a

uniform probability.• Similarity Matching:

Find the best matching unit (BMU) or winner neuronof X , denoted here by b which is the closest neuron (mapunit) to X in the criterion of minimum Euclidean distance,at time step n (nth training iteration).

b = argmini| |X −Wi (n)| |, i = 1, 2, ...,M (6)

• Updating:Adjust the weight vectors of all neurons by using the

update formula 7, so that the best matching unit (BMU)and its topological neighbors are moved closer to the inputvector X in the input space.

Wi (n + 1) =Wi (n) + η(n)hb,i (n)(X −Wi (n)) (7)

Where η(n) denotes the learning rate and hb,i (n) is thesuitable neighborhood kernel function centered on thewinner neuron.

�e distance kernel function can be, for example, Gauss-ian:

hb,i (n) = e− | |rb−ri | |

2

2σ 2(n) (8)Where rb and ri denote the positions of neuron b and

i on the SOM grid and σ (n) is the width of the kernel orneighborhood radius at step n. σ (n) decreases monoton-ically along the steps as well. �e initial value of neigh-borhood radius σ (0) should be fairly wide to avoid theordering direction of neurons to change discontinuously.σ (0) can be properly set to be equal to or greater than halfthe diameter of the map. Formula 9 gives the initial valueof the neighborhood radius for a map of size a by b.

σ (0) =√a2 + b2

2 (9)

• Continuation:Continue with sampling until no noticeable changes in

the feature map are observed or the pre-de�ned maximumnumber of iterations is reached.

�e most commonly used visualization techniques of SOM are theU-Matrix and Hit histogram.

1 2

Figure 2: U-matrix of a trained SOM

�e U-matrix [10] holds all distances between neurons and theirimmediate neighbor neurons. Figure 2 shows the U-matrix of atrained map on a input data set that has two clusters. �e lighterthe color in the hexagon connecting any two neurons, the smalleris the distance between them. From the U-matrix, two large lightregions can be visualized. One is towards the le�, while the otheris to the right. �ese regions present the two clusters obtained ontraining the input data set. �e U-matrix gives a direct visualizationof the number of clusters and their distribution.

1

2

Figure 3: Hit histogram of a trained SOM

�e hit histogram of the input data set on the trained map pro-vides a visualization that details the distribution of input data acrossthe clusters. Each input data instance in the data set can be pro-jected to the closest neuron on a trained SOM map. �e closestneuron is called the best matching unit (BMU) of the input data

Page 5: Biomedical Document Clustering and Visualization based on ...the SOM map, the distribution of the inputs can be visualized on ... surement and the concept weighting scheme are •rst

Biomedical Document Clustering and Visualization based on the Concepts of Diseases KDD’17, August 2017, Halifax, NS, CANADA

instance. �e hit histogram is constructed by counting the numberof hits each neuron receives from the input data set. Figure 3 showsthe hit histogram of an input data set on the trained SOM map.Each hexagon represents one neuron on the map. �e size of themarker indicates the number of hits the neuron receives. �us, alarger marker is representative of a larger number of hits on thatneuron. Based on the hit histogram, it is visualized that most ofthe input data hits neurons in the le� and right regions. �ese tworegions correspond to the two clusters on the U-matrix shown inFigure 2.

7 EXPERIMENT SETTING AND RESULTANALYSIS

To evaluate the proposed biomedical document clustering frame-work that is based on the concepts of diseases, two subsets of largebiomedical document collections have been used: PubMed CentralOpen Access and Ohsumed collection. �e details of these twodocument collections and the corresponding clustering results andvisualization are detailed in the following subsections.

7.1 DatasetsPubMed Central Open Access data set has been used by many re-search projects to examine tasks of biomedical literature clusteringand classi�cation [20] [21]. It is also a part of the training corpusfor the Word2Vec model used in this research. Ohsumed documentcollection is a data set that has been used by many researchers [6][16] for text mining. Although the Ohsumed collection includesdocuments that are not up to date, it is used to evaluate the robust-ness of the proposed document clustering framework on a dataset where concepts might be presented di�erently than the onesincluded in the training corpus for Word2Vec.

7.1.1 PubMedCentralOpenAccess (PMC-OA). �ePubMedCentral Open Access [3] is a subset of over 1 million articles fromthe total collection of articles in PMC. For this research, a set of600 articles were randomly selected from the ‘A-B’ subset whichincludes articles from journals whose names start with le�er ‘A’ or‘B’. �e number of selected articles from each journal is shown inTable 2.

To be consistent with the data set - Ohsumed Collection, onlycontent in ‘Title’ and ‘Abstract’ sections from these documentsare used. 658 unique concepts of diseases are identi�ed by UMLSMetaMap. Figure 4 shows the distribution of these concepts basedon the number of words in each concept.

7.1.2 Ohsumed Collection. �eOhsumed collection [5] usedhere includes the abstracts of 20,000 articles. �ese articles arerelated to cardiovascular diseases and are further categorized into23 cardiovascular disease categories. For this research, a subset of600 documents is randomly selected. �ese documents cover allthe 23 categories. Table 3 shows the number of documents selectedfrom each category.

A�er concepts of diseases are identi�ed by using the UMLSMetaMap, 67 documents have no disease-related concept identi�ed.�ese documents are not included in the experiments. In total,1449 concepts of diseases are identi�ed and extracted from the533 documents. Figure 4 shows the distribution of these concepts

Table 2: PMC-OA Data Set

Name of journal # of documentsAmerican Journal of Hypertension 13Augmentative and alternative communication 2Ancient Science of Life 3Bioinformatics and biology insights 45Allergy and asthma proceedings 28BoneKEy reports 4Anesthesia, essays and researches 135Biological trace element research 31Bone Marrow Research 1Brain and language 1American journal of physiology. 11Endocrinology and metabolismAphasiology 3Annals of rehabilitation medicine 323

Table 3: Ohsumed Collection

Category Label # of documentsBacterial Infections and Mycoses C01 22Virus Diseases C02 23Parasitic Diseases C03 29Neoplasms C04 26Musculoskeletal Diseases C05 25Digestive System Diseases C06 23Stomatognathic Diseases C07 23Respiratory Tract Diseases C08 24Otorhinolaryngologic Diseases C09 29Nervous System Diseases C10 23Eye Diseases C11 28Urologic and Male Genital Diseases C12 26Female Genital Diseases and C13 27Pregnancy ComplicationsCardiovascular Diseases C14 27Hemic and Lymphatic Diseases C15 28Neonatal Diseases and Abnormalities C16 25Skin and Connective Tissue Diseases C17 28Nutritional and Metabolic Diseases C18 27Endocrine Diseases C19 30Immunologic Diseases C20 26Disorders of Environmental Origin C21 26Animal Diseases C22 25Pathological Conditions, Signs C23 29and Symptoms

based on the number of words in the concepts in comparison withthe concepts extracted from PMC-OA. Although the total numberof concepts of diseases extracted from the Ohsumed collection ishigher than that are extracted from the PMC-OA. �e distribu-tion based on the number of words in the concepts is very similar.Ohsumed collection has a slightly higher radio of concepts of oneword whereas PMC-OA collection has a higher radio of concepts

Page 6: Biomedical Document Clustering and Visualization based on ...the SOM map, the distribution of the inputs can be visualized on ... surement and the concept weighting scheme are •rst

KDD’17, August 2017, Halifax, NS, CANADA Setu Shah and Xiao Luo

Figure 4: Distribution of the concepts based on the numberof words

spanning two words. �e percentages of concepts with 3 or morewords are almost the same.

7.2 Clustering, Visualization and DiscussionSOM has been used for document clustering a�er concepts extrac-tion and document representation using the proposed weightingscheme. �e size of the map is 10 by 10 which contains 100 neurons.�e training iterations are set to be 50,000.

Figure 5 shows the clustering results of document collectionPMC-OA. Among 658 concepts that are extracted by using theUMLS MetaMap, 180 concepts occur in more than 1 document.Compare to the Ohsumed collection, a larger number of conceptshave document frequency more than 1. �at means the TF-IDFvalue in the weighting scheme has more impact on PMC-OA datasetthan it does on the Ohsumed collection. �e U-matrix of the trainedSOM map shows more clear boundaries than that of the Ohsumedcollection.

From the U-matrix and hit histogram, 8 clusters can be clearlyidenti�ed based on the darker colored neurons surrounding them.Cluster 5 includes a large cluster with documents consisting ofconcepts of diseases like ‘obesity’, ‘diabetes’, ‘hypertension’, ‘hy-perglycemia’, and so on. �is cluster also includes other diseasessuch as ‘coronary artery disease’, since these diseases are highlyrelated. �e ‘coronary artery disease’ might be an outcome of‘hypertension’, ‘hyperglycemia’ or their combination. Cluster 6contains documents discussing infections related to diseases suchas ‘Malaria’, and other respiratory infection diseases such as ‘tuber-culosis’ and ‘bronchitis’ are also included in this cluster. Cluster 8is a smaller cluster which mainly includes chest infection relateddocuments. Cluster 7 is another smaller cluster about concepts ofinfections such as avian in�uenza. Cluster 6, 7 and 8 are close toeach other since they are all about infections, but each is focusedon a smaller areas of infections. Cluster 4 contains three neuronswhich present three di�erent types of concepts. �e le� region isdominated by documents which discuss ‘dysphagia’ and similarconcepts such as ‘laryngospasm’. �e concepts in the right half ofthe cluster include spinal disorders like ‘stenosis’, ‘scoliosis’ and‘spinal instability’. �e top of the cluster is dominated by pain re-lated diseases and syndromes. Cluster 3 contains documents that

are related to ‘paralysis’ and damage of nerves. Many of them dis-cuss paralysis of the face, hands (‘carpel tunnel syndrome’), spine(‘spinal cord atrophy’), brain (‘celebral palsy’), legs (‘spastic foot’),and so on. All these concepts are related and close to each other,and thus the cluster is well formed. Cluster 2 is a cluster of docu-ments with concepts about di�erent cardiovascular diseases suchas ‘hypertension’, ‘myocardial infarction’, ‘coronary artery disease’,‘coronary heart disease’, and ‘ischemic strokes’. �is cluster alsoconsists of a few documents that talk about brain strokes arisingfrom lesions in the brain, or lead to speech disorders. More thanhalf of the documents in this cluster discuss strokes and closelyrelated cardiac concepts. One interesting �nding is that cluster 1contains all the documents in which the only one concept is identi-�ed by UMLSMetaMap is ‘stroke’. However, further analysis showsthat these documents have nothing to do with ‘stroke’ as a disease.�is shows the UMLS MetaMap cannot always accurately map allconcepts to the semantic types.

Figure 6 shows the clustering results of Ohsumed collectionbased on the concepts of diseases that are extracted. Based on thedocument frequencies of the concepts of Ohsumed collection, thereare 1108 concepts out of the total 1449 that occurs only in onedocument and there are total 1414 concepts occur in less than �vedocuments. �at means the weights of these concepts rely heavilyon the similarity measurement between the concepts.

Based on the original data set description, all documents arerelated to cardiovascular diseases. �is lead to the shorter distancesbetween neurons which is re�ected by the color of the U-matrix.By analyzing the U-matrix and hit histogram of the trained map,8 clusters are identi�ed. A majority of the documents in cluster7 are about infections and infectious diseases, with half of themfrom the categories of bacterial infections and mucoses (C01), virusdiseases (C02) and parasitic diseases (C03). �e rest of the docu-ments from this cluster discuss other infections from categories likerespiratory tract diseases (C08) and digestive system diseases (C06).�ere are also a few documents from immunologic diseases (C19)in this cluster. Notably, all of the documents talk about infectionsof di�erent types. Cluster 1 has documents that discuss diseasesabout the nervous system. Whereas, the documents in cluster 2discuss neoplasms which include di�erent types of cancers of thebrain, prostate, neck and so on. �e documents in this cluster arefrom all the categories except virus diseases (C02) and diseases ofenvironmental origin (C21). Cluster 3 includes documents aboutdiseases related to hormone secretion and distribution. �is clusteralso includes diseases of the bones and blood, since these conceptsare closely related. Cluster 4 contains documents with diseasesabout the ear, nose, throat, head and surrounding areas of the face.Cluster 5 is the smallest cluster and the documents concentrate ondi�erent types of tuberculosis and sexually transmi�ed diseaseslike AIDS, HPV, etc. Documents about ‘cryptococcosis’, which iso�en seen in patients with HIV whose immunity has been low-ered, also fall in this cluster. Cluster 6 consists of documents withconcepts relating to diabetes. Documents containing concepts like‘nephropathy’, ‘impaired glucose tolerance’,‘non-insulin dependentdiabetes’ are in the le� half of the cluster. Whereas, the right halfof the cluster is dominated by documents with concepts such as‘Crohn’s disease’, ‘renal ulceration’, and ‘kidney stone’. Cluster 8has documents about diseases related to di�erent heart conditions

Page 7: Biomedical Document Clustering and Visualization based on ...the SOM map, the distribution of the inputs can be visualized on ... surement and the concept weighting scheme are •rst

Biomedical Document Clustering and Visualization based on the Concepts of Diseases KDD’17, August 2017, Halifax, NS, CANADA

Cluster 5: MetabolicHypertension, Obesity, Diabetes...

Cluster 7: RespiratoryDry Mouth, Asthma, Sore Throat...

Cluster 8: ChestPneumoperitoneum, Pneumothorax...

Cluster 6: InfectionsArrested, Malaria...

Cluster 4: MultipleDysphagia, Myofascial Pain Syndrome, Low Back Pain, 

Spinal Stenosis...

Cluster 3: ParalysisCerebral Palsy, Carpel Tunnel Syndrome, Neuropathy...

Cluster 2: CardiovascularStroke, Dysphagia, Myocardial Infarction...

Cluster 1: StrokeStroke...

12

3

4

5

67

8

1

2

3

4

5

67

8

Figure 5: Clustering results of PubMed Central Open Access

Cluster 7: InfectionsMalaria, Hepatitis, Sexually Transmitted Diseases, 

Pyomyositis...

Cluster 1: Nervous SystemSpinal Cord Ischemia, Paraplegia, Sphenoid Sinusitis...

Cluster 2: CancerTumour, carcinoma, sarcoma, thymoma...

Cluster 8: CardiovascularHypertension, Hyperglycemia, Stroke, CHD...

Cluster 3: Endocrine, Digestive, OrthopedicLiver Cirrhosis, Pancreatitis, Arthritis, Myalgia...

Cluster 5: ImmunologicHIV, Tuberculosis, Cryptoccosis...

Cluster 4: Otorhinolaryngologic, GeneticHeadache, Migraine, Glaucoma, Anaemia...

Cluster 6: Metabolic, Gastro‐IntestinalDiabetes, Hyperglycemia, Crohn’s Disease, Duodenal Ulcer...

12

3

4

5

6

7

8

1

2

34

5

6

7

8

Figure 6: Clustering results of Ohsumed Collection

Page 8: Biomedical Document Clustering and Visualization based on ...the SOM map, the distribution of the inputs can be visualized on ... surement and the concept weighting scheme are •rst

KDD’17, August 2017, Halifax, NS, CANADA Setu Shah and Xiao Luo

and obstruction in the �ow of blood. Since the theme of documentsin Ohsumed collection is cardiovascular concepts, this cluster hasdocuments from all of the categories except parasitic diseases (C03),neoplasms (C04) and digestive system diseases (C06).

It is worth noting that only concepts of diseases are extractedfrom both data sets and used for document clustering. While theoriginal category labels of the Ohsumed collection might not beassigned based on the concepts of diseases, thus, these labels areused to evaluate the clustering performance.

Overall, the proposed document clustering and visualizationframework works well on both data sets. Although Ohsumed col-lection has much more concepts diseases, and the majority of themhave very low document frequency. On the U-matrix, the colors ofthe neurons surrounding the clusters demonstrate how separatedthese clusters are. �e darker the color is, the more separated theyare. �at means clusters are more unrelated. �e clusters on theU-matrix of the PMC-OA appear to be more separated than thoseof the Ohsumed collection. One reason could be that all documentsare related to cardiovascular diseases, so the clusters locate moreclosely on the U-matrix.

8 CONCLUSION AND FUTUREWORKIn this paper, a biomedical document clustering framework basedon concepts of diseases is proposed. �e concepts of diseases areidenti�ed by using UMLS MetaMap. Instead of using an existingontology to generate concept representation, the concepts are rep-resented by using vectors based on a combination of TF-IDF andWord2Vec models. �e proposed similarity measure is based onthe vector representations of the concepts and shows that closelyassociated concepts of diseases have higher similarity scores thanothers. A representation of documents that considers the localcontent and semantic similarity between the concepts within thedocuments is used. A weighting scheme using TF-IDF combinedwith similarity score between the concepts is proposed. Insteadof focusing on clustering performance evaluation, clustering vi-sualization is explored in this research. Self-Organizing Map is aclustering algorithm that provides a visualization aid to understandthe clusters and distribution of the clusters, and is thus used inthis research. �e results show that the clustering occurs alongconcepts of similar nature, of similar area and organs of the body,and concepts which are synonymous to one another. Nearby clus-ters are related in most cases, as well. �is kind of visualizationwill help researchers explore related articles based on concepts ofdiseases.

Potential future work includes visualizing clusters of larger cor-pora by using a hierarchical clustering architecture, evaluating thisvisualization aid for the task of biomedical document search andextending this framework to biomedical document clustering basedon concepts of symptoms and treatments.

REFERENCES[1] [n. d.]. Fact Sheet - UMLS Metathesaurus. h�ps://www.nlm.nih.gov/pubs/

factsheets/umlsmeta.html.[2] [n. d.]. MetaMap - A Tool For Recognizing UMLS Concepts in Text. h�ps://metamap.

nlm.nih.gov/.[3] [n. d.]. Open Access Subset. h�ps://www.ncbi.nlm.nih.gov/pmc/tools/open�list/.[4] [n. d.]. SNOMED CT. h�ps://www.nlm.nih.gov/healthit/snomedct/.[5] [n. d.]. Text Categorization Corpora. h�p://disi.unitn.it/moschi�i/corpora.htm.

[6] Stephan Bloehdorn, Philipp Cimiano, and Andreas Hotho. 2006. Learning ontolo-gies to improve text clustering and classi�cation. In From data and informationanalysis to knowledge engineering. Springer, 334–341.

[7] Carsten Gorg, Hannah Tipney, Karin Verspoor, William A Baumgartner Jr, K Bre-tonnel Cohen, John Stasko, and Lawrence E Hunter. 2010. Visualization andlanguage processing for supporting analysis across the biomedical literature.In International Conference on Knowledge-Based and Intelligent Information andEngineering Systems. Springer, 420–429.

[8] Jun Gu,Wei Feng, Jia Zeng, HiroshiMamitsuka, and Shanfeng Zhu. 2013. E�cientsemisupervised MEDLINE document clustering with MeSH-semantic and global-content constraints. IEEE transactions on cybernetics 43, 4 (2013), 1265–1276.

[9] Rasmus Knappe, Henrik Bulskov, and Troels Andreasen. 2007. Perspectives onontology-based querying. International Journal of Intelligent Systems 22, 7 (2007),739–761.

[10] Teuvo Kohonen. 1998. �e self-organizing map. Neurocomputing 21, 1 (1998),1–6.

[11] Teuvo Kohonen, Samuel Kaski, Krista Lagus, Jarkko Salojarvi, Jukka Honkela,Vesa Paatero, and An�i Saarela. 2000. Self organization of a massive documentcollection. IEEE transactions on neural networks 11, 3 (2000), 574–585.

[12] S Logeswari and K Premalatha. 2013. Biomedical document clustering usingontology based concept weight. In Computer Communication and Informatics(ICCCI), 2013 International Conference on. IEEE, 1–4.

[13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je� Dean. 2013.Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing systems. 3111–3119.

[14] SPFGH Moen and Tapio Salakoski2 Sophia Ananiadou. 2013. Distributionalsemantics resources for biomedical text processing. (2013).

[15] Philip Resnik et al. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language.J. Artif. Intell. Res.(JAIR) 11 (1999), 95–130.

[16] Mondelle Simeon and Robert Hilderman. 2008. Categorical proportional di�er-ence: A feature selection method for text categorization. In Proceedings of the 7thAustralasian Data Mining Conference-Volume 87. Australian Computer Society,Inc., 201–208.

[17] Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection.In Proceedings of the 32nd annual meeting on Association for ComputationalLinguistics. Association for Computational Linguistics, 133–138.

[18] Illhoi Yoo and Xiaohua Hu. 2006. A comprehensive comparison study of docu-ment clustering for a biomedical digital library MEDLINE. In Proceedings of the6th ACM/IEEE-CS joint conference on Digital libraries. ACM, 220–229.

[19] X. Yoo, I. Hu and I.-Y. Song. 2006. A coherent graph-based semantic clusteringand summarization approach for biomedical literature and a new summarizationevaluation method. In First International Workshop on Text Mining in Bioinfor-matics Proceedings. 84–89. h�ps://doi.org/10.1186/1471-2105-8-s9-s4

[20] Xiaodan Zhang, Liping Jing, Xiaohua Hu, Michael Ng, and Xiaohua Zhou. 2007.A comparative study of ontology based term similarity measures on PubMeddocument clustering. Advances in Databases: Concepts, Systems and Applications(2007), 115–126.

[21] Shanfeng Zhu, Jia Zeng, and Hiroshi Mamitsuka. 2009. Enhancing MEDLINEdocument clustering by incorporating MeSH semantic similarity. Bioinformatics25, 15 (2009), 1944–1951.