Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
CONCEPT NAME SIMILARITY MEASURE ON
SNOMED CT
BY
HTET HTET HTUN
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE
(ENGINEERING AND TECHNOLOGY)SIRINDHORN INTERNATIONAL INSTITUTE OF TECHNOLOGY
THAMMASAT UNIVERSITY
ACADEMIC YEAR 2016
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
CONCEPT NAME SIMILARITY MEASURE ON
SNOMED CT
BY
HTET HTET HTUN
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE
(ENGINEERING AND TECHNOLOGY)SIRINDHORN INTERNATIONAL INSTITUTE OF TECHNOLOGY
THAMMASAT UNIVERSITYACADEMIC YEAR 2016
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Abstract
CONCEPT NAME SIMILARITY MEASURE ON SNOMED CT
by
HTET HTET HTUN
B.C.Sc. : , Bachelor of Computer Science, M.C.Sc.: Master of Computer Science, School of Information Computer and Communication Technology (ICT), University of Computer Studies (UCSM), Mandalay, Myanmar, 2016
Semantic similarity measure between concepts by exploiting medical
ontologies is a very essential task for exacting medical information and knowledge
discovery. One important application is a health decision support system that
recommends similar or alternative treatments between disease concepts from the
medical ontologies according to their similarity degrees. In the past, all of the existing
similarity measures estimate the similarity based on the taxonomical path length
between evaluated two concepts and the distance from the ontology hierarchy. But
taxonomic-based similarity measures cannot be accepted for all concepts in an
ontology because it includes “primitive concepts” that have limited amount of
informations and their definitions are not sufficiently distinguish from other concepts’
definitions in an ontology. Therefore, measuring the similarity based on the
taxonomical paths cannot give the desired similarity degrees for all ontology
concepts. For this fact, we proposed a new concept name similarity measure based on
semantic and syntactic similarities of the concept label. Our proposed measure is
mainly intended for primitive concept similarity. To examine the accuracy of our
proposed method, we calculate the correlation and error measurements against human
expert results. Moreover, we make the comparison between the results of our
proposed method and existing taxonomic-based similarity measures which got the ! ii
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
highest correlation values among the most other existing measures in the literature. As
a result, experiments showed that our method gets the highest correlation with human
expert and outperforms previous similarity measures. Additionally, experimental
results show that our proposed method is suitable for all types of ontology concepts -
defined concepts and primitive concepts.
Keywords: Concept Name Similarity Measure, Text Similarity, Natural Language
Processing, SNOMED CT, Semantic Similarity
! iii
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Acknowledgements
I would like to express my sincere gratitude to my advisor, Dr. Virach
Sornlertlamvanich for his valuable advice, support, encouragement, kindness and
patience throughout my study.
My thanks also go to the committee members, Dr. Marut Buranarach and
Dr. Nguyen Duy Hung for their valuable comments, supports and guidance.
I also want to thank to all faculty members, seniors and my friends for
their encouragements, discussions and assistance during my studies.
I would like to acknowledge to my parents for their kindness, love and
valuable support, understanding and strength throughout my life.
Finally, grateful acknowledgement to Sirindhorn International Institute of
Technology (SIIT), Thammasat University (TU) for giving me the chance to get my
Second Master Degree.
! iv
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Table of Contents
Chapter Title Page
Signature Page i
Abstract ii
Acknowledgements iv
Table of Contents v
List of Tables viii
List of Figures ix
1 Introduction 1
1.1 Text Similarity 1
1.1.1 WordNet 1
1.2 Biomedical Knowledge Sources and Ontologies 2
1.2.1 UMLS (Unified Medical Language System) 2
1.2.2 SNOMED CT (Systematized Nomenclature of
Medicine - Clinical Terms) 2
1.2.3 MeSH (Medical Subject Headings) 3
2 Literature Review 4
2.1 Text Similarity Measures 4
2.1.1 Unordered-based Text Similarity Measures 5
(i) Jaccard Similarity Coefficient 5
(ii) Cosine Similarity Coefficient 5
(iii) Szymkiewicz-Simpson Coefficient 5
(iv) Tversky Coefficient 5
(v) Difflib Similarity 6
! v
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
2.1.2 Ordered-based Text Similarity Measures 6
(i) Levenshtein Distance 6
2.2 Ontology-based Semantic Similarity Measures 8
2.2.1 Taxonomic-based Similarity Measures 8
(i) Leacock and Chodorow 8
(ii) Wu and Palmer 9
(iii) Choi and Kim 9
(iv) AI-Mubaid and Nguyen 9
(v) A New Path-based Similarity Measure 9
2.2.2 Description Logic ELH Semantic Similarity Measure (ELSIM) 10
3 Concept Name Similarity Measure on SNOMED CT 13
3.1 Concept Name Similarity Measure on SNOMED CT 15
3.1.1 Semantic Similarity (Linguistic Headword Structure) 16
3.1.2 Syntactic Similarity (Context-free Grammar) 18
3.1.3 Proposed Similarity Measure 19
4 Experimental Results and Discussion 20
4.1 Preliminary Experiment 20
4.2 Main Experiment on SNOMED CT 22
4.2.1 Experiments between Primitive Concepts 22
4.2.2 Experiments between Defined and Primitive Concepts 24
4.2.3 Experiments between Defined Concepts 27
4.3 Discussion 29
4.3.1 Limitations 30
5 Conclusions and Recommendations 32
! vi
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
References 33
Appendices 36
! vii
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
List of Tables
Tables Page
3.1 Incorrect similarity degree between primitive concepts using
existing two similarity measures 14
3.2 Different weights of concept P1 17
3.3 Different weights of concept P2 17
4.1 Results of similarity degrees for all categories of SNOMED CT
based on text similarity measures 21
4.2 Results of 30 pairs of concepts between primitive concepts estimated
by path-based, ELSIM, our proposed method and human expert 23
4.3 Results of 30 paris of concepts between primitive and defined concepts
estimated by path-based, ELSIM, our proposed method and human expert 25
4.4 Results of 30 paris of concepts between defined concepts estimated by
path-based, ELSIM, our proposed method and human expert 27
4.5 Correlation values between similarity measures and human expert
for each case 29
4.6 Error values between similarity measures and human expert for each case 29
4.7 Different similarity results between concepts using our proposed measure
with human expert results 30
4.8 Similarity degrees between concepts using our proposed measure
with human expert results 31
! viii
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
List of Figures
Figures Page
2.1 Text Similarity Measures 4
3.1 Overview system of concept name similarity measure on SNOMED CT 15
3.2 Notion of proposed similarity measure 15
! ix
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Chapter 1Introduction
1.1 Text Similarity
Measuring the similarity between word pairs has been extensively studied in
many approaches such as natural language processing [1], machine translation, text
classification and summarization, query reformulation, knowledge acquisition and
information retrieval [2]. However, there are increasing number of tasks that require
computing the similarity between two strings. Generally, the notion of similarity is
often referred to as the lexical similarity based on total overlap between vocabularies
and common words. Additionally, similarity varies “semantic similarity” based on
their meaning or semantic content and “syntactic similarity” based on their string
format or syntactical representation. In a consequence, there has recently been
proposed various approaches for measuring concept similarity by using various
knowledge sources (ontologies, domain corpora, and thesauri etc.) [3]. Because
knowledge sources provide a structured, unambiguous representation and a formal of
conceptualization of knowledge. For general purpose thesaurus of the English
language has also been successfully applied for assessing word similarity.
1.1.1 WordNet
WordNet is a semantic lexical database for the English language developed at
Princeton University. In WordNet [4], nouns, adjectives, verbs and adverbs of English
are arranged into synonym sets (synsets) with short definitions (glosses). The synset
or concepts are connected with other synsets in the taxonomy by using various types
of relationship. The usual relationships are Hyponym/Hypernym (that means is-a
relation) and Meronym/Holonym (that means part-of relation).
However, measuring biomedical terms based on WordNet performs poorly
because of the restricted amount of specialized domain in the knowledge source.
There are many biomedical ontologies giving concept ids, terms, synonyms and
definitions used in clinical documentation and reporting in the biomedical field.
�1Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
1.2 Biomedical Knowledge Sources and Ontologies
1.2.1 UMLS (Unified Medical Language System)
The UMLS was established by the United States National Library of Medicine
and it includes a set of files that describe health and biomedical vocabularies. UMLS
includes three domains (the Metathesaurus, the Semantic Network and the
SPECIALIST Lexicon) and software tools that able to access these knowledge
sources [5]. SNOMED CT is one of the Metathesaurus of the UMLS and a set of wide
categories for a consistent categorization of all concepts include in the Semantic
Network. The SPECIALIST Lexicon includes terms with linguistic information that
identify the domain of biomedical and healthcare system. UMLS is freely accessed
for the research purpose but a license will be needed.
1.2.2 SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms)
SNOMED CT is a standard biomedical terminology [6] supported by the
International Health Terminology Standards Development Organization (IHTSDO),
which validates the contents every 6 months. SNOMED CT covers all areas of
clinical information which organized into 18 top-level categories including body
structure, context-dependent, environment, event, finding, observable entity,
organism, physical force, physical object, procedure, product, qualifier value, social
concept, special concept, specimen, staging scale, substance and disease.
In SNOMED CT, concepts are organized in a hierarchy by using various
levels of specificity. There are 65 different relationship types and concepts are
connected by two main relations: “is-a” relation and “part of” relation [7]. Each
concept is uniquely identified with a concept ID (eg: id= 19036004), annotated with a
short textual description (eg: “rheumatic heart valve stenosis”) and equipped with a
definition. There are two kinds of concepts: defined concept and primitive concept in
the SNOMED CT ontology and contains 364,461 concept names which is the DL
version released in January 2005.
�2Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
1.2.3 MeSH (Medical Subject Headings)
It is a hierarchy structure of biomedical concepts created by the United States
National Library of Medicine (NLM). It was introduced in 1960, with the NLM’s own
index catalogue [8]. MeSH terms are arranged in “is-a” hierarchy with more common
terms (eg: “chemicals and drugs”) higher in a hierarchy than more particular terms
(eg: “aspirin”). It includes 15 taxonomies with more than 22,000 terms (version 2004)
and each concept can occur in excess of one hierarchy. Each term is presented by
different features, the main descriptions are the MeSH Heading (MH), Scope Note
and Entry Terms. Each concept is identified by its MeSH code name that showing the
precise location of the term in a MeSH hierarchy.
�3Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Chapter 2 Literature Review
Over the years, many text similarity approaches have been proposed for
various applications. Basically, lexical similarity or surface-matching similarity
measures are primitive for the text similarity. Recently, there are several similarity
approaches by using different knowledge sources as their background ontologies.
Among them, some approaches have been modified to the medical research by
containing clinical information from biomedical ontologies such as SNOMED CT or
MeSH. In this chapter, we reviewed many previous works in both primitive text
similarity and semantic similarity measures based on medical ontologies.
2.1 Text Similarity Measures
Very Primitive and basic text similarity approaches can be categorized into
two types: unordered-based methods and order-based methods.
�4
Ordered-based
Tversky
Cosine
Levenshtein
Difflib
Simpson
Jaccard
Unordered-based
Text Similarity
Figure 2.1 Text Similarity Approaches
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
2.1.1 Unordered-based Text Similarity Measures
These similarity approaches do not consider the order of the words when
comparing the similarity. They compute the similarity on the total overlap of words
between the strings.
(i) Jaccard Similarity Coefficient
The Jaccard similarity [9] is defined as the ratio between the intersection and
union of the two sets as shown in Equation 2.1.
!
(ii) Cosine Similarity Coefficient
This is a measure of similarity [10] based on the two vectors of an inner
product space that computes the cosine degree between them. It is defined by using
the word vectors of a dot product and magnitude ||.|| as in Equation 2.2.
!
(iii) Szymkiewicz-Simpson Coefficient
This method finds the overlap between two strings as the ratio of the
cardinality of the intersection to the minimum between the cardinality of two sets
[11]. If one set is a subset of another set or the converse then the overlap coefficient is
equal to one.
!
(iv) Tversky Coefficient
It is an asymmetric similarity measure on sets [12]. The numerator represents
the commonality between two sets and the denominator represents the referent for
tsimJaccard (A,B) = | tset(A)∩ tset(B) || tset(A)∪ tset(B) |
(2.1)
tsimCosine(A,B) = tset(A). tset(B)|| tset(A) || || tset(B) ||
(2.2)
tsimSimpson (A,B) = | tset(A)∩ tset(B) |min(| tset(A) |,| tset(B) |)
(2.3)
�5Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
comparison as in Equation 2.4. In Tversky index, ! gets the Jaccard similarity
and ! gets Dice similarity.
!
(v) Difflib Similarity
Difflib similarity is defined as the matching words (M) multiplied by 2 and the
total number of words (T) between both sets [13]. In Difflib, multiset is denoted by
tmset(.) to find the similarity. Number of matching words (M) is defined as the
cardinality of intersection of the multisets and T as follows.
!
!
!
2.1.2 Ordered-based Text Similarity Measures
They measure the similarity by taking not only the common words but also the
continuous data order of the strings. They get less similarity value between two
strings than unordered-based measures for the lexical similarity because they also
consider the ordering of the words for the similarity measure.
(i) Levenshtein Distance
It is the edit distance by taking smallest number of operations including
insertions, deletions and substitutions that require to convert the source string (s) to
the target string (t) [14]. It calculates based on matrix for measuring the difference
between two sequences as the following algorithm. In the matrix, the distance is in the
lower right hand corner of the matrix.
α , β = 1
α , β = 0.5
tsimTversky(A,B) = | tset(A)∩ tset(B) || tset(A)∩ tset(B) | +α | tset(A)− tset(B) | +β | tset(B)− tset(A) |
(2.4)
M =| tmset(A)∩ tmset(B) |
T =| tmset(A) | + | tmset(B) |
tsimDifflib (A,B) = 2 ×M /T (2.5)
�6Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Algorithm LevenshteinDistance (s,t)
Input: Two lists of words s,t
Output: distance d
Initialization:
!
!
Processing:
!
!
!
!
!
!
!
Termination:
D (n,m) is distance
To get the similarity from Levenshtein distance, it needs to be converted the
distance into normalization denoted by ! that is in the range of 0 and 1 as in
Equation 2.6. lendiff = difference of length of the two lists.
!
After getting ! , there are two ways to calculate the similarity:
!
!
len1← length(s)
len2← length(t)
1.D[ ][ ]← arrayof size len1× len2
for i←1 to len1 do
for each j←1 to len2 do
If s[i] equals t[ j], then cost = 0
else cost = 1
2. D[i, j]= min imum of :
i) D[i −1, j]+1ii) D[i, j −1]+1iii) D[i −1, j −1]+ cost
dnorm
dnorm (s,t) =d − lendiffmin(| s |, |t |)
(2.6)
dnorm
tsimLeven1(A,B) = 1− dnorm (tlist(A),tlist(B)) (2.7)
tsimLeven2 (A,B) = ( 11+ dnorm (tlist(A),tlist(B))
)× 2 −1 (2.8)
�7Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
2.2 Ontology-based Semantic Similarity Measures
Recently, knowledge sources and ontologies are generally used for the
similarity research because they provide a structured and unambiguous representation
of concepts interconnected by semantic pointers. Generally, the basic idea to compute
concept similarity is based on the taxonomical structure such as minimum number of
path length between evaluated concepts. In this section, we discussed many
taxonomic-based similarity measures. Moreover, some ontologies are written in the
Description Logic ELH so we reviewed ELH semantic similarity measure for
SNOMED CT ontology.
2.2.1 Taxonomic-based Similarity Measures
Ontologies are directed graphs where concepts are connected mainly by means
of taxonomic (is-a) and other semantic links. Therefore, the basic idea to find concept
similarity is the taxonomic-based measure. In a taxonomy, the common way to
determine the distance between two concepts c1 and c2 is to calculate the shortest
path length connecting evaluated concepts [15].
!
(i) Leacock and Chodorow
This is a measure by taking the minimum path length between two concepts
denoted by Np from c1 to c2 including themselves and maximum depth D of the
ontology [16] and [17].
sim = log(Np/2D)
!
(ii) Wu and Palmer
It is a path-based measure by taking the depth of the two terms in the
taxonomy where N1 and N2 are the amount of “is-a” relations from concept c1 and c2
disPL (c1,c2 ) = min amount of taxonomical edges connecting c1 and c2 (2.9)
simL&C (c1,c2 ) = − log(Np / 2D) (2.10)
�8Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
to their least common subsumer (LCS) and N3 is also the depth from LCS to the root
of ontology [16] and [18].
! (2.11)
(iii) Choi and Kim
This approach is also the taxonomic-based measure [19] based on the
difference in the levels of the depth for two concepts c1, c2 and the distance of the
minimum path between them as shown in Equation 2.12.
(iv) AI-Mubaid and Nguyen
This approach accounts the depth of the concept nodes and the path length
between them [20]. The method also takes the level of their least common subsumer
(lcs), and the distance of the minimum path of between them.
!
(v) A New Path-based Similarity Measure
This measure calculates the similarity based on the taxonomic paths
connecting the two concepts. It considers all of the ancestors connected to all the
taxonomic paths between concepts [21]. It is based on the idea that pairs of concepts
connected to an upper level of the hierarchy (i.e., they share few ancestors) and it’s
similarity degree should be less than the pairs of concepts in a under level because
they share more ancestors. It calculates the similarity between concept c1 and c2
simW&P (c1,c2 ) =2 × N3
N1 + N2 + 2 × N3
sim(c1,c2 ) = log2( [ l(c1,c2 )−1]× [D − depth(lcs(c1,c2 )) ]+ 2) (2.13)
�9
simCK (c1,c2 ) =MAX _PATH − path(c1,c2 )
MAX _PATH× MAX _ LEVEL − diff _ level(c1,c2 )
MAX _ LEVEL(2.12)
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
based on the amount of non-shared knowledge and all of the shared and non-shared
knowledge, and it takes the inverted logarithm function as shown in Equation 2.14.
!
L e t s e t t h e f u l l t a x o n o m y ! o f c o n c e p t s ( C ) o f a n o n t o l o g y,
! is the union of the ancestors of the
concept ! itself.
2.2.2 Description Logic ELH Semantic Similarity Measure (ELSIM)
In Description Logics (DLs), concept descriptions are defined with a set of
constructors, a set of concept names CN and role names RN. The set of concept
definitions for a specific DL ELH is denoted by Con(ELH) [22]. The set Con(ELH)
can be defined as follow:
!
in which T denotes the top concept, ! A is concept name (CN) and r
is role name (RN). In DL, concept names appearing on the left hand side of a
definition are denoted by “defined concept names” ( ! ). Other concept names are
called “primitive concept names” ( ! ). Therefore, ! .
ELSIM measure determines the similarity by using structural characterization
of two concepts by constructing the description trees. It first constructs description
tree for each concept from Top to evaluated concept using Algorithm 1 as the
following.
sim(c1,c2 ) = − log2|T (c1)∪T (c2 ) | − |T (c1)∩T (c2 ) |
|T (c1)∪T (c2 ) |(2.14)
H c
T (ci ) = {cj ∈C | cj is superconcept of ci}∪{ci}
ci and ci
C,D→ A |T |C ∩D | ∃r.C
C,D∈Con(ELH ),
CNdef
CN pri CN = CN pri ∪CNdef
�10Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Algorithm 1 ELH description tree
!
where !
Input :ΡC and εCOutput :ThedescriptiontreeTFunctionbuild − tree(ΡC ,εC )1.CreateanewtreeT2.Createanewvertex v∈V3. L(v)←ΡC
4. for each∃r.C '∈εC do5.build − child − node(u,r,ΡC ' ,εC ' )6.returnTfunctionbuild − child − node(u,r,ΡC ,εC )1.Createanewvertexw∈V2. L(w)∈←ΡC
3. Add anewedge(v,w)toE4. ρ(v,w)← {r}5. for each∃s.C '∈εC do6. build − child − node(w, s,ΡC ' ,εC ' )
0 ≤ µ ≤1;
�11
After constructing the description tree, they compute the similarity based on the
homomorphism tree function as the following.
Description (Homomorphism degree)
Let define ELH description trees that correspond to two ELH concept names
C and D, respectively. The homomorphism degree function is inductively defined as
follows: hd(ΤD ,ΤC ) = µ.ρ − hd(ΡD ,ΡC )+ (1− µ).e− set − hd(εD ,εC )
ρ − hd(ΡC,ΡD ) :=1 if ΡC =∅|ΡC ∩ΡD ||ΡC |
otherwise,
⎧
⎨⎪
⎩⎪
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
where |.| represents the set cardinality;
!
where ! are existential restrictions and
! !
where ! and !
The ELH similarity degree between C and D is determined as follows:
!
The implementation of this measure is available from this website (http://
ict.siit.tu.ac.th). This measure is constructed using a specific language Description
Logic ELH so it fulfills only the requirements of written language.
e− set − hd(εC ,εD ) :=1 if εC =∅0 if εC ≠ andεD =∅
∈i∈εC∑ max{e− hd(∈i ,∈j ) :∈j∈εD}| εC |
otherwise,
⎧
⎨
⎪⎪⎪
⎩
⎪⎪⎪
∈i ,∈j
e− hd(∃γ .X,∃s.Y ) := γ (v + (1− v).hd(ΤX ,ΤY ))
γ = |ℜr ∩ℜs ||ℜr |
0 ≤ v ≤1.
sim(C,D) = hd(ΤC ,ΤD )+ hd(ΤD ,ΤC )2
�12Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Chapter 3Concept Name Similarity Measure on SNOMED CT
In the previous chapter, all of the existing similarity measures find the
similarity based on the structural characterization of the ontology. But there are
different types of concepts in the ontology - defined concepts and primitive concepts.
Defined Concepts
They are fully defined in the ontology which also have at least one
relationship to another concept and their definitions are sufficiently defined to
distinguish from other concepts.
For example,
“Hypoxia of brain”
Is a = hypoxia
Finding site = brain structure
Sufficiently Defined
“Hypoxia of brain” has “is-a” relation with “hypoxia” and also has “attribute-
value” relationship type “finding site” with another concept “brain structure”.
Therefore, this concept has specific and complete information in order to sufficiently
distinguish from other concepts.
Primitive Concepts
They are partially defined in the ontology because their definitions are not
sufficiently distinguished from other concepts because they are actually needed to
define with additional information.
For example,
“Tumor of dermis”
Is a = navigational concept
Primitive
�13Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
“Tumor of dermis” has “is-a” relation with “navigational concept” but it does
not have complete information about itself. Therefore, ontology builders call them as
“primitive concepts” and they always redefine these concepts with more complete and
specific information from the actual medical treatment records. For these reasons,
there has one interesting point whether existing taxonomic-based similarity measures
give the correct similarity degree between all types of ontology concepts or not. So,
we test some pairs of primitive concepts from SNOMED CT ontology using existing
two taxonomic-based measures (1. Path-based measure in section 2.2.1.5 which got
the highest correlation value with human expert result among most of the existing
taxonomic-based similarity measures 2. ELSIM in section 2.2.2 which is the
Description logic ELH semantic similarity measure) and then compare the results
from human experts as the following Table 3.1.
According to the Table 3.1, existing measures cannot give desired similarity
degrees for the primitive concepts. In a consequence, we intend to propose concept
name similarity measure mainly for the primitive concept similarity on SNOMED CT.
Figure 3.1 shows our overview system to find the similarity degrees using our
proposed measure on SNOMED CT ontology.
Primitive Concept P1 Primitive Concept P2 Path-based ELSIM human
resultInfiltrative lung
tuberculosis
Nodular lung
tuberculosis
0.2 0.0 0.7
maternal autoimmune
hemolytic anemia
autoimmune
hemolytic anemia
0.2 0.0 0.8
phakic corneal edema Corneal epithelial
edema
0.2 0.0 0.5
�14
Table 3.1 Incorrect similarity degree between primitive concepts using existing two similarity measures
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
3.1 Concept Name Similarity Measure on SNOMED CT
In SNOMED CT, each ontology concept is uniquely identified by a concept
ID (e.g. id=10365005), annotated with a short textual description (e.g. “right main
coronary artery thrombosis”) and equipped with a definition in description logic.
Moreover, ontology concept names are taken from the actual patient medical health
records so they are very informative and can demonstrate the complete meaning of the
concept.
�15
Concept nameSemantic similarity (based on headword)
Syntactic similarity (Context-free grammar)
Figure 3.2 Notion of proposed similarity measure
Experiments between three cases
Primitive concepts
Primitive concepts and defined concepts
Defined concepts
Proposed Similarity Measure
Similarity Results
Figure 3.1 Overview system of concept name similarity measure on SNOMED CT
SNOMED CT
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
3.1.1 Semantic Similarity (Linguistic Headword Structure)
All concept names are expressed in the form of noun phrase, in which the
“headword” holds the core meaning of the phrase and we cannot omit the headword
in noun phrase. Therefore we consider the highest weight for the headword when
comparing the similarity of two concept names. In English language, the structure of
noun phrases can be described as in the following cases.
1. Determiner + Pre-modifier + noun (headword)
2. noun (headword) + Post-modifier/ complement
3. noun +noun
All of the SNOMED CT concept names perform as the first case. Therefore,
the rightmost noun is the headword of the concept name. We made some experiments
by giving different weights to each component of concept name according to the
analysis of noun phrase structure. After some experiments, we conclude that the
suitable weight for the headword is 0.6, and 0.4 is for the remaining components. For
the calculation of data, let’s consider following two concepts,
P1 = “right main coronary artery thrombosis” and
P2 = “superior mesenteric vein thrombosis”.
For concept P1,
• Weight for headword “thrombosis” is 0.6
• Weight for remaining components is 0.4 (0.1 for each remaining component)
Firstly, we give equal weights to each remaining component. As the idea of
nearer components from the headword have higher semantic influence on the
headword [23], nearer components should get higher weights than other components.
For this fact, we consider positions of the components and assign the weight for each
component based on the distance from the headword. Therefore, the weight of each
component is divided by the distance value. For the nearest component from the
headword, we subtract the sum of all other remaining components from 0.4. So, the
sum of all weights of concept name is 1. As a result, the weight can be distributively
estimated as shown in Table 3.2 and 3.3.
�16Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
We apply the Jaccard similarity for headword similarity denoted by
! .
! !
There are two points that we need to consider for this semantic similarity.
1. Some words are lexically same but they have different meanings.
For two examples, “kidney parenchyma” and “kidney beans”
• “kidney parenchyma” is human tissue of kidney and “kidney beans” is about
a kind of bean.
• This case cannot occur as we compute the similarity based on the same
category (for disease category, all the concepts are about health such as
illness, sickness).
right main coronary artery thrombosis
0.1 0.1 0.1 0.1 0.6
0.1/4=0.025 0.1/3=0.033 0.1/2=0.05 0.4-(0.025+0.033+0.05)=0.292
0.6
superior mesenteric vein thrombosis
0.133 0.133 0.133 0.6
0.133/3=0.044 0.133/2=0.067 0.4-(0.044+0.067)=0.289
0.6
simHeadword
simHeadword (P1,P2 ) =| tset(P1)∩ tset(P2 ) || tset(P1)∪ tset(P2 ) |
= 0.6(0.025 + 0.033+ 0.05 + 0.292 + 0.6 + 0.044 + 0.067 + 0.289)
= 0.43
�17
Table 3.2 Different weights of concept P1
Table 3.3 Different weights of concept P2
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
2. Some words are lexically different but they have the same meaning.
For two examples, illness and sickness.
• To complete this requirement, we used WordNet ontology to calculate the
synsets similarity ! because two concepts are similar if their synsets are
lexically similar [24] as the Equation 3.1.
! (3.1)
A is the synset of concept ! and B is the synset of concept ! .
• For this reason, we apply the synset similarity calculation to only the two
important headwords. If the degree of similarity of two snysets is greater than
0, then the two words are considered to be the same. Otherwise, they are
different.
!
3.1.2 Syntactic Similarity (Context-free Grammar)
According to english noun phrase construction, we can also decide the
similarity from the syntactic structure. In order to know the syntactic structure of
noun phrases, we apply the context-free grammar (CFG) [25]. The grammar G = (T,
N, S, R).
• T is set of terminals • N is set of non-terminals (NP in this case) • S is the starting symbol • R is rules or productions of the form
We create noun phrase rules that cover all types of concept names in
SNOMED CT as listed in the following.
1. NP ! N
2. NP ! N NP
3. NP ! Adj NP
4. NP ! Det NP
Ssynset
Ssynset (P1,P2 ) =| A∩ B || A∪ B |
P1 P2
Sim(P1,P2 ) =1, if Ssynset (P1,P2 ) > 0
0 if Ssynset (P1,P2 ) = 0
⎧⎨⎪
⎩⎪
→
→
→
→
�18Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
5. NP ! Adv NP
After applying CFG rule, the parsing orders of ! and ! from the previous
section are shown as the following.
• Parsing order of ! : 3-3-3-2-1
• Parsing order of ! : 3-3-2-1
Syntactic similarity is estimated by applying CFG parsing rule. For the
similarity calculation, nominator is the intersection of rules and denominator is the
maximum number of rules.
!
= 0.8
3.1.3 Proposed Similarity Measure
After getting similarity values from two dimensions: semantic and syntactic
structure, we consider finalize similarity value by assigning different weights based
on their generalizations. If two concepts have exactly the same syntactic structure, but
different headword terms, they have different meanings. Headword structure has
higher accurate influence for the similarity degree according to their headword
position. This means that headword structure decides the similarity more effective
than syntactic structure. Therefore, we decide to set different weights as 0.7 for
headword structure and 0.3 for syntactic structure.
!
→
P1 P2
P1
P2
simCFG (P1,P2 ) =45
Wsim(P1,P2 ) = a × simHeadword (P1,P2 )+ b × simCFG (P1,P2 )= 0.7 × 0.43+ 0.3× 0.8= 0.54
�19Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Chapter 4 Experimental Results and Discussion
In this experiment, there are two parts:
(1) preliminary experiment with text similarity approaches (in section 2.1) in order to
know the general and overview similarity degree of each category of SNOMED
CT and
(2) main experiment using proposed method.
4.1 Preliminary Experiment
For this experiment, we use SNOMED CT which is the DL version released in
January 2005 that contains 364,461 concept names [26]. There are 18 top-level
categories in the ontology. We pick up 50 concepts from each category and generate
20825 concept pairs by considering only the distinct pairs (i.e., not include P1= P2)
and calculate the similarity degrees using 6 different text similarity measures. For the
Levenshtein distance, we apply two different kinds of similarity, Leven1 and Leven2.
For the results of 20825 pairs, we show the average and maximum value from each
category as shown in Table 4.1.
According to the Table 4.1, we conclude about 73% of the pairs are totally
dissimilar (i.e., zero value for similarity) among 20825 pairs by applying five
unordered-based measures based on the average of concepts. For the ordered-based
measure, Levenshtein distance gives 443 pairs of getting zero value more than
unordered-based measures because it also considers the ordering of words when
compare the similarity. For the performance, average execution time of each method
requires 1.54 seconds. If we compute all concepts in SNOMED CT, it will take about
38 days for all total number of distinct pairs as it has 364461 concepts. By doing the
preliminary experiments, we notice that all of the concept names are noun phrases and
there has the most important noun called “headword” that holds the core meaning of
the noun phrase [23]. Therefore, we make the main experiment using our proposed
measure as shown in the next section.
�20Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
�21
Similarity Measures (avg/ max)Category Jaccard Cosine Tversky Simpson Difflib Leven 1 Leven 2 AverageBody Structure
0.10/0.71 0.16/0.83 0.16/0.83 0.19/0.83 0.16/0.83 0.15/0.83 0.1/0.71 0.15/1.0
Context-dependent
0.03/0.67 0.05/0.82 0.05/0.86 0.06/1.0 0.05/0.8 0.06/1.0 0.04/1.0 0.05/1.0
Environment 0.05/0.83 0.07/0.91 0.07/0.91 0.08/1.0 0.07/0.91 0.08/1.0 0.06/1.0 0.07/1.0
Event 0.15/0.93 0.22/0.97 0.21/0.97 0.3/1.0 0.21/0.94 0.3/1.0 0.23/1.0 0.23/1.0Finding 0.03/0.75 0.04/0.87 0.04/0.92 0.05/1.0 0.04/0.86 0.04/1.0 0.03/1.0 0.04/1.0
Observable Entity
0.06/0.83 0.1/0.91 0.09/0.91 0.12/1.0 0.09/0.91 0.11/1.0 0.07/1.0 0.09/1.0
Organism 0.01/0.5 0.01/0.71 0.01/0.67 0.01/1.0 0.01/0.67 0.01/1.0 0.01/1.0 0.01/1.0
Physical Force
0.12/0.8 0.18/0.89 0.18/0.89 0.2/1.0 0.18/0.89 0.19/1.0 0.14/1.0 0.17/1.0
Physical Object
0.25/0.8 0.39/0.89 0.38/0.89 0.46/1.0 0.38/0.89 0.46/1.0 0.31/1.0 0.37/1.0
Procedure 0.13/0.75 0.19/0.87 0.18/0.86 0.2/1.0 0.18/0.86 0.2/1.0 0.14/1.0 0.17/1.0
Product 0.01/0.67 0.02/0.82 0.02/0.8 0.02/1.0 0.02/0.8 0.02/1.0 0.02/1.0 0.02/1.0
Qualifier Value
0.04/0.75 0.05/0.87 0.05/0.86 0.05/1.0 0.05/0.86 0.05/1.0 0.04/1.0 0.04/1.0
Social Concept
0.03/0.8 0.04/0.89 0.04/0.89 0.05/1.0 0.04/0.89 0.05/1.0 0.04/1.0 0.04/1.0
Special Concept
0.03/0.8 0.04/0.89 0.04/0.89 0.04/1.0 0.04/0.89 0.04/1.0 0.03/1.0 0.04/1.0
Specimen 0.29/0.8 0.42/0.89 0.41/0.89 0.47/1.0 0.41/0.89 0.44/1.0 0.32/1.0 0.39/1.0
Staging Scale 0.16/0.8 0.24/0.89 0.23/0.89 0.27/1.0 0.23/0.8 0.21/1.0 0.16/1.0 0.21/1.0
Substance 0.002/0.6 0.003/0.8 0.003/0.8 0.003/0.8 0.003/0.8 0.003/0.8 0.002/0.6 0.003/1.0
Concept Average
0.09/0.93 0.13/0.97 0.13/0.97 0.16/1.0 0.13/0.94 0.15/1.0 0.11/1.0 0.13/1.0
Roles 0.02/0.5 0.03/0.71 0.03/0.67 0.04/1.0 0.03/0.67 0.03/1.0 0.02/1.0 0.03/1.0
Table 4.1 Results of similarity degrees for all categories of SNOMED CT based on text similarity measures
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
4.2 Main Experiment on SNOMED CT
In the SNOMED CT, there are two different kinds of concepts - defined
concept and primitive concept. Therefore, we make three different types of
experiment between (1) primitive concepts, (2) primitive and defined concepts, and
(3) defined concepts. From SNOMED CT disorder category, we pick up 30 pairs of
concepts for each type, therefore, total is 90 pairs of concepts for three different types
of experiment.
One of the usual way in order to prove the outperformance of the proposed
method is to compare with existing measures so we chose the path-based measure
(section 2.2.1.5) which got the highest correlation value among most of the existing
similarity measures and description logic ELH semantic similarity measure (section
2.2.2) because SNOMED CT is written in description logic. To compute the result of
proposed method with existing two similarity measures, we make the implementation
of these two similarity measures.
To examine the validity of all measures, we requested the similarity results for
90 pairs of concepts from five medical doctors. Therefore, they make a consensus on
the degree of similarity of the concepts and we calculate the correlation values
between the results from all measures and medical doctors.
4.2.1 Experiments between Primitive Concepts
The first experiment is between primitive concepts and our proposed method
is mainly intended for the primitive concepts. As primitive concepts do not have full
relationship or definitions in the ontology hierarchy, our proposed concept name
similarity measure from the natural language processing views is the best similarity
measure rather than existing taxonomic-based measures. The results of primitive
concepts estimated by path-based, ELSIM, our proposed measure and human expert
are shown in Table 4.2.
�22Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Primitive Concept P1
Primitive Concept P2 Path-based
ELSIM Proposed measure
Human expert
Hormonal tumor Malignant mast cell tumor
0.2 0.0 0.5 0.6
Maternal autoimmune hemolytic anemia
Autoimmune hemolytic anemia
0.2 0.0 0.8 0.8
Hypertensive leg ul- cer
Solitary anal ulcer 0.3 0.7 0.5 0.4
Bovine viral diarrhea Bovine coronoviral diarrhea
0.6 0.6 0.7 0.7
Acute uterine inflam- matory disease
Mycoplasmal pelvic inflammatory disease
0.4 0.2 0.9 0.9
Primary cutaneous blastomycosis
Primary pulmonary blastomycosis
0.7 0.9 0.7 0.6
Iodine-deficiency-related multinodular endemic goiter
Non-toxic multi nodular goiter
0.8 0.7 0.8 0.8
Congenital pharyn- geal polyp
Uterine cornual polyp
0.4 0.6 0.5 0.5
Phakic corneal edema
Corneal epithelial edema
0.2 0.0 0.5 0.5
Knee pyogenic arthri- tis
Gonococcal arthritis dermatitis syndrome
0.9 0.8 0.4 0.4
Hereditary canine spinal muscular atrophy
Spinal cord concussion
0.5 0.7 0.3 0.5
Mite-borne hemorrhagic fever
Meningococcal cerebrospinal fever
0.4 0.5 0.6 0.5
Congenital cleft larynx
Congenital spastic foot
0.6 0.8 0.3 0.3
Congenital acetabular dysplasia
Short rib dysplasia 0.5 0.9 0.5 0.5
Intestinal polyposis syndrome
Ovarian vein syndrome
0.6 0.8 0.6 0.5
�23
Table 4.2 Results of 30 pairs of concepts between primitive concepts estimated by path-based, ELSIM, our proposed method and human expert
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
4.2.2 Experiments between Defined and Primitive Concepts
As the primitive concepts are not fully defined in the ontology, similarity
measure between primitive and defined concepts is also interesting point in the
SNOMED CT. Existing taxonomic-based approaches can not give desired similarity
degrees estimated by human experts because primitive concepts have partially defined
Extrapulmonary subpleural pulmonary sequestration
Pulmonary alveolar proteinosis
0.7 0.6 0.4 0.4
Atypical chest pain Psychogenic back pain
0.3 0.1 0.5 0.5
Puerperal pelvic cellulitis
Chronic female pelvic cellulitis
0.9 0.7 0.8 0.7
Spinal cord hypoplasia
Spinal cord rupture 0.5 0.7 0.6 0.6
Infiltrative lung tuberculosis
Nodular lung tuberculosis
0.2 0.0 0.9 0.7
Early gastric cancer Primary vulval cancer 0.4 0.8 0.4 0.4
Congenital mesocolic hernia
Gangrenous epigastric hernia
0.2 0.0 0.5 0.4
Congenital nonspherocytic hemolytic anemia
Congenital macular corneal dystrophy
0.2 0.0 0.3 0.2
Congenital cerebellar cortical atrophy
Congenital renal atrophy
0.6 0.9 0.7 0.2
Puerperal pyrexia Heat pyrexia 0.3 0.0 0.5 0.6
Methylmalonyl-CoA mutase deficiency
Muscle phosphoglycerate mutase deficiency
0.2 0.0 0.7 0.5
Recurrent mouth ulcers
Multiple gastric ulcers
0.5 0.8 0.4 0.4
Infantile breast hypertrophy
Sebaceous gland hypertrophy
0.6 0.7 0.6 0.4
Congenital pyloric hypertrophy
Synovial hypertrophy 0.3 0.0 0.5 0.3
Inflammatory testicular mass
Inflammatory epidermal nevus
0.5 0.7 0.3 0.3
�24Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
information and defined concepts have fully defined information. Therefore,
similarity degrees between two concepts can be very low or high because of the
incomplete amount of information for the primitive concepts. For this reason,
taxonomic-based approaches are also not acceptable for the defined and primitive
concept similarity.
Primitive Concept P1
Defined Concept P2 Path-based
ELSIM Proposed method
Human expert
Mosquito-borne hemorrhagic fever
Glandular fever pharyngitis
0.4 0.7 0.5 0.5
Right main coronary artery thrombosis
Coronary artery rupture
0.9 0.9 0.5 0.4
right main coronary artery thrombosis
superior mesenteric vein thrombosis
0.7 0.9 0.5 0.5
Infectious mononucleosis hepatitis
chronic alcoholic hepatitis
0.2 0.0 0.5 0.5
Cerebral venous sinus thrombosis
Phlebitis cavernous sinus
1.0 0.9 0.6 0.6
Third degree perineal laceration
Complex periorbital laceration
0.3 0.7 0.5 0.5
Congenital subaortic stenosis
Rheumatic aortic stenosis
0.9 0.7 0.6 0.7
Congenital acetabular dysplasia
Aortic valve dysplasia
0.5 0.6 0.5 0.3
Intestinal polyposis syndrome
Fetal cytomegalovirus syndrome
0.4 0.4 0.6 0.3
Anterior choroidal artery syndrome
Juvenile polyposis syndrome
0.4 0.7 0.5 0.3
Puerperal pelvic cellulitis
Streptococcal cellulitis
0.3 0.5 0.5 0.3
Benign hypertensive renal disease
Pulmonary hypertensive venous disease
0.7 0.8 0.6 0.4
�25
Table 4.3 Results of 30 pairs of concepts between primitive and defined concepts estimated by path-based, ELSIM, our proposed method and human expert
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Corneal epithelial edema
Idiopathic corneal edema
0.1 0.0 0.8 0.6
Chronic sarcoid myopathy
Hereditary hollow viscus myopathy
0.3 0.6 0.5 0.5
Primary cutaneous blastomycosis
Chronic pulmonary blastomycosis
0.7 0.9 0.6 0.6
Gingival pregnancy tumor
Granular cell tumor 0.4 0.5 0.6 0.4
Borderline epithelial tumor
Melanotic malignant nerve sheath tumor
0.4 0.6 0.4 0.4
Congenital sternomastoid tumor
Malignant mast cell tumor
0.4 0.5 0.4 0.4
Congenital pharyngeal polyp
Rhinosporidial mucosal polyp
0.4 0.4 0.6 0.5
Mercurial diuretic poisoning
Lobelia species poisoning
0.4 0.4 0.4 0.5
Branch macular artery occlusion
Acute mesenteric arterial occlusion
0.5 0.9 0.5 0.6
Intrarenal hematoma
Stomach hematoma 0.5 0.9 0.5 0.6
Spinal cord hypoplasia
Spinal cord dysplasia
0.9 0.9 0.6 0.6
Coronary artery thrombosis
Vertebral artery thrombosis
0.2 0.0 0.9 0.6
Duodenal papillary stenosis
Congenital bronchial stenosis
0.5 0.6 0.6 0.4
Arteriovenous fistula stenosis
Subclavian vein stenosis
0.4 0.2 0.6 0.5
Mechanical hemolytic anemia
Hereditary sideroblastic anemia
0.7 0.7 0.6 0.5
Malignant catarrhal fever
Malignant lipomatous tumor
0.7 0.2 0.3 0.3
Bolivian hemorrhagic fever
Dengue hemorrhagic fever
0.6 0.9 0.8 0.6
Benign brain tumor Benign neuroendocrine tumor
0.4 0.0 0.5 0.5
�26Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
4.2.3 Experiments between Defined Concepts
Defined concepts are completely defined and their definitions are sufficiently
defined in the ontology but there is no guarantee for all defined concepts as their
informations are fully satisfied from the actual medical treatment records. Therefore,
this type of experiment is also interesting to check that existing taxonomic-based
approaches give desired similarity degrees for defined concepts. Table 4.4 is shown
the results between defined concepts computed by path-based, ELSIM, our proposed
method and human expert.
Defined Concept P1 Defined Concept P2 Path-based
ELSIM Proposed method
Human expert
Rheumatic heart valve stenosis
Coronary artery stenosis
0.6 0.8 0.5 0.6
Nasal septal hematoma
Vocal cord hematoma
0.3 0.9 0.5 0.5
Simple periorbital laceration
Brain stem laceration
0.5 0.9 0.4 0.5
Peritonsillar cellulitis Dentoalveolar cellulitis
0.5 0.9 0.6 0.5
Parainfluenza virus laryngotracheitis
Acute viral laryngotracheitis
1.0 0.9 0.4 0.6
Bone marrow hyperplasia
Retromolar gingival hyperplasia
0.8 0.8 0.5 0.4
Chronic proctocolitis
Chronic viral hepatitis
0.5 0.8 0.4 0.3
Obstructive biliary cirrhosis
Syphilitic portal cirrhosis
0.6 0.8 0.5 0.6
Peripheral T-cell lymphoma
Primary cerebral lymphoma
0.5 0.8 0.5 0.6
Mast cell leukemia
Prolymphocytic leukemia
0.9 0.9 0.4 0.5
Tricuspid valve regurgitation
Rheumatic mitral regurgitation
0.9 0.9 0.5 0.6
�27
Table 4.4 Results of 30 pairs of concepts between defined concepts estimated by path-based, ELSIM, our proposed method and human expert
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Gangrenous paraesophageal hernia
Congenital bladder hernia &
0.5 0.8 0.6 0.5
Congenital mandibular hyperplasia
Atypical endometrial hyperplasia
0.3 0.6 0.5 0.5
Tuberculous adenitis
Acute mesenteric adenitis
0.7 0.6 0.5 0.4
Congenital skeletal dysplasia
Aortic valve dysplasia
0.4 0.7 0.5 0.4
Histiocytic sarcoma Alveolar soft part sarcoma
1.0 0.9 0.5 0.6
Drug-induced ulceration
Amebic perianal ulceration
0.9 0.6 0.5 0.5
Cervical radiculitis Cervical lymphadenitis
0.4 0.8 0.7 0.6
Basilar artery embolism
Obstetric pulmonary embolism
0.6 0.7 0.5 0.6
Acute apical abscess
Chronic apical abscess
0.5 1.0 0.9 0.7
Acute glossitis Chronic glossitis 0.4 0.4 0.5 0.7
Acute bronchitis Acute purulent meningitis
0.3 0.6 0.4 0.4
Acute lower gastrointestinal hemorrhage
Stromal corneal hemorrhage
0.4 0.8 0.4 0.4
Epidural hemorrhage
Tracheostomy hemorrhage
0.6 0.7 0.5 0.4
Thallium sulfate toxicity
Ammonium sulfamate toxicity
1.0 0.6 0.6 0.5
Simple periorbital laceration
Complex periorbital laceration
0.8 1.0 0.8 0.6
Biceps femoris tendinitis
Profunda femoris artery thrombosis
0.5 0.8 0.2 0.4
Hyperplastic thrush Hyperplastic gingivitis
0.3 0.5 0.7 0.5
Acer rubrum poisoning
Penicillium rubrum toxicosis
0.5 0.6 0.6 0.6
Acute vesicular dermatitis
Herpesviral vesicular dermatitis
0.4 0.5 0.9 0.6
�28Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
After getting the similarity results using all approaches based on three cases,
we calculate the correlation and error values against human expert results as shown in
Table 4.5 and 4.6.
4.5 Discussion
Corresponding to the correlation values in Table 4.5, it is clear that existing
taxonomic-based similarity approaches cannot give the desired similarity results for
the ontology concepts. In the case of primitive concepts similarity, the first existing
approach (Path-based) gets very few correlation value (0.04) so there is only 4 %
relation between these two results. The second existing approach (ELSIM) gets the
negative correlation, it means that these two results are totally different. When its
result is high, the human result is low, vice versa. Therefore, similarity measures
Method Method Tpye Primitive Concepts
Primitive and Defined Concepts
Defined Concepts
Path-based ontology-based 0.04 0.2 0.2
ELSIM ontology-based -0.19 0.18 0.03
Proposed measure
concept name 0.84 0.5 0.51
Method Method Tpye Primitive Concepts
Primitive and Defined Concepts
Defined Concepts
Path-based ontology-based 0.1 0.1 0.1
ELSIM ontology-based 0.2 0.1 0.1
Proposed measure
concept name 0.02 0.02 0.02
�29
Table 4.5 Correlation values between similarity measures and human expert for each case
Table 4.6 Error values between similarity measures and human expert for each case
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
based on taxonomical paths are not acceptable for the primitive concept similarity.
Our proposed method gets the highest correlation value (0.74) and lowest error value
(0.02) so our proposal outperforms the existing approaches for the primitive concept
similarity. Moreover, our proposed method gets the highest correlations in another
two cases (between primitive and defined concepts, and between defined concepts). In
a consequence, we knew these two important points: the first one is about even
defined concepts in the ontology need more complete information from actual
medical treatment records and the second one is even our proposed measure mainly
intends for primitive concepts, it can determine the most desired similarity degrees for
all three cases.
4.5.1 Limitations
According to the analysis of the results, our proposed measure has some
limitations. Although the headword holds the core meaning of each concept, the
headword in some concept names represents the general meaning (eg: syndrome,
atrophy). This is the only case that using the important of the headword cannot
effectively distinguish the similarity degree as shown in Table 4.7.
Although the headword has the common meaning (e.g., disease, fever) in
some concepts, our proposed method gets the desired similarity degrees with human
expert results because our approach assigns the second highest weight to the nearest
component from the headword. Therefore, the similarity results of our proposed
Concept P1 Concept P2 Proposed method Human expert
Intestinal polyposis syndrome
Fetal cytomegalovirus syndrome
0.6 0.3
Congenital cerebellar cortical atrophy
Congenital renal atrophy
0.7 0.2
�30
Table 4.7 Different similarity results between concepts using our proposed measure with human expert results
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
method are not very different with human results even if the headword holds the
general meaning as shown in Table 4.8.
Concept P1 Concept P2 Proposed method Human expert
Acute uterine inflammatory disease
Mycoplasmal pelvic inflammatory disease
0.9 0.9
Bolivian hemorrhagic fever
Dengue hemorrhagic fever
0.8 0.6
Acute apical abscess
Chronic apical abscess
0.9 0.7
�31
Table 4.8 Similarity degrees between concepts using our proposed measure with human expert results
Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
Chapter 5Conclusions and Recommendations
Measuring semantic similarity between ontology concepts is the important
research area, such as structuring of textual resources. In the biomedical domain,
determining the similarity degrees between ontology disease concepts in order to
recommend the similar or alternative treatments is very important research area for
the heath decision support system. The basic way to determine the similarity for the
ontology concepts depends on the taxonomical paths but there are different types of
ontology concepts and most of the concepts are needed to redefine with their
complete information. Therefore, existing ontology taxonomic-based similarity
measures could not give desired similarity degrees with the human expert.
In this thesis, we proposed a new concept name similarity measure based on
ontology concept labels by effectively capture the syntactic and semantic information
for the similarity measurement. Moreover, we made three different experiments for
finding the similarity between (1) primitive concepts, (2) primitive concepts and
defined concepts, and (3) defined concepts. And then, calculate the correlation values
as well as error values to prove the utility of our proposed measure. Furthermore, we
revised the existing ontology-based similarity measures based on the SNOMED CT
medical ontology and point out the limitations and weakness of these measures based
on three different experiments.
In conclusion, experiments show that our proposed measure surpasses existing
taxonomic-based measures for all types of ontology concepts. In the future, we will
apply our proposed measure to other medical ontologies such as MeSH ontology to
estimate the advantage of our proposed measure.
�32Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
References
1. Resnik, P. (1999). Semantic Similarity in a Taxonomy. An Information-based
Measure and its Application to Problems of Ambiguity in Natural Language.
Journal of Artificial Intelligence Research, 11, 95-130.
2. Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E. G.M., & Milios, E. (2006)
Information Retrieval by Semantic Similarity. International Journal of Semantic
Web Informatics Systems, 55-73.
3. Abdelrahman, A. M. B., & Kayed, A. (2015). A Survey on Semantic Similarity
Measures between Concepts in Health Domain. American Journal of
Computational Mathematics, 5, 204-214.
4. WordNet database, http://wordnet.princeton.edu.
5. UMLS Terminology Services, https://uts.nlm.nih.gov/home.html.
6. SNOMED CT: Systematized Nomenclature of Medicine - Clinical Terminology.
http://www.snomed.org/snomedct/index.html.
7. Zhang, M., Patrick, J., Truran, D., and Innes, K. Deriving a SNOMED CT Data
Model
8. Medical Subject Headings (MeSH), National Library of Medicine, http://
www.nlm.nih.gov/mesh.
9. Niwattanakul, S., Singthongchai, J., Naenudorn, E., & Wanapu, S. (2013). Using
of Jaccard Coefficient for Keywords Similarity. In Proceedings of the
International MultiConference of Engineers and Computer Scientists (IMECS), 1,
Hong Kong.
10. Sree, K.P.N.V.S., & Murthy, J.V.R. (2012). Clustering Based on Cosine Similarity
Measure. In International Journal of Engineering Science and Advanced
Technology (IJESAT), 2(3), 508-512.
11. Choi, J., Oh, T., & Kweon, I.S. (2016). Human Attention Estimation for Natural
Images: An Automatic Gaze Refinement Approach. Korea Advanced Institute of
Science and Technology (KAIST), Jan.
�33Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
12. Jimenez, S., Becerra, C., & Gelbukh, A. (2013). Softcardinality-core: Improving
Text Overlap with Distributional Measures for Semantic Textual Similarity.
Second Joint Conference on Lexical and Computational Semantics, 1, 194-201,
Atlanta, Georgia.
13. Wolk, K. & Marasek, K. (2014). A Sentence Meaning Based Alignment Method
for Parallel Text Corpora Preparation. In Proceedings of New Perspectives in
Information Systems and Technologies, 1, 229-237, Spinger, Switzerland.
14. McCallum, A. (2006). String Edit Distance (and intro to dynamic programming)
Computational Linguistics, Spring.
15. Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and
Application of a Metric on Semantic Nets. IEEE Transactions Systems, Man and
Cybernetics. 19(1), 17-30.
16. Zare, M., Pahl, C., Nilashi, M., Salim, N., & Ibrahim, O. (2015). A Review of
Semantic Similarity Measures in Biomedical Domain Using SNOMED CT. Soft
Computing and Decision Support Systems, 2(6), 1-13.
17. Pedersen, T., Pakhomov, S. V. S., Patwardhan, S., & Chute, C. G. (2006)
Measures of Semantic Similarity and relatedness in the Biomedical Domain.
Journal of Biomedical Informatics, 40, 288-299.
18. Garla, V. N., & Brandt, C. (2012). Semantic Similarity in the Biomedical
Domain: An Evaluation across Knowledge Sources. Journal of BMC
Bioinformatics, October.
19. Choi, I., & Kim, M. (2003). Topic Distillation using Hierarchy Concept Tree.
Proceedings of the 26th annual international ACM SIGIR Conference on Research
and Development in Information Retrieval. 371-371, Toronto, Canada.
20. Mubaid, H. A., & Nguyen, H. (2006). A Cluster-based Approach for Semantic
Similarity in the Biomedical Domain. Proceedings of the 28th IEEE EMBS
Annual International Conference. New York City, USA.
21. Batet, M., Sanchez, D., & Valls, A., (2011). An Ontology-baed Measure to
Compute Semantic Similarity in Biomedicine. Journal of Biomedical Informatics,
44, 118-125.
�34Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
22. Tongphu, S., & Suntisrivaraporn, B. (2015). Algorithms for Measuring Similarity
Between ELH Concept Descriptions: A Case Study on SNOMED CT, Journal of
Computing and Informatics, 20.
23. Lieberman, M., & Sproat, R. (1992). The Stress and Structure of Modified Noun
Phrases in English, Stanford University.
24. Petrakis, E. G. M., Varelas, G., Hliaoutakis, A., & Raftopoulou, P. (2006) X-
Similarity: Computing Semantic Similarity between Concepts from Different
Ontologies. Journal of Digital Information Management, 4(4).
25. Ko, S., Han, Y., & Salomma, K. (2016). Approximate Matching between a
Context-free Grammar and a Finite-state Automaton: Information and
Computation, 278-289.
26. IHTSDO. SNOMED Licensing, International Health Terminology Standards
Development Organization, http://www.ihtsdo.org/licensing.
�35Ref. code: 25595822040902FWM
Ref. code: 25595822040902FWMRef. code: 25595822040902FWMRef. code: 25595822040902FWM
AppendixList of Publications
A.1 International Conference
1. Htun, H. H., Sornlertlamvanich, V., & Suntisrivaraporn, B. (2016). Towards
Automatic Generation of “Preference Profile” for Primitive Concept Similarity
Measures on SNOMED CT. In the Eleventh International Conference on
Knowledge, Information and Creativity Support Systems, Yogyakarta, Indonesia,
194-199.
2. Htun, H. H., Sornlertlamvanich, V. (2017). Text Similarity Approach for
SNOMED CT Primitive Concept Similarity Measure. In the Eight International
Conference on Information and Communication Technology for Embedded
Systems (ICICTES), Thailand.
3. Htun, H. H., Sornlertlamvanich, V. (2017) SNOMED CT Primitive Concept
Similarity Measure by Concept Name Text Similarity Approach. In the 27th
International Conference on Information Modeling and Knowledge Bases (EJC),
Krabi, Thailand.
�36Ref. code: 25595822040902FWM