proceedings template - word - engineering...

13
Relating Taxonomies with Regulations Chin Pang Cheng, Jiayi Pan Stanford University Dept. of Civil & Environmental Eng. Stanford, CA 94305-4020 (cpcheng, [email protected]) Gloria T. Lau, Kincho H. Law Stanford University Dept. of Civil & Environmental Eng. Stanford, CA 94305-4020 (glau, [email protected]) Albert Jones Enterprise Systems Group NIST Gaithersburg, MD 20899-0001 ([email protected]) ABSTRACT Increasingly, taxonomies are being developed for a wide variety of industrial domains and specific applications within those domains. These industry or application specific taxonomies attempt to represent the vocabularies commonly used by the practitioners. These formal representations have the potential to automate information retrieval, facilitate interoperability and improve decision making. Decisions made must comply with existing government regulations and codes of practices, which are not always known to the industry practitioners. Although regulations and codes are now in digital forms and are often available online, it remains difficult to search for relevant regulatory information that are applicable to particular decisions. As industry practitioners, unlike legal practitioners, are familiar with one or more industry- specific taxonomies but not necessarily regulatory organization systems, it would be desirable to relate regulations with existing industry-specific taxonomies.

Upload: dinhlien

Post on 12-Mar-2018

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Proceedings Template - WORD - Engineering …eil.stanford.edu/publications/jack_cheng/dgo2008.doc · Web viewIn practice, multiple terminology classifications or data model structures

Relating Taxonomies with RegulationsChin Pang Cheng, Jiayi Pan

Stanford UniversityDept. of Civil & Environmental Eng.

Stanford, CA 94305-4020

(cpcheng, [email protected])

Gloria T. Lau, Kincho H. Law Stanford University

Dept. of Civil & Environmental Eng.Stanford, CA 94305-4020

(glau, [email protected])

Albert JonesEnterprise Systems Group

NISTGaithersburg, MD 20899-0001

([email protected])

ABSTRACTIncreasingly, taxonomies are being developed for a wide variety of industrial domains and specific applications within those domains. These industry or application specific taxonomies attempt to represent the vocabularies commonly used by the practitioners. These formal representations have the potential to automate information retrieval, facilitate interoperability and improve decision making. Decisions made must comply with existing government regulations and codes of practices, which are not always known to the industry practitioners. Although regulations and codes are now in digital forms and are often available online, it remains difficult to search for relevant regulatory information that are applicable to particular decisions. As industry practitioners, unlike legal practitioners, are familiar with one or more industry-specific taxonomies but not necessarily regulatory organization systems, it would be desirable to relate regulations with existing industry-specific taxonomies.

The mapping from a single taxonomy to a single regulation is a trivial keyword matching task. In this paper, we examine techniques to map a single taxonomy to multiple regulations, as well as to map multiple taxonomies to a single regulation. Those techniques include cosine similarity, Jaccard coefficient and market-basket analysis. These techniques provide a metric that measures the similarity between concepts from different taxonomies. Preliminary evaluations of the three metrics are performed using examples from the building industry. These examples illustrate the potential regulatory benefits from the mapping between various taxonomies and regulations.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – retrieval models, J.1 [Administrative Data Processing]: law.

KeywordsHeterogeneous Ontologies, Taxonomy Interoperability, Relatedness Analysis, Regulation Retrieval.

1. INTRODUCTIONGovernment regulations are an important asset of the society. They extend the laws governing the country with specific guidance for corporate and public actions. Ideally regulations should be readily retrievable by interested individuals. To aid understanding of the law, much prior research focused on the abstraction and retrieval of case law [1, 3, 5, 30, 38, 40], analysis of regulations [19, 20], and compliance guidance for regulations [15, 16]. Relatively little research, however, has

been devoted to methodologies and tools that allow practitioners to intelligently browse and retrieve relevant regulations utilizing familiar terms and vocabularies (for example, according to industry-specific taxonomies).

Increasingly, taxonomies are being developed to capture and represent those terms and vocabularies formally for a wide variety of industrial domains. These taxonomies can facilitate information interoperation and regulation retrieval. Interoperability is important because it allows practitioners to access, relate, and combine information from multiple, heterogeneous sources and therefore increases the value of information. Recent studies by the National Institute of Standards and Technology (NIST) have reported that inefficient interoperability and integration led to significant costs to the construction as well as the automotive industries [7, 14].

Ontologies have been proposed as a way to remove these inefficiencies. One recent forecast estimates that “By 2010, ontologies ….will be the basis for 80 percent of application integration projects” [32]. Ontologies serve as a means for information sharing and capture the semantics of domain-specific information in a formal and computer interpretable form. They have the potential to automate much of the integration process, thereby reducing cost and time significantly.

Building a single ontology for an entire domain, however, has proven to be neither efficient nor practical. Rather, small communities that need to exchange information frequently tend to build their own distinct ontologies [36]. In practice, multiple terminology classifications or data model structures exist. For instance, the architectural, engineering and construction (AEC) community has built several ontologies that describe the semantics of building models. Even though these ontologies are all targeted towards the same user group, the structures, vocabularies and coverage differ depending on the application.

Government regulations, on the other hand, are often organized and classified by the needs of the agency that enforces them, rather than the needs of the communities that use them [6]. Consequently, there is a clear need and benefit of bridging these two distinct needs. One way to build such a bridge is to develop methods and tools that enable practitioners to browse and retrieve government regulations using their own terms and vocabularies, for example, via existing industry taxonomies. For instance, to browse through regulations for compliance requirements, adhering to an existing taxonomy that the users are familiar with minimizes learning of new classification and vocabularies. Their mental models are better represented using existing taxonomies rather than agency’s classification for regulations.

Page 2: Proceedings Template - WORD - Engineering …eil.stanford.edu/publications/jack_cheng/dgo2008.doc · Web viewIn practice, multiple terminology classifications or data model structures

In this paper, we present a systematic approach to map regulations to industry-specific taxonomies. We begin with linking one taxonomy to one regulation which is a trivial keyword extraction task. Extending one taxonomy to multiple regulations requires clustering of relevant sections from different regulations. To do this, we reuse the relatedness analysis core from [19] to compute relevancy between those sections. We then discuss the need and the challenges of mapping multiple taxonomies to a single regulation. Three different methodologies are investigated to cluster relevant concepts from different taxonomies in order to compute relevancy between those concepts. Cosine similarity and Jaccard coefficient, two vector-based similarity measures commonly used in the field of information retrieval are adopted to compare semantic similarity between ontologies. The market basket model, a popular technique in data mining, is modified as a relatedness analysis measure for ontology mapping. Our preliminary evaluations of the three metrics are discussed. We conclude with our proposal to address methods for mapping multiple taxonomies to multiple regulations.

2. Illustrative Ontology Standards and Regulatory CorpusWe work with taxonomies and regulatory corpus from both the building industry and the environmental protection industry [15, 16, 19, 20]. To illustrate their organization and structure, we present briefly the ontology standards and classification systems that are commonly used in the building industry. For the AEC industry, there are a few ontologies that describe the semantics of building models, such as the CIMsteel Integration Standards (CIS/2) for the steel building and fabrication industry [8], the Industry Foundation Classes (IFC) initiated by the CAD vendors for design description of building components [12], and the OmniClass construction classification system (OmniClass) for the construction specification, materials and product components [34].

Figures 1 and 2 show excerpted examples of the OmniClass and IfcXML standards. Typical of ontology standards, both are organized hierarchically with implicit “is-a” type relationships defined accordingly. OmniClass consists of 15 tables, each of which represents a different facet of construction information. Each term is associated with a unique ID. For example, the term “Sound and Signal Devices” is associated with the ID “23-85 10 11 11”. For the IfcXML, the Industry Foundation Class objects are expressed in an XML structure that defines the hierarchical relationship between elements and entities. To extract the object terms for mapping purposes, the two standards are preprocessed to eliminate the miscellaneous information, such as the IDs in the OmniClass and the element names, group names and type names in the IfcXML, as well as the duplicated terms.

Regulations are voluminous and cover a broad range of scopes and topics. Increasingly, regulatory documents are available online and organized in XML structure. The International Building Code (IBC) [13], which represents the code of practice in the building industry, is employed as one of the regulatory document corpuses. Figure 3 shows a provision in IBC and its representation in XML structure. One notable feature of regulations is that they are typically organized into sections and

sub-sections, each of which contains contents with a specific topic or scope. The tree hierarchy of regulations provides useful information that can be explored, for example, to locate similar sections and to build an e-government system [19, 20].

Figure 1: Excerpt from OmniClass Construction Classification System

Figure 2: Organization of IfcXML

Page 3: Proceedings Template - WORD - Engineering …eil.stanford.edu/publications/jack_cheng/dgo2008.doc · Web viewIn practice, multiple terminology classifications or data model structures

Figure 3: An IBC Provision and XML Structure

3. ONE TAXONOMY TO ONE REGULATIONMapping one taxonomy to one regulation is a simple keyword latching task. There are many commercial tools available to latch keywords from documents into a taxonomy. Industry taxonomies are hierarchical classification systems which are generally less than 10 levels deep. Node labels in the taxonomy tree are treated as concept keywords, and they are mapped to sections in the regulation where they appear. As regulations tend to be voluminous, we use a section or subsection as a unit of interest. Figure 4 shows the International Building Codes (IBC) [13] latched with the OmniClass. Users can then traverse the taxonomy and browse relevant sections of the regulation.

Extending the mapping from one taxonomy to multiple regulations unfortunately leads to the classic problem of information overload. For instance, suppose we want to search the Web to find state regulations governing chlorine levels in drinking water. If we search the drinking-water regulations in Alabama and Arizona for the concept “chlorine”, we would find over 30 sections in each. The actual relevancy of these 60 sections to chlorine levels is not known. The problem is that Web content ignores the actual structure of the documents. Consequently, search engines cannot take that structure into account when computing relevancy. The result is that users quickly become frustrated with information overload; intelligent retrieval and presentation of web results has become the holy grail of search [4]. Fortunately, regulatory documents are much more organized and structured than web content, and we propose to solve the problem of information overload by clustering relevant sections from different regulations and pivoting on one regulation that the user is most familiar with. We discuss our approach in the following section.

4. ONE TAXONOMY TO MULTIPLE REGULATIONSSimultaneous traversal of multiple regulation trees using one taxonomy is a challenging but real problem. It is not uncommon for industry practitioners to be familiar with one particular regulation but not the others. For example, an engineer from Montgomery might be familiar with Alabama state code, but not Arizona state code. Nonetheless, if a water distribution system is to be designed that provides water to Phoenix from lakes near Montgomery, the engineer would need an understanding of both [9]. In this scenario, finding the relevant Arizona regulations on chlorine levels might pose a serious problem. We believe that it is beneficial to map the taxonomy to Alabama code first, and then branch out to recommend related sections from the Arizona code. In general, focusing on one regulation as the basis for finding relevant sections from other regulations significantly reduces information overload.

Figure 5 shows a simple user interface that shows a scenario of finding related provisions between regulations from the two states. After browsing down the taxonomy tree to the concept “chlorine”, users are shown a list of matched sections from the

Alabama regulation. As discussed in Section 3, matching sections to taxonomy concept is simply keyword latching. Selecting Section 335.7.6.15 of the AL code shows that there are 15 recommended sections from the Arizona regulation. A user can stay focused on the regulation of their choice, and at the same time acquire relevant sections from other regulations as needed. There are two major challenges to developing such a system: a suitable user interface and a methodology for determining relevant regulations. In this paper, we focus on methodologies for making recommendations based on relevancies between sections from different regulations.

To identify related provisions from different regulations, we reuse the relatedness analysis core from [19, 20], which compares sections from different regulations based on shared features using a cosine similarity measure (see Section 5.1) [18, 31]. The goal is to identify the most strongly related provisions using not only a traditional term match but also a combination of feature matches, and not only content comparison but also structural analysis. Regulations are first compared based on

<LEVEL level-depth="8" style-id="0-0-0-304" style-name="Section3" style-name-escaped="Section3" toc-section="true"><RECORD id="0-0-0-5529" number="5529" version="3"><HEADING>[F] 907.2.11.3 Emergency voice/alarm communication system.</HEADING><PARA><DESTINATION id="0-0-0-3521" name="IBC2006907.2.11.3"/><CHARFORMAT bold="1" hidden="0" italic="0" strike-out="0" underline="0">[F] 907.2.11.3 Emergency voice/alarm communication system. </CHARFORMAT></PARA></RECORD><LEVEL level-depth="0" style-id="0-0-0-0" style-name="Normal Level" style-name-escaped="Normal-Level" toc-section="false"><RECORD id="0-0-0-5530" number="5530" version="3"><PARA style-id="0-0-0-15" style-name="Body3" style-name-escaped="Body3">An emergency voice/alarm communication system, which is also allowed to serve as a public address system, shall be installed in accordance with NFPA 72, and shall be audible throughout the entire special amusement building.</PARA></RECORD></LEVEL></LEVEL>

Figure 5: Chlorine mapped to Section 335.7.6.15 in AL code, which have 15 related sections in AZ code

Figure 4: Regulation Latched with Taxonomy Concepts

Page 4: Proceedings Template - WORD - Engineering …eil.stanford.edu/publications/jack_cheng/dgo2008.doc · Web viewIn practice, multiple terminology classifications or data model structures

conceptual information as well as domain knowledge through a combination of feature matching. Regulations also possess specific structures, such as a tree hierarchy of provisions and the referential structure. These structures represent useful information for locating related provisions and are, therefore, used in the analysis as well. For the detailed discussion on the evaluations of results from the relatedness analysis of provisions, see [19].

5. MULTIPLE TAXONOMIES TO ONE REGULATIONApart from mapping one taxonomy to many regulations, we also attempt to map many taxonomies to one regulation. As suggested in the Introduction section, multiple taxonomies have been developed for different applications within the same industry domain. Most industry practitioners are familiar with at least one of the taxonomies; but, they frequently need to deal with others for various applications [2, 23]. Therefore, mapping regulations to a single taxonomy is limiting usability of the system. However, traversing regulations using multiple taxonomy trees pose a non-trivial problem. There are much research efforts on ontology merging [33, 39]. These efforts produced a merged ontology that can be used for data interoperability but not as a front-end representation format. Since users would need to learn the newly merged ontology in order to browse regulations, this would defeat the original intent of using existing taxonomies to help locate regulatory provisions. Using the same argument from Section 4, we believe that focusing on one taxonomy that users are familiar with is the right starting point to traverse regulations. Once users reach a taxonomy node of interest, related concepts from other taxonomies can be suggested and users can switch their focal point from one taxonomy to another.

Figure 6 illustrates the proposed approach using the OmniClass [34] and the IFC [12] taxonomies, and the International Building Code (IBC) regulations [13]. The OmniClass is altered from its original representation, shown in Figure 6, to display a widget upon mouse-over that includes an ordered list of matching IBC

sections and recommended relevant IFC concepts. In this scenario, the user is more familiar with the OmniClass hierarchy, and thus starts browsing the IBC using this taxonomy. The user uses the term “steel decking” from OmniClass to find an ordered list of matching IBC sections and relevant IFC concepts. Upon locating a list of IBC sections that are related to “steel decking”, sorted in order of relevance, the user also sees a list of related IFC concepts including “slab”. Mousing-over the IFC concept “slab” brings the focal point to the IFC hierarchy, where the user is presented with the same analysis – namely the IFC elements around this concept “slab”, a ranked list of matching IBC sections, and a ranked list of relevant OmniClass concepts.

As opposed to locating related sections from multiple regulations, the task here is to identify similar or related concepts from multiple taxonomies. Ontology mapping has been an active research area since the semantic web movement [28, 29]. It is difficult to develop mappings between two arbitrary ontologies. In our case, however, the problem is slightly more manageable since our ontologies are very industry specific and are targeted towards the same group of users. Similar to the techniques presented in Section 4, the relevance among concepts from different ontologies is computed using a vector comparison approach. A document corpus is used to relate concepts by computing their co-occurrence frequencies. This training corpus must be carefully selected as it represents the relevancy among concepts from different taxonomies. Conveniently, we have a corpus of regulatory documents that have been meticulously drafted and reviewed for accuracy. Unlike web content, regulations do not have random co-occurrences of phrases in the same provision. This dramatically increases the likelihood of finding real matches.

Consider a pool of m concepts and a corpus of n regulation sections. A frequency vector is an n-by-1 vector storing the occurrence frequencies of concept i among the n documents. That is, the k-th element of equals the number of times concept i is matched in section k. In subsequent sections, we

Figure 6: Traversing the IBC using OmniClass Taxonomy with Relevant Concepts from the IFC Taxonomy

Page 5: Proceedings Template - WORD - Engineering …eil.stanford.edu/publications/jack_cheng/dgo2008.doc · Web viewIn practice, multiple terminology classifications or data model structures

will discuss three metrics to compute the similarity score among concepts. In our example shown in Figure 6, to relate “steel decking” from the OmniClass to “slab” from the IFC, we compute their similarity score based on the defined metrics. As shown in the figure, their cosine similarity score is 0.895, which ranks second among all IFC concepts that are relevant to “steel decking”.

5.1 Cosine SimilarityCosine similarity is a non-Euclidean distance measure between two vectors. It is a common approach to compare documents in the field of text mining [18, 31]. Given two frequency vectors

and , the similarity score between concepts i and j is represented using the dot product:

The resulting score is in the range of [0, 1] with 1 as the highest relatedness between concepts i and j.

5.2 Jaccard Similarity CoefficientJaccard similarity coefficient [31, 37] is a statistical measure of the extent of overlap between two vectors. It is defined as the size of the intersection divided by the size of the union of the vector dimension sets:

Jaccard similarity coefficient is a popular similarity analysis measure of term-term similarity due to its simplicity and retrieval effectiveness [17]. Two concepts are considered similar if there is a high probability for both concepts to appear in the same sections. To illustrate the application to our problem, let N11 be the number of sections both concept i and j are matched to, N10 be the number of sections concept i is matched to but not concept j, N01 be the number of sections concept j is matched to but not concept i, and N00 be the number of sections that both concept i and j are not matched to. The similarity between both concepts is then computed as

Since the size of intersection cannot be larger than the size of union, the resulting similarity score is between 0 and 1.

5.3 Market-Basket ModelMarket-basket model is a probabilistic data-mining technique to find item-item correlation [11]. The task is to find the items that frequent the same baskets. The support of each itemset I is defined as the number of baskets containing all items in I. Sets of items that appear in s or more baskets, where s is the support threshold, are the frequent itemsets.

Market-basket analysis is primarily used to uncover association rules between item and itemsets. The confidence of an

association rule is defined as the

conditional probability of j given itemset . The interest of an association rule is defined as the absolute value of the difference between the confidence of the rule and the probability of item j. To compute the similarities among concepts, our goal is to find concepts i and j where either association rule or is high-interest.

Consider a corpus of n documents. Let N11 be the number of sections both concept i and j are matched to, N10 be the number of sections concept i is matched to but not concept j, and N01 be the number of sections concept j is matched to but not concept i. The probability of concept j is computed as

and the confidence of the association rule is

The forward similarity of the concepts i and j, which is the interest of the association rule without absolute notation, is expressed as

The value ranges from -1 to 1. The value of -1 means that concept j appears in every section while concept i does not co-occur in any of these sections. The value of 1 is unattainable because (N11 + N01) cannot be zero while confidence equals one. Conceptually, it represents the boundary case where the occurrence of concept j is not significant in the corpus, but it appears in every section that concept i appears.

Table 1 summarizes the similarities and differences among the three metrics, namely the cosine similarity, Jaccard similarity and the market basket model. All are statistical analysis measures that leverage the co-occurrence frequencies of concepts in the regulatory corpus.

5.4 Use of Regulation Hierarchy Structural InformationMany related concepts can be uncovered by treating each section in a regulation as an independent document. In this model, a concept-document matrix is generated to compute concept co-occurrence in documents, which are really regulatory sections. This model is generally sufficient in revealing most related concepts, but some related concepts rarely co-occur in the same sections. For example, if two concepts contain an Is-A relationship, like door furniture and door hardware, they may be used in the same regulation interchangeably but in different sections.

Page 6: Proceedings Template - WORD - Engineering …eil.stanford.edu/publications/jack_cheng/dgo2008.doc · Web viewIn practice, multiple terminology classifications or data model structures

Table 1: Similarities and differences among the three statistical analysis measures

Cosine Similarity

Jaccard Similarity

Market basket Model

Non-Euclidean Yes Yes Yes

Vector-based Yes Yes No

Underlying methodology

Vector space model

Set theory(Intersections and unions)

Probability theory and

association rule

Symmetrical forward and

backward scoresYes Yes No

Range of scores [0, 1] [0, 1] [-1, 1)

Usage Often used as the baseline metric

Computationally effective

To discover potential item-item correlation

Is-A-related concepts are also hard to find if each section is treated as if it were an independent document. The relationship between Is-A-related concepts, such as “building materials” and “concrete” as shown in Figure 7, are sometimes implicit from the structures of sections. For example, the descriptions of “building materials” and those of “concrete” may not appear in the same section. Instead, the sections describing “concrete” are usually the subsections of the sections describing “building materials.” If we consider subsections and its parent section in the computation of the similarity score between “building materials” and “concrete,” the implicit relationship between building materials and concrete might become more obvious. Therefore, the hierarchical structure of sections needs to be considered to extract non-trivial related concepts.

Regulations contain well-organized hierarchical structures with sections and sub-sections of specific topic or scope. There are organizational and referential structures explicitly defined in regulations. Section 4 briefly discussed the usage of the regulation hierarchical information to locate related sections from different regulation trees. The results [20] show that regulatory structure sometimes helps to reduce prediction error of related provisions [19]. Here, regulation hierarchy will be considered to uncover semantic relationships between concepts from different taxonomies in the same manner.

Well-structured regulations could be represented as a hierarchical tree, where each section corresponds to a discrete node. As illustrated in Figure 8, each section has a parent section, a set of sibling sections and a set of child sections. In general, for a section with a particular topic, the parent section describes a broader topic, the sibling sections describe parallel topics, and the child sections describe more specific topics. In our computation, we will consider the co-occurrence of concepts in a broader scope, namely the parent, sibling and child sections.

The frequency matrix C is modified to take the parent section, sibling sections and child sections into consideration. To include the parent section, the weighted numbers of occurrence for all the concepts in the parent section are added to the numbers of occurrence in the self section. Similarly, the sibling

sections and child sections are then included with a discounted weight. In our formulation below, we will denote Par(k), Sib(k) and Child(k) as the parent section, set of sibling sections and set of child sections of Section k. The k-th element of frequency vector , i.e., the number of times concept i is matched to Section k, is updated as

where wp, ws and wc are the weights of the parent, sibling and child sections respectively.

5.5 Evaluations of the MeasuresTwenty concepts are randomly selected from the OmniClass and the IFC hierarchies respectively, and pairwise similarity scores are computed using the three relatedness analysis measures described above. The results of concept matching performed by domain experts are treated as the true matches. Root mean square error (RMSE), precision, recall and F-measure are used as performance metrics to evaluate and compare the three measures and the use of regulation structural information. A baseline ontology matcher is compared to the three measures using precision and recall as the evaluation metrics.

Three domain experts are asked to identify the related concept pairs among a total of 400 possible pairs. Related concept pairs are assigned a true value of one; all other pairs are assigned a true value of zero. As for the predicted values, two concepts are predicted as similar or related if the computed similarity score is larger than certain threshold score. Based on each of the three manual and independent mappings, we computed values of RMSE, precision, recall and F-measure for the three measures, the baseline ontology matcher, and different regulation structural information inclusions. The averages of the results

Building materials

Concrete Steel

Properties of Concrete

Figure 7: Example of related but rarely co-occurring concepts

Ch 7

Sec 7.3

Sec 7.4 Sec 7.2 Sec 7.1

Sec 7.3.1 Sec 7.3.2

Parent

Siblings

Self

Childs

Figure 8: Tree hierarchy of sections in regulations

Page 7: Proceedings Template - WORD - Engineering …eil.stanford.edu/publications/jack_cheng/dgo2008.doc · Web viewIn practice, multiple terminology classifications or data model structures

from the three true answers are then taken as the final results. Details are given in the following sections.

5.5.1 Root Mean Square Errors (RMSE) among the Three MeasuresRoot mean square error (RMSE) is a metric to compute the difference between the predicted values and the true values so as to evaluate the accuracy of the prediction. Comparison between ontology of m concept terms and ontology of n concept terms involves m by n concept-concept pairs. Therefore the RMSE is calculated as

m

i

n

jjiji predictedtrue

mnRMSE

1 1,,

1

Figure 9 shows the results of the three measures compared using RMSE for threshold similarity scores ranging from 0.15 to 0.9. No regulation hierarchy structural information is considered. As illustrated in Figure 9, we conclude that the market-basket model results in the lowest RMSE for most threshold similarity scores. This means that the market-basket model outperforms the other two measures in locating related concept pairs from different ontologies, using provisions from regulations as independent documents in the co-occurrence computation. Cosine similarity appears to be average among the three measures.

5.5.2 Precision, Recall and F-measure among the Three MeasuresWe use precision, recall and F-measure values to compare the three similarity analysis measures and the use of regulation hierarchy structural information. While RMSE takes both correctness and incorrectness of prediction into consideration, precision and recall emphasize correctness only. Precision and recall evaluates the accuracy of predictions and the coverage of accurate pairs. Precision measures the fraction of predicted matches that are correct, i.e., the number of true positives over the number of pairs predicted as matched. Recall measures the fraction of correct matches that are predicted, i.e., the number of true positives over the number of pairs that are actually matched. They are computed as

There is always a tradeoff between precision and recall. F-measure is therefore leveraged to combine both metrics. It is a weighed harmonic mean of precision and recall. In other words, it is the weighed reciprocal of the arithmetic mean of the reciprocals of precision and recall. It is computed as

Figure 10 shows the results for the three relatedness analysis measures using F-measure, a combination of precision and

recall. The market-basket model shows the highest F-measure values in all cases, again, consistent with the RMSE results. In

fact, market-basket model achieves the highest recall rate with relatively high precision in all cases. Jaccard similarity is not preferred due to its low F-measure values, resulted from its very low recall rates. Cosine similarity appears to be average among the three measures, consistent with the RMSE results.

0.09

0.10

0.11

0.12

0.13

0.14

0.15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold Similarity Score

RMS

E

cosine similarity

Jaccard similarity

Market-basket

Figure 9: Evaluation results of the three measures using RMSE

0.00

0.20

0.40

0.60

0.80

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold Similarity Score

F M

easu

re

cosine similarity

Jaccard similarity

Market-basket

Figure 10: Evaluation results of the three measures using F-measure

0.00

0.20

0.40

0.60

0.80

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold Similarity Score

F M

easu

re

no hierarchical informationparent onlychilds onlyparent and childsparent, siblings and childs

Figure 11: Evaluation results of market-basket model using F-measure

Page 8: Proceedings Template - WORD - Engineering …eil.stanford.edu/publications/jack_cheng/dgo2008.doc · Web viewIn practice, multiple terminology classifications or data model structures

As the market-basket model outperforms cosine and Jaccard similarities using both RMSE and the F-measure, we will evaluate the impact of regulation hierarchy using the market-basket model as the similarity measure of choice. As shown in Figure 11, the effect of including regulatory structure in the analysis is inconclusive. In general, it increases recall rate and reduces precision, as more regulatory nodes are considered to locate related concepts. The inclusion of parent section produced a slightly higher F-measure in most threshold scores, likely due to the fact that parent relationship is one to one which minimizes the impact on precision. Other relationships, such as sibling and child, are not one to one; the number of such relationships, therefore, could heavily tax precision with only minor increase in recall.

5.5.3 Comparison of the Domain-based Model with the Lexicon-based ModelIn addition to comparing the three measures with one another, we also evaluated our domain-based approach to a traditional lexicon-based approach. Ontology mapping is an active research topic, and common mapping methods discover the semantic similarity between ontology elements using rule-based [22, 28], lexicon-based [24, 35] and structure-based methods [25, 27]. Our approach is comparable to a lexicon-based approach, where dictionary and thesaurus are used to enumerate synonyms, homonyms, and etc. In our analysis, we use a domain-specific corpus, i.e., a domain-appropriate regulation, to uncover such semantic relationships.

A thesaurus is necessary to compare our model to a lexicon-based one. A common thesaurus is the WordNet [26], which is a well-known lexical resource for the English language. Synonyms in WordNet are interlinked by means of conceptual-semantic and lexical relations. It is one of the most widely adopted synonym sources for ontology matching techniques including CUPID [24], Learned Ontology Model (LOM) [21], and Version Matching Approach (VMA) [41]. Table 2 shows the result for comparing domain-based ontology mapping method with a lexicon-based element matcher using WordNet.

Table 2 shows that our domain-based approach outperforms the lexicon-based matcher in terms of precision and recall. Some examples of matches that are found by our domain-based matcher but not by the lexicon-based matcher are: (sound and signal devices, IfcSwitchingDeviceType), (door hardware, IfcBuildingElementComponent), (steel decking, IfcSlab), and (sound and signal devices, IfcAlarmType). The reliability of lexicon-based matchers is not guaranteed because their use of stemmers to reduce derived words to their root form, e.g., from piling to pile, is not always appropriate for the domain [10]. In addition, many concepts have different meanings when used in different domains, so that their synonyms and definitions could be different.

We should note that WordNet is a generic linguistic thesaurus rather than an industry-specific taxonomy. As a result, it contains little and imprecise information of the terminology used by the OmniClass and IfcXML. The result shows that domain-related corpora, such as regulations and technical specifications, are useful in discovering the semantic relationships across multiple ontologies.

Table 2: Precision and recall comparisons of domain-based ontology mapping to lexicon-based ontology mapping

Score threshold Approaches Cosine

SimilarityJaccard

SimilarityMarket-basket

Model

P R P R P R

0.2Lexicon-based

Matcher0.50 0.03 0.00 0.00 0.00 0.00

Domain-based Matcher

0.79 0.53 0.91 0.17 0.70 0.71

0.3Lexicon-based

Matcher0.50 0.03 1.00 0.03 0.50 0.03

Domain-based Matcher

0.83 0.41 0.90 0.15 0.75 0.71

0.4Lexicon-based

Matcher1.00 0.03 1.00 0.03 0.50 0.03

Domain-based Matcher

0.91 0.36 1.00 0.12 0.80 0.59

0.5Lexicon-based

Matcher1.00 0.03 1.00 0.03 1.00 0.03

Domain-based Matcher

0.90 0.31 1.00 0.11 0.81 0.51

0.6Lexicon-based

Matcher1.00 0.03 1.00 0.03 1.00 0.03

Domain-based Matcher

0.92 0.20 1.00 0.07 0.81 0.49

6. CONCLUSIONS & FUTURE TASKSRegulatory documents are written by government agencies who organize the material to suit their own needs. From industry practitioners’ standpoint, the original hierarchy might not be the easiest retrieval model for regulations. In this paper, we proposed a system to map concepts from industry-specific taxonomies to similar concepts in those regulations to increase their usability by industry practitioners. A running example from the AEC industry is shown to illustrate the need, the usage and the benefit of the mapping system.

A 1-1, 1-n, n-1 mapping between taxonomies and regulations are demonstrated. We plan to implement an n-n concept-section mapping in the future, by combining the techniques of concept comparisons and section comparisons. In section comparisons, the hierarchical information of regulations is used to enhance the analysis; we also plan to incorporate the hierarchical information of taxonomies into concept comparisons. In concept comparisons, three similarity metrics are tested, whereas only cosine similarity is implemented for regulatory comparisons which are due for more testing. Among the three metrics, we have shown in this paper that the market-basket model performs the best in terms of RMSE and F-measure, which is a combination of precision and recall. The comparison between the lexicon-based approach and our domain-based approach shows that our approach results in higher precision and recall. A few examples are given to show matches that are found by our domain-based approach but not the lexicon-based approach.

Page 9: Proceedings Template - WORD - Engineering …eil.stanford.edu/publications/jack_cheng/dgo2008.doc · Web viewIn practice, multiple terminology classifications or data model structures

In the future, we plan to engage potential users to help perform formal evaluations of the similarity metrics and the usability of the system. To improve usability, a better user interface is much needed, and we plan to investigate the need to implement or adopt such visualization tool. An ideal user interface should facilitate access to the mapping of multiple taxonomies and the browsing of regulations by industry practitioners, rulemakers and domain experts.

7. ACKNOWLEDGMENTSThe authors would like to thank the International Code Council for providing the XML version of the International Building

Code (2006). The authors would also like to acknowledge the supports by the National Science Foundation, Grant No. CMS-0601167, the Center for Integrated Facility Engineering (CIFE) at Stanford University and the Enterprise Systems Group at the National Institute of Standards and Technology (NIST). Any opinions and findings are those of the authors, and do not necessarily reflect the views of NSF, CIFE and NIST.

8. REFERENCES