semantic web research at university of texas at dallas (schema matching + storage & retrieval of...
Post on 15-Dec-2015
216 Views
Preview:
TRANSCRIPT
Semantic Web Research at University of Texas at Dallas(Schema Matching + Storage & Retrieval of RDF graph)
Faculties: Latifur KhanBhavani Thuraisingham
Semantic Matching in the GIS Domain
Jeffrey Partyka (Ph.D. Student)
Faculties: Funded byLatifur KhanBhavani Thuraisingham
Schema Matching•Performing semantic similarity between
two tables by mapping the properties of instances to one another:
roadName City
Johnson Rd. Plano
School Dr. Richardson
Zeppelin St. Lakehurst
Alma Dr. Richardson
Preston Rd. Addison
Dallas Pkwy Dallas
Road County
Custer Pwy Cooke
15th St. Collin
Parker Rd. Collin
Alma Dr. Collin
Campbell Rd. Denton
Harry Hines Blvd.
Dallas
EBD similarity
Representing types using N-grams*•Use commonly occurring N-grams in compared columns to determine similarity (N = 2)
StrName FENAME Status
LOCUST-GROVE DR
LOCUST GROVE
BUILT
LOUISE LN LOUISE BUILT
Street Laddress Raddress
TRAIL RANGE DR
1600 1798
CR45/MANET CT
2500 2598
CA
N-gram types from A.StrName = {LO, OC, CU,ST,…..}
N-gram types from B.Street = {TR, RA, R4, 5/,…..}
CB
1.Jeffrey Partyka, Neda Alipanah, Nilesh Singhania, Latifur Khan, Bhavani Thuraisingham, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.1.Jeffrey Partyka, Neda Alipanah, Nilesh Singhania, Latifur Khan, Bhavani Thuraisingham, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.1.Jeffrey Partyka, Neda Alipanah, Nilesh Singhania, Latifur Khan, Bhavani Thuraisingham, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.1.Jeffrey Partyka, Neda Alipanah, Nilesh Singhania, Latifur Khan, Bhavani Thuraisingham, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.
*Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani Thuraisingham & Shashi Shekhar, “Content Based Ontology Matching for GIS Datasets“, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Page: 407-410, Irvine, California, USA, November 2008.
How do we measure N-gram similarity between columns?
•Entropy-Based Distribution (EBD)•EBD is a measurement of type similarity
between 2 columns:
•EBD takes values in the range of [0,1] . Greater EBD corresponds to more similar type distributions between compared columns.
EBD = H(C|T) C = C1 U C2
H(C)
Entropy and Conditional EntropyEntropy: measure of the uncertainty
associated with a random variable:
Conditional Entropy: measures the remaining entropy of a random variable Y given the value of a second random variable X
Visualizing Entropy and Conditional Entropy
H(C) = –Σpi log pi for all x є C1 U
C2
H(C | T) = H (C,T) – H(C) for all x є C1 U C2 and t є T
Faults of this Method• Semantically similar columns are not
guaranteed to have a high similarity score
City Country
Dallas USA
Houston USA
Kingston Jamaica
Halifax Canada
Mexico City
Mexico
ctyName country
Shanghai China
Beijing China
Tokyo Japan
New Delhi India
Kuala Lumpur
Malaysia
2-grams extracted from A: {Da, al, la, as, Ho, ou, us…}
A є O1 B є O2
2-grams extracted from B: {Sh, ha, an, ng, gh, ha, ai, Be, ei, ij…}
Introducing Google Distance
* Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar, “Ontology Alignment Using Multiple Contexts”, International Semantic Web Conference (ISWC) (Posters & Demos), Karlsruhe, Germany, October, 2008.
: Column 1
: Column 2
Similarity = H(C|T) / H(C)
C1 є O1 C2 є O2
Step 3 Calculate Similarity
Extract distinct keywords from compared columns
Group distinct keywords together into semantic clusters
Keywords extracted from columns = {Johnson, Rd., School, 15th,…}
“Rd.”,”Dr.”,”St.”,”Pwy”,…“Johnson”,”School”,”Dr.”….
C1 C2
C1 U C2
Step 1
Step 2
roadName City
Johnson Rd. Plano
School Dr. Richardson
Zeppelin St. Lakehurst
Road County
Custer Pwy Collin
15th St. Collin
Parker Rd. Collin
K-medoid + NGD instance similarity
Problems with K-medoid + NGD*
It is possible that two different geographic entities (ie: Dallas, TX and Dallas County) in the same location will have a very low computed NGD value, and thus, be mistaken for being similar:
roadName City
Johnson Rd. Plano
School Dr. Richardson
Zeppelin St. Lakehurst
Alma Dr. Richardson
Preston Rd. Addison
Dallas Pkwy Dallas
Road County
Custer Pwy Cooke
15th St. Collin
Parker Rd. Collin
Alma Dr. Collin
Campbell Rd. Denton
Harry Hines Blvd.
Dallas
*Jeffrey Partyka, Latifur Khan, Bhavani Thuraisingham, “Semantic Schema Matching Without Shared Instances,” to appear in Third IEEE International Conference on Semantic Computing, Berkeley, CA, USA - September 14-16, 2009.
Using geographic type information*
We use a gazetteer to determine the geographic type of an instance:
O1 O2Geotypes
*Jeffrey Partyka, Latifur Khan, Bhavani Thuraisingham, “Geographically-Typed Semantic Schema Matching,” submitted to ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2009), Seattle, Washington, USA, November 2009.
Disambiguating Geographic Types For A Given InstanceWe can use metadata and other information to reduce the number of
type possibilities for a given instance:
City
Plano
Richardson
Dallas
……
Dallas
City
County
Dallas City
roadName City
Johnson Rd. Plano
School Dr. Richardson
Zeppelin St. Lakehurst
Alma Dr. Richardson
Preston Rd. Addison
Dallas Pkwy Dallas
Road County
Custer Pwy Cooke
15th St. Collin
Parker Rd. Collin
Alma Dr. Collin
Campbell Rd. Denton
Harry Hines Blvd.
Dallas
Geographic Types + NGD
It is now possible to make corrections for the geographic co-occurrence mistakes of NGD:
Disambiguation Using latlong values
• Each input consists of a name and coordinates (Lat/Long values).
• Our knowledge base consists of records for a number of different geospatial features such as streets, lakes, schools, etc. for the entire US.
• Each entry in the knowledge base contains, coordinates and other spatial information such as length and area of the landmark.
Disambiguation Using latlong values (contd..)
Geo-Database
Disambiguation Using latlong values (contd..)
• We first select look for the entries with similar name in knowledge base.
• Next, for each feature type in the knowledge base, we choose the entry which is located closest to the input.
• In case of two features having close proximity to the input, we disambiguate the feature type on the basis of geospatial properties like area and perimeter.
Attribute Weighting
•Default weighting scheme is to treat all 1-1 matches between properties/attributes with equal importance:
roadName City
Johnson Rd. Plano
School Dr. Richardson
Zeppelin St. Lakehurst
Alma Dr. Richardson
Preston Rd. Addison
Dallas Pkwy Dallas
Road County
Custer Pwy Cooke
15th St. Collin
Parker Rd. Collin
Alma Dr. Collin
Campbell Rd. Denton
Harry Hines Blvd.
Dallas
50% 50%
Results of Geographic Matching Over 2 Separate Road Network Data Sources
top related