Data Warehousing/Mining 1
Data Warehousing/Mining
Comp 150 DW Semistructured Data
Instructor: Dan Hebert
Data Warehousing/Mining 2
Semistructured Data
Everything that has no rigid schema– Schema is contained within the data (self-describing), OR– No separate schema, OR– Schema exists but places only loose constraints on data
Emerged as an important topic for a variety of reasons– Many data sources like WWW which we would like to treat
as databases but cannot for the lack of schema– Desirable to have an extremely flexible format for data
exchange between disparate databases– May want to view structured data as semistructured data
for the purpose of browsing
Data Warehousing/Mining 3
Motivation
Some data really is unstructured/semistructured– World Wide Web, – Data exchange formats– Some exotic database management systems,
e.g., ACeDB, popular with biologists Data integration Browsing
Data Warehousing/Mining 4
Motivation - World Wide Web
Why do we want to treat the Web as a database?– To maintain integrity– To query based on structure (as opposed to content)– To introduce some “organization”.
But the Web has no structure. The best we can say is that it is an enormous graph.
Data Warehousing/Mining 5
Motivation - Data Formats Much (probably most) of the world’s data
is in data formats These are formats defined for the
interchange and archiving of data Data formats vary in generality. ASN.1
and XDR are quite general Scientific data formats tend to be “fixed
schemas” The textual representation given by data
formats is sometimes not immediately translatable into a standard relational/object-oriented representation
Data Warehousing/Mining 6
Motivation - Data Integration
Goal is to integrate all types of information, including unstructured information– Irregular, missing information, structure not fully known,
dynamic schema evolution, etc. Traditional data models and languages not well suited
– Cannot accommodate heterogeneous data sets (different types and structures), etc.
– Difficult to build software that will easily convert between two disparate models
OEM (Object Exchange Model)– Semistructured data model from TSIMMIS project at Stanford– Internal data structure for exchange of data between DBMSs– Used by other systems: e.g., Windows 95 registry, Lotus
Notes
Data Warehousing/Mining 7
Motivation - Browsing To query a database one needs to understand
the schema. However schemas have opaque terminology
and the user may want to start by querying the data with little or no knowledge of the schema.– Where in the database is the string “Casablanca” to
be found?– Are there integers in the database greater than 216 ?– What objects in the database have an attribute name
that starts with “act”? While extensions to relational query languages
have been proposed for such queries, there is no generic technique for interpreting them.
Data Warehousing/Mining 8
The Model
Represent data as some kind of graph-like or tree-like model– Cycles are allowed but usually refer to them as
trees– Several different approaches with minor
differences (easy to convert) Data on labels or edges, nodes carry information or not
Straightforward to encode relational and object-oriented databases– Issue: object identity
Data Warehousing/Mining 9
Querying Semistructured Data
There are (at least) three approaches to this problem– Add arbitrary features to SQL or to your favorite
query language– Find some principled approach to programs that
are based on the type of the data– Represent the graph (or whatever the structure
is) as appropriate predicates and use some variety of datalog on that structure
Data Warehousing/Mining 10
The “Extend SQL” Approach
In fact it is an attempt to extend the philosophy of OQL and comprehension syntax to these new structures
It is the approach taken in the design of UnQL and also of Lorel
Looks very similar to OQL (path expressions)
Data Warehousing/Mining 11
Example
select Entry.Movie.Title
from DB
where Entry.Movie.Director...
Data Warehousing/Mining 12
Syntax Issues
Need (path) variables to tie paths and edges together
Paths of arbitrary length– “Find all strings in db”– “Find whether “Allen” acted in “Casablanca”– Need regular expresions to constrain paths
Rich set of overloadings for operators to deal with comparisons of objects with values and of values with sets
Data Warehousing/Mining 13
Underlying Computational Strategy
Model graph as a relational database and use relational query language.– Database large relation (node-id, label,
node-id)– Used by Stanford group in LORE/LOREL
Complications– Labels are from heterogeneous set of types,
need more than one relation– Additional relations if info to be stored in
nodes– Various navigation issues
Data Warehousing/Mining 14
Semistructured Data - Case StudyObject Exchange Model
15Data Warehousing/Mining 15
OIDOID unique identifier or NULL LabelLabel character string descriptor
TypeType atomic data type or set ValueValue atomic value or set of object references
• Common model for heterogeneous information exchange, self-describing
• Each object:
OIDOID LabelLabel TypeType ValueValue
• “Help pages” for labels• Query language OEM-QL
OEM Features
16Data Warehousing/Mining 16
<collection, {b1, a1, ...}>b1: <book, {t, a}> t: <title, “Database and ...”> a: <author, {n, p}>
n: <name, “Jeff Ullman”>p: <picture, “/gifs/ullman.gif”>
a1: <article, {v, w, x}>v: <author, “Gio Wiederhold”>w: <title, “Mediators in the …”>x: <journal, “IEEE Computer”>
Label
Set Value
Atomic ValueMemoryAddresses
...
Representing Semistructured Data Using OEM
17Data Warehousing/Mining 17
• Logic-based language for OEM– Match object patterns, generate variable bindings,
construct new OEM objects from existing ones
• Get articles published in “IEEE Computer”
P :-
P:<articles {<journal “IEEE Computer”>}>• Get titles of books by “Jeff Ullman”
<answer_title T> :-
<book {<author “Jeff Ullman”> <title T>}>
An OEM Query Language: OEM-QL
Data Warehousing/Mining 18
Semistructured Data - Case StudyWWW Extraction
Data Warehousing/Mining 19
Problem
Lots of valuable information on the Web– irregular structure– highly dynamic
Embedded in HTML Limited query facilities
Data Warehousing/Mining 20
Data Extraction Tool
Flexible, easy to use Accommodate virtually any HTML source Interface with existing system, e.g., data
warehouse, user interface for querying
WorldWideWeb
Extractor
Specification
DataWarehouse
WHIntegrator
Query
Data Warehousing/Mining 21
Approach
Extract Web data into OEM format– Query using OEM-QL
Python-based, configurable parser Declarative description of HTML source
– location of data on page– how to package data into OEM
“Regular expression”-like syntax Human intelligence rather than A.I.
Data Warehousing/Mining 22
Extractor Specification
[ “variable(s)”, “source”, “pattern” ]
Consists of commands of the form:
Data Warehousing/Mining 23
HTML Source File<HTML><HEAD>...<TABLE> <TR> <TH><I> header 1 </I></TH> <TH><I> header 2 </I></TH> <TH><I> header 3 </I></TH> </TR> <TR> <TD> text 1 </TD> <TD><A HREF=http://www.stuff/> text 2 </A></TD> <TD> text 3 </TD> </TR> . . .</TABLE>...</BODY></HTML>
Data Warehousing/Mining 24
Specification File[
[“root”, “get('http://www.example.test/')”, “#” ],
[“__tempvar1”, “root”, “*<table>#</table>*” ],
[“__tempvar2”, “split (__tempvar1,’</tr>’)”, “#” ],
[“rows”, “__tempvar2[1:-1]”, “#” ],
[“header1,header2_url,header2,header3”, “rows”, “*<td>#</td>*<a*href=#>#</a>*<td>#</td>*”]
]
Data Warehousing/Mining 25
Result OEM Object
<root complex { <rows complex { <header1 string “text 1”> <header2_url string “http://www.stuff”> <header2 string “text 2” <header3 string “text 3”>
}> <rows complex {
}>
}>
...
...
Data Warehousing/Mining 26
Basic Syntax:Variable
variable(l:p:t)– optional parameters for specification of
corresponding OEM object l: label name t: type p: parent object
_variable– temporary data structure, does not appear as
OEM object
Data Warehousing/Mining 27
Basic Syntax: Source
split(variable,token)– creates a list with multiple elements using token as
the element separator
get(URL)– obtain contents of HTML file at address URL
Data Warehousing/Mining 28
Basic Syntax: Patterns
token1 # token2– match and store current input (between tokens)
token1 * token2– match, don’t store current input (between tokens)
Data Warehousing/Mining 29
Syntactic Sugar
Functions for extracting commonly used HTML constructs
– extract_table(variable),pattern
– split_table_row(variable)
– split_table_column(variable)
– extract_list(variable),pattern
– split_list(variables)
Data Warehousing/Mining 30
Advanced Features
Customization of output– structure, label names, data type, ...
Extraction across multiple HTML pages Graceful recovery from parse errors
– resume parsing using next input from source Multiple patterns in single command
– follow different parse tree depending on structure in source
Data Warehousing/Mining 31
Sample Extraction Scenario
. . .
Data Warehousing/Mining 32
Extracted OEM Dataroot complex {
temperature complex {city_temp complex {
country string “Austria”city_url url http://www…city string “Vienna”weather_today string “snow”high_today string “-2”low_today string “-7”weather_tom string “snow”high_tomorrow string “-2”low_tomorrow string “-7”
}city_temp complex {
country string “Belgium”city_url url http://www…city string “Brussles”…
}…
}}OEM-QL query:
<city C {<high H> < low L>}> :-
<temperature {<city_temp
{<country “Germany”> <city C> <high_today H> <low_today L>}>}>
Data Warehousing/Mining 33
Evaluation
Better than– writing programs– YACC, PERL, etc.– A.I.
Can do better– GUI tool to simplify the generation of extractor
specification– Machine learning or data mining techniques to
automatically infer structure...