data warehousing/mining comp 150 dw semistructured data

33
Data Warehousing/Mining Data Warehousing/Mining Comp 150 DW Semistructured Data Instructor: Dan Hebert

Upload: roanna-abbott

Post on 02-Jan-2016

30 views

Category:

Documents


1 download

DESCRIPTION

Data Warehousing/Mining Comp 150 DW Semistructured Data. Instructor: Dan Hebert. Semistructured Data. Everything that has no rigid schema Schema is contained within the data (self-describing), OR No separate schema, OR Schema exists but places only loose constraints on data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 1

Data Warehousing/Mining

Comp 150 DW Semistructured Data

Instructor: Dan Hebert

Page 2: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 2

Semistructured Data

Everything that has no rigid schema– Schema is contained within the data (self-describing), OR– No separate schema, OR– Schema exists but places only loose constraints on data

Emerged as an important topic for a variety of reasons– Many data sources like WWW which we would like to treat

as databases but cannot for the lack of schema– Desirable to have an extremely flexible format for data

exchange between disparate databases– May want to view structured data as semistructured data

for the purpose of browsing

Page 3: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 3

Motivation

Some data really is unstructured/semistructured– World Wide Web, – Data exchange formats– Some exotic database management systems,

e.g., ACeDB, popular with biologists Data integration Browsing

Page 4: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 4

Motivation - World Wide Web

Why do we want to treat the Web as a database?– To maintain integrity– To query based on structure (as opposed to content)– To introduce some “organization”.

But the Web has no structure. The best we can say is that it is an enormous graph.

Page 5: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 5

Motivation - Data Formats Much (probably most) of the world’s data

is in data formats These are formats defined for the

interchange and archiving of data Data formats vary in generality. ASN.1

and XDR are quite general Scientific data formats tend to be “fixed

schemas” The textual representation given by data

formats is sometimes not immediately translatable into a standard relational/object-oriented representation

Page 6: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 6

Motivation - Data Integration

Goal is to integrate all types of information, including unstructured information– Irregular, missing information, structure not fully known,

dynamic schema evolution, etc. Traditional data models and languages not well suited

– Cannot accommodate heterogeneous data sets (different types and structures), etc.

– Difficult to build software that will easily convert between two disparate models

OEM (Object Exchange Model)– Semistructured data model from TSIMMIS project at Stanford– Internal data structure for exchange of data between DBMSs– Used by other systems: e.g., Windows 95 registry, Lotus

Notes

Page 7: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 7

Motivation - Browsing To query a database one needs to understand

the schema. However schemas have opaque terminology

and the user may want to start by querying the data with little or no knowledge of the schema.– Where in the database is the string “Casablanca” to

be found?– Are there integers in the database greater than 216 ?– What objects in the database have an attribute name

that starts with “act”? While extensions to relational query languages

have been proposed for such queries, there is no generic technique for interpreting them.

Page 8: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 8

The Model

Represent data as some kind of graph-like or tree-like model– Cycles are allowed but usually refer to them as

trees– Several different approaches with minor

differences (easy to convert) Data on labels or edges, nodes carry information or not

Straightforward to encode relational and object-oriented databases– Issue: object identity

Page 9: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 9

Querying Semistructured Data

There are (at least) three approaches to this problem– Add arbitrary features to SQL or to your favorite

query language– Find some principled approach to programs that

are based on the type of the data– Represent the graph (or whatever the structure

is) as appropriate predicates and use some variety of datalog on that structure

Page 10: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 10

The “Extend SQL” Approach

In fact it is an attempt to extend the philosophy of OQL and comprehension syntax to these new structures

It is the approach taken in the design of UnQL and also of Lorel

Looks very similar to OQL (path expressions)

Page 11: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 11

Example

select Entry.Movie.Title

from DB

where Entry.Movie.Director...

Page 12: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 12

Syntax Issues

Need (path) variables to tie paths and edges together

Paths of arbitrary length– “Find all strings in db”– “Find whether “Allen” acted in “Casablanca”– Need regular expresions to constrain paths

Rich set of overloadings for operators to deal with comparisons of objects with values and of values with sets

Page 13: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 13

Underlying Computational Strategy

Model graph as a relational database and use relational query language.– Database large relation (node-id, label,

node-id)– Used by Stanford group in LORE/LOREL

Complications– Labels are from heterogeneous set of types,

need more than one relation– Additional relations if info to be stored in

nodes– Various navigation issues

Page 14: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 14

Semistructured Data - Case StudyObject Exchange Model

Page 15: Data Warehousing/Mining Comp 150 DW  Semistructured Data

15Data Warehousing/Mining 15

OIDOID unique identifier or NULL LabelLabel character string descriptor

TypeType atomic data type or set ValueValue atomic value or set of object references

• Common model for heterogeneous information exchange, self-describing

• Each object:

OIDOID LabelLabel TypeType ValueValue

• “Help pages” for labels• Query language OEM-QL

OEM Features

Page 16: Data Warehousing/Mining Comp 150 DW  Semistructured Data

16Data Warehousing/Mining 16

<collection, {b1, a1, ...}>b1: <book, {t, a}> t: <title, “Database and ...”> a: <author, {n, p}>

n: <name, “Jeff Ullman”>p: <picture, “/gifs/ullman.gif”>

a1: <article, {v, w, x}>v: <author, “Gio Wiederhold”>w: <title, “Mediators in the …”>x: <journal, “IEEE Computer”>

Label

Set Value

Atomic ValueMemoryAddresses

...

Representing Semistructured Data Using OEM

Page 17: Data Warehousing/Mining Comp 150 DW  Semistructured Data

17Data Warehousing/Mining 17

• Logic-based language for OEM– Match object patterns, generate variable bindings,

construct new OEM objects from existing ones

• Get articles published in “IEEE Computer”

P :-

P:<articles {<journal “IEEE Computer”>}>• Get titles of books by “Jeff Ullman”

<answer_title T> :-

<book {<author “Jeff Ullman”> <title T>}>

An OEM Query Language: OEM-QL

Page 18: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 18

Semistructured Data - Case StudyWWW Extraction

Page 19: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 19

Problem

Lots of valuable information on the Web– irregular structure– highly dynamic

Embedded in HTML Limited query facilities

Page 20: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 20

Data Extraction Tool

Flexible, easy to use Accommodate virtually any HTML source Interface with existing system, e.g., data

warehouse, user interface for querying

WorldWideWeb

Extractor

Specification

DataWarehouse

WHIntegrator

Query

Page 21: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 21

Approach

Extract Web data into OEM format– Query using OEM-QL

Python-based, configurable parser Declarative description of HTML source

– location of data on page– how to package data into OEM

“Regular expression”-like syntax Human intelligence rather than A.I.

Page 22: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 22

Extractor Specification

[ “variable(s)”, “source”, “pattern” ]

Consists of commands of the form:

Page 23: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 23

HTML Source File<HTML><HEAD>...<TABLE> <TR> <TH><I> header 1 </I></TH> <TH><I> header 2 </I></TH> <TH><I> header 3 </I></TH> </TR> <TR> <TD> text 1 </TD> <TD><A HREF=http://www.stuff/> text 2 </A></TD> <TD> text 3 </TD> </TR> . . .</TABLE>...</BODY></HTML>

Page 24: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 24

Specification File[

[“root”, “get('http://www.example.test/')”, “#” ],

[“__tempvar1”, “root”, “*<table>#</table>*” ],

[“__tempvar2”, “split (__tempvar1,’</tr>’)”, “#” ],

[“rows”, “__tempvar2[1:-1]”, “#” ],

[“header1,header2_url,header2,header3”, “rows”, “*<td>#</td>*<a*href=#>#</a>*<td>#</td>*”]

]

Page 25: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 25

Result OEM Object

<root complex { <rows complex { <header1 string “text 1”> <header2_url string “http://www.stuff”> <header2 string “text 2” <header3 string “text 3”>

}> <rows complex {

}>

}>

...

...

Page 26: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 26

Basic Syntax:Variable

variable(l:p:t)– optional parameters for specification of

corresponding OEM object l: label name t: type p: parent object

_variable– temporary data structure, does not appear as

OEM object

Page 27: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 27

Basic Syntax: Source

split(variable,token)– creates a list with multiple elements using token as

the element separator

get(URL)– obtain contents of HTML file at address URL

Page 28: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 28

Basic Syntax: Patterns

token1 # token2– match and store current input (between tokens)

token1 * token2– match, don’t store current input (between tokens)

Page 29: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 29

Syntactic Sugar

Functions for extracting commonly used HTML constructs

– extract_table(variable),pattern

– split_table_row(variable)

– split_table_column(variable)

– extract_list(variable),pattern

– split_list(variables)

Page 30: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 30

Advanced Features

Customization of output– structure, label names, data type, ...

Extraction across multiple HTML pages Graceful recovery from parse errors

– resume parsing using next input from source Multiple patterns in single command

– follow different parse tree depending on structure in source

Page 31: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 31

Sample Extraction Scenario

. . .

Page 32: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 32

Extracted OEM Dataroot complex {

temperature complex {city_temp complex {

country string “Austria”city_url url http://www…city string “Vienna”weather_today string “snow”high_today string “-2”low_today string “-7”weather_tom string “snow”high_tomorrow string “-2”low_tomorrow string “-7”

}city_temp complex {

country string “Belgium”city_url url http://www…city string “Brussles”…

}…

}}OEM-QL query:

<city C {<high H> < low L>}> :-

<temperature {<city_temp

{<country “Germany”> <city C> <high_today H> <low_today L>}>}>

Page 33: Data Warehousing/Mining Comp 150 DW  Semistructured Data

Data Warehousing/Mining 33

Evaluation

Better than– writing programs– YACC, PERL, etc.– A.I.

Can do better– GUI tool to simplify the generation of extractor

specification– Machine learning or data mining techniques to

automatically infer structure...