why do we care about databases in a course about web?

27
Why do we care about databases in a course about Web?

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Why do we care about databases in a course about Web?

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Administrivia

• Midterm next class (in class)– Openbook/Open Notes/open course web (not general

web)– Everything done until Sep 28th (and covered in

homework 2)– Questions?

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Adapting old disciplines for Web-age• Information (text) retrieval

– Scale of the web– Hyper text/ Link structure– Authority/hub computations

• Databases– Multiple databases

• Heterogeneous, access limited, partially overlapping– Network (un)reliability

• Datamining [Machine Learning/Statistics/Databases]– Learning patterns from large scale data

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Why do we care about databases?

• Three reasons– Deep web is all databases…– We can do better with structured data…– Exposing databases on web changes their clientele..

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Deep Web is databases..• The crawlable web pages are just the tip of a huge ice berg that is

deep web– Many web sites have huge backend databases that generate pages dynamically

in response to queries• Airline fare databases; News paper classifieds etc.

– By some estimates, deep web is 2 orders of magnitude bigger than the shallow (“html page”) web

• We need to exploit deep web– Crawl/index deep web– Select databases relevant to a query– Provide information aggregation/integration services over deep web databases

• ..and all the big kids are trying to gobble up anyone who is even going through the motions of doing these..

• …which leads to several DB challenges not addressed in traditional DBs – Wrapper generation– Schema mapping– (automated) form filling– Query optimization

• Learning source profiles

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Web brings unwashed masses, unreliable medium as well as dirty data to databases..

• Web accessibility changes the user/data/medium profile significantly– from SQL gurus supporting financial data on dedicated DBMS to

“2.1 keyword query” instant gratification seekers working with dirty/inconsistent data over unreliable web.

• Challenges– How does one support keyword queries in databases?– How does one support imprecise queries in databases?– How do we handle incompleteness/inconsistency in databases?– Does it make sense to focus on total response time minimization

• As against a multi-objective cost/benefit optimization?

The DB community has embraced these challenges --see Lowell Report

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Databases offer lessons on exploiting structure

• We argued that structure (and semantics) help querying– If there is structure (as in databases) we can exploit it

• Databases is an existing technology for exploiting some forms of structure– SQL may not look like much, but it is more expressive than keyword queries!

– If not, we can extract structure and then exploit it• Challenges

– Techniques for extracting information (NLP-lite)– Languages for representing/handling “Semi-structured” data– Standards for supporting/exploiting semantic tagging

Before we play havoc with databases, let’s quickly review the traditional art of db management

so we know all that needs to change

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Concepts covered so far …

• Information Retrieval– Text retrieval– Hyper-linked text retrieval– Improvements…

• Information Mining– Clustering techniques to improve result presentation– Classification and filtering techniques

Slides adapted from Rao (ASU) & Franklin (Berkeley)

What Is a Database System?• Database:

a very large, integrated collection of data.• Models a real-world enterprise

– Entities (e.g., teams, games)– Relationships

(e.g., The Patriots are playing in The Superbowl)

– More recently, also includes active components , often called “business logic”. (e.g., the BCS ranking system)

• A Database Management System (DBMS) is a software system designed to store, manage, and facilitate access to databases.

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Functionality of a DBMS• Data Dictionary Management• Storage management

– Data storage Definition Language (DDL)• High level query and data manipulation language

– SQL/XQuery etc.– May tell us what we are missing in text-based search

• Efficient query processing– May change in the internet scenario

• Transaction processing• Resiliency: recovery from crashes,• Different views of the data, security

– May be useful to model a collection of databases together• Interface with programming languages

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Database (relational)

Database Manager(DBMS) -Storage mgmt -Query processing -View management -(Transaction processing)

Query(SQL)

Answer(relation)

Traditional Database Architecture

Services

Webpages

Structureddata

Sensors(streamingData)

Services

Webpages

Structureddata

Sensors(streamingData)

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainity,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Answers

ProbingQueries

Monitor

ExecutorNeeds to handleSource/network

Interruptions,Runtime uncertainity,

replanning

Source Fusion/Query Planning

Needs to handle:Multiple objectives,Service composition,

Source quality & overlap

Source TrustOntologies;

Source/ServiceDescriptions

Answers

ProbingQueries

Monitor

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Building an Application with a Database System

• Requirements modeling (conceptual, pictures)– Decide what entities should be part of the application and

how they should be linked.• Schema design and implementation

– Decide on a set of tables, attributes.– Define the tables in the database system.– Populate database (insert tuples).

• Write application programs using the DBMS– Now much easier, with data management API

Slides adapted from Rao (ASU) & Franklin (Berkeley)

ssn

address name field

Professor

Advises

Takes

Teaches

CourseStudent

name category

quarter

name

Conceptual Modeling

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Data Models• A data model is a collection of concepts for

describing data.

• A schema is a description of a particular collection of data, using a given data model.

• The relational model of data is the most widely used model today.– Main concept: relation, basically a table with rows and

columns.– Every relation has a schema, which describes the columns, or

fields.

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Levels of Abstraction

• Views describe how users see the data.

• Conceptual schema

defines logical structure

• Physical schema describes the files and indexes used.

Physical Schema

Conceptual Schema

View 1 View 2 View 3

DB

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Example: University Database

• Conceptual schema: – Students(sid: string, name: string,

login: string, age: integer, gpa:real)

– Courses(cid: string, cname:string, credits:integer)

• External Schema (View): – Course_info(cid:string,enrollment:int

eger)

• Physical schema:– Relations stored as unordered files. – Index on first column of Students.

Physical Schema

Conceptual Schema

View 1 View 2 View 3

DB

If five people are asked to come up with a schema for the data, what are the odds that they will come up with the same schema?

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Data Independence• Applications insulated from

how data is structured and stored.

• Logical data independence: Protection from changes in logical structure of data.

• Physical data independence: Protection from changes in physical structure of data.

• Q: Why are these particularly important for DBMS?

Physical Schema

Conceptual Schema

View 1 View 2 View 3

DB

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Schema Design & Implementation

• Table Students

• Separates the logical view from the physical view of the data.

Student Course Quarter

Charles CS 444 Fall, 1997

Dan CS 142 Winter,1998

… … …

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Terminology

tuples

Attribute namesStudents

(Arity=3)

Student Course Quarter

Charles CS 444 Fall, 1997

Dan CS 142 Winter,1998

… … …

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Querying a Database

• Find all the students taking CSE594 in Q1, 2004

• S(tructured) Q(uery) L(anguage)select E.namefrom Enroll Ewhere E.course=CS490i and E.quarter=“Winter, 2000”

• Query processor figures out how to answer the query efficiently.

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Defining Views(Virtual) Views are relations, except that they are not physically stored.

They are used mostly in order to simplify complex queries andto define conceptually different views of the database to differentclasses of users.

View: purchases of telephony products:

CREATE VIEW telephony-purchases AS SELECT product, buyer, seller, store FROM Purchase, Product WHERE Purchase.product = Product.name AND Product.category = “telephony”

Slides adapted from Rao (ASU) & Franklin (Berkeley)

A Different ViewCREATE VIEW Seattle-view AS

SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer

We can later use the views: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = “shoes”

What’s really happening when we query a view??

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Updating Views

How can I insert a tuple into a table that doesn’t exist?

CREATE VIEW bon-purchase AS SELECT store, seller, product FROM Purchase WHERE store = “The Bon Marche”

If we make the following insertion:

INSERT INTO bon-purchase VALUES (“the Bon Marche”, Joe, “Denby Mug”)

We can simply add a tuple (“the Bon Marche”, Joe, NULL, “Denby Mug”)to relation Purchase.

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Non-Updatable Views

CREATE VIEW Seattle-view AS SELECT seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer

How can we add the following tuple to the view?

(Joe, “Shoe Model 12345”, “Nine West”)

Given Purchase (buyer, seller, store, product) Person( name, phone-num, city)

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Materialized Views• Views whose corresponding queries have been executed

and the data is stored in a separate database– Uses: Caching

• Issues– Using views in answering queries

• Normally, the views are available in addition to database– (so, views are local caches)

• In information integration, views may be the only things we have access to. – An internet source that specializes in woody allen movies can be seen as a view

on a database of all movies. Except, there is no database out there which contains all movies..

– Maintaining consistency of materialized views

Slides adapted from Rao (ASU) & Franklin (Berkeley)

Query Optimization

Imperative query execution plan:Declarative SQL query

Ideally: Want to find best plan. Practically: Avoid worst plans!

Goal:

(Simple Nested Loops)

Purchase Person

Buyer=name

City=‘seattle’ phone>’5430000’

buyer

(Table scan) (Index scan)

SELECT S.buyerFROM Purchase P, Person QWHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘5430000’

Inputs:• the query• statistics about the data (indexes, cardinalities, selectivity factors)• available memory