why do we care about databases in a course about web?
Post on 20-Dec-2015
214 views
TRANSCRIPT
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Administrivia
• Midterm next class (in class)– Openbook/Open Notes/open course web (not general
web)– Everything done until Sep 28th (and covered in
homework 2)– Questions?
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Adapting old disciplines for Web-age• Information (text) retrieval
– Scale of the web– Hyper text/ Link structure– Authority/hub computations
• Databases– Multiple databases
• Heterogeneous, access limited, partially overlapping– Network (un)reliability
• Datamining [Machine Learning/Statistics/Databases]– Learning patterns from large scale data
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Why do we care about databases?
• Three reasons– Deep web is all databases…– We can do better with structured data…– Exposing databases on web changes their clientele..
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Deep Web is databases..• The crawlable web pages are just the tip of a huge ice berg that is
deep web– Many web sites have huge backend databases that generate pages dynamically
in response to queries• Airline fare databases; News paper classifieds etc.
– By some estimates, deep web is 2 orders of magnitude bigger than the shallow (“html page”) web
• We need to exploit deep web– Crawl/index deep web– Select databases relevant to a query– Provide information aggregation/integration services over deep web databases
• ..and all the big kids are trying to gobble up anyone who is even going through the motions of doing these..
• …which leads to several DB challenges not addressed in traditional DBs – Wrapper generation– Schema mapping– (automated) form filling– Query optimization
• Learning source profiles
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Web brings unwashed masses, unreliable medium as well as dirty data to databases..
• Web accessibility changes the user/data/medium profile significantly– from SQL gurus supporting financial data on dedicated DBMS to
“2.1 keyword query” instant gratification seekers working with dirty/inconsistent data over unreliable web.
• Challenges– How does one support keyword queries in databases?– How does one support imprecise queries in databases?– How do we handle incompleteness/inconsistency in databases?– Does it make sense to focus on total response time minimization
• As against a multi-objective cost/benefit optimization?
The DB community has embraced these challenges --see Lowell Report
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Databases offer lessons on exploiting structure
• We argued that structure (and semantics) help querying– If there is structure (as in databases) we can exploit it
• Databases is an existing technology for exploiting some forms of structure– SQL may not look like much, but it is more expressive than keyword queries!
– If not, we can extract structure and then exploit it• Challenges
– Techniques for extracting information (NLP-lite)– Languages for representing/handling “Semi-structured” data– Standards for supporting/exploiting semantic tagging
Before we play havoc with databases, let’s quickly review the traditional art of db management
so we know all that needs to change
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Concepts covered so far …
• Information Retrieval– Text retrieval– Hyper-linked text retrieval– Improvements…
• Information Mining– Clustering techniques to improve result presentation– Classification and filtering techniques
Slides adapted from Rao (ASU) & Franklin (Berkeley)
What Is a Database System?• Database:
a very large, integrated collection of data.• Models a real-world enterprise
– Entities (e.g., teams, games)– Relationships
(e.g., The Patriots are playing in The Superbowl)
– More recently, also includes active components , often called “business logic”. (e.g., the BCS ranking system)
• A Database Management System (DBMS) is a software system designed to store, manage, and facilitate access to databases.
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Functionality of a DBMS• Data Dictionary Management• Storage management
– Data storage Definition Language (DDL)• High level query and data manipulation language
– SQL/XQuery etc.– May tell us what we are missing in text-based search
• Efficient query processing– May change in the internet scenario
• Transaction processing• Resiliency: recovery from crashes,• Different views of the data, security
– May be useful to model a collection of databases together• Interface with programming languages
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Database (relational)
Database Manager(DBMS) -Storage mgmt -Query processing -View management -(Transaction processing)
Query(SQL)
Answer(relation)
Traditional Database Architecture
Services
Webpages
Structureddata
Sensors(streamingData)
Services
Webpages
Structureddata
Sensors(streamingData)
ExecutorNeeds to handleSource/network
Interruptions,Runtime uncertainity,
replanning
Source Fusion/Query Planning
Needs to handle:Multiple objectives,Service composition,
Source quality & overlap
Source TrustOntologies;
Source/ServiceDescriptions
Answers
ProbingQueries
Monitor
ExecutorNeeds to handleSource/network
Interruptions,Runtime uncertainity,
replanning
Source Fusion/Query Planning
Needs to handle:Multiple objectives,Service composition,
Source quality & overlap
Source TrustOntologies;
Source/ServiceDescriptions
Answers
ProbingQueries
Monitor
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Building an Application with a Database System
• Requirements modeling (conceptual, pictures)– Decide what entities should be part of the application and
how they should be linked.• Schema design and implementation
– Decide on a set of tables, attributes.– Define the tables in the database system.– Populate database (insert tuples).
• Write application programs using the DBMS– Now much easier, with data management API
Slides adapted from Rao (ASU) & Franklin (Berkeley)
ssn
address name field
Professor
Advises
Takes
Teaches
CourseStudent
name category
quarter
name
Conceptual Modeling
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Data Models• A data model is a collection of concepts for
describing data.
• A schema is a description of a particular collection of data, using a given data model.
• The relational model of data is the most widely used model today.– Main concept: relation, basically a table with rows and
columns.– Every relation has a schema, which describes the columns, or
fields.
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Levels of Abstraction
• Views describe how users see the data.
• Conceptual schema
defines logical structure
• Physical schema describes the files and indexes used.
Physical Schema
Conceptual Schema
View 1 View 2 View 3
DB
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Example: University Database
• Conceptual schema: – Students(sid: string, name: string,
login: string, age: integer, gpa:real)
– Courses(cid: string, cname:string, credits:integer)
• External Schema (View): – Course_info(cid:string,enrollment:int
eger)
• Physical schema:– Relations stored as unordered files. – Index on first column of Students.
Physical Schema
Conceptual Schema
View 1 View 2 View 3
DB
If five people are asked to come up with a schema for the data, what are the odds that they will come up with the same schema?
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Data Independence• Applications insulated from
how data is structured and stored.
• Logical data independence: Protection from changes in logical structure of data.
• Physical data independence: Protection from changes in physical structure of data.
• Q: Why are these particularly important for DBMS?
Physical Schema
Conceptual Schema
View 1 View 2 View 3
DB
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Schema Design & Implementation
• Table Students
• Separates the logical view from the physical view of the data.
Student Course Quarter
Charles CS 444 Fall, 1997
Dan CS 142 Winter,1998
… … …
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Terminology
tuples
Attribute namesStudents
(Arity=3)
Student Course Quarter
Charles CS 444 Fall, 1997
Dan CS 142 Winter,1998
… … …
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Querying a Database
• Find all the students taking CSE594 in Q1, 2004
• S(tructured) Q(uery) L(anguage)select E.namefrom Enroll Ewhere E.course=CS490i and E.quarter=“Winter, 2000”
• Query processor figures out how to answer the query efficiently.
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Defining Views(Virtual) Views are relations, except that they are not physically stored.
They are used mostly in order to simplify complex queries andto define conceptually different views of the database to differentclasses of users.
View: purchases of telephony products:
CREATE VIEW telephony-purchases AS SELECT product, buyer, seller, store FROM Purchase, Product WHERE Purchase.product = Product.name AND Product.category = “telephony”
Slides adapted from Rao (ASU) & Franklin (Berkeley)
A Different ViewCREATE VIEW Seattle-view AS
SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer
We can later use the views: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = “shoes”
What’s really happening when we query a view??
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Updating Views
How can I insert a tuple into a table that doesn’t exist?
CREATE VIEW bon-purchase AS SELECT store, seller, product FROM Purchase WHERE store = “The Bon Marche”
If we make the following insertion:
INSERT INTO bon-purchase VALUES (“the Bon Marche”, Joe, “Denby Mug”)
We can simply add a tuple (“the Bon Marche”, Joe, NULL, “Denby Mug”)to relation Purchase.
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Non-Updatable Views
CREATE VIEW Seattle-view AS SELECT seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer
How can we add the following tuple to the view?
(Joe, “Shoe Model 12345”, “Nine West”)
Given Purchase (buyer, seller, store, product) Person( name, phone-num, city)
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Materialized Views• Views whose corresponding queries have been executed
and the data is stored in a separate database– Uses: Caching
• Issues– Using views in answering queries
• Normally, the views are available in addition to database– (so, views are local caches)
• In information integration, views may be the only things we have access to. – An internet source that specializes in woody allen movies can be seen as a view
on a database of all movies. Except, there is no database out there which contains all movies..
– Maintaining consistency of materialized views
Slides adapted from Rao (ASU) & Franklin (Berkeley)
Query Optimization
Imperative query execution plan:Declarative SQL query
Ideally: Want to find best plan. Practically: Avoid worst plans!
Goal:
(Simple Nested Loops)
Purchase Person
Buyer=name
City=‘seattle’ phone>’5430000’
buyer
(Table scan) (Index scan)
SELECT S.buyerFROM Purchase P, Person QWHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘5430000’
Inputs:• the query• statistics about the data (indexes, cardinalities, selectivity factors)• available memory