unit 5 networks
TRANSCRIPT
-
8/13/2019 Unit 5 networks
1/67
Database System Concepts, 6 th Ed . Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
UNIT- 5
http://www.db-book.com/http://www.db-book.com/http://www.db-book.com/http://www.db-book.com/ -
8/13/2019 Unit 5 networks
2/67
Silberschatz, Korth and Sudarshan22.2Database System Concepts - 6 th Edition
Centralized Systems
Run on a single computer system and do not interact with other
computer systems.General-purpose computer system: one to a few CPUs and a numberof device controllers that are connected through a common bus thatprovides access to shared memory.
Single-user system (e.g., personal computer or workstation): desk-top
unit, single user, usually has only one CPU and one or two harddisks; the OS may support only one user.
Multi-user system: more disks, more memory, multiple CPUs, and amulti-user OS. Serve a large number of users who are connected tothe system vie terminals. Often called server systems.
-
8/13/2019 Unit 5 networks
3/67
Silberschatz, Korth and Sudarshan22.3Database System Concepts - 6 th Edition
A Centralized Computer System
-
8/13/2019 Unit 5 networks
4/67
Silberschatz, Korth and Sudarshan22.4Database System Concepts - 6 th Edition
Client-Server Systems
Server systems satisfy requests generated at m client systems, whose
general structure is shown below:
-
8/13/2019 Unit 5 networks
5/67
Silberschatz, Korth and Sudarshan22.5Database System Concepts - 6 th Edition
Client-Server Systems (Cont.)
Database functionality can be divided into:
Back-end : manages access structures, query evaluation andoptimization, concurrency control and recovery.
Front-end : consists of tools such as forms , report-writers , andgraphical user interface facilities.
The interface between the front-end and the back-end is through SQL orthrough an application program interface.
-
8/13/2019 Unit 5 networks
6/67
Silberschatz, Korth and Sudarshan22.6Database System Concepts - 6 th Edition
Client-Server Systems (Cont.)
Advantages of replacing mainframes with networks of workstations or
personal computers connected to back-end server machines:better functionality for the cost
flexibility in locating resources and expanding facilities
better user interfaces
easier maintenance
-
8/13/2019 Unit 5 networks
7/67 Silberschatz, Korth and Sudarshan22.7Database System Concepts - 6 th Edition
Server System Architecture
Server systems can be broadly categorized into two kinds:
transaction servers which are widely used in relational databasesystems, and
data servers , used in object-oriented database systems
-
8/13/2019 Unit 5 networks
8/67 Silberschatz, Korth and Sudarshan22.8Database System Concepts - 6 th Edition
Transaction Servers
Also called query server systems or SQL server systems
Clients send requests to the serverTransactions are executed at the server
Results are shipped back to the client.
Requests are specified in SQL, and communicated to the serverthrough a remote procedure call (RPC) mechanism.
Transactional RPC allows many RPC calls to form a transaction.
Open Database Connectivity (ODBC) is a C language applicationprogram interface standard from Microsoft for connecting to a server,sending SQL requests, and receiving results.
JDBC standard is similar to ODBC, for Java
-
8/13/2019 Unit 5 networks
9/67 Silberschatz, Korth and Sudarshan22.9Database System Concepts - 6 th Edition
Transaction Server Process Structure
A typical transaction server consists of multiple processes accessing
data in shared memory.Server processes
These receive user queries (transactions), execute them and sendresults back
Processes may be multithreaded , allowing a single process to
execute several user queries concurrentlyTypically multiple multithreaded server processes
Lock manager process
More on this later
Database writer process
Output modified buffer blocks to disks continually
-
8/13/2019 Unit 5 networks
10/67 Silberschatz, Korth and Sudarshan22.10Database System Concepts - 6 th Edition
Transaction Server Processes (Cont.)
Log writer process
Server processes simply add log records to log record bufferLog writer process outputs log records to stable storage.
Checkpoint process
Performs periodic checkpoints
Process monitor processMonitors other processes, and takes recovery actions if any ofthe other processes fail
E.g., aborting any transactions being executed by a serverprocess and restarting it
-
8/13/2019 Unit 5 networks
11/67 Silberschatz, Korth and Sudarshan22.11Database System Concepts - 6 th Edition
Transaction System Processes (Cont.)
-
8/13/2019 Unit 5 networks
12/67 Silberschatz, Korth and Sudarshan22.12Database System Concepts - 6 th Edition
Transaction System Processes (Cont.)
Shared memory contains shared data
Buffer poolLock tableLog bufferCached query plans (reused if same query submitted again)
All database processes can access shared memory
To ensure that no two processes are accessing the same data structureat the same time, databases systems implement mutual exclusion using either
Operating system semaphoresAtomic instructions such as test-and-set
To avoid overhead of interprocess communication for lockrequest/grant, each database process operates directly on the locktable
instead of sending requests to lock manager processLock manager process still used for deadlock detection
-
8/13/2019 Unit 5 networks
13/67 Silberschatz, Korth and Sudarshan22.13Database System Concepts - 6 th Edition
Data Servers
Used in high-speed LANs, in cases where
The clients are comparable in processing power to the serverThe tasks to be executed are compute intensive.
Data are shipped to clients where processing is performed, and thenshipped results back to the server.
This architecture requires full back-end functionality at the clients.
Used in many object-oriented database systems
Issues:
Page-Shipping versus Item-Shipping
Locking
Data CachingLock Caching
-
8/13/2019 Unit 5 networks
14/67 Silberschatz, Korth and Sudarshan22.14Database System Concepts - 6 th Edition
Data Servers (Cont.)
Page-shipping versus item-shipping
Smaller unit of shipping more messagesWorth prefetching related items along with requested itemPage shipping can be thought of as a form of prefetching
LockingOverhead of requesting and getting locks from server is high dueto message delaysCan grant locks on requested and prefetched items; with pageshipping, transaction is granted lock on whole page.Locks on a prefetched item can be P{called back} by the server,and returned by client transaction if the prefetched item has notbeen used.Locks on the page can be de escalated to locks on items in thepage when there are lock conflicts. Locks on unused items canthen be returned to server.
-
8/13/2019 Unit 5 networks
15/67 Silberschatz, Korth and Sudarshan22.15Database System Concepts - 6 th Edition
Data Servers (Cont.)
Data Caching
Data can be cached at client even in between transactionsBut check that data is up-to-date before it is used ( cachecoherency )
Check can be done when requesting lock on data item
Lock Caching
Locks can be retained by client system even in betweentransactions
Transactions can acquire cached locks locally, withoutcontacting server
Servercalls back
locks from clients when it receives conflictinglock request. Client returns lock once no local transaction isusing it.
Similar to deescalation, but across transactions.
-
8/13/2019 Unit 5 networks
16/67 Silberschatz, Korth and Sudarshan22.16Database System Concepts - 6 th Edition
Parallel Systems
Parallel database systems consist of multiple processors and multiple
disks connected by a fast interconnection network.A coarse-grain parallel machine consists of a small number ofpowerful processors
A massively parallel or fine grain parallel machine utilizesthousands of smaller processors.
Two main performance measures:throughput --- the number of tasks that can be completed in agiven time interval
response time --- the amount of time it takes to complete a singletask from the time it is submitted
-
8/13/2019 Unit 5 networks
17/67 Silberschatz, Korth and Sudarshan22.17Database System Concepts - 6 th Edition
Speed-Up and Scale-Up
Speedup : a fixed-sized problem executing on a small system is given
to a system which is N-times larger.Measured by:
speedup = small system elapsed time
large system elapsed time
Speedup is linear if equation equals N.
Scaleup : increase the size of both the problem and the system
N-times larger system used to perform N-times larger job
Measured by:
scaleup = small system small problem elapsed time
big system big problem elapsed timeScale up is linear if equation equals 1.
-
8/13/2019 Unit 5 networks
18/67 Silberschatz, Korth and Sudarshan22.18Database System Concepts - 6 th Edition
Speedup
-
8/13/2019 Unit 5 networks
19/67 Silberschatz, Korth and Sudarshan22.19Database System Concepts - 6 th Edition
Scaleup
-
8/13/2019 Unit 5 networks
20/67 Silberschatz, Korth and Sudarshan22.20Database System Concepts - 6 th Edition
Batch and Transaction Scaleup
Batch scaleup :
A single large job; typical of most decision support queries andscientific simulation.
Use an N-times larger computer on N-times larger problem.
Transaction scaleup :
Numerous small queries submitted by independent users to ashared database; typical transaction processing and timesharingsystems.
N-times as many users submitting requests (hence, N-times asmany requests) to an N-times larger database, on an N-timeslarger computer.
Well-suited to parallel execution.
-
8/13/2019 Unit 5 networks
21/67 Silberschatz, Korth and Sudarshan22.21Database System Concepts - 6 th Edition
Factors Limiting Speedup and Scaleup
Speedup and scaleup are often sublinear due to:
Startup costs : Cost of starting up multiple processes may dominatecomputation time, if the degree of parallelism is high.
Interference : Processes accessing shared resources (e.g., systembus, disks, or locks) compete with each other, thus spending timewaiting on other processes, rather than performing useful work.
Skew : Increasing the degree of parallelism increases the variance inservice times of parallely executing tasks. Overall execution timedetermined by slowest of parallely executing tasks.
-
8/13/2019 Unit 5 networks
22/67 Silberschatz, Korth and Sudarshan22.22Database System Concepts - 6 th Edition
Parallel Database Architectures
Shared memory -- processors share a common memory
Shared disk -- processors share a common diskShared nothing -- processors share neither a common memory norcommon disk
Hierarchical -- hybrid of the above architectures
-
8/13/2019 Unit 5 networks
23/67 Silberschatz, Korth and Sudarshan22.23Database System Concepts - 6 th Edition
Parallel Database Architectures
-
8/13/2019 Unit 5 networks
24/67
Silberschatz, Korth and Sudarshan22.24Database System Concepts - 6 th Edition
Shared Memory
Processors and disks have access to a common memory, typically via
a bus or through an interconnection network.Extremely efficient communication between processors data inshared memory can be accessed by any processor without having tomove it using software.
Downside architecture is not scalable beyond 32 or 64 processorssince the bus or the interconnection network becomes a bottleneck
Widely used for lower degrees of parallelism (4 to 8).
-
8/13/2019 Unit 5 networks
25/67
Silberschatz, Korth and Sudarshan22.25Database System Concepts - 6 th Edition
Shared Disk
All processors can directly access all disks via an interconnection
network, but the processors have private memories.The memory bus is not a bottleneck
Architecture provides a degree of fault-tolerance if aprocessor fails, the other processors can take over its taskssince the database is resident on disks that are accessible from
all processors.Examples: IBM Sysplex and DEC clusters (now part of Compaq)running Rdb (now Oracle Rdb) were early commercial users
Downside: bottleneck now occurs at interconnection to the disksubsystem.
Shared-disk systems can scale to a somewhat larger number ofprocessors, but communication between processors is slower.
-
8/13/2019 Unit 5 networks
26/67
Silberschatz, Korth and Sudarshan22.26Database System Concepts - 6 th Edition
Shared Nothing
Node consists of a processor, memory, and one or more disks.
Processors at one node communicate with another processor atanother node using an interconnection network. A node functions asthe server for the data on the disk or disks the node owns.
Examples: Teradata, Tandem, Oracle-n CUBE
Data accessed from local disks (and local memory accesses) do notpass through interconnection network, thereby minimizing theinterference of resource sharing.
Shared-nothing multiprocessors can be scaled up to thousands ofprocessors without interference.
Main drawback: cost of communication and non-local disk access;sending data involves software interaction at both ends.
-
8/13/2019 Unit 5 networks
27/67
Silberschatz, Korth and Sudarshan22.27Database System Concepts - 6 th Edition
Hierarchical
Combines characteristics of shared-memory, shared-disk, and shared-
nothing architectures.Top level is a shared-nothing architecture nodes connected by aninterconnection network, and do not share disks or memory with eachother.
Each node of the system could be a shared-memory system with afew processors.
Alternatively, each node could be a shared-disk system, and each ofthe systems sharing a set of disks could be a shared-memory system.
Reduce the complexity of programming such systems by distributedvirtual-memory architectures
Also called non-uniform memory architecture (NUMA)
-
8/13/2019 Unit 5 networks
28/67
Silberschatz, Korth and Sudarshan22.28Database System Concepts - 6 th Edition
Hybrid architecture
hybrid architecture includes:
Non-Uniform Memory Architecture (NUMA), which involves the Non-Uniform Memory Access .
Cluster (shared nothing + shared disk: SAN/NAS), which is formed bya group of connected computers.
Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing , where the memory access timedepends on the memory location relative to a processor. UnderNUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory sharedbetween processors).
NUMA architectures logically follow in scaling from symmetricmultiprocessing (SMP) architectures.
http://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Access -
8/13/2019 Unit 5 networks
29/67
Silberschatz, Korth and Sudarshan22.29Database System Concepts - 6 th Edition
XML
Using an example explain the distinction between attribute and a sub
element. Explain the purpose and use of namespaces
Give the DTD for an XML representation of the following nested-relational schema.
Emp = (ename, ChildrenSet setof(Children), SkillsSet setof(Skills))
Children = (name, Birthday)Birthday = (day, month, year)Skills = (type, ExamsSet setof(Exams))Exams = (year, city)
Explain the limitations of DTD. Describe the alternative to overcomethis limitation.
-
8/13/2019 Unit 5 networks
30/67
Silberschatz, Korth and Sudarshan22.30Database System Concepts - 6 th Edition
Introduction
XML: Extensible Markup Language
Defined by the WWW Consortium (W3C)Derived from SGML (Standard Generalized Markup Language), butsimpler to use than SGML
Documents have tags giving extra information about sections of thedocument
E.g. XML Introduction
Extensible , unlike HTML
Users can add new tags, and separately specify how the tag should behandled for display
-
8/13/2019 Unit 5 networks
31/67
Silberschatz, Korth and Sudarshan22.31Database System Concepts - 6 th Edition
Comparison with Relational Data
Inefficient: tags, which in effect represent schema information, are
repeatedBetter than relational tuples as a data-exchange format
Unlike relational tuples, XML data is self-documenting due topresence of tags
Non-rigid format: tags can be added
Allows nested structuresWide acceptance, not only in database systems, but also inbrowsers, tools, and applications
-
8/13/2019 Unit 5 networks
32/67
Silberschatz, Korth and Sudarshan22.32Database System Concepts - 6 th Edition
Structure of XML Data
Tag : label for a section of data
Element : section of data beginning with < tagname > and ending withmatching
Elements must be properly nested
Proper nesting
.
Improper nesting
.
Formally: every start tag must have a unique matching end tag,that is in the context of the same parent element.
Every document must have a single top-level element
-
8/13/2019 Unit 5 networks
33/67
Silberschatz, Korth and Sudarshan22.33Database System Concepts - 6 th Edition
Structure of XML Data (Cont.)
Mixture of text with sub-elements is legal in XML.
Example:
This course is being offered for the first time in 2009. BIO-399 Computational Biology Biology
3 Useful for document markup, but discouraged for datarepresentation
-
8/13/2019 Unit 5 networks
34/67
Silberschatz, Korth and Sudarshan22.34Database System Concepts - 6 th Edition
Attributes
Elements can have attributes
Intro. to Computer Science Comp. Sci. 4
Attributes are specified by name=value pairs inside the starting tag of anelementAn element may have several attributes, but each attribute name canonly occur once
-
8/13/2019 Unit 5 networks
35/67
Silberschatz, Korth and Sudarshan22.35Database System Concepts - 6 th Edition
Attributes vs. Subelements
Distinction between subelement and attribute
In the context of documents, attributes are part of markup, whilesubelement contents are part of the basic document contents
In the context of data representation, the difference is unclear andmay be confusing
Same information can be represented in two ways
CS- 101
Suggestion: use attributes for identifiers of elements, and use
subelements for contents
-
8/13/2019 Unit 5 networks
36/67
Silberschatz, Korth and Sudarshan22.36Database System Concepts - 6 th Edition
Namespaces
XML data has to be exchanged between organizations
Same tag name may have different meaning in different organizations,causing confusion on exchanged documents
Specifying a unique string as an element name avoids confusion
Better solution: use unique-name:element-name
Avoid using long unique names all over document by using XMLNamespaces
CS-101 Intro. to Computer Science Comp. Sci. 4
http://www.yale.edu/http://www.yale.edu/ -
8/13/2019 Unit 5 networks
37/67
Silberschatz, Korth and Sudarshan22.37Database System Concepts - 6 th Edition
XML Document Schema
Database schemas constrain what information can be stored, and the
data types of stored valuesXML documents are not required to have an associated schema
However, schemas are very important for XML data exchange
Otherwise, a site cannot automatically interpret data received fromanother site
Two mechanisms for specifying XML schemaDocument Type Definition (DTD)
Widely used
XML Schema
Newer, increasing use
-
8/13/2019 Unit 5 networks
38/67
Silberschatz, Korth and Sudarshan22.38Database System Concepts - 6 th Edition
Document Type Definition (DTD)
The type of an XML document can be specified using a DTD
DTD constraints structure of XML dataWhat elements can occur
What attributes can/must an element have
What subelements can/must occur inside each element, and howmany times.
DTD does not constrain data types
All values represented as strings in XML
DTD syntax
-
8/13/2019 Unit 5 networks
39/67
Silberschatz, Korth and Sudarshan22.39Database System Concepts - 6 th Edition
Element Specification in DTD
Subelements can be specified as
names of elements, or#PCDATA (parsed character data), i.e., character strings
EMPTY (no subelements) or ANY (anything can be a subelement)
Example
Subelement specification may have regular expressions
Notation: | - alternatives +- 1 or more occurrences *- 0 or more occurrences
-
8/13/2019 Unit 5 networks
40/67
Silberschatz, Korth and Sudarshan22.40Database System Concepts - 6 th Edition
University DTD
]>
-
8/13/2019 Unit 5 networks
41/67
Silberschatz, Korth and Sudarshan22.41Database System Concepts - 6 th Edition
Attribute Specification in DTD
Attribute specification : for each attribute
NameType of attribute
CDATAID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)
more on this later
Whethermandatory (#REQUIRED)has a default value (value),or neither (#IMPLIED)
Examples, or
course_id ID #REQUIREDdept_name IDREF #REQUIREDinstructors IDREFS #IMPLIED >
-
8/13/2019 Unit 5 networks
42/67
Silberschatz, Korth and Sudarshan22.42Database System Concepts - 6 th Edition
IDs and IDREFs
An element can have at most one attribute of type ID
The ID attribute value of each element in an XML document must bedistinct
Thus the ID attribute value is an object identifier
An attribute of type IDREF must contain the ID value of an element inthe same document
An attribute of type IDREFS contains a set of (0 or more) ID values.Each ID value must contain the ID value of an element in the samedocument
h b
-
8/13/2019 Unit 5 networks
43/67
Silberschatz, Korth and Sudarshan22.43Database System Concepts - 6 th Edition
University DTD with Attributes
University DTD with ID and IDREF attribute types.
dept_name ID #REQUIRED >
IID ID #REQUIREDdept_name IDREF #REQUIRED > declarations for title, credits, building,
budget, name and salary ]>
Li i i f DTD
-
8/13/2019 Unit 5 networks
44/67
Silberschatz, Korth and Sudarshan22.44Database System Concepts - 6 th Edition
Limitations of DTDs
No typing of text elements and attributes
All values are strings, no integers, reals, etc.Difficult to specify unordered sets of subelements
Order is usually irrelevant in databases (unlike in the document-layout environment from which XML evolved)
(A | B)* allows specification of an unordered set, but
Cannot ensure that each of A and B occurs only once
IDs and IDREFs are untyped
The instructors attribute of an course may contain a reference toanother course, which is meaningless
instructors attribute should ideally be constrained to refer toinstructor elements
XML S h
-
8/13/2019 Unit 5 networks
45/67
Silberschatz, Korth and Sudarshan22.45Database System Concepts - 6 th Edition
XML Schema
XML Schema is a more sophisticated schema language whichaddresses the drawbacks of DTDs . Supports
Typing of values
E.g. integer, string, etc
Also, constraints on min/max values
User-defined, comlex types
Many more features, including
uniqueness and foreign key constraints, inheritance
XML Schema is itself specified in XML syntax, unlike DTDs
More-standard representation, but verbose
XML Scheme is integrated with namespacesBUT: XML Schema is significantly more complicated than DTDs.
D i i S S
-
8/13/2019 Unit 5 networks
46/67
Silberschatz, Korth and Sudarshan22.46Database System Concepts - 6 th Edition
Decision Support Systems
Decision-support systems are used to make business decisions,often based on data collected by on-line transaction-processingsystems.
Examples of business decisions:
What items to stock?
What insurance premium to change?To whom to send advertisements?
Examples of data used for making decisions
Retail sales transaction details
Customer profiles (income, age, gender, etc.)
D i i S t S t O i
-
8/13/2019 Unit 5 networks
47/67
Silberschatz, Korth and Sudarshan22.47Database System Concepts - 6 th Edition
Decision-Support Systems: OverviewData analysis tasks are simplified by specialized tools and SQLextensions
Example tasksFor each product category and each region, what were the totalsales in the last quarter and how do they compare with thesame quarter last yearAs above, for each product category and each customercategory
Statistical analysis packages (e.g., : S++) can be interfaced withdatabases
Statistical analysis is a large field, but not covered hereData mining seeks to discover knowledge automatically in the form ofstatistical rules and patterns from large databases.
A data warehouse archives information gathered from multiplesources, and stores it under a unified schema, at a single site.
Important for large businesses that generate data from multipledivisions, possibly at multiple sitesData may also be purchased externally
D t W h i
-
8/13/2019 Unit 5 networks
48/67
Silberschatz, Korth and Sudarshan22.48Database System Concepts - 6 th Edition
Data Warehousing
Data sources often store only current data, not historical data
Corporate decision making requires a unified view of all organizationaldata, including historical data
A data warehouse is a repository (archive) of information gatheredfrom multiple sources, stored under a unified schema, at a single site
Greatly simplifies querying, permits study of historical trends
Shifts decision support query load away from transactionprocessing systems
D W h i
-
8/13/2019 Unit 5 networks
49/67
Silberschatz, Korth and Sudarshan22.49Database System Concepts - 6 th Edition
Data Warehousing
D i I
-
8/13/2019 Unit 5 networks
50/67
Silberschatz, Korth and Sudarshan22.50Database System Concepts - 6 th Edition
Design Issues
When and how to gather data
Source driven architecture : data sources transmit newinformation to warehouse, either continuously or periodically(e.g., at night)
Destination driven architecture : warehouse periodicallyrequests new information from data sources
Keeping warehouse exactly synchronized with data sources(e.g., using two-phase commit) is too expensive
Usually OK to have slightly out-of-date data at warehouse
Data/updates are periodically downloaded form onlinetransaction processing (OLTP) systems.
What schema to useSchema integration
-
8/13/2019 Unit 5 networks
51/67
Silberschatz, Korth and Sudarshan22.51Database System Concepts - 6 th Edition
More Warehouse Design Issues
Data cleansing
E.g., correct mistakes in addresses (misspellings, zip codeerrors)
Merge address lists from different sources and purge duplicates
How to propagate updates
Warehouse schema may be a (materialized) view of schema
from data sourcesWhat data to summarize
Raw data may be too large to store on-line
Aggregate values (totals/subtotals) often suffice
Queries on raw data can often be transformed by queryoptimizer to use aggregate values
-
8/13/2019 Unit 5 networks
52/67
Silberschatz, Korth and Sudarshan22.52Database System Concepts - 6 th Edition52
Why Data Mining?
The Explosive Growth of Data
Data collection and data availability
Automated data collection tools, database systems, Web, computerized
societyMajor sources of abundant data
Business: Web, e- commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific simulation,
Society and everyone: news, digital cameras,
We are drowning in data, but starving for knowledge!
Necessity is the mother of invention Data mining Automated analysis ofmassive data sets
-
8/13/2019 Unit 5 networks
53/67
Silberschatz, Korth and Sudarshan22.53Database System Concepts - 6 th Edition53
Why Data Mining? Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM), marketbasket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, qualitycontrol, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
Bioinformatics and bio-data analysis
-
8/13/2019 Unit 5 networks
54/67
Silberschatz, Korth and Sudarshan22.54Database System Concepts - 6 th Edition
Data Mining: A KDD Process
Data mining: the core ofknowledge discoveryprocess.
Data CleaningData Integration
Databases
Data Warehouse
Task-relevant Data Data SelectionData Preprocessing
Data Mining
Pattern Evaluation
-
8/13/2019 Unit 5 networks
55/67
Silberschatz, Korth and Sudarshan22.55Database System Concepts - 6 th Edition
Steps of a KDD Process
Learning the application domain:relevant prior knowledge and goals of application
Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation :Find useful features, dimensionality/variable reduction, invariantrepresentation.
Choosing functions of data mining summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)Data mining : search for patterns of interestPattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
-
8/13/2019 Unit 5 networks
56/67
Silberschatz, Korth and Sudarshan22.56Database System Concepts - 6 th Edition56
Data Mining Functionalities
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
-
8/13/2019 Unit 5 networks
57/67
Silberschatz, Korth and Sudarshan22.57Database System Concepts - 6 th Edition57
Data Mining Functionalities
Multidimensional concept description : Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs.wet regions
association analysis
Diaper Beer [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and distinguish classes orconcepts for future prediction
E.g., classify countries based on (climate), or classify cars based on(gas mileage)
Predict some unknown or missing numerical values
-
8/13/2019 Unit 5 networks
58/67
Silberschatz, Korth and Sudarshan22.58Database System Concepts - 6 th Edition
58
Data Mining Functionalities (2)Cluster analysis
Class label is unknown: Group data to form new classes, e.g., clusterhouses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: Data object that does not comply with the general behavior of thedata
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Periodicity analysisSimilarity-based analysis
Other pattern-directed or statistical analyses
-
8/13/2019 Unit 5 networks
59/67
Silberschatz, Korth and Sudarshan22.59Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 59
Data Cleaning
ImportanceData cleaning is one of the three biggest problemsin data warehousing Ralph KimballData cleaning is the number one problem in data
warehousing DCI surveyData cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
-
8/13/2019 Unit 5 networks
60/67
Silberschatz, Korth and Sudarshan22.60Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 60
Missing Data
Data is not always available
E.g., many tuples have no recorded value for severalattributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time ofentry
not register history or changes of the data
Missing data may need to be inferred.
-
8/13/2019 Unit 5 networks
61/67
Silberschatz, Korth and Sudarshan22.61Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 61
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: inference-based such as Bayesian
formula or decision tree
-
8/13/2019 Unit 5 networks
62/67
Silberschatz, Korth and Sudarshan22.62Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 62
Noisy Data
Noise: random error or variance in a measured variableIncorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
-
8/13/2019 Unit 5 networks
63/67
Silberschatz, Korth and Sudarshan22.63Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 63
How to Handle Noisy Data?
Binningfirst sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by binmedian, smooth by bin boundaries , etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliersCombined computer and human inspection
detect suspicious values and check by human (e.g.,deal with possible outliers)
Simple Discretization Methods: Binning
-
8/13/2019 Unit 5 networks
64/67
Silberschatz, Korth and Sudarshan22.64Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 64
Simple Discretization Methods: Binning
Equal-width (distance) partitioningDivides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/ N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximatelysame number of samples
Good data scaling
Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
-
8/13/2019 Unit 5 networks
65/67
Silberschatz, Korth and Sudarshan22.65Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 65
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25-
-
8/13/2019 Unit 5 networks
66/67
Silberschatz, Korth and Sudarshan22.66Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 66
Regression
x
y
y = x + 1
X1
Y1
Y1
-
8/13/2019 Unit 5 networks
67/67
Cluster Analysis