unit 5 networks

8/13/2019 Unit 5 networks

1/67

Database System Concepts, 6 th Ed . Silberschatz, Korth and Sudarshan

See www.db-book.com for conditions on re-use

UNIT- 5
http://www.db-book.com/http://www.db-book.com/http://www.db-book.com/http://www.db-book.com/


2/67

Silberschatz, Korth and Sudarshan22.2Database System Concepts - 6 th Edition

Centralized Systems

Run on a single computer system and do not interact with other

computer systems.General-purpose computer system: one to a few CPUs and a numberof device controllers that are connected through a common bus thatprovides access to shared memory.

Single-user system (e.g., personal computer or workstation): desk-top

unit, single user, usually has only one CPU and one or two harddisks; the OS may support only one user.

Multi-user system: more disks, more memory, multiple CPUs, and amulti-user OS. Serve a large number of users who are connected tothe system vie terminals. Often called server systems.


3/67


A Centralized Computer System


4/67


Client-Server Systems

Server systems satisfy requests generated at m client systems, whose

general structure is shown below:


5/67


Client-Server Systems (Cont.)

Database functionality can be divided into:

Back-end : manages access structures, query evaluation andoptimization, concurrency control and recovery.

Front-end : consists of tools such as forms , report-writers , andgraphical user interface facilities.

The interface between the front-end and the back-end is through SQL orthrough an application program interface.


6/67


Client-Server Systems (Cont.)

Advantages of replacing mainframes with networks of workstations or

personal computers connected to back-end server machines:better functionality for the cost

flexibility in locating resources and expanding facilities

better user interfaces

easier maintenance


7/67 Silberschatz, Korth and Sudarshan22.7Database System Concepts - 6 th Edition

Server System Architecture

Server systems can be broadly categorized into two kinds:

transaction servers which are widely used in relational databasesystems, and

data servers , used in object-oriented database systems



Transaction Servers

Also called query server systems or SQL server systems

Clients send requests to the serverTransactions are executed at the server

Results are shipped back to the client.

Requests are specified in SQL, and communicated to the serverthrough a remote procedure call (RPC) mechanism.

Transactional RPC allows many RPC calls to form a transaction.

Open Database Connectivity (ODBC) is a C language applicationprogram interface standard from Microsoft for connecting to a server,sending SQL requests, and receiving results.

JDBC standard is similar to ODBC, for Java



Transaction Server Process Structure

A typical transaction server consists of multiple processes accessing

data in shared memory.Server processes

These receive user queries (transactions), execute them and sendresults back

Processes may be multithreaded , allowing a single process to

execute several user queries concurrentlyTypically multiple multithreaded server processes

Lock manager process

More on this later

Database writer process

Output modified buffer blocks to disks continually



Transaction Server Processes (Cont.)

Log writer process

Server processes simply add log records to log record bufferLog writer process outputs log records to stable storage.

Checkpoint process

Performs periodic checkpoints

Process monitor processMonitors other processes, and takes recovery actions if any ofthe other processes fail

E.g., aborting any transactions being executed by a serverprocess and restarting it



Transaction System Processes (Cont.)



Transaction System Processes (Cont.)

Shared memory contains shared data

Buffer poolLock tableLog bufferCached query plans (reused if same query submitted again)

All database processes can access shared memory

To ensure that no two processes are accessing the same data structureat the same time, databases systems implement mutual exclusion using either

Operating system semaphoresAtomic instructions such as test-and-set

To avoid overhead of interprocess communication for lockrequest/grant, each database process operates directly on the locktable

instead of sending requests to lock manager processLock manager process still used for deadlock detection



Data Servers

Used in high-speed LANs, in cases where

The clients are comparable in processing power to the serverThe tasks to be executed are compute intensive.

Data are shipped to clients where processing is performed, and thenshipped results back to the server.

This architecture requires full back-end functionality at the clients.

Used in many object-oriented database systems

Issues:

Page-Shipping versus Item-Shipping

Locking

Data CachingLock Caching



Data Servers (Cont.)

Page-shipping versus item-shipping

Smaller unit of shipping more messagesWorth prefetching related items along with requested itemPage shipping can be thought of as a form of prefetching

LockingOverhead of requesting and getting locks from server is high dueto message delaysCan grant locks on requested and prefetched items; with pageshipping, transaction is granted lock on whole page.Locks on a prefetched item can be P{called back} by the server,and returned by client transaction if the prefetched item has notbeen used.Locks on the page can be de escalated to locks on items in thepage when there are lock conflicts. Locks on unused items canthen be returned to server.



Data Servers (Cont.)

Data Caching

Data can be cached at client even in between transactionsBut check that data is up-to-date before it is used ( cachecoherency )

Check can be done when requesting lock on data item

Lock Caching

Locks can be retained by client system even in betweentransactions

Transactions can acquire cached locks locally, withoutcontacting server

Servercalls back

locks from clients when it receives conflictinglock request. Client returns lock once no local transaction isusing it.

Similar to deescalation, but across transactions.



Parallel Systems

Parallel database systems consist of multiple processors and multiple

disks connected by a fast interconnection network.A coarse-grain parallel machine consists of a small number ofpowerful processors

A massively parallel or fine grain parallel machine utilizesthousands of smaller processors.

Two main performance measures:throughput --- the number of tasks that can be completed in agiven time interval

response time --- the amount of time it takes to complete a singletask from the time it is submitted



Speed-Up and Scale-Up

Speedup : a fixed-sized problem executing on a small system is given

to a system which is N-times larger.Measured by:

speedup = small system elapsed time

large system elapsed time

Speedup is linear if equation equals N.

Scaleup : increase the size of both the problem and the system

N-times larger system used to perform N-times larger job

Measured by:

scaleup = small system small problem elapsed time

big system big problem elapsed timeScale up is linear if equation equals 1.



Speedup



Scaleup



Batch and Transaction Scaleup

Batch scaleup :

A single large job; typical of most decision support queries andscientific simulation.

Use an N-times larger computer on N-times larger problem.

Transaction scaleup :

Numerous small queries submitted by independent users to ashared database; typical transaction processing and timesharingsystems.

N-times as many users submitting requests (hence, N-times asmany requests) to an N-times larger database, on an N-timeslarger computer.

Well-suited to parallel execution.



Factors Limiting Speedup and Scaleup

Speedup and scaleup are often sublinear due to:

Startup costs : Cost of starting up multiple processes may dominatecomputation time, if the degree of parallelism is high.

Interference : Processes accessing shared resources (e.g., systembus, disks, or locks) compete with each other, thus spending timewaiting on other processes, rather than performing useful work.

Skew : Increasing the degree of parallelism increases the variance inservice times of parallely executing tasks. Overall execution timedetermined by slowest of parallely executing tasks.



Parallel Database Architectures

Shared memory -- processors share a common memory

Shared disk -- processors share a common diskShared nothing -- processors share neither a common memory norcommon disk

Hierarchical -- hybrid of the above architectures



Parallel Database Architectures


24/67


Shared Memory

Processors and disks have access to a common memory, typically via

a bus or through an interconnection network.Extremely efficient communication between processors data inshared memory can be accessed by any processor without having tomove it using software.

Downside architecture is not scalable beyond 32 or 64 processorssince the bus or the interconnection network becomes a bottleneck

Widely used for lower degrees of parallelism (4 to 8).


25/67


Shared Disk

All processors can directly access all disks via an interconnection

network, but the processors have private memories.The memory bus is not a bottleneck

Architecture provides a degree of fault-tolerance if aprocessor fails, the other processors can take over its taskssince the database is resident on disks that are accessible from

all processors.Examples: IBM Sysplex and DEC clusters (now part of Compaq)running Rdb (now Oracle Rdb) were early commercial users

Downside: bottleneck now occurs at interconnection to the disksubsystem.

Shared-disk systems can scale to a somewhat larger number ofprocessors, but communication between processors is slower.


26/67


Shared Nothing

Node consists of a processor, memory, and one or more disks.

Processors at one node communicate with another processor atanother node using an interconnection network. A node functions asthe server for the data on the disk or disks the node owns.

Examples: Teradata, Tandem, Oracle-n CUBE

Data accessed from local disks (and local memory accesses) do notpass through interconnection network, thereby minimizing theinterference of resource sharing.

Shared-nothing multiprocessors can be scaled up to thousands ofprocessors without interference.

Main drawback: cost of communication and non-local disk access;sending data involves software interaction at both ends.


27/67


Hierarchical

Combines characteristics of shared-memory, shared-disk, and shared-

nothing architectures.Top level is a shared-nothing architecture nodes connected by aninterconnection network, and do not share disks or memory with eachother.

Each node of the system could be a shared-memory system with afew processors.

Alternatively, each node could be a shared-disk system, and each ofthe systems sharing a set of disks could be a shared-memory system.

Reduce the complexity of programming such systems by distributedvirtual-memory architectures

Also called non-uniform memory architecture (NUMA)


28/67


Hybrid architecture

hybrid architecture includes:

Non-Uniform Memory Architecture (NUMA), which involves the Non-Uniform Memory Access .

Cluster (shared nothing + shared disk: SAN/NAS), which is formed bya group of connected computers.

Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing , where the memory access timedepends on the memory location relative to a processor. UnderNUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory sharedbetween processors).

NUMA architectures logically follow in scaling from symmetricmultiprocessing (SMP) architectures.
http://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Symmetric_multiprocessinghttp://en.wikipedia.org/wiki/Multiprocessinghttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Access


29/67


XML

Using an example explain the distinction between attribute and a sub

element. Explain the purpose and use of namespaces

Give the DTD for an XML representation of the following nested-relational schema.

Emp = (ename, ChildrenSet setof(Children), SkillsSet setof(Skills))

Children = (name, Birthday)Birthday = (day, month, year)Skills = (type, ExamsSet setof(Exams))Exams = (year, city)

Explain the limitations of DTD. Describe the alternative to overcomethis limitation.


30/67


Introduction

XML: Extensible Markup Language

Defined by the WWW Consortium (W3C)Derived from SGML (Standard Generalized Markup Language), butsimpler to use than SGML

Documents have tags giving extra information about sections of thedocument

E.g. XML Introduction

Extensible , unlike HTML

Users can add new tags, and separately specify how the tag should behandled for display


31/67


Comparison with Relational Data

Inefficient: tags, which in effect represent schema information, are

repeatedBetter than relational tuples as a data-exchange format

Unlike relational tuples, XML data is self-documenting due topresence of tags

Non-rigid format: tags can be added

Allows nested structuresWide acceptance, not only in database systems, but also inbrowsers, tools, and applications


32/67


Structure of XML Data

Tag : label for a section of data

Element : section of data beginning with < tagname > and ending withmatching

Elements must be properly nested

Proper nesting

.

Improper nesting

.

Formally: every start tag must have a unique matching end tag,that is in the context of the same parent element.

Every document must have a single top-level element


33/67


Structure of XML Data (Cont.)

Mixture of text with sub-elements is legal in XML.

Example:

This course is being offered for the first time in 2009. BIO-399 Computational Biology Biology

3 Useful for document markup, but discouraged for datarepresentation


34/67


Attributes

Elements can have attributes

Intro. to Computer Science Comp. Sci. 4

Attributes are specified by name=value pairs inside the starting tag of anelementAn element may have several attributes, but each attribute name canonly occur once


35/67


Attributes vs. Subelements

Distinction between subelement and attribute

In the context of documents, attributes are part of markup, whilesubelement contents are part of the basic document contents

In the context of data representation, the difference is unclear andmay be confusing

Same information can be represented in two ways

CS- 101

Suggestion: use attributes for identifiers of elements, and use

subelements for contents


36/67


Namespaces

XML data has to be exchanged between organizations

Same tag name may have different meaning in different organizations,causing confusion on exchanged documents

Specifying a unique string as an element name avoids confusion

Better solution: use unique-name:element-name

Avoid using long unique names all over document by using XMLNamespaces

CS-101 Intro. to Computer Science Comp. Sci. 4
http://www.yale.edu/http://www.yale.edu/


37/67


XML Document Schema

Database schemas constrain what information can be stored, and the

data types of stored valuesXML documents are not required to have an associated schema

However, schemas are very important for XML data exchange

Otherwise, a site cannot automatically interpret data received fromanother site

Two mechanisms for specifying XML schemaDocument Type Definition (DTD)

Widely used

XML Schema

Newer, increasing use


38/67


Document Type Definition (DTD)

The type of an XML document can be specified using a DTD

DTD constraints structure of XML dataWhat elements can occur

What attributes can/must an element have

What subelements can/must occur inside each element, and howmany times.

DTD does not constrain data types

All values represented as strings in XML

DTD syntax


39/67


Element Specification in DTD

Subelements can be specified as

names of elements, or#PCDATA (parsed character data), i.e., character strings

EMPTY (no subelements) or ANY (anything can be a subelement)

Example

Subelement specification may have regular expressions

Notation: | - alternatives +- 1 or more occurrences *- 0 or more occurrences


40/67


University DTD

]>


41/67


Attribute Specification in DTD

Attribute specification : for each attribute

NameType of attribute

CDATAID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)

more on this later

Whethermandatory (#REQUIRED)has a default value (value),or neither (#IMPLIED)

Examples, or

course_id ID #REQUIREDdept_name IDREF #REQUIREDinstructors IDREFS #IMPLIED >


42/67


IDs and IDREFs

An element can have at most one attribute of type ID

The ID attribute value of each element in an XML document must bedistinct

Thus the ID attribute value is an object identifier

An attribute of type IDREF must contain the ID value of an element inthe same document

An attribute of type IDREFS contains a set of (0 or more) ID values.Each ID value must contain the ID value of an element in the samedocument

h b


43/67


University DTD with Attributes

University DTD with ID and IDREF attribute types.

dept_name ID #REQUIRED >

IID ID #REQUIREDdept_name IDREF #REQUIRED > declarations for title, credits, building,

budget, name and salary ]>

Li i i f DTD


44/67


Limitations of DTDs

No typing of text elements and attributes

All values are strings, no integers, reals, etc.Difficult to specify unordered sets of subelements

Order is usually irrelevant in databases (unlike in the document-layout environment from which XML evolved)

(A | B)* allows specification of an unordered set, but

Cannot ensure that each of A and B occurs only once

IDs and IDREFs are untyped

The instructors attribute of an course may contain a reference toanother course, which is meaningless

instructors attribute should ideally be constrained to refer toinstructor elements

XML S h


45/67


XML Schema

XML Schema is a more sophisticated schema language whichaddresses the drawbacks of DTDs . Supports

Typing of values

E.g. integer, string, etc

Also, constraints on min/max values

User-defined, comlex types

Many more features, including

uniqueness and foreign key constraints, inheritance

XML Schema is itself specified in XML syntax, unlike DTDs

More-standard representation, but verbose

XML Scheme is integrated with namespacesBUT: XML Schema is significantly more complicated than DTDs.

D i i S S


46/67


Decision Support Systems

Decision-support systems are used to make business decisions,often based on data collected by on-line transaction-processingsystems.

Examples of business decisions:

What items to stock?

What insurance premium to change?To whom to send advertisements?

Examples of data used for making decisions

Retail sales transaction details

Customer profiles (income, age, gender, etc.)

D i i S t S t O i


47/67


Decision-Support Systems: OverviewData analysis tasks are simplified by specialized tools and SQLextensions

Example tasksFor each product category and each region, what were the totalsales in the last quarter and how do they compare with thesame quarter last yearAs above, for each product category and each customercategory

Statistical analysis packages (e.g., : S++) can be interfaced withdatabases

Statistical analysis is a large field, but not covered hereData mining seeks to discover knowledge automatically in the form ofstatistical rules and patterns from large databases.

A data warehouse archives information gathered from multiplesources, and stores it under a unified schema, at a single site.

Important for large businesses that generate data from multipledivisions, possibly at multiple sitesData may also be purchased externally

D t W h i


48/67


Data Warehousing

Data sources often store only current data, not historical data

Corporate decision making requires a unified view of all organizationaldata, including historical data

A data warehouse is a repository (archive) of information gatheredfrom multiple sources, stored under a unified schema, at a single site

Greatly simplifies querying, permits study of historical trends

Shifts decision support query load away from transactionprocessing systems

D W h i


49/67


Data Warehousing

D i I


50/67


Design Issues

When and how to gather data

Source driven architecture : data sources transmit newinformation to warehouse, either continuously or periodically(e.g., at night)

Destination driven architecture : warehouse periodicallyrequests new information from data sources

Keeping warehouse exactly synchronized with data sources(e.g., using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse

Data/updates are periodically downloaded form onlinetransaction processing (OLTP) systems.

What schema to useSchema integration


51/67


More Warehouse Design Issues

Data cleansing

E.g., correct mistakes in addresses (misspellings, zip codeerrors)

Merge address lists from different sources and purge duplicates

How to propagate updates

Warehouse schema may be a (materialized) view of schema

from data sourcesWhat data to summarize

Raw data may be too large to store on-line

Aggregate values (totals/subtotals) often suffice

Queries on raw data can often be transformed by queryoptimizer to use aggregate values


52/67

Silberschatz, Korth and Sudarshan22.52Database System Concepts - 6 th Edition52

Why Data Mining?

The Explosive Growth of Data

Data collection and data availability

Automated data collection tools, database systems, Web, computerized

societyMajor sources of abundant data

Business: Web, e- commerce, transactions, stocks,

Science: Remote sensing, bioinformatics, scientific simulation,

Society and everyone: news, digital cameras,

We are drowning in data, but starving for knowledge!

Necessity is the mother of invention Data mining Automated analysis ofmassive data sets


53/67


Why Data Mining? Potential Applications

Data analysis and decision support

Market analysis and management

Target marketing, customer relationship management (CRM), marketbasket analysis, cross selling, market segmentation

Risk analysis and management

Forecasting, customer retention, improved underwriting, qualitycontrol, competitive analysis

Fraud detection and detection of unusual patterns (outliers)

Other Applications

Text mining (news group, email, documents) and Web mining

Stream data mining

Bioinformatics and bio-data analysis


54/67


Data Mining: A KDD Process

Data mining: the core ofknowledge discoveryprocess.

Data CleaningData Integration

Databases

Data Warehouse

Task-relevant Data Data SelectionData Preprocessing

Data Mining

Pattern Evaluation


55/67


Steps of a KDD Process

Learning the application domain:relevant prior knowledge and goals of application

Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation :Find useful features, dimensionality/variable reduction, invariantrepresentation.

Choosing functions of data mining summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)Data mining : search for patterns of interestPattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge


56/67


Data Mining Functionalities

General functionality

Descriptive data mining

Predictive data mining

Different views lead to different classifications

Data view: Kinds of data to be mined

Knowledge view: Kinds of knowledge to be discovered

Method view: Kinds of techniques utilized

Application view: Kinds of applications adapted


57/67


Data Mining Functionalities

Multidimensional concept description : Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs.wet regions

association analysis

Diaper Beer [0.5%, 75%] (Correlation or causality?)

Classification and prediction

Construct models (functions) that describe and distinguish classes orconcepts for future prediction

E.g., classify countries based on (climate), or classify cars based on(gas mileage)

Predict some unknown or missing numerical values


58/67


58

Data Mining Functionalities (2)Cluster analysis

Class label is unknown: Group data to form new classes, e.g., clusterhouses to find distribution patterns

Maximizing intra-class similarity & minimizing interclass similarity

Outlier analysis

Outlier: Data object that does not comply with the general behavior of thedata

Noise or exception? Useful in fraud detection, rare events analysis

Trend and evolution analysis

Trend and deviation: e.g., regression analysis

Periodicity analysisSimilarity-based analysis

Other pattern-directed or statistical analyses


59/67

Silberschatz, Korth and Sudarshan22.59Database System Concepts - 6 th EditionDecember 5, 2013 Data Mining: Concepts and 59

Data Cleaning

ImportanceData cleaning is one of the three biggest problemsin data warehousing Ralph KimballData cleaning is the number one problem in data

warehousing DCI surveyData cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration


60/67


Missing Data

Data is not always available

E.g., many tuples have no recorded value for severalattributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time ofentry

not register history or changes of the data

Missing data may need to be inferred.


61/67


How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming

the tasks in classification not effective when the percentage of

missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

a global constant : e.g., unknown, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class:

smarter

the most probable value: inference-based such as Bayesian

formula or decision tree


62/67


Noisy Data

Noise: random error or variance in a measured variableIncorrect attribute values may due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention

Other data problems which requires data cleaning

duplicate records

incomplete data

inconsistent data


63/67


How to Handle Noisy Data?

Binningfirst sort data and partition into (equal-frequency) bins

then one can smooth by bin means, smooth by binmedian, smooth by bin boundaries , etc.

Regression

smooth by fitting the data into regression functions

Clustering

detect and remove outliersCombined computer and human inspection

detect suspicious values and check by human (e.g.,deal with possible outliers)

Simple Discretization Methods: Binning


64/67


Simple Discretization Methods: Binning

Equal-width (distance) partitioningDivides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the

width of intervals will be: W = (B A)/ N.

The most straightforward, but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximatelysame number of samples

Good data scaling

Managing categorical attributes can be tricky

Binning Methods for Data Smoothing


65/67


Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,29, 34

* Partition into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25-


66/67


Regression

x

y

y = x + 1

X1

Y1

Y1


67/67

Cluster Analysis