dbt anna university qa june 2010 dec 2010

7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

1/90

ANNA UNIVERSITY- CHENNAI-JUNE 2010 & DECEMBER 2010

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SUB CODE/SUB NAME: CS9221 / DATABASE TECHNOLOGY

Part A (10*2=20 Marks)

1. What is fragmentation? (JUNE 2010)

Fragmentation is a database server feature that allows you to control where data is stored at

the table level. Fragmentation enables you to define groups of rows or index keys within a

table according to some algorithm or scheme. You use SQL statements to create the

fragments and assign them to dbspaces.

2. What is Concurrency control? (JUNE 2010) (NOV/ DEC 2010)

Concurrency control is the activity of coordinating concurrent accesses to a database in a

multiuser system. Concurrency control allows user to access a database in a multi-

programmed fashion while preserving the consistency of the data.

3. What is Persistence? (JUNE 2010) (NOV/ DEC 2010)

Persistence is the property of an object through which its existence transcends time i.e. (the

object continues to exist after its creator ceases to exist), and/or space (i.e. the objects

location moves from the address space in which it was created).

4. What is Transaction Processing? (JUNE 2010)

A Transaction Processing system (TPS) is a set of information which processes the data

transaction in database system that monitors transaction programs (a special kind of

program). For e.g. in an electronic payment is made, the amount must be both withdrawn

from one account and added to the other; it cannot complete only one of those steps. Either

both must occur, or neither. In case of a failure preventing transaction completion, the

partially executed transaction must be 'rolled back' by the TPS.

5. What is Client/Server model? (JUNE 2010)

1
http://en.wikipedia.org/wiki/Rollback_(data_management)http://en.wikipedia.org/wiki/Rollback_(data_management)


2/90

The server in a client/server model is simply the DBMS, whereas the client is the database

application serviced by the DBMS.

The client/server model of a database system is classified into basic & distributed

client/server model.

6. What is the difference between data warehousing and data mining? (JUNE 2010)

Data warehousing: It is the process that is used to integrate and combine data from

multiple sources and format into a single unified schema. So it provides the enterprise with

a storage mechanism for its huge amount of data.

Data mining: It is the process of extracting interesting patterns and knowledge from huge

amount of data. So we can apply data mining techniques on the data warehouse of an

enterprise to discover useful patterns.

7. Why do we need Normalization? (JUNE 2010)

Normalization is a process followed for eliminating redundant data and establishes a

meaningful relationship among tables based on rules and regulations in order to maintain

integrity of data. It is done for maintaining storage space and also for performance tuning.

8. What is Integrity? (JUNE 2010) (NOV/ DEC 2010)

Integrity refers to the process of ensuring that a database remains an accurate reflection of

the universe of discourse it is modeling or representing. In other words there is a close

correspondence between the facts stored in the database and the real world it models0

9. Give two features of Multimedia Databases (JUNE 2010) (NOV/ DEC 2010)

The multimedia database systems are to be used when it is required to administrate a huge

amounts of multimedia data objects of different types of data media (optical storage, video,

2


3/90

tapes, audio records, etc.) so that they can be used (that is, efficiently accessed and

searched)for as many applications as needed.

The Objects of Multimedia Data are: text, images, graphics, sound recordings, video

recordings, signals, etc., that are digitalized and stored.

10. What are Deductive Databases? (JUNE 2010) (NOV/ DEC 2010)

A Deductive Database is the combination of a conventional database containing facts, a

knowledge base containing rules, and an inference engine which allows the derivation of

information implied by the facts and rules.

A deductive database system specify rules through a declarative language - a language

in which we specify what to achieve rather than how to achieve it. An inference engine

within the system can deduce new facts from the database by interpreting these rules. Themodel used for deductive databases is related to the relational data model and also related

to the field oflogic programming and the Prolog language.

11. What is query processing? (NOV/ DEC 2010)

Query processing is a set of activities involving in getting the result of a query expressed in

a high-level language. These activities include parsing the queries and translate them into

expressions that can be implemented at the physical level of the file system, optimizing the

query of internal form to get suitable execution strategies for processing and then doing the

actual execution of queries to get the results.

12.Give two features of object-oriented databases. (NOV/ DEC 2010)

The features of Object Oriented Databases are,

It provides persistent storage for objects.

They may provide one or more of the following: a query language; indexing;

transaction support with rollback and commit; the possibility of distributing objects

transparently over many servers.

3


4/90

13. What is Data warehousing? (NOV/ DEC 2010)

Data warehousing: It is the process that is used to integrate and combine data from

multiple sources and format into a single unified schema. So it provides the enterprise with

a storage mechanism for its huge amount of data.

14. What is Normalization? (NOV/ DEC 2010)

Normalization is a process followed for eliminating redundant data and establishes a

meaningful relationship among tables based on rules and regulations in order to maintain

integrity of data. It is done for maintaining storage space and also for performance tuning.

15. Mention two features of parallel Databases. (NOV/ DEC 2010)

It is used to provide speedup, where queries are executed faster because more

resources, such as processors and disks, are provided.

It is also used to provide scaleup, where increasing workloads are handled without

increased response time, via an increase in the degree of parallelism.

4


5/90

Part B (5*16=80 Marks) (JUNE 2010) & (DECEMBER 2010)

1. (a)Explain the architecture of Distributed Databases.(16) (JUNE 2010)

Or

(b) Discuss in detail the architecture of distributed database. (16)

(NOV/DEC 2010)

5


6/90

6


7/90

7


8/90

(b)Write notes on the following:

(i) Query processing. (8) (JUNE 2010)

8


9/90

9


10/90

10


11/90

11


12/90

12


13/90

(ii) Transaction processing. (8) (JUNE 2010)

A transaction is a collection of actions that make consistent transformations of system states while

preserving system consistency.

concurrency transparency failure transparency

13


14/90

Example Transaction SQL Version

Begin_transaction Reservationbegininput(flight_no, date, customer_name);EXEC SQL UPDATE FLIGHT

SET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = date;EXEC SQL INSERTINTO FC(FNO, DATE, CNAME, SPECIAL);VALUES (flight_no, date, customer_name, null);output(reservation completed)end . {Reservation}

Properties of Transactions

ATOMICITY

all or nothing

CONSISTENCY

no violation of integrity constraints

ISOLATION

concurrent changes invisible E serializable

DURABILITY

committed updates persist

These are the ACID Properties of Transaction

Atomicity

Either all or none of the transaction's operations are performed. Atomicity requires that if a

transaction is interrupted by a failure, its partial results must be undone. The activity of preserving

the transaction's atomicity in presence of transaction aborts due to input errors, system overloads,

or deadlocks is called transaction recovery. The activity of ensuring atomicity in the presence of

system crashes is called crash recovery.

Consistency

Internal consistency

A transaction which executes alone against a consistent database leaves it in a consistent state.

Transactions do not violate database integrity constraints.

14


15/90

Transactions are correct programs.

Isolation

Degree 0

Transaction T does not overwrite dirty data of other transactions

Dirty data refers to data values that have been updated by a transaction prior to its commitment.

Degree 2

T does not overwrite dirty data of other transactions

T does not commit any writes before EOT

T does not read dirty data from other transactions

Degree 3

T does not overwrite dirty data of other transactions

T does not commit any writes before EOT

T does not read dirty data from other transactions

Other transactions do not dirty any data read by T before T completes.

Isolation

Serializability

If several transactions are executed concurrently, the results must be the same as if they wereexecuted serially in some order.

Incomplete results

An incomplete transaction cannot reveal its results to other transactions before its commitment.

Necessary to avoid cascading aborts.

Durability: Once a transaction commits, the system must guarantee that the results of its

operations will never be lost, in spite of subsequent failures.

Database recovery

15


16/90

Transaction transparency: Ensures all distributed Ts maintain distributed databases integrity

and consistency.

Distributed T accesses data stored at more than one location.

Each T is divided into no. of subTs, one for each site that has to be accessed.

DDBMS must ensure the indivisibility of both the global T and each of the subTs.

Concurrency transparency: All Ts must execute independently and be logically consistent withresults obtained if Ts executed in some arbitrary serial order.

Replication makes concurrency more complex

Failure transparency: must ensure atomicity and durability of global T.

Means ensuring that subTs of global T either all commit or all abort.

Classification transparency: In IBMs Distributed Relational Database Architecture

(DRDA), four types of Ts:

Remote request

Remote unit of work

Distributed unit of work

Distributed request.

2. (a)Discuss the Modeling and design approaches for Object Oriented Databases

(JUNE 2010)

Or

(b) Describe modeling and design approaches for object oriented database. (16)

(NOV/DEC 2010)

16


17/90

MODELING AND DESIGN

Basically, an OODBMS is an object database that provides DBMS capabilities to objects that

have been created using an object-oriented programming language (OOPL). The basic

principle is to add persistence to objects and to make objects persistent. Consequently

application programmers who use OODBMSs typically write programs in a native OOPL such

as Java, C++ or Smalltalk, and the language has some kind of Persistent class, Database class,Database Interface, or Database API that provides DBMS functionality as, effectively, an

extension of the OOPL.

Object-oriented DBMSs, however, go much beyond simply adding persistence to any one

object-oriented programming language. This is because, historically, many object-oriented

DBMSs were built to serve the market for computer-aided design/computer-aided

manufacturing (CAD/CAM) applications in which features like fast navigational access,

versions, and long transactions are extremely important.

Object-oriented DBMSs, therefore, support advanced object-oriented database applications

with features like support for persistent objects from more than one programming language,

distribution of data, advanced transaction models, versions, schema evolution, and dynamicgeneration of new types.

Object data modeling

An object consists of three parts: structure (attribute, and relationship to other objects like

aggregation, and association), behavior (a set of operations) and characteristic of types

(generalization/serialization). An object is similar to an entity in ER model; therefore we begin

with an example to demonstrate the structure and relationship.

17


18/90

Attributes are like the fields in a relational model. However in the Book example we have, for

attributes publishedBy and writtenBy, complex types Publisher and Author, which are also objects.

Attributes with complex objects, in RDNS, are usually other tables linked by keys to the employee

table.Relationships: publish and writtenBy are associations with I: N and 1:1 relationship; composed of

is an aggregation (a Book is composed of chapters). The 1: N relationship is usually realized as

attributes through complex types and at the behavioral level. For example,

Generalization/Serialization is the is a relationship, which is supported in OODB through class

hierarchy. An ArtBook is a Book, therefore the ArtBook class is a subclass of Book class. A

subclass inherits all the attribute and method of its superclass.

Message: means by which objects communicate, and it is a request from one object to another to

execute one of its methods. For example: Publisher_object.insert (Rose, 123) i.e. request to

execute the insert method on a Publisher object)Method: defines the behavior of an object. Methods can be used to change state by modifying its

attribute values to query the value of selected attributes The method that responds to the message

example is the method insert defied in the Publisher class. The main differences between relational

database design and object oriented database design include:

Many-to-many relationships must be removed before entities can be translated into relations.

Many-to-many relationships can be implemented directly in an object-oriented database.

Operations are not represented in the relational data model. Operations are one of the main

18


19/90

components in an object-oriented database. In the relational data model relationships are

implemented by primary and foreign keys. In the object model objects communicate through their

interfaces. The interface describes the data (attributes) and operations (methods) that are visible to

other objects.

(b) Explain the Multi-Version Locks and Recovery in Query Languages. (16) (JUNE 2010)

Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages.

(DECEMBER2010)

Multi-Version Locks

Multiversion concurrency control (abbreviated MCC orMVCC), in the database field of

computer science, is a concurrency control method commonly used by database management

systems to provide concurrent access to the database and in programming languages to implement

transactional memory.

For instance, a database will implement updates not by deleting an old piece of data and

overwriting it with a new one, but instead by marking the old data as obsolete and adding the

newer "version." Thus there are multiple versions stored, but only one is the latest. This allows the

database to avoid overhead of filling in holes in memory or disk structures but requires (generally)

the system to periodically sweep through and delete the old, obsolete data objects. For a document-

oriented database such as CouchDB,RiakorMarkLogic Serverit also allows the system to

optimize documents by writing entire documents onto contiguous sections of diskwhen updated,

the entire document can be re-written rather than bits and pieces cut out or maintained in a linked,

non contiguous database structure

19
http://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Concurrency_controlhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/Riakhttp://en.wikipedia.org/wiki/MarkLogic_Serverhttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Concurrency_controlhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/Riakhttp://en.wikipedia.org/wiki/MarkLogic_Server


20/90

MVCC also provides potential "point in time" consistent views. In fact read transactions under

MVCC typically use a timestamp or transaction ID to determine what state of the DB to read, and

read these "versions" of the data.

This avoids managing locks for read transactions because writes can be isolatedby virtue of the

old versions being maintained, rather than through a process of locks or mutexes. Writes affect

future "version" but at the transaction ID that the read is working at, everything is guaranteed to be

consistent because the writes are occurring at a later transaction ID.

In other words, MVCC provides each user connected to the database with a "snapshot" of the

database for that person to work with. Any changes made will not be seen by other users of the

database until the transaction has been committed.

MVCC uses timestamps or increasing transaction IDs to achieve transactional consistency. MVCC

ensures a transaction never has to wait for a database object by maintaining several versions of an

object. Each version would have a write timestamp and it would let a transaction (T i) read the most

recent version of an object which precedes the transaction timestamp (TS (T i)).

If a transaction (Ti) wants to write to an object, and if there is another transaction (Tk), thetimestamp of Ti must precede the timestamp of Tk(i.e., TS(Ti) < TS(Tk)) for the object write

operation to succeed, which is to say a write cannot complete if there are outstanding transactions

with an earlier timestamp.

Every object would also have a read timestamp, and if a transaction T i wanted to write to object P,

and the timestamp of that transaction is earlier than the object's read timestamp (TS(T i) < RTS(P)),

the transaction Ti is aborted and restarted. Otherwise, Ti creates a new version of P and sets the

read/write timestamps of P to the timestamp of the transaction TS (Ti).

The obvious drawback to this system is the cost of storing multiple versions of objects in thedatabase. On the other hand reads are never blocked, which can be important for workloads mostly

involving reading values from the database. MVCC is particularly adept at implementing true

snapshot isolation, something which other methods of concurrency control frequently do either

incompletely or with high performance costs.

At t1 the state of a DB could be

Time Object1 Object2

t1 Hello Bar

t2 Foo Bar

This indicates that the current set of this database (perhaps a key-value store database) is

Object1="Hello", Object2="Bar". Previously, Object1 was "Foo" but that value has been

superseded. It is not deleted because the database holds multiple versions but will be deleted

later.

20
http://en.wikipedia.org/wiki/Isolation_(database_systems)http://en.wikipedia.org/wiki/Timestamphttp://en.wikipedia.org/wiki/Snapshot_isolationhttp://en.wikipedia.org/wiki/Isolation_(database_systems)http://en.wikipedia.org/wiki/Timestamphttp://en.wikipedia.org/wiki/Snapshot_isolation


21/90

If a long running transaction starts a read operation, it will operate at transaction "t1" and see this

state. If there is a concurrent update (during that long-running read transaction) which deletes

Object 2 and adds Object 3 = foo-bar the database will look like

Time Object1 Object2 Object3

t2 Hello (deleted) Foo-Bar

t1 Hello Bar

t0 Hello Bar

Now there is a new version as of transaction ID t2. Note critically that the long-running read

transaction still has access to a coherent snapshot of the system at t1* even though the write

transaction added data as of t2, so the read transaction is able to run in isolation from the update

transaction that created the t2 values. This is how MVCC allows isolated, ACID, reads without any

locks.

Recovery

21


22/90

22


23/90

3. (a) Discuss in detail Data Warehousing and Data Mining. (JUNE 2010)

Or

(a) Explain the features of data warehousing and data mining. (16)

(NOV/DEC 2010)

Data Warehouse:

Large organizations have complex internal organizations, and have data stored at different

locations, on different operational (transaction processing) systems, under different

schemas

Data sources often store only current data, not historical data

Corporate decision making requires a unified view of all organizational data, including

historical data

A data warehouse is a repository (archive) of information gathered from multiple sources,

stored under a unified schema, at a single site

Greatly simplifies querying, permits study of historical trends

Shifts decision support query load away from transaction processing systems

When and how to gather data

Source driven architecture: data sources transmit new information to warehouse, either

continuously or periodically (e.g. at night)

Destination driven architecture: warehouse periodically requests new information from

data sources

Keeping warehouse exactly synchronized with data sources (e.g. using two-phase commit)

is too expensive

23


24/90

Usually OK to have slightly out-of-date data at warehouse

Data/updates are periodically downloaded form online transaction processing (OLTP)systems.

What schema to use

Schema integration

Data cleansing

E.g. correct mistakes in addresses

E.g. misspellings, zip code errors

Merge address lists from different sources and purge duplicates

Keep only one address record per household (householding)

How to propagate updates

Warehouse schema may be a (materialized) view of schema from data sources

Efficient techniques for update of materialized views

What data to summarize

Raw data may be too large to store on-line

Aggregate values (totals/subtotals) often suffice

Queries on raw data can often be transformed by query optimizer to use aggregate values.

24


25/90

Typically warehouse data is multidimensional, with very large fact tables

Examples of dimensions: item-id, date/time of sale, store where sale was made, customer

identifier

Examples of measures: number of items sold, price of items

Dimension values are usually encoded using small integers and mapped to full values via

dimension tables

Resultant schema is called a star schema

More complicated schema structures

Snowflake schema: multiple levels of dimension tables

Constellation: multiple fact tables

Data Mining

Broadly speaking, data mining is the process of semi-automatically analyzing large

databases to find useful patterns.

Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns

Differs from machine learning in that it deals with large volumes of data stored primarily

on disk.

Some types of knowledge discovered from a database can be represented by a set of rules.

e.g.,: Young women with annual incomes greater than $50,000 are most likely to buy

sports cars.

25


26/90

Other types of knowledge represented by equations, or by prediction functions.

Some manual intervention is usually required

Pre-processing of data, choice of which type of pattern to find, postprocessing to

find novel patterns

Applications of Data Mining

Prediction based on past history

Predict if a credit card applicant poses a good credit risk, based on some attributes

(income, job type, age, ..) and past history

Predict if a customer is likely to switch brand loyalty

Predict if a customer is likely to respond to junk mail

Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms:

Classification

Given a training set consisting of items belonging to different classes, and a new

item whose class is unknown, predict which class it belongs to.

Regression formulae

Given a set of parameter-value to function-result mappings for an unknown

function, predict the function-result for a new parameter-value

Descriptive Patterns

Associations

Find books that are often bought by the same customers. If a new customer buys

one such book, suggest that he buys the others too.

Other similar applications: camera accessories, clothes, etc.

Associations may also be used as a first step in detecting causation

E.g. association between exposure to chemical X and cancer, or new medicine and

cardiac problems

Clusters

E.g. typhoid cases were clustered in an area surrounding a contaminated well

Detection of clusters remains important in detecting epidemics

26


27/90

Classification Rules

Classification rules help assign new objects to a set of classes. E.g., given a new

automobile insurance applicant, should he or she be classified as low risk, medium risk or

high risk?

Classification rules for above example could use a variety of knowledge, such aseducational level of applicant, salary of applicant, age of applicant, etc.

person P, P.degree = masters and P.income > 75,000

P.credit = excellent

person P, P.degree = bachelors and

(P.income 25,000 and P.income 75,000)

P.credit = good

Rules are not necessarily exact: there may be some misclassifications

Classification rules can be compactly shown as a decision tree.

Decision Tree

Training set: a data sample in which the grouping for each tuple is already known.

Consider credit risk example: Suppose degree is chosen to partition the data at the root.

Since degree has a small number of possible values, one child is created for each

value.

At each child node of the root, further classification is done if required. Here, partitions are

defined by income.

Since income is a continuous attribute, some number of intervals are chosen, and

one child created for each interval.

Different classification algorithms use different ways of choosing which attribute to

partition on at each node, and what the intervals, if any, are.

In general

Different branches of the tree could grow to different levels.

Different nodes at the same level may use different partitioning attributes.

Greedy top down generation of decision trees.

27


28/90

Each internal node of the tree partitions the data into groups based on a partitioning

attribute, and a partitioning condition for the node

More on choosing partioning attribute/condition shortly

Algorithm is greedy: the choice is made once and not revisited as more of the tree is

constructed

The data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class, or

All attributes have been considered, and no further partitioning is possible.

Such a node is a leaf node.

Otherwise the data at the node is partitioned further by picking an attribute for

partitioning data at the node.

Decision-Tree Construction Algorithm

Procedure Grow.Tree(S)

Partition(S);

Procedure Partition (S)

if (purity(S) >

p or |S| 2) relationship set by a

number of distinct binary relationship sets, a n-ary relationship set shows more clearly that several

entities participate in a single relationship.

Mapping Cardinalities

Express the number of entities to which another entity can be associated via a relationship

set.

Most useful in describing binary relationship sets.

For a binary relationship set the mapping cardinality must be one of the following types:

One to one

One to many

Many to one

Many to many

We distinguish among these types by drawing either a directed line( ),signifying one,or

an undirected line( ),signifying many, between the relationship set and the entity set.

One-To-One Relationship

A customer is associated with at most one loan via the relationship borrower

A loan is associated with at most one customer via borrower

42


43/90

One-To-Many and Many-To-One Relationship

In the one-to-many relationship (a), a loan is associated with at most one customer via

borrower; a customer is associated with several (including 0) loans via borrower

In the many-to-one relationship (b), a loan is associated with several (including 0)

customers via borrower; a customer is associated with at most one loan via borrower

Many-To-Many Relationship

A customer is associated with several (possibly 0) loans via borrower

A loan is associated with several (possibly 0) customers via borrower

Existence Dependencies

43


44/90

If the existence of entity x depends on the existence of entity y, then x is said to be

existence dependenton y.

y is a dominant entity (in example below, loan)

x is a subordinate entity (in example below,payment)

If a loan entity is deleted, then all its associatedpaymententities must be deleted also.

E-R Diagram Components

Rectangles represent entity sets.

Ellipses represent attributes.

Diamonds represent relationship sets.

Lines link attributes to entity sets and entity sets to relationship sets.

Double ellipses represent multivalued attributes.

Dashed ellipses denote derived attributes.

Primary key attributes are underlined.

Weak Entity Sets

An entity set that does not have a primary key is referred to as a weak entity set.

The existence of a weak entity set depends on the existence of a strong entity set; it must

relate to the strong set via a one-to-many relationship set.

The discriminator(orpartial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity set.

The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent, plus the weak entity set's

discriminator.

We depict a weak entity set by double rectangles.

We underline the discriminator of a weak entity set with a dashed line.

payment-number -- discriminator of the payment entity set

Primary key for payment -- (loan-number, payment-number)

Specialization

44


45/90

Top-down design process; we designate subgroupings within an entity set that are

distinctive from other entities in the set.

These subgroupings become lower-level entity sets that have attributes or participate in

relationships that do not apply to the higher-level entity set.

Depicted by a triangle component labeled ISA (i.e.,savings-accountis anaccount)

Generalization

A bottom-up design process -- combine a number of entity sets that share the same features

into a higher-level entity set

Specialization and generalization are simple inversions of each other; they are represented

in an E-R diagram in the same way.

Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship

participation of the higher-level entity set to which it is linked.

Design Constraints on a Generalization

Constraint on which entities can be members of a given lower-level entity set.

condition-defined

user-defined

Constraint on whether or not entities may belong to more than one lower-level entity set

within a single generalization.

disjoint

overlapping

Completeness constraint -- specifies whether or not an entity in the higher-level entity set

must belong to at least one of the lower-level entity sets within a generalization.

Total

Partial

Aggregation

Relationship sets borrowerand loan-officerrepresent the same information

Eliminate this redundancy via aggregation

Treat relationship as an abstract entity

Allows relationships between relationships

Abstraction of relationship into new entity

45


46/90

Without introducing redundancy , the following diagram represents that:

A customer takes out a loan

An employee may be a loan officer for a customer-loan pair

E-R Design Decisions

The use of an attribute or entity set to represent an object

Whether a real-world concept is best expressed by an entity set or a relationship set.

The use of a ternary relationship versus a pair of binary relationships.

The use of a strong or weak entity set.

The use of generalization -- contributes to modularity in the design.

The use of aggregation -- can treat the aggregate entity set as a single unit without concern

for the details of its internal structure.

Reduction of an E-R schema to Tables

Primary keys allow entity sets and relationship sets to be expressed uniformly as tables

which represent the contents of the database.

A database which conforms to an E-R diagram can be represented by a collection of tables.

For each entity set and relationship set there is a unique table which is assigned the name of

the corresponding entity set or relationship set.

Each table has a number of columns (generally corresponding to attributes), which haveunique names.

Converting an E-R diagram to a table format is the basis for deriving a relational database

design from an E-R diagram.

Representing Entity Sets as Tables

A strong entity set reduces to a table with the same attributes.

A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity set.

Representing Relationship Sets as Tables

A many-to-many relationship set is represented as a table with columns for the primary

keys of the two participating entity sets, and any descriptive attributes of the relationship

set.

The table corresponding to a relationship set linking a weak entity set to its identifying

strong entity set is redundant. Thepaymenttable already contains the information that

46


47/90

would appear in the loan-paymenttable (i.e., the columns loan-numberandpayment-

number).

E-R Diagram for Banking Enterprise

Representing Generalization as Tables

Method 1: Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)

Method 2: Form a table for each entity set that is generalized table

(b) Explain the features of Temporal & Spatial Databases in detail. (JUNE 2010)

47


48/90

Or

(a) Give features of Temporal and Spatial Databases. Temporal Database.

(DECEMBER 2010)

Temporal Database

Time Representation, Calendars, and Time Dimensions

Time is considered ordered sequence of points in some granularity

Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for convenience.

Accommodates various calendars

Gregorian (western), Chinese, Islamic, Etc.

Point events

Single time point event

E.g., bank deposit

Series of point events can form a time series data

Duration events

Associated with specific time period

Time period is represented by start time and end time

Transaction time

The time when the information from a certain transaction becomes valid

Bitemporal database

Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning

Add to every tuple

Valid start time

Valid end time

48


49/90

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

A single complex object stores all temporal changes of the object

Time varying attribute

An attribute that changes over time

E.g., age

Non-Time varying attribute

An attribute that does not changes over time

E.g., date of birth

Spatial Database

49


50/90

Types of Spatial Data

Point Data

Points in a multidimensional space

E.g., Raster data such as satellite imagery, where each

pixel stores a measured value

E.g., Feature vectors extracted from text

Region Data

Objects have spatial extent with location and boundary.

DB typically uses geometric approximations constructed using line segments, polygons,

etc., called vector data.

Types of Spatial Queries

Spatial Range Queries

Find all cities within 50 miles of Madison

Query has associated region (location, boundary)

Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries Find the 10 cities nearest to Madison

Results must be ordered by proximity

Spatial Join Queries

Find all cities near a lake

Expensive, join condition involves regions and proximity

Applications of Spatial Data

Geographic Information Systems (GIS)

E.g., ESRIs ArcInfo; OpenGIS Consortium

Geospatial information

All classes of spatial queries and data are common

Computer-Aided Design/Manufacturing

Store spatial objects such as surface of airplane fuselage

50


51/90

Range queries and spatial join queries are common

Multimedia Databases

Images, video, text, etc. stored and retrieved by content

First converted to feature vector form; high dimensionality

Nearest-neighbor queries are the most common

Single-Dimensional Indexes

B+ trees are fundamentally single-dimensional indexes.

When we create a composite search key B+ tree, e.g., an index on , we effectively

linearize the 2-dimensional space since we sort entries firstby age and then by sal.

Consider entries:

,

,

Multi-dimensional Indexes

A multidimensional index clusters entries so as to exploit nearness in multidimensional space.

Keeping track of entries and maintaining abalanced index structure presents a challenge!

Consider entries:

,

,

51


52/90

Motivation for Multidimensional Indexes

Spatial queries (GIS, CAD).

Find all hotels within a radius of 5 miles from the conference venue.

Find the city with population 500,000 or more that is nearest to Kalamazoo, MI.

Find all cities that lie on the Nile in Egypt.

Find all parts that touch the fuselage (in a plane design).

Similarity queries (content-based retrieval).Given a face, find the five most similar faces.

Multidimensional range queries. 50 < age < 55 AND 80K < sal < 90K

Drawbacks

An index based on spatial location needed.One-dimensional indexes dont support multidimensional searching efficiently.Hash indexes only support point queries; want to support range queries as well.Must support inserts and deletes gracefully.

Ideally, want to support non-point data as well (e.g., lines, shapes). The R-tree meets these requirements, and variants are widely used today.

R-Tree

R-Tree Properties

Leaf entry = < n-dimensional box, rid >

This is Alternative (2), with key value being a box.

52


53/90

Box is the tightest bounding box for a data object.

Non-leaf entry = < n-dim box, ptr to child node >

Box covers all boxes in child node (in fact, subtree).

All leaves at same distance from root.

Nodes can be kept 50% full (except root).

Can choose a parameter m that is


54/90

Will reduce overlap with nodes in tree, and reduce the number of nodes fetched by

avoiding some branches altogether.

Cost of overlap test is higher than bounding box intersection, but it is a main-memory

cost, and can actually be done quite efficiently. Generally a win.

Insert Entry

Start at root and go down to best-fit leaf L.

Go to child whose box needs least enlargement to cover B; resolve ties by going to

smallest area child.

If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2.

Adjust entry for L in its parent so that the box now covers (only) L1.

Add an entry (in the parent node of L) for L2. (This could cause the parent node to

recursively split.)

Splitting a Node during Insertion

The entries in node L plus the newly inserted entry must be distributed between L1 and L2.

Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries.

Idea: Redistribute so as to minimize area of L1 plus area of L2.

R-Tree Variants

The R* tree uses the concept of forced reinserts to reduce overlap in tree nodes. When a

node overflows, instead of splitting.

Remove some (say, 30% of the) entries and reinsert them into the tree.

Could result in all reinserted entries fitting on some existing pages, avoiding a split.

R* trees also use a different heuristic, minimizing box perimeters rather than box areas

during insertion.

Another variant, the R+ tree, avoids overlap by inserting an object into multiple leaves if

necessary.

Searches now take a single path to a leaf, at cost of redundancy.

GiST

The Generalized Search Tree (GiST) abstracts the tree nature of a class of indexes

including B+ trees and R-tree variants.

54


55/90

Striking similarities in insert/delete/search and even concurrency control algorithms

make it possible to provide templates for these algorithms that can be customized to

obtain the many different tree index structures.

B+ trees are so important (and simple enough to allow further specialization) that they

are implemented specially in all DBMSs.

GiST provides an alternative for implementing other treeindexes in an ORDBS.

Indexing High-Dimensional Data

Typically, high-dimensional datasets are collections of points, not regions.

E.g., Feature vectors in multimedia applications.

Very sparse

Nearest neighbor queries are common.

R-tree becomes worse than sequential scan for most datasets with more than a dozendimensions.

As dimensionality increases contrast (ratio of distances between nearest and farthest points)

usually decreases; nearest neighbor is not meaningful.

In any given data set, advisable to empirically test contrast.

5. (a) Explain the features of Parallel and Text Databases in detail. (JUNE 2010)

Parallel Databases

Introduction

Parallel machines are becoming quite common and affordable

Prices of microprocessors, memory and disks have dropped sharply

Databases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis.

multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for:

storing large volumes of data

processing time-consuming decision-support queries

55


56/90

providing high throughput for transaction processing

Parallelism in Databases

Data can be partitioned across multiple disks for parallel I/O.

Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel

Data can be partitioned and each processor can work independentlyon its own partition.

Queries are expressed in high level language (SQL, translated torelational algebra)

Makes parallelization easier.

Different queries can be run in parallel with each other.

Concurrency control takes care of conflicts.

Thus, databases naturally lend themselves to parallelism.

I/O Parallelism

Reduce the time required to retrieve relations from disk by partitioningthe relations on multiple

disks.

Horizontal partitioning tuples of a relation are divided among many disks such that each tuple

resides on one disk.

Partitioning techniques (number of disks = n):

Round-robin:

Send the ith tuple inserted in the relation to disk i mod n.

Hash partitioning:

Choose one or more attributes as the partitioning attributes.

Choose hash function h with range 0n 1

Let i denote result of hash function h applied tothe partitioning attribute value of a tuple. Send

tuple to disk i.

Partitioning techniques (cont.):

Range partitioning:

Choose an attribute as the partitioning attribute.

A partitioning vector [vo, v1, ..., vn-2] is chosen.

56


57/90

Let vbe the partitioning attribute value of a tuple. Tuples such that vi_vi+1 go to diskI+ 1. Tuples

with v < v0 go to disk 0 and tuples with v_vn-2 go to diskn-1.

E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk

0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go to disk2.

Comparison of Partitioning Techniques

Evaluate how well partitioning techniques support the following types of data access:

1. Scanning the entire relation.

2. Locating a tuple associatively point queries.

E.g., r.A = 25.

3. Locating all tuples such that the value of a given attribute lies within a specified range range

queries.

E.g., 10 _r.A < 25.

Round robin:

Advantages

Best suited for sequential scan of entire relation on each query.

All disks have almost an equal number of tuples; retrieval work is thus well balanced

between disks.

Range queries are difficult to process

No clustering -- tuples are scattered across all disks

Hash partitioning:

Good for sequential access

Assuming hash function is good, and partitioning attributes form a key, tuples will be

equally distributed between disks

Retrieval work is then well balanced between disks.

Good for point queries on partitioning attribute

Can lookup single disk, leaving others available for answering other queries.

Index on partitioning attribute can be local to disk, making lookup and update more

efficient

No clustering, so difficult to answer range queries

57


58/90

Range partitioning:Provides data clustering by partitioning attribute value.Good for sequential accessGood for point queries on partitioning attribute: only one disk needs to be accessed.For range queries on partitioning attribute, one to a few disks may need to be accessed

Remaining disks are available for other queries. Good if result tuples are from one to a few blocks. If many blocks are to be fetched, they are still fetched from one to a few disks, and

potential parallelism in disk access is wasted Example ofexecution skew.

Partitioning a Relation across Disks

! If a relation contains only a few tuples which will fit into a single disk block, then assign the

relation to a single disk.

! Large relations are preferably partitioned across all the available disks.

! If a relation consists ofm disk blocks and there are n disks available in the system, then the

relation should be allocated min(m,n) disks.

Handling of Skew

The distribution of tuples to disks may be skewed that is, some disks have many tuples, while

others may have fewer tuples.

Types of skew:

Attribute-value skew:" Some values appear in the partitioning attributes of many tuples; all the

tuples with the same value for the partitioning attribute end up in the same partition. Can occur

with range-partitioning and hash-partitioning.

Partition skew: " With range-partitioning, badly chosen partition vector may assign too many

tuples to some partitions and too few to others. " Less likely with hash-partitioning if a good hash-

function ischosen.

Handling Skew in Range-Partitioning

To create a balanced partitioning vector (assuming partitioning attribute forms a key of the

relation):

Sort the relation on the partitioning attribute.

Construct the partition vector by scanning the relation in sorted order as follows.

" After every 1/nth of the relation has been read, the value of the partitioning attribute of

the next tuple is added to the partition vector.

58


59/90

n denotes the number of partitions to be constructed.

Duplicate entries or imbalances can result if duplicates are present in partitioning attributes.

Alternative technique based on histograms used in practice

Handling Skew using Histograms

Balanced partitioning vector can be constructed from histogram in a relatively straightforwardfashion.

Assume uniform distribution within each range of the histogram

Histogram can be constructed by scanning relation, or sampling(blocks containing) tuples of the

relation

Interquery Parallelism

Queries/transactions execute in parallel with one another.

Increases transaction throughput; used primarily to scale up a transaction processing system tosupport a larger number of transactions per second.

Easiest form of parallelism to support, particularly in a shared memory parallel database, because

even sequential database systems support concurrent processing.

More complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors.

Data in a local buffer may have been updated at another processor.

Cache-coherency has to be maintained reads and writes of datain buffer must find

latest version of data.

Cache Coherency Protocol

Example of a cache coherency protocol for shared disk systems:

Before reading/writing to a page, the page must be locked in shared/exclusive mode.

On locking a page, the page must be read from disk59


60/90

Before unlocking a page, the page must be written to disk if it was modified.

More complex protocols with fewer disk reads/writes exist.

Cache coherency protocols for shared-nothing systems are similar. Each database page is assigned

a homeprocessor. Requests to fetch the page or write it to disk are sent to the home

processor.

Intraquery Parallelism

Execution of a single query in parallel on multiple processors/disks; important for speeding up

long-running queries.

Two complementary forms of intraquery parallelism:

Intraoperation Parallelism parallelize the execution of each individual operation in the query.

Interoperation Parallelism execute the different operations in a query expression in parallel.

The first form scales better with increasing parallelism because the number of tuples processed byeach operation is typically more than the number of operations in a query.

Parallel Sort

Range-Partitioning Sort

Choose processors P0... Pm, where m _ n -1 to do sorting.

Create range-partition vector with m entries, on the sorting attributes

Redistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi

Pi stores the tuples it received temporarily on disk Di.

This step requires I/O and communication overhead.

Each processor Pi sorts its partition of the relation locally.

Each processors executes same operation (sort) in parallel with other processors, without any

interaction with the others (data parallelism).

Final merge operation is trivial: range-partitioning ensures that, for 1 jm, the key values in

processor Pi are all less than the key values in Pj.

Parallel External Sort-Merge

Assume the relation has already been partitioned among disksD0, ...,Dn-1

Each processorPi locally sorts the data on diskDi.

60


61/90

The sorted runs on each processor are then merged to get the final sorted output.

Parallelize the merging of sorted runs as follows:

The sorted partitions at each processorPi are range-partitioned across the processorsP0...

Pm-1.

Each processorPiperforms a merge on the streams as they are received, to get a single

sorted run.

The sorted runs on processorsP0,...,Pm-1 are concatenated to get

the final result.

Parallel Join

The join operation requires pairs of tuples to be tested to see if they satisfy the join condition, and

if they do, the pair is added to the join output.

Parallel join algorithms attempt to split the pairs to be tested over several processors. Each

processor then computes part of the join locally.

In a final step, the results from each processor can be collected together to produce the final result.

Partitioned Join

For equi-joins and natural joins, it is possible topartition the two input relations across the

processors, and compute the join locally at each processor.

Let randsbe the input relations, and we want to compute rands each are partitionedinto npartitions, denoted r0, r1... rn-1 ands0,s1, ...,sn-1.

Can use eitherrange partitioningorhash partitioning.

rands must be partitioned on their join attributes r.A ands.B), using the same range-partitioning

vector or hash function.

Partitions ri andsi are sent to processorPi,

61


62/90

Each processorPi locally computes

Any of the standard join methods can be used.

Fragment-and-Replicate Join

Partitioning not possible for some join conditions,

e.g., non-equijoin conditions, such as r.A > s.B.

For joins were partitioning is not applicable, parallelization can be accomplished by fragment and

replicate technique

Depicted on next slide

Special case asymmetric fragment-and-replicate:

One of the relations, say r, is partitioned; any partitioning technique can be used.

The other relation,s, is replicated across all the processors.

ProcessorPi then locally computes the join ofri with all of s usingany join technique.

Both versions of fragment-and-replicate work with any join condition, since every tuple in rcan betested with every tuple ins.

Usually has a higher cost than partitioning, since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicated.Sometimes asymmetric fragment-and-replicate is preferable even though partitioning could beused.

E.g., says is small and ris large, and already partitioned. It may becheaper to replicates across all processors, rather than repartition rands on the join attributes.Partitioned Parallel Hash-Join

Parallelizing partitioned hash join:Assumes is smaller than rand therefores is chosen as the build relation.

62


63/90

A hash function h1 takes the join attribute value of each tuple ins and maps this tuple to one of thenprocessors.Each processorPi reads the tuples ofs that are on its diskDi, and sends each tuple to theappropriate processor based on hash function h1. Letsi denote the tuples of relations that aresent to processorPi.As tuples of relations are received at the destination processors, they are partitioned further usinganother hash function, h2, which is used to compute the hash-join locally.Once the tuples ofs have been distributed, the larger relation ris redistributed across the m

processors using the hash function h1 Let ri denote the tuples of relation rthat are sent to processorPi.

As the rtuples are received at the destination processors, they are repartitioned using the functionh2Each processorPi executes the build and probe phases of the hash-join algorithm on the localpartitions ri ands ofrands to produce a partition of the final result of the hash-join.Hash-join optimizations can be applied to the parallel case

e.g., the hybrid hash-join algorithm can be used to cache some of the incoming tuples inmemory and avoid the cost of writing them and reading them back in.

Parallel Nested-Loop Join

Assume that

relations is much smaller than relation rand that ris stored by partitioning. there is an index on a join attribute of relation rat each of the

partitions of relation r.Use asymmetric fragment-and-replicate, with relationsbeing replicated, and using the existingpartitioning of relation r.Each processorPj where a partition of relations is stored reads the tuples of relations stored inDj,and replicates the tuples to every other processorPi.At the end of this phase, relations is replicated at all sites that store tuples of relation r.Each processorPi performs an indexed nested-loop join of relations with the ith partition ofrelation r.Interoperator Parallelism

Pipelined parallelism Consider a join of four relations

Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of

And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel, sending result tuples it computes to the

next operation even as it is computing further results

Provided a pipelineable join evaluation algorithm is used.

Independent Parallelism

Consider a join of four relations,

Let P1 be assigned the computation of



63


64/90

P1 and P2 can workindependently in parallel

P3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3, combining independent parallelism and

pipelined parallelism

Does not provide a high degree of parallelism

useful with a lower degree of parallelism.

less useful in a highly parallel system.

Query Optimization

Query optimization in parallel databases is significantly more complex than query optimization in

sequential databases.

Cost models are more complicated, since we must take into account partitioning costs and issues

such as skew and resource contention.

When scheduling execution tree in parallel system, must decide:

How to parallelize each operation and how many processors to use for it.

What operations to pipeline, what operations to execute independently in parallel, and what

operations to execute sequentially, one after the other.

Determining the amount of resources to allocate for each operation is a problem.

E.g., allocating more processors than optimal can result in high communication overhead.

Long pipelines should be avoided as the final operation may wait a lotfor inputs, while holding

precious resources

Text Databases

A text database is a compilation of documents or other information in the form of a database in

which the complete text of each referenced document is available for online viewing, printing, or

downloading. In addition to text documents, images are often included, such as graphs, maps,

photos, and diagrams. A text database is searchable by keyword, phrase, or both.

When an item in a text database is viewed, it may appear in ASCII format (as a text file with the.txt extension), as a word-processed file (requiring a program such as Microsoft Word), as an PDF)

file. When a document appears as a PDF file, it is usually a scanned hardcopy of the original

article, chapter, or book.

A text databases are used by college and university libraries as a convenience to their students and

staff. Full-text databases are ideally suited to online courses of study, where the student remains at

home and obtains course materials by downloading them from the Internet. Access to these

databases is normally restricted to registered personnel or to people who pay a specified fee per

64
http://whatis.techtarget.com/definition/0,289893,sid9_gci211895,00.htmlhttp://whatis.techtarget.com/definition/0,289893,sid9_gci213125,00.htmlhttp://whatis.techtarget.com/definition/0,289893,sid9_gci211600,00.htmlhttp://whatis.techtarget.com/definition/0,289893,sid9_gci211895,00.htmlhttp://whatis.techtarget.com/definition/0,289893,sid9_gci213125,00.htmlhttp://whatis.techtarget.com/definition/0,289893,sid9_gci211600,00.html


65/90

viewed item. Full-text databases are also used by some corporations, law offices, and government

agencies.

(b) Discuss the Rules, Knowledge Bases and Image Databases. (JUNE 2010)

Rule-based systems are used as a way to store and manipulate knowledge to interpret information

in a useful way. They are often used in artificial intelligence applications and research.

Rule-based systems are specialized software that encapsulates Human Intelligence like

knowledge there by make intelligent decisions quickly and in repeatable form. Also known as Rule

Based Systems, Expert Systems & Artificial Intelligence.

Rule based systems are:

Knowledge based systems

Part of the Artificial Intelligence field

Computer programs that contain some subject-specific knowledge of one or more

human experts

Made up of a set of rules that analyze user supplied information about a specific

class of problems.

Systems that utilize reasoning capabilities and draw conclusions.

Knowledge Engineering building an expert system

Knowledge Engineers the people who build the system

Knowledge Representation the symbols used to represent the knowledge

Factual Knowledge knowledge of a particular task domain that is widely shared

Heuristic Knowledge more judgmental knowledge of performance in a task domain.

Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily

be transferred to other members.

Solves problems that would normally be tackled by a medical or other professional.

Currently used in fields such as accounting, medicine, process control, financial service,

production, and human resources

Applications

65
http://en.wikipedia.org/wiki/Artificial_intelligencehttp://en.wikipedia.org/wiki/Artificial_intelligence


66/90

A classic example of a rule-based system is the domain-specific expert system that uses rules to

make deductions or choices. For example, an expert system might help a doctor choose the correct

diagnosis based on a cluster of symptoms, or select tactical moves to play a game.

Rule-based systems can be used to perform lexical analysis to compile or interpret computer

programs, or in natural language processing.

Rule-based programming attempts to derive execution instructions from a starting set of data and

rules. This is a more indirect method than that employed by an imperative programming language,

which lists execution steps sequentially.

Construction

A typical rule-based system has four basic components:

A list of rules orrule base, which is a specific type ofknowledge base.

An inference engineorsemantic reasoner, which infers information or takes action based

on the interaction of input and the rule base. The interpreter executes a production system

program by performing the following match-resolve-act cycle:

Match: In this first phase, the left-hand sides of all productions are matched against the

contents of working memory. As a result a conflict set is obtained, which consists of

instantiations of all ,satisfied productions. An instantiation of a production is an ordered

list of working memory elements that satisfies the left-hand side of the production.

Conflict-Resolution: In this second phase, one of the production instantiations in the

conflict set is chosen for execution. If no productions are satisfied, the interpreter halts.

Act: In this third phase, the actions of the production selected in the conflict-resolutionphase are executed. These actions may change the contents of working memory. At the

end of this phase, execution returns to the first phase.

Temporary working memory.

A user interface or other connection to the outside world through which input and output

signals are received and sent.

Components of an Rule Based System

Set of Rules derived from the knowledge base and used by the interpreter to evaluate theinputted data

Knowledge Engineer decides how to represent the experts knowledge and how to build

the inference engine appropriately for the domain

Interpreter interprets the inputted data and draws a conclusion based on the users

responses.

Problem-solving Models

66
http://en.wikipedia.org/wiki/Expert_systemhttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/Compilehttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Rule-based_programminghttp://en.wikipedia.org/wiki/Imperative_programming_languagehttp://en.wikipedia.org/wiki/Knowledge_basehttp://en.wikipedia.org/wiki/Knowledge_basehttp://en.wikipedia.org/wiki/Inference_enginehttp://en.wikipedia.org/wiki/Inference_enginehttp://en.wikipedia.org/wiki/Semantic_reasonerhttp://en.wikipedia.org/wiki/User_interfacehttp://en.wikipedia.org/wiki/Expert_systemhttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/Compilehttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Rule-based_programminghttp://en.wikipedia.org/wiki/Imperative_programming_languagehttp://en.wikipedia.org/wiki/Knowledge_basehttp://en.wikipedia.org/wiki/Inference_enginehttp://en.wikipedia.org/wiki/Semantic_reasonerhttp://en.wikipedia.org/wiki/User_interface


67/90

Forward-chaining starts from a set of conditions and moves towards some conclusion

Backward-chaining starts with a list of goals and the works backwards to see if there is

any data that will allow it to conclude any of these goals.

Both problem-solving methods are built into inference engines or inference procedures

Advantages

Provide consistent answers for repetitive decisions, processes and tasks.

Hold and maintain significant levels of information.

Reduce employee training costs

Centralize the decision making process.

Create efficiencies and reduce the time needed to solve problems.

Combine multiple human expert intelligences

Reduce the amount of human errors.

Give strategic and comparative advantages creating entry barriers to competitors

Review transactions that human experts may overlook.

Disadvantages

Lack human common sense needed in some decision making.

Will not be able to give the creative responses that human experts can give in unusual

circumstances.

Domain experts cannot always clearly explain their logic and reasoning.

Challenges of automating complex processes.

Lack of flexibility and ability to adapt to changing environments.

Not being able to recognize when no answer is available.

Knowledge Bases

Knowledge-based Systems: Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base

to solve problems that normally require human expertise.

67


68/90

Heuristic rather than algorithmic

Heuristics in search vs. in KBS general vs. domain-specific

Highly specific domain knowledge

Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

The inference engine and knowledge base are separated because:

the reasoning mechanism needs to be as stable as possible;

the knowledge base must be able to grow and change, as knowledge is added;

this arrangement enables the system to be built from, or converted to, a shell.

It is reasonable to produce a richer, more elaborate, description of the typical expert

system.

A more elaborate description, which still includes the components that are to be found in

almost any real-world system, would look like this:

Knowledge representation formalisms & Inference

KR Inference

Logic Resolution principle

68


69/90

Production rules backward (top-down, goal directed)

forward (bottom-up, data-driven)

Semantic nets &

Frames Inheritance & advanced reasoning

Case-based

Reasoning Similarity based

KBS tools Shells

- Consist of KA Tool, Database & Development Interface

- Inductive Shells

- simplest

- example cases represented as matrix of known data(premises) and resulting effects

- matrix converted into decision tree or IF-THEN statements

- examples selected for the tool

Rule-based shells

- simple to complex

- IF-THEN rules

Hybrid shells

- sophisticate & powerful

- support multiple KR paradigms & reasoning schemes

- generic tool applicable to a wide range

Special purpose shells

- specifically designed for particular types of problems

- restricted to specialised problems

-Scratch

- require more time and effort

- no constraints like shells

- shells should be investigated first

69


70/90

Some example KBSs

DENDRAL (chemical)

MYCIN (medicine)

XCON/RI (computer)

Typical tasks of KBS

(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions.

e.g. diagnose reasons for engine failure

(2) Interpretation - To provide an understanding of a situation from available information. e.g.

DENDRAL

(3) Prediction - To predict a future state from a set of data or observations. e.g. Drilling Advisor,

PLANT

(4) Design - To develop configurations that satisfy constraints of a design problem. e.g. XCON

(5) Planning - Both short term & long term in areas like project management, product

development or financial planning. e.g. HRM

(6) Monitoring - To check performance & flag exceptions.

e.g., KBS monitors radar data and estimates the position of the space shuttle

(7) Control - To collect and evaluate evidence and form opinions on that evidence.

e.g. control patients treatment

(8) Instruction - To train students and correct their performance. e.g. give medical students

experience diagnosing illness

(9) Debugging - To identify and prescribe remedies for malfunctions.

e.g. identify errors in an automated teller machine network and ways to correct the errors

Advantages

- Increase availability of expert knowledge

expertise not accessible

training future experts

- Efficient and cost effective

- Consistency of answers

- Explanation of solution

70


71/90

- Deal with uncertainty

Limitations

- Lack of common sense

- Inflexible, Difficult to modify

- Restricted domain of expertise

- Lack of learning ability

- Not always reliable

6. (a) Compare Distributed databases and conventional databases. (16)

(NOV/DEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availability/reliability/performance by removing reliance on a central site

Reduced communication overhead

Most data access is local, less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

71


72/90

7. (a) Explain the Multi-Version Locks and Recovery in Query Languages.(NOV/DEC 2010)

Multi-Version Locks: Multiversion concurrency control (abbreviated MCC orMVCC), in the

database field ofcomputer science, is a concurrency control method commonly used by database

management systems to provide concurrent access to the database and in programming languages

to implement transactional memory.

For instance, a database will implement updates not by deleting an old piece of data and

overwriting it with a new one, but instead by marking the old data as obsolete and adding the

newer "version." Thus there are multiple versions stored, but only one is the latest. This allows the

database to avoid overhead of filling in holes in memory or disk structures but requires (generally)

the system to periodically sweep through and delete the old, obsolete data objects. For a document-

oriented database such as CouchDB,RiakorMarkLogic Serverit also allows the system to

optimize documents by writing entire documents onto contiguous sections of diskwhen updated,

the entire document can be re-written rather than bits and pieces cut out or maintained in a linked,

non contiguous database structure

MVCC also provides potential "point in time" consistent views. In fact read transactions under

MVCC typically use a timestamp or transaction ID to determine what state of the DB to read, and

read these "versions" of the data.

This avoids managing locks for read transactions because writes can be isolatedby virtue of the

old versions being maintained, rather than through a process of locks or mutexes. Writes affect

future "version" but at the transaction ID that the read is working at, everything is guaranteed to be

consistent because the writes are occurring at a later transaction ID.

In other words, MVCC provides each user connected to the database with a "snapshot" of the

database for that person to work with. Any changes made will not be seen by other users of the

database until the transaction has been committed.

MVCC uses timestamps or increasing transaction IDs to achieve transactional consistency. MVCC

ensures a transaction never has to wait for a database object by maintaining several versions of an

object. Each version would have a write timestamp and it would let a transaction (T i) read the most

recent version of an object which precedes the transaction timestamp (TS (T i)).

If a transaction (Ti) wants to write to an object, and if there is another transaction (Tk), the

timestamp of Ti must precede the timestamp of Tk(i.e., TS(Ti) < TS(Tk)) for the object write

operation to succeed, which is to say a write cannot complete if there are outstanding transactions

with an earlier timestamp.

Every object would also have a read timestamp, and if a transaction T i wanted to write to object P,

and the timestamp of that transaction is earlier than the object's read timestamp (TS(T i) < RTS(P)),

the transaction Ti is aborted and restarted. Otherwise, Ti creates a new version of P and sets the

read/write timestamps of P to the timestamp of the transaction TS (Ti).

72
http://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Concurrency_controlhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/Riakhttp://en.wikipedia.org/wiki/MarkLogic_Serverhttp://en.wikipedia.org/wiki/Isolation_(database_systems)http://en.wikipedia.org/wiki/Timestamphttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Concurrency_controlhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/Riakhttp://en.wikipedia.org/wiki/MarkLogic_Serverhttp://en.wikipedia.org/wiki/Isolation_(database_systems)http://en.wikipedia.org/wiki/Timestamp


73/90

The obvious drawback to this system is the cost of storing multiple versions of objects in the

database. On the other hand reads are never blocked, which can be important for workloads mostly

involving reading values from the database. MVCC is particularly adept at implementing true

snapshot isolation, something which other methods of concurrency control frequently do either

incompletely or with high performance costs.

At t1 the state of a DB could be

Time Object1 Object2

t1 Hello Bar

t2 Foo Bar

This indicates that the current set of this database (perhaps a key-value store database) is

Object1="Hello", Object2="Bar". Previously, Object1 was "Foo" but that value has been

superseded. It is not deleted because the database holds multiple versions but will be deleted

later.

If a long running transaction starts a read operation, it will operate at transaction "t1" and see this

state. If there is a concurrent update (during that long-running read transaction) which deletes

Object 2 and adds Object 3 = foo-bar the database will look like

Time Object1 Object2 Object3

t2 Hello (deleted) Foo-Bar

t1 Hello Bar

t0 Hello Bar

Now there is a new version as of transaction ID t2. Note critically that the long-running read

transaction still has access to a coherent snapshot of the system at t1* even though the write

transaction added data as of t2, so the read transaction is able to run in isolation from the update

transaction that created the t2 values. This is how MVCC allows isolated, ACID, reads without any

locks.

Recovery

73
http://en.wikipedia.org/wiki/Snapshot_isolationhttp://en.wikipedia.org/wiki/Snapshot_isolation


74/90

74


75/90

(b)Discuss client/server model and mobile databases. (16)

(NOV/DEC 2010)

75


76/90

Mobile Databases

Recent advances in portable and wireless technology led to mobile computing, a new

dimension in data communication and processing.

Portable computing devices coupled with wireless communications allow clients to access

data from virtually anywhere and at any time.

There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized.

Some of the software problems which may involve data management, transaction

management, and database recovery have their origins in distributed database systems.

76


77/90

In mobile computing, the problems are more difficult, mainly:

The limited and intermittent connectivity afforded by wireless communications.

The limited life of the power supply(battery).

The changing topology of the network.

In addition, mobile computing introduces new architectural possibilities andchallenges.

Mobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 30.1.

It is distributed architecture where a number of computers, generally referred to as Fixed

Hosts and Base Stations are interconnected through a high-speed wired network.

Fixed hosts are general purpose computers configured to manage mobile units.

Base stations function as gateways to the fixed network for the Mobile Units.

Wireless Communications

The wireless medium have bandwidth significantly lower than those of a wired

network.

77


78/90

The current generation of wireless technology has data rates range from the

tens to hundreds of kilobits per second (2G cellular telephony) to tens of

megabits per second (wireless Ethernet, popularly known as WiFi).

Modern (wired) Ethernet, by comparison, provides data rates on the order of

hundreds of megabits per second.

The other characteristics distinguish wireless connectivity options:

interference,

locality of access,

range,

support for packet switching,

seamless roaming throughout a geographical region.

Some wireless networks, such as WiFi and Bluetooth, use unlicensed areas of the

frequency spectrum, which may cause interference with other appliances, such as

cordless telephones.

Modern wireless networks can transfer data in units called packets, that are used in

wired networks in order to conserve bandwidth.

Client/Network Relationships

Mobile units can move freely in a geographic mobility domain, an area that is

circumscribed by wireless network coverage.

To manage entire mobility domain is divided into one or more smaller

domains, called cells, each of which is supported by at least one base

station.

Mobile units be unrestricted throughout the cells of domain, while

maintaining information access contiguity.

The communication architecture described earlier is designed to give the mobile

unit the impression that it is attached to a fixed network, emulating a traditional

client-server architecture.

Wireless communications, however, make other architectures possible. One

alternative is a mobile ad-hoc network (MANET), illustrated in 29.2.

In a MANET, co-located mobile units do not need to communicate via a fixed

network, but instead, form their own using cost-effective technologies such as

Bluetooth.

78


79/90

In a MANET, mobile units are responsible for routing their own data, effectively

acting as base stations as well as clients.

Moreover, they must be robust enough to handle changes in the network

topology, such as the arrival or departure of other mobile units.

MANET applications can be considered as peer-to-peer, meaning that a mobile unit

is simultaneously a client and a server.

Transaction processing and data consistency control become more difficult

since there is no central control in this architecture.

Resource discovery and data routing by mobile units make computing in a

MANET even more complicated.

Sample MANET applications are multi-user games, shared whiteboard,

distributed calendars, and battle information sharing.

Characteristics of Mobile Environments

The characteristics of mobile computing include:

Communication latency

Intermittent connectivity

Limited battery life

Changing client location

The server may not be able to reach a client.

A client may be unreac

dbt anna university qa june 2010 dec 2010

Documents