dbt anna university qa june 2010 dec 2010

Upload: hema-latha

Post on 05-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    1/90

    ANNA UNIVERSITY- CHENNAI-JUNE 2010 & DECEMBER 2010

    DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

    SUB CODE/SUB NAME: CS9221 / DATABASE TECHNOLOGY

    Part A (10*2=20 Marks)

    1. What is fragmentation? (JUNE 2010)

    Fragmentation is a database server feature that allows you to control where data is stored at

    the table level. Fragmentation enables you to define groups of rows or index keys within a

    table according to some algorithm or scheme. You use SQL statements to create the

    fragments and assign them to dbspaces.

    2. What is Concurrency control? (JUNE 2010) (NOV/ DEC 2010)

    Concurrency control is the activity of coordinating concurrent accesses to a database in a

    multiuser system. Concurrency control allows user to access a database in a multi-

    programmed fashion while preserving the consistency of the data.

    3. What is Persistence? (JUNE 2010) (NOV/ DEC 2010)

    Persistence is the property of an object through which its existence transcends time i.e. (the

    object continues to exist after its creator ceases to exist), and/or space (i.e. the objects

    location moves from the address space in which it was created).

    4. What is Transaction Processing? (JUNE 2010)

    A Transaction Processing system (TPS) is a set of information which processes the data

    transaction in database system that monitors transaction programs (a special kind of

    program). For e.g. in an electronic payment is made, the amount must be both withdrawn

    from one account and added to the other; it cannot complete only one of those steps. Either

    both must occur, or neither. In case of a failure preventing transaction completion, the

    partially executed transaction must be 'rolled back' by the TPS.

    5. What is Client/Server model? (JUNE 2010)

    1

    http://en.wikipedia.org/wiki/Rollback_(data_management)http://en.wikipedia.org/wiki/Rollback_(data_management)
  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    2/90

    The server in a client/server model is simply the DBMS, whereas the client is the database

    application serviced by the DBMS.

    The client/server model of a database system is classified into basic & distributed

    client/server model.

    6. What is the difference between data warehousing and data mining? (JUNE 2010)

    Data warehousing: It is the process that is used to integrate and combine data from

    multiple sources and format into a single unified schema. So it provides the enterprise with

    a storage mechanism for its huge amount of data.

    Data mining: It is the process of extracting interesting patterns and knowledge from huge

    amount of data. So we can apply data mining techniques on the data warehouse of an

    enterprise to discover useful patterns.

    7. Why do we need Normalization? (JUNE 2010)

    Normalization is a process followed for eliminating redundant data and establishes a

    meaningful relationship among tables based on rules and regulations in order to maintain

    integrity of data. It is done for maintaining storage space and also for performance tuning.

    8. What is Integrity? (JUNE 2010) (NOV/ DEC 2010)

    Integrity refers to the process of ensuring that a database remains an accurate reflection of

    the universe of discourse it is modeling or representing. In other words there is a close

    correspondence between the facts stored in the database and the real world it models0

    9. Give two features of Multimedia Databases (JUNE 2010) (NOV/ DEC 2010)

    The multimedia database systems are to be used when it is required to administrate a huge

    amounts of multimedia data objects of different types of data media (optical storage, video,

    2

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    3/90

    tapes, audio records, etc.) so that they can be used (that is, efficiently accessed and

    searched)for as many applications as needed.

    The Objects of Multimedia Data are: text, images, graphics, sound recordings, video

    recordings, signals, etc., that are digitalized and stored.

    10. What are Deductive Databases? (JUNE 2010) (NOV/ DEC 2010)

    A Deductive Database is the combination of a conventional database containing facts, a

    knowledge base containing rules, and an inference engine which allows the derivation of

    information implied by the facts and rules.

    A deductive database system specify rules through a declarative language - a language

    in which we specify what to achieve rather than how to achieve it. An inference engine

    within the system can deduce new facts from the database by interpreting these rules. Themodel used for deductive databases is related to the relational data model and also related

    to the field oflogic programming and the Prolog language.

    11. What is query processing? (NOV/ DEC 2010)

    Query processing is a set of activities involving in getting the result of a query expressed in

    a high-level language. These activities include parsing the queries and translate them into

    expressions that can be implemented at the physical level of the file system, optimizing the

    query of internal form to get suitable execution strategies for processing and then doing the

    actual execution of queries to get the results.

    12.Give two features of object-oriented databases. (NOV/ DEC 2010)

    The features of Object Oriented Databases are,

    It provides persistent storage for objects.

    They may provide one or more of the following: a query language; indexing;

    transaction support with rollback and commit; the possibility of distributing objects

    transparently over many servers.

    3

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    4/90

    13. What is Data warehousing? (NOV/ DEC 2010)

    Data warehousing: It is the process that is used to integrate and combine data from

    multiple sources and format into a single unified schema. So it provides the enterprise with

    a storage mechanism for its huge amount of data.

    14. What is Normalization? (NOV/ DEC 2010)

    Normalization is a process followed for eliminating redundant data and establishes a

    meaningful relationship among tables based on rules and regulations in order to maintain

    integrity of data. It is done for maintaining storage space and also for performance tuning.

    15. Mention two features of parallel Databases. (NOV/ DEC 2010)

    It is used to provide speedup, where queries are executed faster because more

    resources, such as processors and disks, are provided.

    It is also used to provide scaleup, where increasing workloads are handled without

    increased response time, via an increase in the degree of parallelism.

    4

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    5/90

    Part B (5*16=80 Marks) (JUNE 2010) & (DECEMBER 2010)

    1. (a)Explain the architecture of Distributed Databases.(16) (JUNE 2010)

    Or

    (b) Discuss in detail the architecture of distributed database. (16)

    (NOV/DEC 2010)

    5

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    6/90

    6

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    7/90

    7

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    8/90

    (b)Write notes on the following:

    (i) Query processing. (8) (JUNE 2010)

    8

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    9/90

    9

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    10/90

    10

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    11/90

    11

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    12/90

    12

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    13/90

    (ii) Transaction processing. (8) (JUNE 2010)

    A transaction is a collection of actions that make consistent transformations of system states while

    preserving system consistency.

    concurrency transparency failure transparency

    13

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    14/90

    Example Transaction SQL Version

    Begin_transaction Reservationbegininput(flight_no, date, customer_name);EXEC SQL UPDATE FLIGHT

    SET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = date;EXEC SQL INSERTINTO FC(FNO, DATE, CNAME, SPECIAL);VALUES (flight_no, date, customer_name, null);output(reservation completed)end . {Reservation}

    Properties of Transactions

    ATOMICITY

    all or nothing

    CONSISTENCY

    no violation of integrity constraints

    ISOLATION

    concurrent changes invisible E serializable

    DURABILITY

    committed updates persist

    These are the ACID Properties of Transaction

    Atomicity

    Either all or none of the transaction's operations are performed. Atomicity requires that if a

    transaction is interrupted by a failure, its partial results must be undone. The activity of preserving

    the transaction's atomicity in presence of transaction aborts due to input errors, system overloads,

    or deadlocks is called transaction recovery. The activity of ensuring atomicity in the presence of

    system crashes is called crash recovery.

    Consistency

    Internal consistency

    A transaction which executes alone against a consistent database leaves it in a consistent state.

    Transactions do not violate database integrity constraints.

    14

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    15/90

    Transactions are correct programs.

    Isolation

    Degree 0

    Transaction T does not overwrite dirty data of other transactions

    Dirty data refers to data values that have been updated by a transaction prior to its commitment.

    Degree 2

    T does not overwrite dirty data of other transactions

    T does not commit any writes before EOT

    T does not read dirty data from other transactions

    Degree 3

    T does not overwrite dirty data of other transactions

    T does not commit any writes before EOT

    T does not read dirty data from other transactions

    Other transactions do not dirty any data read by T before T completes.

    Isolation

    Serializability

    If several transactions are executed concurrently, the results must be the same as if they wereexecuted serially in some order.

    Incomplete results

    An incomplete transaction cannot reveal its results to other transactions before its commitment.

    Necessary to avoid cascading aborts.

    Durability: Once a transaction commits, the system must guarantee that the results of its

    operations will never be lost, in spite of subsequent failures.

    Database recovery

    15

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    16/90

    Transaction transparency: Ensures all distributed Ts maintain distributed databases integrity

    and consistency.

    Distributed T accesses data stored at more than one location.

    Each T is divided into no. of subTs, one for each site that has to be accessed.

    DDBMS must ensure the indivisibility of both the global T and each of the subTs.

    Concurrency transparency: All Ts must execute independently and be logically consistent withresults obtained if Ts executed in some arbitrary serial order.

    Replication makes concurrency more complex

    Failure transparency: must ensure atomicity and durability of global T.

    Means ensuring that subTs of global T either all commit or all abort.

    Classification transparency: In IBMs Distributed Relational Database Architecture

    (DRDA), four types of Ts:

    Remote request

    Remote unit of work

    Distributed unit of work

    Distributed request.

    2. (a)Discuss the Modeling and design approaches for Object Oriented Databases

    (JUNE 2010)

    Or

    (b) Describe modeling and design approaches for object oriented database. (16)

    (NOV/DEC 2010)

    16

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    17/90

    MODELING AND DESIGN

    Basically, an OODBMS is an object database that provides DBMS capabilities to objects that

    have been created using an object-oriented programming language (OOPL). The basic

    principle is to add persistence to objects and to make objects persistent. Consequently

    application programmers who use OODBMSs typically write programs in a native OOPL such

    as Java, C++ or Smalltalk, and the language has some kind of Persistent class, Database class,Database Interface, or Database API that provides DBMS functionality as, effectively, an

    extension of the OOPL.

    Object-oriented DBMSs, however, go much beyond simply adding persistence to any one

    object-oriented programming language. This is because, historically, many object-oriented

    DBMSs were built to serve the market for computer-aided design/computer-aided

    manufacturing (CAD/CAM) applications in which features like fast navigational access,

    versions, and long transactions are extremely important.

    Object-oriented DBMSs, therefore, support advanced object-oriented database applications

    with features like support for persistent objects from more than one programming language,

    distribution of data, advanced transaction models, versions, schema evolution, and dynamicgeneration of new types.

    Object data modeling

    An object consists of three parts: structure (attribute, and relationship to other objects like

    aggregation, and association), behavior (a set of operations) and characteristic of types

    (generalization/serialization). An object is similar to an entity in ER model; therefore we begin

    with an example to demonstrate the structure and relationship.

    17

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    18/90

    Attributes are like the fields in a relational model. However in the Book example we have, for

    attributes publishedBy and writtenBy, complex types Publisher and Author, which are also objects.

    Attributes with complex objects, in RDNS, are usually other tables linked by keys to the employee

    table.Relationships: publish and writtenBy are associations with I: N and 1:1 relationship; composed of

    is an aggregation (a Book is composed of chapters). The 1: N relationship is usually realized as

    attributes through complex types and at the behavioral level. For example,

    Generalization/Serialization is the is a relationship, which is supported in OODB through class

    hierarchy. An ArtBook is a Book, therefore the ArtBook class is a subclass of Book class. A

    subclass inherits all the attribute and method of its superclass.

    Message: means by which objects communicate, and it is a request from one object to another to

    execute one of its methods. For example: Publisher_object.insert (Rose, 123) i.e. request to

    execute the insert method on a Publisher object)Method: defines the behavior of an object. Methods can be used to change state by modifying its

    attribute values to query the value of selected attributes The method that responds to the message

    example is the method insert defied in the Publisher class. The main differences between relational

    database design and object oriented database design include:

    Many-to-many relationships must be removed before entities can be translated into relations.

    Many-to-many relationships can be implemented directly in an object-oriented database.

    Operations are not represented in the relational data model. Operations are one of the main

    18

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    19/90

    components in an object-oriented database. In the relational data model relationships are

    implemented by primary and foreign keys. In the object model objects communicate through their

    interfaces. The interface describes the data (attributes) and operations (methods) that are visible to

    other objects.

    (b) Explain the Multi-Version Locks and Recovery in Query Languages. (16) (JUNE 2010)

    Or

    (a) Explain the Multi-Version Locks and Recovery in Query Languages.

    (DECEMBER2010)

    Multi-Version Locks

    Multiversion concurrency control (abbreviated MCC orMVCC), in the database field of

    computer science, is a concurrency control method commonly used by database management

    systems to provide concurrent access to the database and in programming languages to implement

    transactional memory.

    For instance, a database will implement updates not by deleting an old piece of data and

    overwriting it with a new one, but instead by marking the old data as obsolete and adding the

    newer "version." Thus there are multiple versions stored, but only one is the latest. This allows the

    database to avoid overhead of filling in holes in memory or disk structures but requires (generally)

    the system to periodically sweep through and delete the old, obsolete data objects. For a document-

    oriented database such as CouchDB,RiakorMarkLogic Serverit also allows the system to

    optimize documents by writing entire documents onto contiguous sections of diskwhen updated,

    the entire document can be re-written rather than bits and pieces cut out or maintained in a linked,

    non contiguous database structure

    19

    http://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Concurrency_controlhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/Riakhttp://en.wikipedia.org/wiki/MarkLogic_Serverhttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Concurrency_controlhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/Riakhttp://en.wikipedia.org/wiki/MarkLogic_Server
  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    20/90

    MVCC also provides potential "point in time" consistent views. In fact read transactions under

    MVCC typically use a timestamp or transaction ID to determine what state of the DB to read, and

    read these "versions" of the data.

    This avoids managing locks for read transactions because writes can be isolatedby virtue of the

    old versions being maintained, rather than through a process of locks or mutexes. Writes affect

    future "version" but at the transaction ID that the read is working at, everything is guaranteed to be

    consistent because the writes are occurring at a later transaction ID.

    In other words, MVCC provides each user connected to the database with a "snapshot" of the

    database for that person to work with. Any changes made will not be seen by other users of the

    database until the transaction has been committed.

    MVCC uses timestamps or increasing transaction IDs to achieve transactional consistency. MVCC

    ensures a transaction never has to wait for a database object by maintaining several versions of an

    object. Each version would have a write timestamp and it would let a transaction (T i) read the most

    recent version of an object which precedes the transaction timestamp (TS (T i)).

    If a transaction (Ti) wants to write to an object, and if there is another transaction (Tk), thetimestamp of Ti must precede the timestamp of Tk(i.e., TS(Ti) < TS(Tk)) for the object write

    operation to succeed, which is to say a write cannot complete if there are outstanding transactions

    with an earlier timestamp.

    Every object would also have a read timestamp, and if a transaction T i wanted to write to object P,

    and the timestamp of that transaction is earlier than the object's read timestamp (TS(T i) < RTS(P)),

    the transaction Ti is aborted and restarted. Otherwise, Ti creates a new version of P and sets the

    read/write timestamps of P to the timestamp of the transaction TS (Ti).

    The obvious drawback to this system is the cost of storing multiple versions of objects in thedatabase. On the other hand reads are never blocked, which can be important for workloads mostly

    involving reading values from the database. MVCC is particularly adept at implementing true

    snapshot isolation, something which other methods of concurrency control frequently do either

    incompletely or with high performance costs.

    At t1 the state of a DB could be

    Time Object1 Object2

    t1 Hello Bar

    t2 Foo Bar

    This indicates that the current set of this database (perhaps a key-value store database) is

    Object1="Hello", Object2="Bar". Previously, Object1 was "Foo" but that value has been

    superseded. It is not deleted because the database holds multiple versions but will be deleted

    later.

    20

    http://en.wikipedia.org/wiki/Isolation_(database_systems)http://en.wikipedia.org/wiki/Timestamphttp://en.wikipedia.org/wiki/Snapshot_isolationhttp://en.wikipedia.org/wiki/Isolation_(database_systems)http://en.wikipedia.org/wiki/Timestamphttp://en.wikipedia.org/wiki/Snapshot_isolation
  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    21/90

    If a long running transaction starts a read operation, it will operate at transaction "t1" and see this

    state. If there is a concurrent update (during that long-running read transaction) which deletes

    Object 2 and adds Object 3 = foo-bar the database will look like

    Time Object1 Object2 Object3

    t2 Hello (deleted) Foo-Bar

    t1 Hello Bar

    t0 Hello Bar

    Now there is a new version as of transaction ID t2. Note critically that the long-running read

    transaction still has access to a coherent snapshot of the system at t1* even though the write

    transaction added data as of t2, so the read transaction is able to run in isolation from the update

    transaction that created the t2 values. This is how MVCC allows isolated, ACID, reads without any

    locks.

    Recovery

    21

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    22/90

    22

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    23/90

    3. (a) Discuss in detail Data Warehousing and Data Mining. (JUNE 2010)

    Or

    (a) Explain the features of data warehousing and data mining. (16)

    (NOV/DEC 2010)

    Data Warehouse:

    Large organizations have complex internal organizations, and have data stored at different

    locations, on different operational (transaction processing) systems, under different

    schemas

    Data sources often store only current data, not historical data

    Corporate decision making requires a unified view of all organizational data, including

    historical data

    A data warehouse is a repository (archive) of information gathered from multiple sources,

    stored under a unified schema, at a single site

    Greatly simplifies querying, permits study of historical trends

    Shifts decision support query load away from transaction processing systems

    When and how to gather data

    Source driven architecture: data sources transmit new information to warehouse, either

    continuously or periodically (e.g. at night)

    Destination driven architecture: warehouse periodically requests new information from

    data sources

    Keeping warehouse exactly synchronized with data sources (e.g. using two-phase commit)

    is too expensive

    23

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    24/90

    Usually OK to have slightly out-of-date data at warehouse

    Data/updates are periodically downloaded form online transaction processing (OLTP)systems.

    What schema to use

    Schema integration

    Data cleansing

    E.g. correct mistakes in addresses

    E.g. misspellings, zip code errors

    Merge address lists from different sources and purge duplicates

    Keep only one address record per household (householding)

    How to propagate updates

    Warehouse schema may be a (materialized) view of schema from data sources

    Efficient techniques for update of materialized views

    What data to summarize

    Raw data may be too large to store on-line

    Aggregate values (totals/subtotals) often suffice

    Queries on raw data can often be transformed by query optimizer to use aggregate values.

    24

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    25/90

    Typically warehouse data is multidimensional, with very large fact tables

    Examples of dimensions: item-id, date/time of sale, store where sale was made, customer

    identifier

    Examples of measures: number of items sold, price of items

    Dimension values are usually encoded using small integers and mapped to full values via

    dimension tables

    Resultant schema is called a star schema

    More complicated schema structures

    Snowflake schema: multiple levels of dimension tables

    Constellation: multiple fact tables

    Data Mining

    Broadly speaking, data mining is the process of semi-automatically analyzing large

    databases to find useful patterns.

    Like knowledge discovery in artificial intelligence data mining discovers statistical rules

    and patterns

    Differs from machine learning in that it deals with large volumes of data stored primarily

    on disk.

    Some types of knowledge discovered from a database can be represented by a set of rules.

    e.g.,: Young women with annual incomes greater than $50,000 are most likely to buy

    sports cars.

    25

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    26/90

    Other types of knowledge represented by equations, or by prediction functions.

    Some manual intervention is usually required

    Pre-processing of data, choice of which type of pattern to find, postprocessing to

    find novel patterns

    Applications of Data Mining

    Prediction based on past history

    Predict if a credit card applicant poses a good credit risk, based on some attributes

    (income, job type, age, ..) and past history

    Predict if a customer is likely to switch brand loyalty

    Predict if a customer is likely to respond to junk mail

    Predict if a pattern of phone calling card usage is likely to be fraudulent

    Some examples of prediction mechanisms:

    Classification

    Given a training set consisting of items belonging to different classes, and a new

    item whose class is unknown, predict which class it belongs to.

    Regression formulae

    Given a set of parameter-value to function-result mappings for an unknown

    function, predict the function-result for a new parameter-value

    Descriptive Patterns

    Associations

    Find books that are often bought by the same customers. If a new customer buys

    one such book, suggest that he buys the others too.

    Other similar applications: camera accessories, clothes, etc.

    Associations may also be used as a first step in detecting causation

    E.g. association between exposure to chemical X and cancer, or new medicine and

    cardiac problems

    Clusters

    E.g. typhoid cases were clustered in an area surrounding a contaminated well

    Detection of clusters remains important in detecting epidemics

    26

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    27/90

    Classification Rules

    Classification rules help assign new objects to a set of classes. E.g., given a new

    automobile insurance applicant, should he or she be classified as low risk, medium risk or

    high risk?

    Classification rules for above example could use a variety of knowledge, such aseducational level of applicant, salary of applicant, age of applicant, etc.

    person P, P.degree = masters and P.income > 75,000

    P.credit = excellent

    person P, P.degree = bachelors and

    (P.income 25,000 and P.income 75,000)

    P.credit = good

    Rules are not necessarily exact: there may be some misclassifications

    Classification rules can be compactly shown as a decision tree.

    Decision Tree

    Training set: a data sample in which the grouping for each tuple is already known.

    Consider credit risk example: Suppose degree is chosen to partition the data at the root.

    Since degree has a small number of possible values, one child is created for each

    value.

    At each child node of the root, further classification is done if required. Here, partitions are

    defined by income.

    Since income is a continuous attribute, some number of intervals are chosen, and

    one child created for each interval.

    Different classification algorithms use different ways of choosing which attribute to

    partition on at each node, and what the intervals, if any, are.

    In general

    Different branches of the tree could grow to different levels.

    Different nodes at the same level may use different partitioning attributes.

    Greedy top down generation of decision trees.

    27

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    28/90

    Each internal node of the tree partitions the data into groups based on a partitioning

    attribute, and a partitioning condition for the node

    More on choosing partioning attribute/condition shortly

    Algorithm is greedy: the choice is made once and not revisited as more of the tree is

    constructed

    The data at a node is not partitioned further if either

    All (or most) of the items at the node belong to the same class, or

    All attributes have been considered, and no further partitioning is possible.

    Such a node is a leaf node.

    Otherwise the data at the node is partitioned further by picking an attribute for

    partitioning data at the node.

    Decision-Tree Construction Algorithm

    Procedure Grow.Tree(S)

    Partition(S);

    Procedure Partition (S)

    if (purity(S) >

    p or |S| 2) relationship set by a

    number of distinct binary relationship sets, a n-ary relationship set shows more clearly that several

    entities participate in a single relationship.

    Mapping Cardinalities

    Express the number of entities to which another entity can be associated via a relationship

    set.

    Most useful in describing binary relationship sets.

    For a binary relationship set the mapping cardinality must be one of the following types:

    One to one

    One to many

    Many to one

    Many to many

    We distinguish among these types by drawing either a directed line( ),signifying one,or

    an undirected line( ),signifying many, between the relationship set and the entity set.

    One-To-One Relationship

    A customer is associated with at most one loan via the relationship borrower

    A loan is associated with at most one customer via borrower

    42

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    43/90

    One-To-Many and Many-To-One Relationship

    In the one-to-many relationship (a), a loan is associated with at most one customer via

    borrower; a customer is associated with several (including 0) loans via borrower

    In the many-to-one relationship (b), a loan is associated with several (including 0)

    customers via borrower; a customer is associated with at most one loan via borrower

    Many-To-Many Relationship

    A customer is associated with several (possibly 0) loans via borrower

    A loan is associated with several (possibly 0) customers via borrower

    Existence Dependencies

    43

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    44/90

    If the existence of entity x depends on the existence of entity y, then x is said to be

    existence dependenton y.

    y is a dominant entity (in example below, loan)

    x is a subordinate entity (in example below,payment)

    If a loan entity is deleted, then all its associatedpaymententities must be deleted also.

    E-R Diagram Components

    Rectangles represent entity sets.

    Ellipses represent attributes.

    Diamonds represent relationship sets.

    Lines link attributes to entity sets and entity sets to relationship sets.

    Double ellipses represent multivalued attributes.

    Dashed ellipses denote derived attributes.

    Primary key attributes are underlined.

    Weak Entity Sets

    An entity set that does not have a primary key is referred to as a weak entity set.

    The existence of a weak entity set depends on the existence of a strong entity set; it must

    relate to the strong set via a one-to-many relationship set.

    The discriminator(orpartial key) of a weak entity set is the set of attributes that

    distinguishes among all the entities of a weak entity set.

    The primary key of a weak entity set is formed by the primary key of the strong entity set

    on which the weak entity set is existence dependent, plus the weak entity set's

    discriminator.

    We depict a weak entity set by double rectangles.

    We underline the discriminator of a weak entity set with a dashed line.

    payment-number -- discriminator of the payment entity set

    Primary key for payment -- (loan-number, payment-number)

    Specialization

    44

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    45/90

    Top-down design process; we designate subgroupings within an entity set that are

    distinctive from other entities in the set.

    These subgroupings become lower-level entity sets that have attributes or participate in

    relationships that do not apply to the higher-level entity set.

    Depicted by a triangle component labeled ISA (i.e.,savings-accountis anaccount)

    Generalization

    A bottom-up design process -- combine a number of entity sets that share the same features

    into a higher-level entity set

    Specialization and generalization are simple inversions of each other; they are represented

    in an E-R diagram in the same way.

    Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship

    participation of the higher-level entity set to which it is linked.

    Design Constraints on a Generalization

    Constraint on which entities can be members of a given lower-level entity set.

    condition-defined

    user-defined

    Constraint on whether or not entities may belong to more than one lower-level entity set

    within a single generalization.

    disjoint

    overlapping

    Completeness constraint -- specifies whether or not an entity in the higher-level entity set

    must belong to at least one of the lower-level entity sets within a generalization.

    Total

    Partial

    Aggregation

    Relationship sets borrowerand loan-officerrepresent the same information

    Eliminate this redundancy via aggregation

    Treat relationship as an abstract entity

    Allows relationships between relationships

    Abstraction of relationship into new entity

    45

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    46/90

    Without introducing redundancy , the following diagram represents that:

    A customer takes out a loan

    An employee may be a loan officer for a customer-loan pair

    E-R Design Decisions

    The use of an attribute or entity set to represent an object

    Whether a real-world concept is best expressed by an entity set or a relationship set.

    The use of a ternary relationship versus a pair of binary relationships.

    The use of a strong or weak entity set.

    The use of generalization -- contributes to modularity in the design.

    The use of aggregation -- can treat the aggregate entity set as a single unit without concern

    for the details of its internal structure.

    Reduction of an E-R schema to Tables

    Primary keys allow entity sets and relationship sets to be expressed uniformly as tables

    which represent the contents of the database.

    A database which conforms to an E-R diagram can be represented by a collection of tables.

    For each entity set and relationship set there is a unique table which is assigned the name of

    the corresponding entity set or relationship set.

    Each table has a number of columns (generally corresponding to attributes), which haveunique names.

    Converting an E-R diagram to a table format is the basis for deriving a relational database

    design from an E-R diagram.

    Representing Entity Sets as Tables

    A strong entity set reduces to a table with the same attributes.

    A weak entity set becomes a table that includes a column for the primary key of the

    identifying strong entity set.

    Representing Relationship Sets as Tables

    A many-to-many relationship set is represented as a table with columns for the primary

    keys of the two participating entity sets, and any descriptive attributes of the relationship

    set.

    The table corresponding to a relationship set linking a weak entity set to its identifying

    strong entity set is redundant. Thepaymenttable already contains the information that

    46

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    47/90

    would appear in the loan-paymenttable (i.e., the columns loan-numberandpayment-

    number).

    E-R Diagram for Banking Enterprise

    Representing Generalization as Tables

    Method 1: Form a table for the generalized entity account Form a table for each entity set

    that is generalized (include primary key of generalized entity set)

    Method 2: Form a table for each entity set that is generalized table

    (b) Explain the features of Temporal & Spatial Databases in detail. (JUNE 2010)

    47

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    48/90

    Or

    (a) Give features of Temporal and Spatial Databases. Temporal Database.

    (DECEMBER 2010)

    Temporal Database

    Time Representation, Calendars, and Time Dimensions

    Time is considered ordered sequence of points in some granularity

    Use the term choronon instead of point to describe minimum granularity

    A calendar organizes time into different time units for convenience.

    Accommodates various calendars

    Gregorian (western), Chinese, Islamic, Etc.

    Point events

    Single time point event

    E.g., bank deposit

    Series of point events can form a time series data

    Duration events

    Associated with specific time period

    Time period is represented by start time and end time

    Transaction time

    The time when the information from a certain transaction becomes valid

    Bitemporal database

    Databases dealing with two time dimensions

    Incorporating Time in Relational Databases Using Tuple Versioning

    Add to every tuple

    Valid start time

    Valid end time

    48

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    49/90

    Incorporating Time in Object-Oriented Databases Using Attribute Versioning

    A single complex object stores all temporal changes of the object

    Time varying attribute

    An attribute that changes over time

    E.g., age

    Non-Time varying attribute

    An attribute that does not changes over time

    E.g., date of birth

    Spatial Database

    49

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    50/90

    Types of Spatial Data

    Point Data

    Points in a multidimensional space

    E.g., Raster data such as satellite imagery, where each

    pixel stores a measured value

    E.g., Feature vectors extracted from text

    Region Data

    Objects have spatial extent with location and boundary.

    DB typically uses geometric approximations constructed using line segments, polygons,

    etc., called vector data.

    Types of Spatial Queries

    Spatial Range Queries

    Find all cities within 50 miles of Madison

    Query has associated region (location, boundary)

    Answer includes ovelapping or contained data regions

    Nearest-Neighbor Queries Find the 10 cities nearest to Madison

    Results must be ordered by proximity

    Spatial Join Queries

    Find all cities near a lake

    Expensive, join condition involves regions and proximity

    Applications of Spatial Data

    Geographic Information Systems (GIS)

    E.g., ESRIs ArcInfo; OpenGIS Consortium

    Geospatial information

    All classes of spatial queries and data are common

    Computer-Aided Design/Manufacturing

    Store spatial objects such as surface of airplane fuselage

    50

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    51/90

    Range queries and spatial join queries are common

    Multimedia Databases

    Images, video, text, etc. stored and retrieved by content

    First converted to feature vector form; high dimensionality

    Nearest-neighbor queries are the most common

    Single-Dimensional Indexes

    B+ trees are fundamentally single-dimensional indexes.

    When we create a composite search key B+ tree, e.g., an index on , we effectively

    linearize the 2-dimensional space since we sort entries firstby age and then by sal.

    Consider entries:

    ,

    ,

    Multi-dimensional Indexes

    A multidimensional index clusters entries so as to exploit nearness in multidimensional space.

    Keeping track of entries and maintaining abalanced index structure presents a challenge!

    Consider entries:

    ,

    ,

    51

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    52/90

    Motivation for Multidimensional Indexes

    Spatial queries (GIS, CAD).

    Find all hotels within a radius of 5 miles from the conference venue.

    Find the city with population 500,000 or more that is nearest to Kalamazoo, MI.

    Find all cities that lie on the Nile in Egypt.

    Find all parts that touch the fuselage (in a plane design).

    Similarity queries (content-based retrieval).Given a face, find the five most similar faces.

    Multidimensional range queries. 50 < age < 55 AND 80K < sal < 90K

    Drawbacks

    An index based on spatial location needed.One-dimensional indexes dont support multidimensional searching efficiently.Hash indexes only support point queries; want to support range queries as well.Must support inserts and deletes gracefully.

    Ideally, want to support non-point data as well (e.g., lines, shapes). The R-tree meets these requirements, and variants are widely used today.

    R-Tree

    R-Tree Properties

    Leaf entry = < n-dimensional box, rid >

    This is Alternative (2), with key value being a box.

    52

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    53/90

    Box is the tightest bounding box for a data object.

    Non-leaf entry = < n-dim box, ptr to child node >

    Box covers all boxes in child node (in fact, subtree).

    All leaves at same distance from root.

    Nodes can be kept 50% full (except root).

    Can choose a parameter m that is

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    54/90

    Will reduce overlap with nodes in tree, and reduce the number of nodes fetched by

    avoiding some branches altogether.

    Cost of overlap test is higher than bounding box intersection, but it is a main-memory

    cost, and can actually be done quite efficiently. Generally a win.

    Insert Entry

    Start at root and go down to best-fit leaf L.

    Go to child whose box needs least enlargement to cover B; resolve ties by going to

    smallest area child.

    If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2.

    Adjust entry for L in its parent so that the box now covers (only) L1.

    Add an entry (in the parent node of L) for L2. (This could cause the parent node to

    recursively split.)

    Splitting a Node during Insertion

    The entries in node L plus the newly inserted entry must be distributed between L1 and L2.

    Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries.

    Idea: Redistribute so as to minimize area of L1 plus area of L2.

    R-Tree Variants

    The R* tree uses the concept of forced reinserts to reduce overlap in tree nodes. When a

    node overflows, instead of splitting.

    Remove some (say, 30% of the) entries and reinsert them into the tree.

    Could result in all reinserted entries fitting on some existing pages, avoiding a split.

    R* trees also use a different heuristic, minimizing box perimeters rather than box areas

    during insertion.

    Another variant, the R+ tree, avoids overlap by inserting an object into multiple leaves if

    necessary.

    Searches now take a single path to a leaf, at cost of redundancy.

    GiST

    The Generalized Search Tree (GiST) abstracts the tree nature of a class of indexes

    including B+ trees and R-tree variants.

    54

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    55/90

    Striking similarities in insert/delete/search and even concurrency control algorithms

    make it possible to provide templates for these algorithms that can be customized to

    obtain the many different tree index structures.

    B+ trees are so important (and simple enough to allow further specialization) that they

    are implemented specially in all DBMSs.

    GiST provides an alternative for implementing other treeindexes in an ORDBS.

    Indexing High-Dimensional Data

    Typically, high-dimensional datasets are collections of points, not regions.

    E.g., Feature vectors in multimedia applications.

    Very sparse

    Nearest neighbor queries are common.

    R-tree becomes worse than sequential scan for most datasets with more than a dozendimensions.

    As dimensionality increases contrast (ratio of distances between nearest and farthest points)

    usually decreases; nearest neighbor is not meaningful.

    In any given data set, advisable to empirically test contrast.

    5. (a) Explain the features of Parallel and Text Databases in detail. (JUNE 2010)

    Parallel Databases

    Introduction

    Parallel machines are becoming quite common and affordable

    Prices of microprocessors, memory and disks have dropped sharply

    Databases are growing increasingly large

    Large volumes of transaction data are collected and stored for later analysis.

    multimedia objects like images are increasingly stored in databases

    Large-scale parallel database systems increasingly used for:

    storing large volumes of data

    processing time-consuming decision-support queries

    55

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    56/90

    providing high throughput for transaction processing

    Parallelism in Databases

    Data can be partitioned across multiple disks for parallel I/O.

    Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel

    Data can be partitioned and each processor can work independentlyon its own partition.

    Queries are expressed in high level language (SQL, translated torelational algebra)

    Makes parallelization easier.

    Different queries can be run in parallel with each other.

    Concurrency control takes care of conflicts.

    Thus, databases naturally lend themselves to parallelism.

    I/O Parallelism

    Reduce the time required to retrieve relations from disk by partitioningthe relations on multiple

    disks.

    Horizontal partitioning tuples of a relation are divided among many disks such that each tuple

    resides on one disk.

    Partitioning techniques (number of disks = n):

    Round-robin:

    Send the ith tuple inserted in the relation to disk i mod n.

    Hash partitioning:

    Choose one or more attributes as the partitioning attributes.

    Choose hash function h with range 0n 1

    Let i denote result of hash function h applied tothe partitioning attribute value of a tuple. Send

    tuple to disk i.

    Partitioning techniques (cont.):

    Range partitioning:

    Choose an attribute as the partitioning attribute.

    A partitioning vector [vo, v1, ..., vn-2] is chosen.

    56

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    57/90

    Let vbe the partitioning attribute value of a tuple. Tuples such that vi_vi+1 go to diskI+ 1. Tuples

    with v < v0 go to disk 0 and tuples with v_vn-2 go to diskn-1.

    E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk

    0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go to disk2.

    Comparison of Partitioning Techniques

    Evaluate how well partitioning techniques support the following types of data access:

    1. Scanning the entire relation.

    2. Locating a tuple associatively point queries.

    E.g., r.A = 25.

    3. Locating all tuples such that the value of a given attribute lies within a specified range range

    queries.

    E.g., 10 _r.A < 25.

    Round robin:

    Advantages

    Best suited for sequential scan of entire relation on each query.

    All disks have almost an equal number of tuples; retrieval work is thus well balanced

    between disks.

    Range queries are difficult to process

    No clustering -- tuples are scattered across all disks

    Hash partitioning:

    Good for sequential access

    Assuming hash function is good, and partitioning attributes form a key, tuples will be

    equally distributed between disks

    Retrieval work is then well balanced between disks.

    Good for point queries on partitioning attribute

    Can lookup single disk, leaving others available for answering other queries.

    Index on partitioning attribute can be local to disk, making lookup and update more

    efficient

    No clustering, so difficult to answer range queries

    57

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    58/90

    Range partitioning:Provides data clustering by partitioning attribute value.Good for sequential accessGood for point queries on partitioning attribute: only one disk needs to be accessed.For range queries on partitioning attribute, one to a few disks may need to be accessed

    Remaining disks are available for other queries. Good if result tuples are from one to a few blocks. If many blocks are to be fetched, they are still fetched from one to a few disks, and

    potential parallelism in disk access is wasted Example ofexecution skew.

    Partitioning a Relation across Disks

    ! If a relation contains only a few tuples which will fit into a single disk block, then assign the

    relation to a single disk.

    ! Large relations are preferably partitioned across all the available disks.

    ! If a relation consists ofm disk blocks and there are n disks available in the system, then the

    relation should be allocated min(m,n) disks.

    Handling of Skew

    The distribution of tuples to disks may be skewed that is, some disks have many tuples, while

    others may have fewer tuples.

    Types of skew:

    Attribute-value skew:" Some values appear in the partitioning attributes of many tuples; all the

    tuples with the same value for the partitioning attribute end up in the same partition. Can occur

    with range-partitioning and hash-partitioning.

    Partition skew: " With range-partitioning, badly chosen partition vector may assign too many

    tuples to some partitions and too few to others. " Less likely with hash-partitioning if a good hash-

    function ischosen.

    Handling Skew in Range-Partitioning

    To create a balanced partitioning vector (assuming partitioning attribute forms a key of the

    relation):

    Sort the relation on the partitioning attribute.

    Construct the partition vector by scanning the relation in sorted order as follows.

    " After every 1/nth of the relation has been read, the value of the partitioning attribute of

    the next tuple is added to the partition vector.

    58

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    59/90

    n denotes the number of partitions to be constructed.

    Duplicate entries or imbalances can result if duplicates are present in partitioning attributes.

    Alternative technique based on histograms used in practice

    Handling Skew using Histograms

    Balanced partitioning vector can be constructed from histogram in a relatively straightforwardfashion.

    Assume uniform distribution within each range of the histogram

    Histogram can be constructed by scanning relation, or sampling(blocks containing) tuples of the

    relation

    Interquery Parallelism

    Queries/transactions execute in parallel with one another.

    Increases transaction throughput; used primarily to scale up a transaction processing system tosupport a larger number of transactions per second.

    Easiest form of parallelism to support, particularly in a shared memory parallel database, because

    even sequential database systems support concurrent processing.

    More complicated to implement on shared-disk or shared-nothing architectures

    Locking and logging must be coordinated by passing messages between processors.

    Data in a local buffer may have been updated at another processor.

    Cache-coherency has to be maintained reads and writes of datain buffer must find

    latest version of data.

    Cache Coherency Protocol

    Example of a cache coherency protocol for shared disk systems:

    Before reading/writing to a page, the page must be locked in shared/exclusive mode.

    On locking a page, the page must be read from disk59

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    60/90

    Before unlocking a page, the page must be written to disk if it was modified.

    More complex protocols with fewer disk reads/writes exist.

    Cache coherency protocols for shared-nothing systems are similar. Each database page is assigned

    a homeprocessor. Requests to fetch the page or write it to disk are sent to the home

    processor.

    Intraquery Parallelism

    Execution of a single query in parallel on multiple processors/disks; important for speeding up

    long-running queries.

    Two complementary forms of intraquery parallelism:

    Intraoperation Parallelism parallelize the execution of each individual operation in the query.

    Interoperation Parallelism execute the different operations in a query expression in parallel.

    The first form scales better with increasing parallelism because the number of tuples processed byeach operation is typically more than the number of operations in a query.

    Parallel Sort

    Range-Partitioning Sort

    Choose processors P0... Pm, where m _ n -1 to do sorting.

    Create range-partition vector with m entries, on the sorting attributes

    Redistribute the relation using range partitioning

    all tuples that lie in the ith range are sent to processor Pi

    Pi stores the tuples it received temporarily on disk Di.

    This step requires I/O and communication overhead.

    Each processor Pi sorts its partition of the relation locally.

    Each processors executes same operation (sort) in parallel with other processors, without any

    interaction with the others (data parallelism).

    Final merge operation is trivial: range-partitioning ensures that, for 1 jm, the key values in

    processor Pi are all less than the key values in Pj.

    Parallel External Sort-Merge

    Assume the relation has already been partitioned among disksD0, ...,Dn-1

    Each processorPi locally sorts the data on diskDi.

    60

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    61/90

    The sorted runs on each processor are then merged to get the final sorted output.

    Parallelize the merging of sorted runs as follows:

    The sorted partitions at each processorPi are range-partitioned across the processorsP0...

    Pm-1.

    Each processorPiperforms a merge on the streams as they are received, to get a single

    sorted run.

    The sorted runs on processorsP0,...,Pm-1 are concatenated to get

    the final result.

    Parallel Join

    The join operation requires pairs of tuples to be tested to see if they satisfy the join condition, and

    if they do, the pair is added to the join output.

    Parallel join algorithms attempt to split the pairs to be tested over several processors. Each

    processor then computes part of the join locally.

    In a final step, the results from each processor can be collected together to produce the final result.

    Partitioned Join

    For equi-joins and natural joins, it is possible topartition the two input relations across the

    processors, and compute the join locally at each processor.

    Let randsbe the input relations, and we want to compute rands each are partitionedinto npartitions, denoted r0, r1... rn-1 ands0,s1, ...,sn-1.

    Can use eitherrange partitioningorhash partitioning.

    rands must be partitioned on their join attributes r.A ands.B), using the same range-partitioning

    vector or hash function.

    Partitions ri andsi are sent to processorPi,

    61

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    62/90

    Each processorPi locally computes

    Any of the standard join methods can be used.

    Fragment-and-Replicate Join

    Partitioning not possible for some join conditions,

    e.g., non-equijoin conditions, such as r.A > s.B.

    For joins were partitioning is not applicable, parallelization can be accomplished by fragment and

    replicate technique

    Depicted on next slide

    Special case asymmetric fragment-and-replicate:

    One of the relations, say r, is partitioned; any partitioning technique can be used.

    The other relation,s, is replicated across all the processors.

    ProcessorPi then locally computes the join ofri with all of s usingany join technique.

    Both versions of fragment-and-replicate work with any join condition, since every tuple in rcan betested with every tuple ins.

    Usually has a higher cost than partitioning, since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicated.Sometimes asymmetric fragment-and-replicate is preferable even though partitioning could beused.

    E.g., says is small and ris large, and already partitioned. It may becheaper to replicates across all processors, rather than repartition rands on the join attributes.Partitioned Parallel Hash-Join

    Parallelizing partitioned hash join:Assumes is smaller than rand therefores is chosen as the build relation.

    62

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    63/90

    A hash function h1 takes the join attribute value of each tuple ins and maps this tuple to one of thenprocessors.Each processorPi reads the tuples ofs that are on its diskDi, and sends each tuple to theappropriate processor based on hash function h1. Letsi denote the tuples of relations that aresent to processorPi.As tuples of relations are received at the destination processors, they are partitioned further usinganother hash function, h2, which is used to compute the hash-join locally.Once the tuples ofs have been distributed, the larger relation ris redistributed across the m

    processors using the hash function h1 Let ri denote the tuples of relation rthat are sent to processorPi.

    As the rtuples are received at the destination processors, they are repartitioned using the functionh2Each processorPi executes the build and probe phases of the hash-join algorithm on the localpartitions ri ands ofrands to produce a partition of the final result of the hash-join.Hash-join optimizations can be applied to the parallel case

    e.g., the hybrid hash-join algorithm can be used to cache some of the incoming tuples inmemory and avoid the cost of writing them and reading them back in.

    Parallel Nested-Loop Join

    Assume that

    relations is much smaller than relation rand that ris stored by partitioning. there is an index on a join attribute of relation rat each of the

    partitions of relation r.Use asymmetric fragment-and-replicate, with relationsbeing replicated, and using the existingpartitioning of relation r.Each processorPj where a partition of relations is stored reads the tuples of relations stored inDj,and replicates the tuples to every other processorPi.At the end of this phase, relations is replicated at all sites that store tuples of relation r.Each processorPi performs an indexed nested-loop join of relations with the ith partition ofrelation r.Interoperator Parallelism

    Pipelined parallelism Consider a join of four relations

    Set up a pipeline that computes the three joins in parallel

    Let P1 be assigned the computation of

    And P2 be assigned the computation of

    And P3 be assigned the computation of Each of these operations can execute in parallel, sending result tuples it computes to the

    next operation even as it is computing further results

    Provided a pipelineable join evaluation algorithm is used.

    Independent Parallelism

    Consider a join of four relations,

    Let P1 be assigned the computation of

    And P2 be assigned the computation of

    And P3 be assigned the computation of

    63

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    64/90

    P1 and P2 can workindependently in parallel

    P3 has to wait for input from P1 and P2

    Can pipeline output of P1 and P2 to P3, combining independent parallelism and

    pipelined parallelism

    Does not provide a high degree of parallelism

    useful with a lower degree of parallelism.

    less useful in a highly parallel system.

    Query Optimization

    Query optimization in parallel databases is significantly more complex than query optimization in

    sequential databases.

    Cost models are more complicated, since we must take into account partitioning costs and issues

    such as skew and resource contention.

    When scheduling execution tree in parallel system, must decide:

    How to parallelize each operation and how many processors to use for it.

    What operations to pipeline, what operations to execute independently in parallel, and what

    operations to execute sequentially, one after the other.

    Determining the amount of resources to allocate for each operation is a problem.

    E.g., allocating more processors than optimal can result in high communication overhead.

    Long pipelines should be avoided as the final operation may wait a lotfor inputs, while holding

    precious resources

    Text Databases

    A text database is a compilation of documents or other information in the form of a database in

    which the complete text of each referenced document is available for online viewing, printing, or

    downloading. In addition to text documents, images are often included, such as graphs, maps,

    photos, and diagrams. A text database is searchable by keyword, phrase, or both.

    When an item in a text database is viewed, it may appear in ASCII format (as a text file with the.txt extension), as a word-processed file (requiring a program such as Microsoft Word), as an PDF)

    file. When a document appears as a PDF file, it is usually a scanned hardcopy of the original

    article, chapter, or book.

    A text databases are used by college and university libraries as a convenience to their students and

    staff. Full-text databases are ideally suited to online courses of study, where the student remains at

    home and obtains course materials by downloading them from the Internet. Access to these

    databases is normally restricted to registered personnel or to people who pay a specified fee per

    64

    http://whatis.techtarget.com/definition/0,289893,sid9_gci211895,00.htmlhttp://whatis.techtarget.com/definition/0,289893,sid9_gci213125,00.htmlhttp://whatis.techtarget.com/definition/0,289893,sid9_gci211600,00.htmlhttp://whatis.techtarget.com/definition/0,289893,sid9_gci211895,00.htmlhttp://whatis.techtarget.com/definition/0,289893,sid9_gci213125,00.htmlhttp://whatis.techtarget.com/definition/0,289893,sid9_gci211600,00.html
  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    65/90

    viewed item. Full-text databases are also used by some corporations, law offices, and government

    agencies.

    (b) Discuss the Rules, Knowledge Bases and Image Databases. (JUNE 2010)

    Rule-based systems are used as a way to store and manipulate knowledge to interpret information

    in a useful way. They are often used in artificial intelligence applications and research.

    Rule-based systems are specialized software that encapsulates Human Intelligence like

    knowledge there by make intelligent decisions quickly and in repeatable form. Also known as Rule

    Based Systems, Expert Systems & Artificial Intelligence.

    Rule based systems are:

    Knowledge based systems

    Part of the Artificial Intelligence field

    Computer programs that contain some subject-specific knowledge of one or more

    human experts

    Made up of a set of rules that analyze user supplied information about a specific

    class of problems.

    Systems that utilize reasoning capabilities and draw conclusions.

    Knowledge Engineering building an expert system

    Knowledge Engineers the people who build the system

    Knowledge Representation the symbols used to represent the knowledge

    Factual Knowledge knowledge of a particular task domain that is widely shared

    Heuristic Knowledge more judgmental knowledge of performance in a task domain.

    Uses of Rule based Systems

    Very useful to companies with a high-level of experience and expertise that cannot easily

    be transferred to other members.

    Solves problems that would normally be tackled by a medical or other professional.

    Currently used in fields such as accounting, medicine, process control, financial service,

    production, and human resources

    Applications

    65

    http://en.wikipedia.org/wiki/Artificial_intelligencehttp://en.wikipedia.org/wiki/Artificial_intelligence
  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    66/90

    A classic example of a rule-based system is the domain-specific expert system that uses rules to

    make deductions or choices. For example, an expert system might help a doctor choose the correct

    diagnosis based on a cluster of symptoms, or select tactical moves to play a game.

    Rule-based systems can be used to perform lexical analysis to compile or interpret computer

    programs, or in natural language processing.

    Rule-based programming attempts to derive execution instructions from a starting set of data and

    rules. This is a more indirect method than that employed by an imperative programming language,

    which lists execution steps sequentially.

    Construction

    A typical rule-based system has four basic components:

    A list of rules orrule base, which is a specific type ofknowledge base.

    An inference engineorsemantic reasoner, which infers information or takes action based

    on the interaction of input and the rule base. The interpreter executes a production system

    program by performing the following match-resolve-act cycle:

    Match: In this first phase, the left-hand sides of all productions are matched against the

    contents of working memory. As a result a conflict set is obtained, which consists of

    instantiations of all ,satisfied productions. An instantiation of a production is an ordered

    list of working memory elements that satisfies the left-hand side of the production.

    Conflict-Resolution: In this second phase, one of the production instantiations in the

    conflict set is chosen for execution. If no productions are satisfied, the interpreter halts.

    Act: In this third phase, the actions of the production selected in the conflict-resolutionphase are executed. These actions may change the contents of working memory. At the

    end of this phase, execution returns to the first phase.

    Temporary working memory.

    A user interface or other connection to the outside world through which input and output

    signals are received and sent.

    Components of an Rule Based System

    Set of Rules derived from the knowledge base and used by the interpreter to evaluate theinputted data

    Knowledge Engineer decides how to represent the experts knowledge and how to build

    the inference engine appropriately for the domain

    Interpreter interprets the inputted data and draws a conclusion based on the users

    responses.

    Problem-solving Models

    66

    http://en.wikipedia.org/wiki/Expert_systemhttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/Compilehttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Rule-based_programminghttp://en.wikipedia.org/wiki/Imperative_programming_languagehttp://en.wikipedia.org/wiki/Knowledge_basehttp://en.wikipedia.org/wiki/Knowledge_basehttp://en.wikipedia.org/wiki/Inference_enginehttp://en.wikipedia.org/wiki/Inference_enginehttp://en.wikipedia.org/wiki/Semantic_reasonerhttp://en.wikipedia.org/wiki/User_interfacehttp://en.wikipedia.org/wiki/Expert_systemhttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/Compilehttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Rule-based_programminghttp://en.wikipedia.org/wiki/Imperative_programming_languagehttp://en.wikipedia.org/wiki/Knowledge_basehttp://en.wikipedia.org/wiki/Inference_enginehttp://en.wikipedia.org/wiki/Semantic_reasonerhttp://en.wikipedia.org/wiki/User_interface
  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    67/90

    Forward-chaining starts from a set of conditions and moves towards some conclusion

    Backward-chaining starts with a list of goals and the works backwards to see if there is

    any data that will allow it to conclude any of these goals.

    Both problem-solving methods are built into inference engines or inference procedures

    Advantages

    Provide consistent answers for repetitive decisions, processes and tasks.

    Hold and maintain significant levels of information.

    Reduce employee training costs

    Centralize the decision making process.

    Create efficiencies and reduce the time needed to solve problems.

    Combine multiple human expert intelligences

    Reduce the amount of human errors.

    Give strategic and comparative advantages creating entry barriers to competitors

    Review transactions that human experts may overlook.

    Disadvantages

    Lack human common sense needed in some decision making.

    Will not be able to give the creative responses that human experts can give in unusual

    circumstances.

    Domain experts cannot always clearly explain their logic and reasoning.

    Challenges of automating complex processes.

    Lack of flexibility and ability to adapt to changing environments.

    Not being able to recognize when no answer is available.

    Knowledge Bases

    Knowledge-based Systems: Definition

    A system that draws upon the knowledge of human experts captured in a knowledge-base

    to solve problems that normally require human expertise.

    67

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    68/90

    Heuristic rather than algorithmic

    Heuristics in search vs. in KBS general vs. domain-specific

    Highly specific domain knowledge

    Knowledge is separated from how it is used

    KBS = knowledge-base + inference engine

    KBS Architecture

    The inference engine and knowledge base are separated because:

    the reasoning mechanism needs to be as stable as possible;

    the knowledge base must be able to grow and change, as knowledge is added;

    this arrangement enables the system to be built from, or converted to, a shell.

    It is reasonable to produce a richer, more elaborate, description of the typical expert

    system.

    A more elaborate description, which still includes the components that are to be found in

    almost any real-world system, would look like this:

    Knowledge representation formalisms & Inference

    KR Inference

    Logic Resolution principle

    68

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    69/90

    Production rules backward (top-down, goal directed)

    forward (bottom-up, data-driven)

    Semantic nets &

    Frames Inheritance & advanced reasoning

    Case-based

    Reasoning Similarity based

    KBS tools Shells

    - Consist of KA Tool, Database & Development Interface

    - Inductive Shells

    - simplest

    - example cases represented as matrix of known data(premises) and resulting effects

    - matrix converted into decision tree or IF-THEN statements

    - examples selected for the tool

    Rule-based shells

    - simple to complex

    - IF-THEN rules

    Hybrid shells

    - sophisticate & powerful

    - support multiple KR paradigms & reasoning schemes

    - generic tool applicable to a wide range

    Special purpose shells

    - specifically designed for particular types of problems

    - restricted to specialised problems

    -Scratch

    - require more time and effort

    - no constraints like shells

    - shells should be investigated first

    69

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    70/90

    Some example KBSs

    DENDRAL (chemical)

    MYCIN (medicine)

    XCON/RI (computer)

    Typical tasks of KBS

    (1) Diagnosis - To identify a problem given a set of symptoms or malfunctions.

    e.g. diagnose reasons for engine failure

    (2) Interpretation - To provide an understanding of a situation from available information. e.g.

    DENDRAL

    (3) Prediction - To predict a future state from a set of data or observations. e.g. Drilling Advisor,

    PLANT

    (4) Design - To develop configurations that satisfy constraints of a design problem. e.g. XCON

    (5) Planning - Both short term & long term in areas like project management, product

    development or financial planning. e.g. HRM

    (6) Monitoring - To check performance & flag exceptions.

    e.g., KBS monitors radar data and estimates the position of the space shuttle

    (7) Control - To collect and evaluate evidence and form opinions on that evidence.

    e.g. control patients treatment

    (8) Instruction - To train students and correct their performance. e.g. give medical students

    experience diagnosing illness

    (9) Debugging - To identify and prescribe remedies for malfunctions.

    e.g. identify errors in an automated teller machine network and ways to correct the errors

    Advantages

    - Increase availability of expert knowledge

    expertise not accessible

    training future experts

    - Efficient and cost effective

    - Consistency of answers

    - Explanation of solution

    70

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    71/90

    - Deal with uncertainty

    Limitations

    - Lack of common sense

    - Inflexible, Difficult to modify

    - Restricted domain of expertise

    - Lack of learning ability

    - Not always reliable

    6. (a) Compare Distributed databases and conventional databases. (16)

    (NOV/DEC 2010)

    DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

    mimics organisational structure with data

    local access and autonomy without exclusion

    cheaper to create and easier to expand

    improved availability/reliability/performance by removing reliance on a central site

    Reduced communication overhead

    Most data access is local, less expensive and performs better Improved processing power

    Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

    71

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    72/90

    7. (a) Explain the Multi-Version Locks and Recovery in Query Languages.(NOV/DEC 2010)

    Multi-Version Locks: Multiversion concurrency control (abbreviated MCC orMVCC), in the

    database field ofcomputer science, is a concurrency control method commonly used by database

    management systems to provide concurrent access to the database and in programming languages

    to implement transactional memory.

    For instance, a database will implement updates not by deleting an old piece of data and

    overwriting it with a new one, but instead by marking the old data as obsolete and adding the

    newer "version." Thus there are multiple versions stored, but only one is the latest. This allows the

    database to avoid overhead of filling in holes in memory or disk structures but requires (generally)

    the system to periodically sweep through and delete the old, obsolete data objects. For a document-

    oriented database such as CouchDB,RiakorMarkLogic Serverit also allows the system to

    optimize documents by writing entire documents onto contiguous sections of diskwhen updated,

    the entire document can be re-written rather than bits and pieces cut out or maintained in a linked,

    non contiguous database structure

    MVCC also provides potential "point in time" consistent views. In fact read transactions under

    MVCC typically use a timestamp or transaction ID to determine what state of the DB to read, and

    read these "versions" of the data.

    This avoids managing locks for read transactions because writes can be isolatedby virtue of the

    old versions being maintained, rather than through a process of locks or mutexes. Writes affect

    future "version" but at the transaction ID that the read is working at, everything is guaranteed to be

    consistent because the writes are occurring at a later transaction ID.

    In other words, MVCC provides each user connected to the database with a "snapshot" of the

    database for that person to work with. Any changes made will not be seen by other users of the

    database until the transaction has been committed.

    MVCC uses timestamps or increasing transaction IDs to achieve transactional consistency. MVCC

    ensures a transaction never has to wait for a database object by maintaining several versions of an

    object. Each version would have a write timestamp and it would let a transaction (T i) read the most

    recent version of an object which precedes the transaction timestamp (TS (T i)).

    If a transaction (Ti) wants to write to an object, and if there is another transaction (Tk), the

    timestamp of Ti must precede the timestamp of Tk(i.e., TS(Ti) < TS(Tk)) for the object write

    operation to succeed, which is to say a write cannot complete if there are outstanding transactions

    with an earlier timestamp.

    Every object would also have a read timestamp, and if a transaction T i wanted to write to object P,

    and the timestamp of that transaction is earlier than the object's read timestamp (TS(T i) < RTS(P)),

    the transaction Ti is aborted and restarted. Otherwise, Ti creates a new version of P and sets the

    read/write timestamps of P to the timestamp of the transaction TS (Ti).

    72

    http://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Concurrency_controlhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/Riakhttp://en.wikipedia.org/wiki/MarkLogic_Serverhttp://en.wikipedia.org/wiki/Isolation_(database_systems)http://en.wikipedia.org/wiki/Timestamphttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Concurrency_controlhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/CouchDBhttp://en.wikipedia.org/wiki/Riakhttp://en.wikipedia.org/wiki/MarkLogic_Serverhttp://en.wikipedia.org/wiki/Isolation_(database_systems)http://en.wikipedia.org/wiki/Timestamp
  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    73/90

    The obvious drawback to this system is the cost of storing multiple versions of objects in the

    database. On the other hand reads are never blocked, which can be important for workloads mostly

    involving reading values from the database. MVCC is particularly adept at implementing true

    snapshot isolation, something which other methods of concurrency control frequently do either

    incompletely or with high performance costs.

    At t1 the state of a DB could be

    Time Object1 Object2

    t1 Hello Bar

    t2 Foo Bar

    This indicates that the current set of this database (perhaps a key-value store database) is

    Object1="Hello", Object2="Bar". Previously, Object1 was "Foo" but that value has been

    superseded. It is not deleted because the database holds multiple versions but will be deleted

    later.

    If a long running transaction starts a read operation, it will operate at transaction "t1" and see this

    state. If there is a concurrent update (during that long-running read transaction) which deletes

    Object 2 and adds Object 3 = foo-bar the database will look like

    Time Object1 Object2 Object3

    t2 Hello (deleted) Foo-Bar

    t1 Hello Bar

    t0 Hello Bar

    Now there is a new version as of transaction ID t2. Note critically that the long-running read

    transaction still has access to a coherent snapshot of the system at t1* even though the write

    transaction added data as of t2, so the read transaction is able to run in isolation from the update

    transaction that created the t2 values. This is how MVCC allows isolated, ACID, reads without any

    locks.

    Recovery

    73

    http://en.wikipedia.org/wiki/Snapshot_isolationhttp://en.wikipedia.org/wiki/Snapshot_isolation
  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    74/90

    74

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    75/90

    (b)Discuss client/server model and mobile databases. (16)

    (NOV/DEC 2010)

    75

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    76/90

    Mobile Databases

    Recent advances in portable and wireless technology led to mobile computing, a new

    dimension in data communication and processing.

    Portable computing devices coupled with wireless communications allow clients to access

    data from virtually anywhere and at any time.

    There are a number of hardware and software problems that must be resolved before the

    capabilities of mobile computing can be fully utilized.

    Some of the software problems which may involve data management, transaction

    management, and database recovery have their origins in distributed database systems.

    76

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    77/90

    In mobile computing, the problems are more difficult, mainly:

    The limited and intermittent connectivity afforded by wireless communications.

    The limited life of the power supply(battery).

    The changing topology of the network.

    In addition, mobile computing introduces new architectural possibilities andchallenges.

    Mobile Computing Architecture

    The general architecture of a mobile platform is illustrated in Fig 30.1.

    It is distributed architecture where a number of computers, generally referred to as Fixed

    Hosts and Base Stations are interconnected through a high-speed wired network.

    Fixed hosts are general purpose computers configured to manage mobile units.

    Base stations function as gateways to the fixed network for the Mobile Units.

    Wireless Communications

    The wireless medium have bandwidth significantly lower than those of a wired

    network.

    77

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    78/90

    The current generation of wireless technology has data rates range from the

    tens to hundreds of kilobits per second (2G cellular telephony) to tens of

    megabits per second (wireless Ethernet, popularly known as WiFi).

    Modern (wired) Ethernet, by comparison, provides data rates on the order of

    hundreds of megabits per second.

    The other characteristics distinguish wireless connectivity options:

    interference,

    locality of access,

    range,

    support for packet switching,

    seamless roaming throughout a geographical region.

    Some wireless networks, such as WiFi and Bluetooth, use unlicensed areas of the

    frequency spectrum, which may cause interference with other appliances, such as

    cordless telephones.

    Modern wireless networks can transfer data in units called packets, that are used in

    wired networks in order to conserve bandwidth.

    Client/Network Relationships

    Mobile units can move freely in a geographic mobility domain, an area that is

    circumscribed by wireless network coverage.

    To manage entire mobility domain is divided into one or more smaller

    domains, called cells, each of which is supported by at least one base

    station.

    Mobile units be unrestricted throughout the cells of domain, while

    maintaining information access contiguity.

    The communication architecture described earlier is designed to give the mobile

    unit the impression that it is attached to a fixed network, emulating a traditional

    client-server architecture.

    Wireless communications, however, make other architectures possible. One

    alternative is a mobile ad-hoc network (MANET), illustrated in 29.2.

    In a MANET, co-located mobile units do not need to communicate via a fixed

    network, but instead, form their own using cost-effective technologies such as

    Bluetooth.

    78

  • 7/31/2019 Dbt Anna University Qa June 2010 Dec 2010

    79/90

    In a MANET, mobile units are responsible for routing their own data, effectively

    acting as base stations as well as clients.

    Moreover, they must be robust enough to handle changes in the network

    topology, such as the arrival or departure of other mobile units.

    MANET applications can be considered as peer-to-peer, meaning that a mobile unit

    is simultaneously a client and a server.

    Transaction processing and data consistency control become more difficult

    since there is no central control in this architecture.

    Resource discovery and data routing by mobile units make computing in a

    MANET even more complicated.

    Sample MANET applications are multi-user games, shared whiteboard,

    distributed calendars, and battle information sharing.

    Characteristics of Mobile Environments

    The characteristics of mobile computing include:

    Communication latency

    Intermittent connectivity

    Limited battery life

    Changing client location

    The server may not be able to reach a client.

    A client may be unreac