mc0077 solved

1. Describe the following with respect to databases:

a. Data Management Functions:- The primary objective for a data management system, DMS, is to provide efficient and effective management of the database. This includes providing functions for data storage, retrieval, secure modification, DB integrity and maintenance. There are two principle quality measures for a DMS; efficiency and effectiveness.

DMS efficiency is typically measured in the time and machine capacity used for data retrieval and storage, respectively. Here, low time or storage requirements indicate high efficiency. Since these are somewhat conflicting objectives, trade-offs are necessary.

DMS effectiveness is typically measured in the quality of service, for example; the correctness or relevance of retrieval or modification results, smooth or seamless presentation of multiple media data, particularly for video, or the security levels supported. Techniques used to reach these goals include:

Index generation is used to increase the efficiency of data retrieval, by speeding data location and thereby reducing retrieval time.

Data compression is used to reduce storage and transmission requirements User interfaces can enhance system effectiveness by supporting formulation of 'complete'

information needs. Similarity algorithms seek to locate only data/ documents that are relevant to the user query.

b. Database Design & Creation:- Prior to establishing a database, the database administrator, DBA, should create a data model to describe the intended content and structure for the DB. It is this model

that is the basis for specification of the Data Definition Language, DDL statements necessary for construction of the DB schema and the structure for the DB storage areas. Figure 3.2a illustrates part of a graphic data model, formed using the graphic syntax of the Entity-Relationship, ER, model defined by (Chen, 1976), for an example administrative database containing various media attributes: images, text

c. Information & Data Retrieval:- The terms information and data are often used interchangeably in the data management literature causing some confusion in interpretation of the goals of different data management system types. It is important to remember that despite the name of a data management system type, it can only manage data. These data are representations of information. However, historically (since the late 1950's) a distinction has been made between: Data Retrieval, as retrieval of 'facts', commonly represented as atomic data about some entity of interest, for example a person's name, and Information Retrieval, as the retrieval of documents, commonly text but also visual and audio, that describe objects and/or events of interest.

2. Describe the following with suitable examples:a. Graphic vs. Declarative Data Models:- A data model is a tool used to specify the structure and (some) semantics of the information to be represented in a database. Depending on the model type used, a data model can be expressed in diverse formats, including: Graphic, as used in most semantic data model types, such as ER and extended/enhanced ER (EER) models.

Lists of declarative statements, as used in The relational model for relation definitions, AI/deductive systems for specification of facts and rules, Metadata standards such as Dublin Core for specification of descriptive attribute-value pairs, Data definition languages (DDL). Tabular, as used to present the content of a DB schema. Even the implemented and populated DB is only a model of the real world as represented by the data.

Graphic models are easier for a human reader to interpret and check for completeness and correctness than list models, while List formed models are more readily converted to a set of data definition statements for compilation and construction of a DB schema.

b. Structural Semantic Data Model – SSM:- The Structural Semantic Model, SSM, first described in Nordbotten (1993a & b), is an extension and graphic simplification of the EER modeling tool 1st presented in the '89 edition of (Elmasri & Navathe, 2003). SSM was developed as a teaching tool and has been and can continue to be modified to include new modeling concepts. A particular requirement today is the inclusion of concepts and syntax symbols for modeling multimedia objects.Data modeling concept

Concepts (synonym(s)) Definition Example(s)

Entity types: Entity (object) Something of interest to the

Information System about which data is collected

A person, student, customer, employee, department, product, exam, order, …

Entity type A set of entities sharing common attributes

Citizens of Norway PERSON {Name, Address, ...}

Subclass, superclass entity type

A sub-class entity type is a specialization, of, alternatively a role played by, a super-class entity type.

Subclass : Superclass Student IS_A Person Teacher IS_A Person

Shared subclass entity type A shared subclass entity type has characteristics of 2 or more parent entity types

A student-assistant IS_BOTHA student and an employee

Category entity type A subclass entity type of 2 or more distinct / independent super-class

An owner IS_EITHERA Person or an Organization

entity types Weak entity type An entity type dependent on

another for its identification Education is (can be) a weak entity typeDependent on

and existence Person

Attributes: Property a characteristic of an entity Person.name = Joan Attribute The name given to a property

of an entity or relationship type

Person {ID, Name, Address, Telephone, Age, Position, … }

- Atomic An attribute having a single value

Person.Id

- Multivalued An attribute with multiple values

Telephone# {home, office, mobil, fax}

- Composite (compound) An attribute composed of several sub-attributes

Address {Street, Nr, City, State, Post#} Name {First, Middle, Last}

- derived An attribute whose value depends on other values in the DB and/or environment.

Person.age: as current_date - birth_date. Person.salary: calculated in relationship to currect salary levels

1. Three types of entity specifications: base (root), subclass, and weak 2. Four types of inter-entity relationships: n-ary associative, and 3 types of classification hierarchies, 3. Four attribute types: atomic, multi-valued, composite, and derived, 4. Domain type specifications in the graphic model, including; standard data types, Binary large objects (blob, text, image, ...), user-defined types (UDT) and functions (UDF),

3. Explain the following:-

a. Query Optimization :- The goal of any query processor is to execute each query as efficiently as possible. Efficiency here can be measured in both response time and correctness.

The traditional, relational DB approach to query optimization is to transform the query to an execution tree, and then execute query elements according to a sequence that reduces the search space as quickly as possible and delays execution of the most expensive (in time) elements as long as possible. A commonly used execution heuristic is:

1. Execute all select and project operations on single tables first, in order to eliminate unnecessary rows and columns from the result set. 2. Execute join operations for further reduce the result set. 3. Execute operations on media data, since these can be very time consuming.4. Prepare the result set for presentation.

b. Text Retrieval Using SQL3/TextRetrieval :- SQL3 supports storage of multimedia data, such as text documents, in an or-database using the blob/clob data types. However, the standard SQL3 specification does not include support for such media content processing functions as indexing or searching using elements of the media content. For example SQL3's support for a query to retrieve documents about famous Norwegian artists is limited to using a serial search of all documents using the pattern match operator 'LIKE'. Queries using this operator are likely to miss the Web sites dedicated to the composer.

Basically, the new - to SQL3 - functionality includes:

Indexing Routines for the various types of media data, as discussed in CH.6, for example using:

Content terms for text data and Color, shape, and texture features for image data. Selection Operators for the SQL3 WHERE clause for specification of selection criteria for media retrieval. Text Processing Sub-Systems for similarity evaluation and result ranking.

4. Describe the following:

a- Data Mining Functions :- Data mining methods may be classified by the function they perform or according to the class of application they can be used in. Some of the main techniques used in data mining are described in this section.

Classification Data Mining tools have to infer a model from the database, and in the case of Supervised Learning this requires the user to define one or more classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class.

A rule is generally presented as, if the left hand side (LHS) then the right hand side (RHS), so that in all instances where LHS is true then RHS is also true, is very probable. The categories of rules are:

Exact Rule – permits no exceptions so each object of LHS must be an element of RHS

Strong Rule – allows some exceptions, but the exceptions have a given limit

Probabilistic Rule – relates the conditional probability P(RHS|LHS) to the probability P(RHS) Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS

b- Data Mining Techniques :- Cluster Analysis In an unsupervised learning environment the system has to discover its own classes and one way in which it does this is to cluster the data in the database as shown in the following diagram. The first step is to discover subsets of related objects and then find descriptions e.eg D1, D2, D3 etc. which describe each of these subsets.

Induction A database is a store of information but more important is the information which can be inferred from it. There are two main inference techniques available i.e. deduction and induction.

Deduction is a technique to infer information that is a logical consequence of the information in the database e.g. the join operator applied to two relational tables where the first concerns employees and departments and the second departments and managers infers a relation between employee and managers.

Induction has been described earlier as the technique to infer information that is generalised from the database as in the example mentioned above to infer that each employee has a manager. This is higher level information or knowledge in that it is a general statement about objects in the database. The database is searched for patterns or regularities.

Decision Trees Decision Trees are simple knowledge representation and they classify examples to a finite number of classes, the nodes are labeled with attribute names, the edges are labeled with possible values for this attribute and the leaves labeled with different classes. Objects are classified by following a path down the tree, by taking the edges, corresponding to the values of the attributes in an object.

Decision Tree Structure

Rule Induction A Data Mining System has to infer a model from the database that is it may define classes such that the database contains one or more attributes that denote the class of a tuple i.e. the predicted attributes while the remaining attributes are the predicting attributes. A Class can then be defined by condition on the attributes. When the classes are defined the system should be able to infer the rules that govern classification, in other words the system should find the description of each class.

5. Describe the Data Mining Functions.

Associations Given a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. These patterns can be expressed by rules such as "72% of all the records that contain items A, B and C also contain items D and E." The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A, B and C are said to be on an opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule. A typical application, identified by IBM, that can be built using an association function is Market Basket Analysis. This is where a retailer run an association operator over the point of sales transaction log, which contains among other information, transaction identifiers and product identifiers. The set of products identifiers listed under the same transaction identifier constitutes a record. The output of the association function is, in this case, a list of product affinities. Thus, by invoking an association function, the market basket analysis application can determine affinities such as "20% of the time that a specific brand toaster is sold, customers also buy a set of kitchen gloves and matching cover sets." Another example of the use of associations is the analysis of the claim forms submitted by patients to a medical insurance company. Every claim form contains a set of medical procedures that were performed on a given patient during one visit. By defining the set of items to be the collection of all medical procedures that can be performed on a patient and the records to correspond to each claim form, the application can find, using the association function, relationships among medical procedures that are often performed together

Sequential/Temporal patterns Sequential/temporal pattern functions analyze a collection of records over a period of time for example to identify trends. Where the identity of a customer who made a purchase is known an analysis can be made of the collection of related records of the same structure (i.e. consisting of a number of items drawn from a given collection of items). The records are related by the identity of the customer who did the repeated purchases. Such a situation is typical of a direct mail application where for example acatalogue merchant has the information, for each customer, of the sets of products that the customer buys in every purchase order. A sequential pattern function will analyze such collections of related records and will detect frequently occurring patterns of products bought over time. A sequential pattern operator could also be used to discover for example the set of purchases that frequently precedes the purchase of a microwave oven.

Clustering/Segmentation Clustering and Segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric. A Cluster is a set of objects grouped together because of their similarity or proximity. Objects are often decomposed into an exhaustive and/or mutually exclusive set of clusters.

IBM – Market Basket Analysis example IBM have used segmentation techniques in their Market Basket Analysis on POS transactions where they separate a set of untagged input records into reasonable groups according to product revenue by

market basket i.e. the market baskets were segmented based on the number and type of products in the individual baskets.

6. Discuss the following with respect to Distributed Database Systems:

a- Problem Areas of Distributed Databases:- Following are the crucial areas in a Distributed Database environment that needs to look into carefully in order to make it a successful. We shall be discussing these in much detail in following sections: 1. Distributed Database Design 2. Distributed Query Processing 3. Distributed Directory Management 4. Distributed Concurrency Control 5. Distributed Deadlock Management 6. Reliability in Distributed DBMS 7. Operating System Support 8. Heterogeneous Databases

b- Transaction Processing Framework:- A transaction is always part of an application. At some time after its invocation by the user, the application issues a begin_transaction primitive; from this moment, all actions which are performed by the application, until a commit or abort primitive is issued, are to be considered part of the same transaction. Alternatively, the beginning of a transaction is implicitly associated with the beginning of the application, and commit/abort primitive ends a transaction and automatically begins a new one, so that explicit begin_transaction primitive is not necessary In order to perform functions at different sites, a distributed application has to execute several processes at these sites. Let us call these processes as agents of application. An agent is therefore a local process which performs some actions on behalf of an application.Any transaction must satisfy the four properties,

Atomicity: Either all or none of the transaction's operations are performed. In other words if a transaction is interrupted by a failure, its partial results are undone.

Consistency Preservation: A transaction is consistency preserving if its complete execution takes the database from one consistent state to another.

Isolation: Execution of a transaction should not be interfered with by any other transactions executing concurrently. It should appear that a transaction is being executed in isolation from other transactions. An incomplete transaction cannot reveal its results to other transactions before its commitment. This property is needed in order to avoid the problem of cascading aborts.

Durability (Permanency): Once a transaction has committed, the system must guarantee that the results of its operations will never be lost, independent of subsequent failures.

c-Models of Failure:- Failures can be classified as 1) Transaction Failures a) Error in transaction due to incorrect data input. b) Present or potential deadlock. c) 'Abort' of transactions due to non-availability of resources or deadlock.

2) Site Failures: From recovery point of view, failure has to be judged from the viewpoint of loss of memory. So failures can be classified as

a) Failure with Loss of Volatile Storage: In these failures, the content of main memory is lost; however, all the information which is recorded on disks is not affected by failure. Typical failures of this kind are system crashes.

b) Media Failures (Failures with loss of Nonvolatile Storage): In these failures the content of disk storage is lost. Failures of this type can be reduced by replicating the information on several disks having 'independent failure modes'. Stable storage is the most resilient storage medium available in the system implemented by replicating the same information on several disks with (i) independent failure modes, and (ii) using the so-called careful replacement strategy, at every update operation, first one copy of the information is updated, then the correctness of the update is verified, and finally the second copy is updated.

3) Communication Failures: There are two basic types of possible communication errors: lost messages and partitions.

mc0077 solved

Documents