15-53 data warehousing and data mining

(DATA WAREHOUSING AND DATA MINING)1. Define the term data warehouse? Give the 3 major activities of a data warehouse.Ans: Collection of key pieces of information to arrive at suitable managerial decisions is known asData ware House. Three major activities of data warehouse are following a) populating the data b) day to day management c) accommodating changes.

2. What is the star flake schema? How it is different with respect to star schema?Ans: Star flake schema is schema that uses a combination of demoralized star and normalized snow flake schemas. There are most appropriate in decision support data warehouses.The star schema looks good section to the problem of warehousing but it simply states that one should identify the fact and store it in read only area.

3. What are the functions of schedule manager?Ans: A scheduling is a key for successful warehouse management and their function are:-a) Handle multiple queues b) Maintain job schedules across outages.c) Support starting and stopping of queries etc.

4. How to categorize data mining systems?Ans: a) Classification according to the type of data source mined:- this classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.b) Classification according to the data model drawn on:- this classification categorizes data mining systems based on the data model involved such as relational database, object-oriented database, data warehouse, transactional, etc.c) Classification according to the king of knowledge discovered:- this classification categorizes data mining systems based on the kind of knowledge discovered.d) Classification according to mining techniques used:- This classification categorizes data mining systems according to the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data warehouse-oriented, etc.

5. Explain the possible techniques for data cleaning.Ans: 1. Data normalization: - For example decimal scaling into the range (0,1), or standard deviation normalization.

2. Data smoothing: - Discrimination of numeric attributes is one example; this is helpful or even necessary for logic based methods. 3. Treatment of missing values: - There is not simple and safe solution for the cases where some of the attributes have significant number of missing values.4. Data reduction: - Reasons for data reduction are in most cases two fold: either the data maybe too big for the program, or expected time for obtaining the solution might be too long.

6. Explain the concept of data warehousing and data mining.Ans. A data warehouse is a collection of a large amount of data and these data is the pieces of information Which is use to suitable managerial decisions. (a storehouse of data) eg:- student datato the details of the citizens of a city or the sales of previous years or the number of patients thatcame to a hospital with different ailments. Such data becomes a storehouse of information.Data mining is the process of exploration and analysis, by automatic or semiautomatic means, oflarge quantities of data in order to discover meaningful patterns and rules. The main concept of datamining using a variety of techniques to identify nuggets of information or decision making knowledgein bodies of data, and extracting these in such a way that they can be put to use in the areas such asdecision support, prediction, forecasting and estimation.7. What is Meta data? Give example.Ans. Meta data is simply data about data which normally describe the objects & their quantity, their sizeand how data are stored. It is helpful in query management. Eg:-8. What are the requirements for clustering? Explain.Ans: Clustering is a division of a data into groups of similar objects. Each group called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Or we can say Clustering is a challenging and interesting field potential application poses their own special requirement. The following are typical requirements of clustering. Scalability: Many clustering algorithms work well on small data sets containing fewer than 200 data objects However, a large database may contain millions of objects. Ability to deal with different types of attributed: Many algorithms are designed to cluster interval-based

(numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. Discovery of clusters with arbitrary shape: Many clustering algorithms determined clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. Minimal requirements for domain knowledge of determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality. Insensitivity to the order of input records: Some clustering algorithms are sensitive to the order of input data; for example, the same set of data, when presented with different orderings to such an algorithm, may generated dramatically different clusters. High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of constraints. Suppose that our job is to choose the locations for a given number of new automatic cash-dispensing machines (i.e., ATMs) in a city. Interpretability and usability: Users expect clustering results to be interpretable, Comprehensible, and usable. 9. Explain in brief the data warehouse delivery process.Ans: Data warehouse delivery process follows the following steps: IT Strategy: - Data warehouse cannot work in ISOLATION whole IT Strategy of company needed. Business case analysis: - Important of various component of business/over all understand of business is must.Designer must have sound knowledge’s of business activity.

Education: - It plays two roles: a)To make people comfortable with DWH concept. b)To aid the prototyping activity. Business Requirements: - It is essential that the business requirements are fully understood by the data

warehouse planner. This is more easily said than done because future modifications are hardly clear even to top level planners, let

alone the IT professionals. Technical blue prints: - This is the stage where the overall architecture that satisfies the requirements is delivered. Building the vision: - Here the first physical infrastructure becomes available. The major infrastructure components are set up; first stages of loading and generation of data start up. History load: - Here the system made fully operational by loading the required history into the warehouse. Now the warehouse becomes fully “loaded” and is ready to take on live “queries”. Adhoc Query: -Now we configure a query tool to operate against the data warehouse. The users can ask questions in a typical format. Automation: - Extracting and loading of data from the sources, transforming the data, backing up, restoration, achieving, aggregations, and monitoring query profiles are the operational process of this phase. Extending Scope: - There is not single mechanism by which this can be achieved. As and when needed, a new set of data may be added; new formats may be included or may be even involve major changes. Requirement Evolution: - Business requirements will constantly change during the life of the warehouse.Hence, the process that supports the warehouse also needs to be constantly monitored and modified.

Q10. What is the need for partitioning of data? Explain the usefulness of partitioning of data.Ans: In most warehouses the size of the fact tables tends to become very large. This leads to several problem of management backup, processing etc. which partitioning and each fact table part into separate partition. This technique allows data to be scanned to be minimized, without the overhead

of using an index. This improves the overall efficiency of the system partitioning data help in –a). Assist in better management of data.b). Ease of backup/recovery since the volume is less.c). The star schemas with partitions produce better performance.

Q11. Explain the steps needed for designing the summary table.Ans: Summary table are designed by following steps –

i. Decide the dimensions along which aggregation is to be done.

ii. Determine the aggregation of multiple facts.

iii. Aggregate multiple facts into the summary table.

iv. Determine the level of aggregation and the context at embedding.

v. Design time into the table.vi. Index the Summary table.

Q12. What is an event? How does an event manager manages the events? Name any 4 events.Ans: An event is a measurable, observable occurrence at defined action. The event manager is software that continuously monitors the system for the occurrence at the event and then takes any action that is suitable. List of common events- i.Running out of memory space ii. A process dying. iii. A process using accessing resource iv. I/O errors.

Q13. What are the reasons for data marting? Mention their advantages and disadvantages.Ans: There are so many reasons for data marting, such as - Since the volume of data scanned issmall, they speed up the query processing, Data can be structured in a form suitable for a user access too, Data can be segmented or partitioned so that they can be used on different platforms and also different control strategies become applicable.Advantages:-i).Since the volume of data scanned is small, they speed up the query processing.ii).Data can be structured in a form suitable for a user access too.iii).Data can be segmented or partitioned so that they can be used on different platformDisadvantages:-i).The lot of setting up and operating data mart is quite high.ii).Once a data marting strategy is put in place, the data mart format became fined.

It may be very difficult to change the strategy later, because data mart format also have to be changed

Q14. What are the issues in data mining.Ans: There are may issues in data mining such as-i). Security and Social issue: security is an important issue with any data collection that is shared or in tended to be used for strategic decision making. ii). User Interface Issues: The knowledge discovered by data mining tools is useful. It is interestingand understanding correlating personal data with other information.iii). Mining Methodology issues: these issues pertain to the data mining approaches applied and their limitations.iv). Performance issues: many artificial intelligence & statistical methods exist for data analysis & interpretationv). Data Source Issues: There are many issues related to the data sources, some are partial such as the diversity of data types, while others are philosophical like the data glut problem.

Q15. Define data mining query in term of primitives.Ans: a) Growing Data Volume: The main reason for necessity of automated computer systems for intelligent data analysis is the enormous volume of existing and newly appearing data that require processing. The amount of data accumulated each day by various business, scientific, and governmental organizations around the world is daunting. b) Limitations of Human Analysis: Two other problems that surface when human analysts’ processdata are the inadequacy of the human brain when searching for complex multifactor dependencies in data, and the lack of objectiveness in such an analysis. c) Low Cost of Machine Learning: While data mining does not eliminate human participation in solving the task completely, it significantly simplifies the job and allows an analyst who is not a professional in statistics and programming to manage the process of extracting knowledge from data.

Q16. Explain in brief the data mining applications.Ans: Data mining has many varied field of application which is listed below:Retail/Marketing:

•Identify buying patterns from customers.•Find associations among customer demographic characteristics•Predict response to mailing campaigns•Market basket analysisBanking:•Detect patterns of fraudulent credit card use•Identity ‘loyal’ customers•Predict customers, determine credit card spending•Identify stock trading Insurance and Health Care:•Claims analysis•Identify behavior pattern of risky customers.•Identify fraudulent behaviorTransportation:•Determine the distribution schedules among outlets.•Analyze loading patternsMedicine:•Characterize patient behavior to predict office visits.•Identify successful medical therapies for different illnesses.

Q17. With example explain the decision tree working concept.Ans: Decision tree is a classifier in the form of a tree structure where each node is either –a. A least node, indicating a class at instances. b. A decision node that specifies some test to be carried out on a single attribute value, with one branch and sub tree for each possible outcome of the test. A decision tree can be used to classify an instance by starting at the root and moving through it until a leaf node, which provides the classification of the instance. Example :- Decision making in the Bombay stock market - Assume that the major factors affecting the Bombay stock markets are – What it did yesterday, What the New Delhi market is doing today, Bank Interest Rate, Unemployment rate.

Q18. What are the implementation steps of data mining with apriori analysis and how the efficiency of this algorithm can be improved?Ans: - Implémentation Steps – i). The Apriori algorithm would analyze all the transactions in a dataset for each items support count. Any items that support count less than the minimum support count is removed from the pool of candidate application.

ii). Initially every item is member of a set of 1-candidate item sets. The support count of each candidate item sets is calculated and items with a support count less the minimum required support count are removedas candidate, the remaining item is joined to create 2 candidate item sets that each comprise of two items or members.iii). The support count of each two member item set is calculated from the database of transactions and 2 member item set that occur with a support count greater than or equal to minimum support count are used to create 3 candidate item sets. The process in steps 1 and 2 is repeated generating 4 and 5.iv). All candidate item sets are generated with a support count greater them the minimize support count from a set of request item sets.v). Apriori recursively generates all the subsets of each frequent item set and creates association rules based

on subsets with a confidence greater than the minimum confidence.There are many variations of Apriori Algorithm have been proposed that focus on improving the

efficiency of the original algorithm –A). Hash based Technique:- It is used to reduce the size of the candidate K – item sets.B). Transaction Reduction:- A transaction that does not contain any frequent K- itemsets cannot contain any frequent (K+1) item sets.C). Sampling: - In this way, we trade off some degree of accuracy against efficiency.D). Dynamic Item set Counting: - It was proposed in which the database is partitionedinto blocks masked by start points.

Q19. In brief explain the process of data preparation.

Ans: Data preparation is divided into different selection data cleaning, formation of new data and data formatting. i). Select data- Data quality properties: completeness and correctness. Technical constraints such as limits on data volume or data type. ii). Data cleaning: - data normalization, data smoothing, treatment of missing values, data reduction.iii). New data construction: - this step represents constructive operations on selected on selected data, which includes. Derivation of new attributes from two or more existing attributes. Generation of new records, Data transformation. iv). Data formatting: - reordering of attributes or records, changes related to the constraints of modeling tools.

Q20.What are the guidelines for KDD environment.Ans: The following are the guidelines for KDD environment are:-1. Support extremely large data sets: Data mining deals with extremely large data sets consisting of billions of records and without proper platforms to store and handle these volumes of data, no reliable data mining is possible. Parallel servers with databases optimized for decision support system oriented queries are useful. Fast and flexible access to large data sets is of very important. 2. Support hybrid learning: Learning tasks can be divided into three areas: a. classification tasks b. knowledge engineering tasks c. problem-solving tasks. All algorithms can not perform well in all the above areas as discussed in previous chapters. Depending on our requirement one has to choose the appropriate one. 3. Establish a data warehouse: A data warehouse contains historic data and is subject oriented and static, that is, users do not update the data but it is created on a regular time-frame on the basis of the operational data of an organization. 4. Introduce data cleaning facilities: Even when a data warehouse is in operation, the data is certain to contain all sorts of heterogeneous mixture. Special tools for cleaning data are necessary and some advanced tools are available, especially in the field of de-duplication of client files. 5. Facilitate working with dynamic coding: Creative coding is the heart of the knowledge discovery process.

The environment should enable the user to experiment with different coding schemes, store partial results make attributes discrete, create time series out of historic data, select random sub-samples, separate test sets and so on.6. Integrate with decision support system: Data mining looks for hidden data that cannot easily be found using normal query techniques. A knowledge discovery process always starts with traditional decision support system activities and from there we magnify in on interesting parts of the data set. 7. Choose extendible architecture: New techniques for pattern recognition and machine learning are under development and we also see many developments in the database area. It is advisable to choose an architecture that enables us to integrate new tools at later stages. 8. Support heterogeneous databases: Not all the necessary data is necessarily to be found in the data warehouse. Sometimes we will need to enrich the data warehouse with information from unexpected sources, such as information brokers or with operational data that is not stored in our regular data warehouse. 9. Introduce client/server architecture: A data mining environment needs extensive reporting facilities. Client/server is a much more flexible system which moves the burden of visualization and graphical techniques from the servers to the local machine. We can then optimize our database server completely for data mining. 10. Introduce cache optimization: The learning algorithm in a data mining environments should be optimized for store the data in separate tables on to cache large portions in internal memory type of database access.

Q21. Explain data mining for financial data analysis.Ans: Financial data collected in the banking and financial industries are often relatively complete, reliable and of high quality, which facilitates systematic data analysis and data mining. The various issues are –a) Design and construction of data warehouses for multidimensional data analysis and data mining: Data warehouses need to be constructed for banking and financial data. Multidimensional data analysis

methods should be used to analyze the general properties of such data. Data warehouses, data cubes, multifeature and discovery-driven data cubes, characteristic and comparative analyses and outlier analyses all play important roles in financial data analysis and mining. b) Loan payment prediction and customer credit policy analysis: Loan payment prediction and customer credit analysis are critical to the business of a bank. Many factors can strongly or weakly influence loan payment performance and customer credit rating. Data mining methods, such as feature selection and attribute relevance ranking may help identify important factors and eliminate irrelevant ones. c) Classification and clustering of customers for targeted marketing: Classification and clustering methods can be used for customer group identification and targeted marketing. Effective clustering and collaborative filtering methods can help identify customer groups, associate a new customer with an appropriate customer group and facilitate targeted marketing. d) Detection of money laundering and other financial crimes: To detect money laundering and other financial crimes, it is important to integrate information from multiple databases, as long as they are potentially related to the study. Multiple data analysis tools can then be used to detect unusual patterns, such as large amounts of cash flow at certain periods, by certain group of people and so on.

Q22. What is data warehouse? Explain the architecture of data warehouse.Ans: It is a large collection of Data and set of Process managers that use this data to make the

information available. The architecture for data warehouse indicated below: It only gives the major items that make up a data ware house.The size & complexity of each items depends on the actual size of warehouse. The extracting & loading process are taken care of by the load manager. The processes of cleanup &transformation of data as also of back up & archiving are duties of the warehouse manager, while the query manager, as the name implies is to take case of query management.

Q23. What is the importance of period of retention of data?Ans: A businessman says he wants to the data to be retained for as long as possible 5, 10, 15 years the longer the better. The more data we have, the better the information generated. But such a view thing is unnecessarily simplistic. If a company wants to have an idea of the recorder levels, details of sales for last 6 months to one year may be enough. Sales pattern of 5 years is unlikely to be relevant today. So, It is important to determine the retention period for each function but once it is drawn, it becomes easy to decide on the optimum value of data to be stored.

Q25. Give the advantages and disadvantages of equal segment partitioning.Ans: The advantage is that the slots are reusable. Suppose we are sure that we will no more need the data of 10 years back, then we can simply delete the data of that slot and use it again. Of course there is a serious draw back in the scheme – if the partitions tend to differ too much in size. The number of visitors visiting a till station, say in summer months, will be much larger than in winter months and hence the size of the segment should be big enough to take case of the summer rush.Q24. What are the facts to optimize the cost-benefit ratio.Ans: The facts to optimize the cost-benefit ratio are: i). Understand the significance of the data stored with respect to time. Only those data that are still needed for processing need to be stored. ii). Find out whether maintaining of statistical samples of each of the subsets could be resortedto instead of storing the entire data. iii). Remove certain columns of the data, if you feel it is no more essential.iv). Determine the use of intelligent and non intelligent keys. v). Incorporate time as one of the factors into the data table. This can help in indicating the usefulness of the data over a period of time and removal of absolute data. vi). Partition the fact table. A record may contain a large number of fields, only a few of which are actually needed in each case. It is desirable to group those fields which will be useful into smaller tables and store separately.

Q26. Explain the Query generation.Ans: Meta data is also required to generate queries. The query manger uses the metadata to build a historyof all queries run and generator a query profile for each user, or group of uses. We simply list a few of the commonly used meta data for the query. The names are self explanatory. o Query- Table accessed- Column accessed, Name, Reference identifier. o Restrictions applied- Column name, Table name, Reference identifier ,Restrictions. o Join criteria applied-Column name, Table name, Reference Identifier, Column name, Table name, Reference identifier. o Aggregate function used-Column name, Reference identifier, Aggregate function. o Syntax o Resources o Disk

Q27. Explain data mining for retail industry application.Ans: The retail industry is a major application area for data mining since it collects huge amount of data on sales, customer shopping history, goods transportation, and consumption and service records and so on. The quantity of data collected continues to expand rapidly, due to web or e-commerce. Today, many stores also have web sites where customers can make purchases on-line. Retail data mining can help identify customer buying behaviors, discover customer shopping patterns and trends, improve the quality of customer service, achieve better customer retention and satisfaction, enhance goods consumption ratios, design more effective goods transportation and distribution policies and reduce the cost of the business. The following are few activities ofdata mining are carried out in the retail industry.a) Design and construction of data warehouses on the benefits of data mining: The first aspect is to design a warehouse. Here it involves deciding which dimensions and levels to include and what preprocessing to perform in order to facilitate quality and efficient data mining. b) Multidimensional analysis of sales, customers, products, time and region: The retail industry requires timely information regarding customer needs, product sales, trends and fashions as well as the quality, cost, profit and service of commodities. It is therefore important to provide powerful multidimensional analysis and visualization tools, including the construction of sophisticated data cubes according to the needs of data analysis. c) Analysis of the effectiveness of sales campaigns: The retail industry conducts sales campaigns

using advertisements, coupons and various kinds of discounts and bonuses to promote products and attract customers. Careful analysis of the effectiveness of sales campaigns can help improve company profits. Multi-dimensional analysis can be used for these purposes by comparing the amount of sales and the number of transactions containing the sales items during the sales period versus those containing the same items before or after the sales campaign.d) Customer retention – analysis of customer loyalty: With customer loyalty card information, one can register sequences of purchases of particular customers. Customer loyalty and purchase trends can be analyzed in a systematic way, Goods purchased at different periods by the same customer can be grouped into sequences. Sequential patterns mining can then be used to investigate changes in customer consumption or loyalty and suggest adjustments on the pricing an variety of goods in order to help retain customers and attract new customers.e) Purchase recommendations and cross-reference of items: Using association mining for sales records, one may discover that a customer who buys a particular brand of bread is likely to buy another set of items. Such information can used to form purchase recommendations. Purchaserecommendations can e advertised on the web, in weekly flyers or on the sales receipts to help improve customer service, aid customers in selecting items and increase sales.

37. Define aggregation. Explain steps require designing summary table.Ans: Association: - A collection of items and a set of records, which containsome number of items from the given collection, an association function is anoperation against this set of records which return affinities or patterns that existamong the collection of items. Summary table are designed by following the stepsgiven as follows: a) decide the dimensions along which aggregation is to be done.b) Determine the aggregation of multiple facts. c) Aggregate multiple facts intothe summary table. d) Determine the level of aggregation and the extent ofembedding. e) Design time into the table. f) Index the summary table.

Q29. Explain the h/w partitioning?

Ans: The data ware design process should try to maximize the performance of the system. One of the ways to ensure this is to try to optimize by designing the data base with respect to specific hardware architecture. Obviously, the exact details of optimization depend on the hardware platforms. Normally the following guidelines are useful: i) maximize the processing, disk and I/O operations. ii) Reduce bottlenecks at the CPU and I/O

31. With diagram explain the architecture of ware house manager.Ans: The ware house manager is a component that performs all operations necessary to support the warehouse management process. Unlike the load manager, the warehouse management process is driven by the extent to which the operational management of the data ware house has been automated.Q28. Explain the system management tools.Ans: 1. Configuration Managers: This tool is responsible for setting up and configuring the hardware. Since several types of machines are being addressed, several concepts like machine configuration, compatibility etc. is to be taken care of, as also the platform on which the system operates. Most configuration managers have a single interface to allow the control of all types of issues.2. Schedule Managers : The scheduling is the key for successful warehouse management. Almost all operations in the ware house need some type of scheduling. Every operating system will have its own scheduler and batch control mechanism. But these schedulers may not be capable of fully meeting the requirements of a data warehouse. Hence it is more desirable to have specially designed schedulers to manage the operations. Some of the capabilities that such a manager should have include the following: Handling multiple queues Interqueue processing capabilitiesMaintain job schedules across system outages Deal with time zone differences. 3. Event Managers: An event is defined as a measurable, observable occurrence of a defined action. If this definition is quite vague, it is because it encompasses a very large set of operations.A partial list of the common events that need to be monitored are as follows:Running out of memory spaceA process dying A process using excessing resource I/O errors4. Database Managers: The database manger normally will also have a separate (and often independent)

system manager module. The purpose of these managers is to automate certain processes and simplify the execution of others. Some of operations are listed as follows. Ability to add/remove users o User management o Manipulate user quotas o Assign and deassign the user profiles Ability to perform database space management 5. Backup Recovery Managers: Since the data stored in a warehouse is invaluable, the need to backup and recover lost data cannot be overemphasized. There are three main features for the management of backups. Scheduling Backup data tracking Database awareness 6. Back propagation: Back propagation learns by iteratively processing a set of training samples,comparing the network’s prediction for each sample with the actual known class label. For each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the means squared error between the network’s prediction and the actual class. These modifications are made in the “backwards” direction, that is, from the output layer, through each hidden layer down to the first hidden layer

Q30.Explain horizontal and vertical partitioning and differentiate them.Ans: HORIZONTAL PARTITIONING-This is essentially means that the table is partitioned after the first few thousand entries, and the next few thousand entries etc. This is because in most cases, not all the information in the fact table needed all the time. Thus horizontal partitioning helps to reduce the query access time, by directly cutting down the amount of data to be scanned by the queries.a) Partition by time into equal segments : This is the most straight forward method of partitioning by months or years etc. This will help if the queries often come regarding the fortnightly or monthly performance / sales etc.b) Partitioning by time into different sized segments: This is very useful technique to keep the physical table small and also the operating cost low. c) Partitioning on other dimension: Data collection and storing need not always be partitioned based on time, though it is a very safe and relatively straight forward method.d) Partition by the size of the table : We will not be sure of any dimension on which partitions can be made. In this case it is ideal to partition by size. e) Using Round Robin Partitions: Once the warehouse is holding full amount of data, if a new

partition is required, it can be done only by reusing the oldest partition. Then Meta data is needed to note the beginning and ending of the historical data. VERTICAL PARTITIONING - A vertical partitioning schema divides the table vertically. Each row is divided into 2 or more partitions. i) We may not need to access all the data pertaining to a student all the time. For example, we may need either only his personal details like age, address etc. or only the examination details of marks scored etc. Then we may choose to split them into separate tables, each containing data only about the relevant fields. This will speed up accessing. ii) The no. of fields in a row become inconveniently large, each field itself being made up of several subfields etc. In such a scenario, it is always desirable to split it into two or more smaller tables. The vertical partitioning itself can be achieved in two different ways: (i) Normalization and (ii) row splitting.

32. Explain the steps needed for designing the summary table.Ans: Summary table are designed by following steps –

i. Decide the dimensions along which aggregation is to be done.

ii. Determine the aggregation of multiple facts.

iii. Aggregate multiple facts into the summary table.

iv. Determine the level of aggregation and the context at embedding.

v. Design time into the table.vi. Index the Summary table.

33. What are the reasons for data marting? Mention their advantages and disadvantages.Ans: There are so many reasons for data marting, such as - Since the volume of data scannedis small, they speed up the query processing, Data can be structured in a form suitable for a user access too, Data can be segmented or partitioned so that they can be used on different platforms and also different control strategies become applicable.

34. Define data mining query in terms of primitives.Ans: A data mining query is defined in terms of the following primitives-Task Relevant Data – This is the database position to be investigated. ii. The kinds of knowledge to be mined - It specifies the data mining functions to be performed such as

characterization, discrimination, association, classification, and clustering and evolution analysis. iii. Background knowledge - User can specify background knowledge or knowledge about the domain to be mined which is useful for guiding KDD process. iv. Interestingness measures – These functions are used to separate uninteresting patterns from knowledge. Different kinds of knowledge may have different interestingness measures. v. Presentation and Visualization of discovered process - Users can choose from different forms for knowledge presentation such as rules, tables, charts, graphs, decision trees and cubes.35.What are the implementation steps of data mining with a priori analysis and how the efficiency of this algorithm can be improved.Ans: - Implémentation Steps – i).The Apriori algorithm would analyze all the transactions in a dataset for each items support count. Any items that support count less than the minimum support count is removed from the pool of candidate application.ii).Initially every item is member of a set of 1-candidate item sets. The support count of each candidate item sets is calculated and items with a support count less the minimum required support count are removed as candidate, the remaining item is joined to create 2 candidate item sets that each comprise of two items or members.iii).The support count of each two member item set is calculated from the database of transactions and 2 member item set that occur with a support count greater than or equal to minimum support count are used to create 3 candidate item sets. The process in steps 1 and 2 is repeated generating 4 and 5.iv).All candidate item sets are generated with a support count greater them the minimize support count from a set of request item sets.v).Apriori recursively generates all the subsets of each frequent item set and creates association rules based on subsets with a confidence greater than the minimum confidence.There are many variations of Apriori Algorithm have been proposed that focus on improving the efficiency of the original algorithm – a).Hash based Technique:- It is used to reduce the size of the candidate K – itemsets.b).Transaction Reduction:- A transaction that does not contain any frequent K- itemsets cannot contain any frequent (K+1) item sets.c).Sampling: - In this way, we trade off some degree of accuracy against efficiency.

d).Dynamic Itemset Counting: - It was proposed in which the database is partitioned into blocks masked by start points.

36.Explain multi dimensional schemas.Ans: This is a very convenient method of analyzing data, when it goes beyond the normal tabular relations.For example, a store maintains a table of each item it sells over a month as a table, in each of it’s 10 outlets. This is a 2 dimensional table. One the other hand, if the company wants a data of all items sold by it’s outlets, it can be done by simply by superimposing the 2 dimensional table for each of these items – one behind the other. Then it becomes a 3 dimensional view. Then the query, instead of looking for a 2 dimensional rectangle of data, will look for a 3 dimensional cuboid of data. There is no reason why the dimensioning should stop at 3 dimensions. In fact almost all queries can be thought of as approaching a multi-dimensioned unit of data from a multidimensional volume of the schema. A lot of designing effort goes into optimizing such searches.

15-53 data warehousing and data mining

Documents