mc0088 august 2012 set1+set2

August 2012Master of Computer Application (MCA) – Semester 6

MC0088 – Data Mining– 4 Credits(Book ID: B1009)

Assignment Set – 1 (60 Marks) Answer all Questions Each Question carries TEN Marks

1. What is operational intelligence?

Operational intelligence (OI) is a form of real-time dynamic, business analytics that delivers visibility and insight into business operations. Operational intelligence solutions run query analysis against live feeds and event data to deliver real-time, actionable information. This real-time information can be acted upon in a variety of ways - alerts can be sent or executive decisions can be made using real-time dashboards.

Purpose

The purpose of OI is to monitor business activities and identify and detect situations relating to inefficiencies, opportunities, and threats. Some definitions define operational intelligence an event-centric approach to delivering information that empowers people to make better decisions. OI helps quantify:

the efficiency of the business activities how the IT infrastructure and unexpected events affect the business activities

(resource bottlenecks, system failures, events external to the company, etc.) how the execution of the business activities contribute to revenue gains or

losses.This is achieved by observing the progress of the business activities and computing several metrics in real-time using these progress events and publishing the metrics to one or more channels (e.g., a dashboard that can display the metrics as charts and graphs, autonomic software that can receive these updates and fine-tune the processes in real-time, email, mobile, and messaging systems that can notify users, and so on). Thresholds can also be placed on these metrics to create notifications or new events.In addition, these metrics act as the starting point for further analysis (drilling down into details, performing root cause analysis — tying anomalies to specific transactions and of the business activity).Sophisticated OI systems also provide the ability to associate metadata with metrics, process steps, channels, etc. With this, it becomes easy to get related information, e.g., 'retrieve the contact information of the person that manages the application that

executed the step in the business transaction that took 60% more time than the norm," or "view the acceptance/rejection trend for the customer who was denied approval in this transaction," "Launch the application that this process step interacted with."

Features

Different operational intelligence solutions may use many different technologies and be implemented in different ways. This section lists the common features of an operational intelligence solution:

Real-time monitoring Real-time situation detection Real-time dashboards for different user roles Correlation of events Industry-specific dashboards Multidimensional analysis

Root cause analysis Time Series and trending analysis

Comparison

OI is often linked to or compared with business intelligence (BI) or real time business intelligence, in the sense that both help make sense out of large amounts of information. But there are some basic differences: OI is primarily activity-centric, whereas BI is primarily data-centric. (As with most technologies, each of these could be sub-optimally coerced to perform the other's task.) OI is, by definition real-time, unlike BI which is traditionally an after-the-fact and report-based approach to identifying patterns, and unlike real time BI which relies on a database as the sole source of events.

2. What is Business Intelligence? Explain the components of BI architecture.

Business intelligence is actually an environment in which business users receive data that is reliable, consistent, understandable, easily manipulated and timely. With this data, business users are able to conduct analyses that yield overall understanding of where the business has been, where it is now and where it will be in the near future. Business intelligence serves two main purposes; it monitors the financial and operational health of the organization (reports, alerts, alarms, analysis tools, key performance indicators and dashboards). It also regulates the operation of the organization providing two-way integration with operational systems and information feedback analysis. There are various definitions given by the experts; some of the definitions are given below:

Converting data into knowledge and making it available throughout the organization are the jobs of processes and applications known as Business Intelligence.

BI is a term that encompasses a broad range of analytical software and solutions for gathering, consolidating, analyzing and providing access to information in a way that is supposed to let the users of an enterprise make better business decisions.

Business Intelligence Infrastructure

Business organizations can gain a competitive advantage with well-designed business intelligence (BI) infrastructure. Think of the BI infrastructure as a set of layers that begin with the operational systems information and Meta data and end in delivery of business intelligence to various business user communities. See Fig. 3.1 & Fig. 3.2.

Based on the overall requirements of business intelligence, the data integration layer is required to extract, cleanse and transform data into load files for the information warehouse.

This layer begins with transaction-level operational data and Meta data about these operational systems.

Typically this data integration is done using a relational staging database and utilizing flat file extracts from source systems.

The product of a good data-staging layer is high-quality data, a reusable infrastructure and meta data supporting both business and technical users.

The information warehouse is usually developed incrementally over time and is architected to include key business variables and business metrics in a structure that meets all business analysis questions required by the business groups.

Fig 3.1: BI infrastructure

1. The information warehouse layer consists of relational and/or OLAP cube services that allow business users to gain insight into their areas of responsibility in the organization.

2. Customer Intelligence relates to customer, service, sales and marketing information viewed along time periods, location/geography, and product and customer variables.

3. Business decisions that can be supported with customer intelligence range from pricing, forecasting, promotion strategy and competitive analysis to up-sell strategy and customer service resource allocation.

4. Operational Intelligence relates to finance, operations, manufacturing, distribution, logistics and human resource information viewed along time periods, location/geography, product, project, supplier, carrier and employee.

5. The most visible layer of the business intelligence infrastructure is the applications layer, which delivers the information to business users.

6. Business intelligence requirements include scheduled report generation and distribution, query and analysis capabilities to pursue special investigations and graphical analysis permitting trend identification. This layer should enable business users to interact with the information to gain new insight into the underlying business variables to support business decisions.

7. Presenting business intelligence on the Web through a portal is gaining considerable momentum. Portals are usually organized by communities of users organized for suppliers, customers, employers and partners.

8. Portals can reduce the overall infrastructure costs of an organization as well as deliver great self-service and information access capabilities.

9. Web-based portals are becoming commonplace as a single personalized point of access for key business information.

3. Differentiate between database management systems (DBMS) and data mining.

A DBMS is a "Database Management System". This is the software that manages data on physical storage devices. The software provides the ability to store, access and modify the data. The software also provides a suite of utilities to manage & monitor the performance on those actions against the data. Examples of a dbms would be Oracle, SQL/Server, DB2 and Informix in the relational (rdbms) world.

Data Mining: A hot buzzword for a class of database applications that look for hidden patterns in a group of data. For example, data mining software can help retail companies find customers with common interests. The term is commonly misused to describe software that presents data in new ways. True data mining software doesn't just change the presentation, but actually discovers previously unknown relationships among the data.

Table 5.1: Differences between Database Management Systems (DBMS) and Data Mining.

The nexus between data warehouse and data mining is indisputable. Popular business organizations use these technologies together. The current section describes the relation between data warehouse and data mining. Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. It is the process of data-driven extraction of not so obvious but useful information from large databases. Data mining has emerged as a key business intelligence technology. Data Mining is a multidisciplinary field drawing works

from statistics, database technology, artificial intelligence, pattern recognition, machine learning, information theory, knowledge acquisition, information retrieval, high-performance computing, and data visualization. The aim of data mining is to extract implicit, previously unknown and potentially useful (or actionable) patterns from data. Data mining consists of many up-to-date techniques such as classification (decision trees, na¨ıve Bayes classifier, k-nearest neighbor, and neural networks), clustering (k-means, hierarchical clustering, and density-based clustering), association (one-dimensional, multidimensional, multilevel association, constraint-based association). Many years of practice show that data mining is a process, and its successful application requires data preprocessing (dimensionality reduction, cleaning, noise/outlier removal), post processing (under standability, summary, presentation), good understanding of problem domains and domain expertise. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehouse is an enabled relational database system designed to support very large databases (VLDB) at a significantly higher level of performance and manageability. Data warehouse is an environment, not a product. It is an architectural construct of information that is hard to access or present in traditional operational data stores.

4. What is Neural Network? Explain in detail.

An Artificial Neural Network (ANN) is an information-processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information.

The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons. This is true of ANNs as well.

Neural Networks are made up of many artificial neurons. An artificial neuron is simply an electronically modelled biological neuron. How many neurons are used depends on the task at hand. It could be as few as three or as many as several thousand. One optimistic researcher has even hard wired 2 million neurons together in the hope he can come up with something as intelligent as a cat although most people in the AI community doubt he will be successful (Update: he wasn't!). There are many different ways of connecting artificial neurons together to create a neural network.

There are different types of Neural Networks, each of which has different strengths particular to their applications. The abilities of different networks can be related to their structure, dynamics and learning methods.

Basic structure of Artificial Neurons

5. What is partition algorithm? Explain with the help of suitable example.

The partition algorithm is based on the observation that the frequent sets are normally very few in number compared to the set of all itemsets. As a result, if we partition the set of transactions to smaller such that each segment can be accommodated in the main memory, then we can compute the set of frequent sets of each of these partitions. It is assumed that these sets (set of local frequent sets) contain a reasonably small number of itemsets. Hence, we can read the whole database (the unsegmented one) once, to count the support of the set of all local frequent sets.

The partition algorithm uses two scans of the database to discover all frequent sets. In one scan, it generates a set of all potentially frequent itemsets by scanning the database once. This set is a superset of all frequent itemsets, i.e., it may contain false positives; but no false negatives are reported. During the second scan, counters for each of these itemsets are set up and their actual support is measured in one scan of the database.

The algorithm executes in two phases. In the first phase, the partition algorithm logically divides the database into a number of non – overlapping partitions. The partitions are considered one at a time and all frequent itemsets for that partitions are generated. Thus, if there are n partitions, Phase I of the algorithm takes n iterations. At the end of Phase I, these frequent itemsets are merged to generate a set of all potential frequent itemsets. In this step, the local frequent itemsets of same lengths from all n partitions are combined to generate the global candidate itemsets. In Phase II, the actual support for these itemsets are generated and the frequent itemsets are identified. The algorithm reads the entire database once during Phase I and once during Phase II. The partition sizes are chosen such that each partition can be accommodated in the main memory, so that the partitions are read only once in each phase.

A partition P of the databases refers to any subset of the transactions contained in the database. Any two partitions are non-overlapping. We define local support for an itemset as the fraction of the transaction containing that particular itemset in partition. We define a local frequent itemset as an itemset whose local support in a partition is at leas the user – defined minimum support. A local frequent itemset may or may not be frequent in the context of the entire database.

Partition Algorithm

P = partition _ database (T); n = Number of partitions // Phase I for i = 1 to n do begin read _ in _ partition (T1 in P)

Li = generate all frequent of Ti using a priori method in main memory. end // Merge Phase for (k = 2;Lki , i = 1, 2, ……, n; k++) do begin CkG = i k n i L 1 end // Phase II for i = 1 to n do begin read _ in _ partition (T1in P) for all candidates c CG computes s(c)Ti end LG = {cCG| s(c)Ti } Answer = LG The partition algorithm is based on the premise that the size of the global candidate set is considerably smaller than the set of all possible itemsets. The intuition behind this is that the size of the global candidate set is bounded by n times the size of the largest of the set of locally frequent sets. For sufficiently large partition sizes, the number of local frequent itemsets is likely to be comparable to the number of frequent itemsets generated for the entire database. If the data characteristics are uniform across partitions, then large numbers of itemsets generated for individual partitions may be common.

Example

Let us take same database T, given in Example 6.2, and the same . Let us partition, for the sake of illustration, T into three partitions T1, T2, and T3, each containing 5 transactions. The first partition T1 contains transactions 1 to 5, T2 contains transactions 6 to 10 and, similarly, T3 contains transactions 11 to 15. We fix the local supports as equal to the given support that is 20%. Thus, 1 = 2= 3 = = 20 %. Any item set that appears in just one of the transactions in any partition is a local frequent set in that partition.

The local frequent sets of the T1 partition are the item sets X, such that s(X)T11.

L1:= {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {1,5}, {1,6}, {1,8}, {2,3}, {2,4}, {2,8}, {4,5}, {4,7}, {4,8}, {5,6}, {5,8}, {5,7}, {6,7}, {6,8}, {1,6,8},{1,5,6}, {1,5,8}, {2,4,8}, {4,5,7}, {5,6,8},{5,6,7},{1,5,6,8}}

Similarly,

L2:= {{2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {2,3}, {2,4}, {2,6}, {2,7}, {2,9}, {3,4}, {3,5}, {3,7}, {5,7}, {6,7}, {6,9}, {7,9}, {2,3,4}, {2,6,7}, {2,6,9}, {2,7,9}, {3,5,7}, {2,6,7,9}}

L3:= {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {1,3}, {1,5}, {1,7}, {2,3}, {2,4}, {2,6},{2,7}, {2,9}, {3,5}, {3,7}, {3,9}, {4,6}, {4,7}, {5,6}, {5,7}, {5,8}, {6,7}, {6,8}, {1,3,5}, {1,3,7}, {1,5,7}, {2,3,9}, {2,4,6}, {2,4,7}, {3,5,7}, {4,6,7}, {5,6,8}, {1,3,5,7}, {2,4,6,7}}

In Phase II, we have the candidate set as

C: = L1 L2L3

C:= {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {1,3}, {1,5},{1,6}, {1,7},{1,8}, {2,3}, {2,4}, {2,6},{2,7}, {2,8},{2,9}, {3,4},{3,5}, {3,7}, {3,9}, {4,5},{4,6}, {4,7},{4,8}, {5,6}, {5,7}, {5,8}, {6,7},{6,7}, {6,8},{6,9},{7,9}, {1,3,5}, {1,3,7},{1,5,6}, {1,5,7},{1,5,8}, {1,6,8},{2,3,4},{2,3,9}, {2,4,6}, {2,4,7},{2,4,8},{2,6,7}, {2,6,9},{2,7,9}, {3,5,7}, {4,5,7},{4,6,7}, {5,6,8},{5,6,7},{1,5,6,8},{2,6,7,9}, {1,3,5,7}, {2,4,6,7}}

Read the database once to compute the global support of the sets in C and get the final set of frequent sets.

6. Describe the following with respect to Web Mining:

a. Categories of Web Mining

Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web. Web mining can be broadly divided into three categories:

1. Web Content Mining

2. Web Structure Mining

3. Web Usage Mining.

All of the three categories focus on the process of knowledge discovery of implicit, previously unknown and potentially useful information from the web. Each of them focuses on different mining objects of the web.

Fig 9.1 shows the web categories and their objects. Also provided are the brief introductions about each of the categories.

Content mining is used to search, collate and examine data by search engine algorithms (this is done by using Web Robots). Structure mining is used to examine the structure of a particular website and collate and analyze related data. Usage mining is used to examine data related to the client end, such as the profiles of the visitors of the website, the browser used, the specific time and period that the site was being surfed, the specific areas of interests of the visitors to the website, and related data from the form data submitted during web transactions and feedback.

Web Content Mining

Web content mining targets the knowledge discovery, in which the main objects are the traditional collections of multimedia documents such as images, video, and audio, which are embedded in or linked to the web pages. It is also quite different from Data mining because Web data are mainly semi-structured and/or unstructured, while Data mining deals primarily with structured data. Web content mining is also different from Text mining because of the semi-structure nature of the Web, while Text mining focuses on unstructured texts. Web content mining thus requires creative applications of Data mining and / or Text mining techniques and also its own unique approaches. In the past few years, there was a rapid expansion of activities in the Web content mining area. This is not surprising because of the phenomenal growth of the Web contents and significant economic benefit of such mining. However, due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still present many challenging research problems. Web content mining could be differentiated from two points of view: Agent-based approach or Database approach. The first approach aims on improving the information finding and filtering. The second approach aims on modeling the data on the Web into more structured form in order to apply standard database querying mechanism and data mining applications to analyze it. Web Structure Mining

Web Structure Mining focuses on analysis of the link structure of the web and one of its purposes is to identify more preferable documents. The different objects are linked in some way. The intuition is that a hyperlink from document A to document B implies that the author of document. A thinks document B contains worthwhile information. Web structure mining helps in discovering similarities between web sites or discovering important sites for a particular topic or discipline or in discovering web communities. Data Mining Unit 9 Sikkim Manipal University Page No. 179 Simply applying the traditional processes and assuming that the events are independent can lead to wrong conclusions. However, the appropriate handling of the links could lead to potential correlations, and then improve the predictive accuracy of the learned models. The goal of Web structure mining is to generate structural summary about the Web site and Web page. Technically, Web content mining mainly focuses on the structure of inner-document, while Web structure mining tries to discover the link structure of the hyperlinks at the inter-document level. Based on the topology of the hyperlinks, Web structure mining will categorize the Web pages and generate the information, such as the similarity and relationship between different Web sites.

Web structure mining can also have another direction – discovering the structure of Web document itself. This type of structure mining can be used to reveal the structure (schema) of Web pages; this would be good for navigation purpose and make it possible to compare/integrate Web page schemes. This type of structure mining will facilitate introducing database techniques for accessing information in Web pages by providing a reference schema.

Web Usage Mining

Web Usage Mining focuses on techniques that could predict the behavior of users while they are interacting with the WWW. Web usage mining, discover user navigation patterns from web data, tries to discovery the useful information from the secondary data derived from the interactions of the users while surfing on the Web. Web usage mining collects the data from Web log records to discover user access patterns of web pages. There are several available research projects and commercial tools that analyze those patterns for different purposes. The insight knowledge could be utilized in personalization, system improvement, site modification, business intelligence and usage characterization. The only information left behind by many users visiting a Web site is the path through the pages they have accessed. Most of the Web information retrieval tools only use the textual information, while they ignore the link information that could be very valuable.

In general, there are mainly four kinds of data mining techniques applied to the web mining domain to discover the user navigation pattern:

Association Rule mining Sequential pattern Clustering Classification

b. Applications of Web Mining

With the rapid growth of World Wide Web, Web mining becomes a very hot and popular topic in Web research. E-commerce and E-services are claimed to be the killer applications for Web mining, and Web mining now also plays an important role for E-commerce website and E-services to understand how their websites and services are used and to provide better services for their customers and users.

A few applications are:

E-commerce Customer Behavior Analysis E-commerce Transaction Analysis E-commerce Website Design E-banking M-commerce Web Advertisement Search Engine Online Auction

c. Web Mining Software

Open source software for web mining includes RapidMiner, which provides modules for text clustering, text categorization, information extraction, named entity recognition, and sentiment analysis. RapidMiner is used for example in applications like automated news filtering for personalized news surveys. It is also used in automated content-based document and e-mail routing, sentiment analysis from web blogs and product reviews in internet discussion groups. Information extraction from web pages also utilizes RapidMiner to create mash-ups which combine information from various web services and web pages, to perform web log mining and web usage mining. SAS Data Quality Solution provides an enterprise solution for profiling, cleansing, augmenting and integrating data to create consistent, reliable information. With SAS Data Quality Solution you can automatically incorporate data quality into data integration and business intelligence projects to dramatically improve returns on your organization’s strategic initiatives. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.

August 2012Master of Computer Application (MCA) – Semester 6MC0088 – Data Mining– 4 Credits (Book ID: B1009)

Assignment Set – 2 (60 Marks) Answer all Questions Each Question carries TEN Marks

1. Explain the following with respect to Data Warehousing:

a. Data Warehouse

The construction of data warehouse, which involves data cleaning and data integration, can be viewed as an important preprocessing step for data mining. Moreover, data warehouses provide on-line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitate effective data mining. Furthermore, many other data mining functions such as classification, prediction, association and clustering can be integrated with OLAP operation to enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the data warehouse has become an increasingly important platform for data analysis and online analytical processing and will provide an effective platform for data mining. Therefore, prior to presenting a systematic coverage of data mining technology in the remainder of this book, we devote this unit to an overview of data warehouse technology. Such an overview is essential for understanding data mining technology.

Data warehouses have been defined in many ways, making it difficult to formulate a rigorous definition. A data warehouse refers to a database that is maintained separately from an organization’s operational databases. Data warehouse systems allow for the integration of a variety of application systems.

According to W. H. Inmon, a leading architect in the construction of data warehouse systems, “A data warehouse is a subject – oriented, integrated, time – variant, and nonvolatile collection of data in support of management’s decision making process”.

Let’s take a closer look at each of these key features.

Subject – oriented: A data warehouse is organized around major subjects, such as customer, supplier, product, and sales. Rather than concentrating on the day-to-day operation and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses typically provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.

Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and on – line transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on.

Time – variant: Data are stored to provide information from a historical perspective (e.g., the past 5 – 10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time.

Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data.

The traditional database approach to heterogeneous database integration is to build wrappers and integrators (or mediators) on top of multiple, heterogeneous databases (examples include IBM Data Joiner and Informix Data Blade). When a query is posed to a client site, a metadata dictionary is used to translate the query into queries appropriate for the individual heterogeneous sites involved. These queries are then mapped and sent to local query processors. The results returned form the different sites are integrated into a global answer set. This query – driven approach requires complex information filtering and integration processes, and competes for resources with processing at local sources. It is inefficient and potentially expensive for frequent queries, especially for queries requiring aggregations.

b. Multidimensional Data Model

Data warehouses and OLAP tools are based on a multidimensional data model. This model views data in the form of a data cube. In this section, you will learn how data cubes model n – dimensional data. You will also learn about concept hierarchies and how they can be used in basic OLAP operations to allow interactive mining at multiple levels of abstraction.

A data cube allows data to be modeled and viewed in multiple dimensions. In general terms, dimensions are the perspectives or entities with respect to which an organization wants to keep records. For example, All Electronics may create a sales data warehouse in order to keep records of the store’s sales with respect to the dimensions time, item, branch, and location. These dimensions allow the store to keep track of things like monthly sales of items, and the branches and locations at which the items were sold. Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. For example, a dimension table for item may contain the attributes item_ name, branch and type. Dimension tables

can be specified by users or experts, or automatically generated and adjusted based on data distributions.

A multidimensional data model is typically organized around a central theme, like sales, for instance. This theme is represented by a fact table. Facts are numerical measures. Think of them as the quantities by which we want to analyze relationships between dimensions. Examples of facts for a sales data warehouse include dollars_sold (sales amount in dollars), units_sold (number of units sold), and amount_budgeted. The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables.

c. Data Warehouse Architecture

The Design of a Data Warehouse: A Business Analysis Framework

“What does the data warehouse provide for business analysts?” First, having a data warehouse may provide a competitive advantage by presenting relevant information from which to measure performance and make critical adjustments in order to help win over competitors. Second, a data warehouse can enhance business productivity since it is able to quickly and efficiently gather information that accurately describes the organization. Third, a data warehouse facilitates customer relationship management since it provides a consistent view of customers and items across all lines of business, all departments, and all markets. Finally, a data warehouse may bring about cost reduction by tracking trends, patterns, and exceptions over long periods of time in a consistent and reliable manner.

To design an effective data warehouse one needs to understand and analyze business needs and construct a business analysis framework. The construction of a large and complex information system can be viewed as the construction of a large and complex building, for which the owner, architect, and builder have different views. These views are combined to form a complex framework that represents the top – down, business – driven, or owner’s perspective, as well as the bottom – up, builder – driven, or implementor’s view of the information system.

Four different views regarding the design of a data warehouse must be considered: the top – down view, the data source view, the data warehouse view, and the business query view.

The top – down view allows the selection of the relevant information necessary for the data warehouse. This information matches the current and coming business needs.

The data source view exposes the information being captured, stored, and managed by operational systems. This information may be documented at various levels of detail and accuracy, from individual data source tables to integrated data source tables. Data sources are often modeled by traditional data modeling techniques, such as the entity – relationship model or CASE (computer – aided software engineering) tools.

The data warehouse view includes fact tables and dimension tables. It represents the information that is stored inside the data warehouse, including precalculated totals and counts, as well as information regarding the source, date and time of origin, added to provide historical context.

Finally, the business query view is the perspective of data in the data warehouse from the viewpoint of the end user.

Figure 2.10: A three – tier data warehousing architecture

Data warehouses often adopt a three – tier architecture, as presented in Figure 2.10

1. The bottom tier is a warehouse database server that is almost always a relational database system. “How are the data extracted from this tier in order to create the data warehouse?” Data from operational databases and external sources (such as customer profile information provided by external consultants) are extracted using application program interfaces known as gateways. A gateway is supported by the underlying DBMS and allows client programs to generate SQL code to be executed at a server. Examples of gateways include

ODBC (Open Database Connection) and OLE – DB (Open Linking and Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).

2. The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to standard relational operations, or (2) a multidimensional OLAP (MOLAP) model, that is, a special – purpose server that directly implements multidimensional data and operations. OLAP servers are discussed in Section 2.3.3

3. The top tier is a client, which contains query and reporting tools, analysis tools, and / or data mining tools (e.g., trend analysis, prediction, and so on).

2. Describe the following algorithms of Association Rule Mining:

a. A Priori Algorithm

It is also called the level – wise algorithm. It was proposed by Agarwal and Srikant in 1994. It is the most popular algorithm to find all the frequent sets. It makes use of the downward closure property. As the name suggests, the algorithm is a bottom – up search, moving upward level – wise in the lattice. However, the important feature of the method is that before reading the database at every level, it graciously prunes many of the sets which are unlikely to be frequent sets.

The first pass of the algorithm simply counts item occurrences to determine the frequent itemsets. A subsequent pass, say pass k, consists of two phases. First, the frequent itemsets Lk-1 found in the (k-1)th pass are used to generate the candidate itemsets Ck, using the a priori candidate generation procedure described blow. Next, the database is scanned and the support of candidates in Ck is counted. For fast counting, we need to efficiently determine the candidates in Ck contained in a given transaction t. The set of candidate itemsets is subjected to a pruning process to ensure that all the subsets of the candidate sets are already known to the frequent itemsets. The candidate generation process and the pruning process are the most important parts of this algorithm. We shall describe these two processes separately below.

Candidate Generation

Given Lk-1, the set of all frequent (k-1) – itemsets, we want to generate a superset of the set of all frequent k – itemsets. The intuition behind the a priori candidate – generation procedure is that if an itemset X has minimum support, so do all subsets of

X. Let us assume that the set of frequent 3 – itemsets are {1, 2, 3}, {1, 2, 5}, {1, 3, 5}, {2, 3, 5}, {2,3, 4}. Then, the 4 – itemsets that are generated as candidate itemsets are the supersets of these 3 – itemsets and in addition, all the 3 – itemset subsets of any candidate 4 – itemset (so generated) must be already known to be in L3. The first part and part of the second part is handled by the a priori candidate – generation method. The following pruning algorithm prunes some candidate sets which do not meet the second criterion. The candidate – generation method is described below.

gen_candidate_itemsets with the given Lk-1 as follows: Ck = for all itemsets l1 Lk-1do for all itemsets l2Lk-1do if l1[1] =l2[1]^ l1[2] = l2[2] ^…….^ l1[k-1]<l2[k-1] then c = l1[1], l1[2]………l1[k-1], l2[k-1] Ck = Ck{c} Using this algorithm: C4 = {{1, 2, 3, 5}, {2, 3, 4, 5}} is obtained from L3 := {{1, 2, 3}, {1,2, 5}, {1, 3, 5}, {2, 3, 5}, {2, 3, 4}}, {1,2,3,5} is generated from {1, 2, 3}and {1, 2, 5}. Similarly, {2, 3, 4, 5} is generated from {2, 3, 4} and {2, 3, 5}. No other pair of 3 – itemsets satisfy the condition.

l1[1] = l2[1] ^ l1[2] = l2[2] ^ …………..^ l1[k-1] < l2[k-1]

b. Partition Algorithm

The partition algorithm is based on the observation that the frequent sets are normally very few in number compared to the set of all itemsets. As a result, if we partition the set of transactions to smaller such that each segment can be accommodated in the main memory, then we can compute the set of frequent sets of each of these partitions. It is assumed that these sets (set of local frequent sets) contain a reasonably small number of itemsets. Hence, we can read the whole database (the unsegmented one) once, to count the support of the set of all local frequent sets.

The partition algorithm uses two scans of the database to discover all frequent sets. In one scan, it generates a set of all potentially frequent itemsets by scanning the database once. This set is a superset of all frequent itemsets, i.e., it may contain false positives; but no false negatives are reported. During the second scan, counters for each of these itemsets are set up and their actual support is measured in one scan of the database.

The algorithm executes in two phases. In the first phase, the partition algorithm logically divides the database into a number of non – overlapping partitions. The partitions are considered one at a time and all frequent itemsets for that partitions are generated. Thus, if there are n partitions, Phase I of the algorithm takes n iterations. At the end of

Phase I, these frequent itemsets are merged to generate a set of all potential frequent itemsets. In this step, the local frequent itemsets of same lengths from all n partitions are combined to generate the global candidate itemsets. In Phase II, the actual support for these itemsets are generated and the frequent itemsets are identified. The algorithm reads the entire database once during Phase I and once during Phase II. The partition sizes are chosen such that each partition can be accommodated in the main memory, so that the partitions are read only once in each phase.

A partition P of the databases refers to any subset of the transactions contained in the database. Any two partitions are non-overlapping. We define local support for an itemset as the fraction of the transaction containing that particular itemset in partition. We define a local frequent itemset as an itemset whose local support in a partition is at leas the user – defined minimum support. A local frequent itemset may or may not be frequent in the context of the entire database.

Partition Algorithm

P = partition _ database (T); n = Number of partitions // Phase I for i = 1 to n do begin read _ in _ partition (T1 in P) Li = generate all frequent of Ti using a priori method in main memory. end // Merge Phase for (k = 2;Lki , i = 1, 2, ……, n; k++) do begin CkG = i k n i L 1 end // Phase II for i = 1 to n do begin read _ in _ partition (T1in P) for all candidates c CG computes s(c)Ti end LG = {cCG| s(c)Ti } Answer = LG The partition algorithm is based on the premise that the size of the global candidate set is considerably smaller than the set of all possible itemsets. The intuition behind this is that the size of the global candidate set is bounded by n times the size of the largest of the set of locally frequent sets. For sufficiently large partition sizes, the number of local frequent itemsets is likely to be comparable to the number of frequent itemsets generated for the entire database. If the data characteristics are uniform across partitions, then large numbers of itemsets generated for individual partitions may be common.

c. Pincers – Search Algorithm

One can see that the a priori algorithm operates in a bottom – up, breadth – first search method. The computation starts from the smallest set of frequent item sets and moves upward till it reaches the largest frequent item set the number of database passes is equal to the largest size of the frequent item set. When any one of the frequent item sets becomes longer, the algorithm has to go through many iterations and, as a result, the performance decreases.

A natural way to overcome this difficulty is to somehow incorporate a bi – directional search, which takes advantages of both the bottom – up as well as the top – down process. The pincer – search algorithm is based on this principle. It attempts to find the frequent item sets in a bottom – up manner but, at the same time, it maintains a list of maximal frequent item sets. While making a database pass, it also counts the support of these candidate maximal frequent item sets to see if any one of these is actually frequent. In that event, it can conclude that all the subsets of these frequent sets are going to be frequent and, hence, they are not verified for the support count in the next pass.

In this algorithm, in each pass, in addition to counting the supports of the candidate in the bottom – up direction, it also counts the supports of the item sets of some item sets using a top – down approach. These are called the Maximal Frequent Candidate Set (MFCS). This process helps in pruning the candidate sets very early on in the algorithm. If we find a maximal frequent set in this process, then it is recorded in the MFCS.

Consider a pass k, during which item sets of size k are to be classified. If some item set that is an element of the MFCS, say X, of cardinality greater than k is found to be frequent in this pass, then all its subsets must be frequent. Therefore all of its subsets of cardinality k can be pruned from the set of candidate sets considered in the bottom – up direction in the pass. They, and their supersets, will never be candidates throughout the rest of the execution, potentially improving the performance.

Similarly, when a new infrequent item set is found in the bottom – up direction, the algorithm will use it to update the MFCS. The subsets of the MFCS should not contain this in frequent item sets.

The MFCS initially contains a single element, the item set of cardinality n containing all the elements of the database. If some m 1 – item sets are infrequent after the first pass (after reading the database once), the MFCS will have one element of cardinality n –m. removing the m infrequent items from the initial element of the MFCS, generates this item set. In this case, the top – down search goes down m levels in one pass. Unlike the search in the bottom – up direction which goes in one – pass, the top – down search

can go down many levels in one pass. This is because we may discover a maximal frequent set very early in the algorithm.

Pincer – Search Method

L0:=; k :=1; C1:={{i} |iI}; S0 =; MFCS := {{1, 2, …., n}}; MFS := ; Do until Ck = and Sk-1 = read database and count supports for Ck and MFCS; MFS := MFS {frequent item set sin MFCS};Sk := {infrequent item sets in Ck}; call MFCS – gen algorithm if Sk; call MFS – pruning procedure; generate candidates Ck+1 from Ck; (similar to priori‟s generate & prune) if any frequent item set in Ck is remove din MFS – pruning procedure call the recovery procedure to recover candidates to Ck+1; call MFCS prune procedure to prune candidates in Ck+1; k:= k+1; return MFS MFCS – gen for all itemsets s Sk for all itemsets m MFCS if s is a subset of m MFCS := MFCS \ {m}; for all items e itemset s if m \ {e} s not a subset of any itemset in MFCS MFCS := MFCS {m\{e}}; return MFCS Recovery for all itemsets l Ck for all itemsets m MFS if the first k – 1 items in l are also in m /* suppose m. itemj = 1. item k-1*/ for i from j+1 to |m| Ck+1 := Ck+1 {{l.item1, l.item2, ……, l.itemk, m.itemi}} MFS – Prune for all itemsets c in Ck+1 if c is not a subset of any itemset in the current MFCS delete c from Ck+1;

3. Explain the various Categories of Web Mining along with their real time applications.

Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web.

Web mining can be broadly divided into three categories:

1. Web Content Mining

2. Web Structure Mining

3. Web Usage Mining.

All of the three categories focus on the process of knowledge discovery of implicit, previously unknown and potentially useful information from the web. Each of them focuses on different mining objects of the web.

Fig 9.1 shows the web categories and their objects. Also provided are the brief introductions about each of the categories.

Content mining is used to search, collate and examine data by search engine algorithms (this is done by using Web Robots).

Structure mining is used to examine the structure of a particular website and collate and analyze related data.

Usage mining is used to examine data related to the client end, such as the profiles of the visitors of the website, the browser used, the specific time and period that the site

was being surfed, the specific areas of interests of the visitors to the website, and related data from the form data submitted during web transactions and feedback.

Web Content Mining

Web content mining targets the knowledge discovery, in which the main objects are the traditional collections of multimedia documents such as images, video, and audio, which are embedded in or linked to the web pages.

It is also quite different from Data mining because Web data are mainly semi-structured and/or unstructured, while Data mining deals primarily with structured data. Web content mining is also different from Text mining because of the semi-structure nature of the Web, while Text mining focuses on unstructured texts. Web content mining thus requires creative applications of Data mining and / or Text mining techniques and also its own unique approaches. In the past few years, there was a rapid expansion of activities in the Web content mining area. This is not surprising because of the phenomenal growth of the Web contents and significant economic benefit of such mining. However, due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still present many challenging research problems.

Web content mining could be differentiated from two points of view: Agent-based approach or Database approach. The first approach aims on improving the information finding and filtering. The second approach aims on modeling the data on the Web into more structured form in order to apply standard database querying mechanism and data mining applications to analyze it.

Web Structure Mining

Web Structure Mining focuses on analysis of the link structure of the web and one of its purposes is to identify more preferable documents. The different objects are linked in some way. The intuition is that a hyperlink from document A to document B implies that the author of document. A thinks document B contains worthwhile information. Web structure mining helps in discovering similarities between web sites or discovering important sites for a particular topic or discipline or in discovering web communities. Data Mining Unit 9 Sikkim Manipal University Page No. 179

Simply applying the traditional processes and assuming that the events are independent can lead to wrong conclusions. However, the appropriate handling of the links could lead to potential correlations, and then improve the predictive accuracy of the learned models.

The goal of Web structure mining is to generate structural summary about the Web site and Web page. Technically, Web content mining mainly focuses on the structure of inner-document, while Web structure mining tries to discover the link structure of the hyperlinks at the inter-document level. Based on the topology of the hyperlinks, Web structure mining will categorize the Web pages and generate the information, such as the similarity and relationship between different Web sites.

Web structure mining can also have another direction – discovering the structure of Web document itself. This type of structure mining can be used to reveal the structure (schema) of Web pages; this would be good for navigation purpose and make it possible to compare/integrate Web page schemes. This type of structure mining will facilitate introducing database techniques for accessing information in Web pages by providing a reference schema.

Web Usage Mining

Web Usage Mining focuses on techniques that could predict the behavior of users while they are interacting with the WWW. Web usage mining, discover user navigation patterns from web data, tries to discovery the useful information from the secondary data derived from the interactions of the users while surfing on the Web. Web usage mining collects the data from Web log records to discover user access patterns of web pages. There are several available research projects and commercial tools that analyze those patterns for different purposes. The insight knowledge could be utilized in personalization, system improvement, site modification, business intelligence and usage characterization.

The only information left behind by many users visiting a Web site is the path through the pages they have accessed. Most of the Web information retrieval tools only use the textual information, while they ignore the link information that could be very valuable.

In general, there are mainly four kinds of data mining techniques applied to the web mining domain to discover the user navigation pattern:

Association Rule mining Sequential pattern Clustering Classification

4. Define Multimedia Data Mining. Also explain the theory and applications of the same.

Multimedia Database

A multimedia database system stores and manages a large collection of multimedia data, such as audio, video, image, graphics, speech, text, document, and hypertext data, which contain text, text markups, and linkages. Multimedia database systems are increasingly common owing to the popular use of audio video equipment, digital cameras, CD-ROMs, and the Internet. Typical multimedia database systems include NASA’s EOS (Earth Observation System), various kinds of image and audio-video databases, and Internet databases.

Mining Associations in Multimedia Data

“What kinds of associations can be mined in multimedia data?” Association rules involving multimedia objects can be mined in image and video databases. At least three categories can be observed:

Associations between image content and non-image content features: A rule like “If at least 50% of the upper part of the picture is blue, then it is likely to represent sky” belongs to this category since it links the image content to the keyword sky.

Associations among image contents that are not related to spatial relationships: A rule like “If a picture contains two blue squares, then it is likely to contain one red circle as well” belongs to this category since the associations are all regarding image contents.

Associations among image contents related to spatial relationships: A rule like “If a red triangle is between two yellow squares, then it is likely a big oval-shaped object is underneath” belongs to this category since it associates objects in the image with spatial relationships.

To mine associations among multimedia objects, we can treat each image as a transaction and find frequently occurring patterns among different images.

“What are the differences between mining association rules in multimedia databases versus in transaction databases?” There are some subtle differences. First, an image may contain multiple objects, each with many features such as color, shape, texture, keyword, and spatial location, so there could be many possible associations. In many cases, a feature may be considered as the same in two images at a certain level of resolution, but different at a finer resolution level. Therefore, it is essential to promote a progressive resolution refinement approach. That is, we can first mine frequently occurring patterns at a relatively rough resolution level, and then focus only on those

that have passed the minimum support threshold when mining at a finer resolution level. This is because the patterns that are not frequent at a rough level cannot be frequent at finer resolution levels. Such a multi resolution mining strategy substantially reduces the overall data mining cost without loss of the quality and completeness of data mining results. This leads to an efficient methodology for mining frequent itemsets and associations in large multimedia databases.

Second, because a picture containing multiple recurrent objects is an important feature in image analysis, recurrence of the same objects should not be ignored in association analysis. For example, a picture containing two golden circles is treated quite differently from that containing only one.

Third, there often exist important spatial relationships among multimedia objects, such as above, beneath, between, nearby, left-of, and so on. These features are very useful for exploring object associations and correlations. Spatial relationships together with other content-based multimedia features, such as color, shape, texture, and keywords, may form interesting associations.

Similarity Search in Multimedia Data

When searching for similarities in multimedia data, can we search on either the data description or the data content. For similarity searching in multimedia data, we consider two main families of multimedia indexing and retrieval systems:

1) Description-based retrieval systems, which build indices and perform object retrieval based on image descriptions, such as keywords, captions, size, and time of creation; and

2) Content-based retrieval systems, which support retrieval based on the image content, such as color histogram, texture, pattern, image topology, and the shape of objects and their layouts and locations within the image.

Description-based retrieval is labor-intensive if performed manually. If automated, the results are typically of poor quality. For example, the assignment of keywords to images can be a tricky and arbitrary task. Recent development of Web-based image clustering and classification methods has improved the quality of description-based Web image retrieval, because image surrounded text information as well as Web linkage information can be used to extract proper description and group images describing a similar theme together.

Content-based retrieval uses visual features to index images and promotes object retrieval based on feature similarity, which is highly desirable in many applications. In a content-based image retrieval system, there are often two kinds of queries:

Image sample-based queries and

Image feature specification queries.

Image-sample-based queries find all of the images that are similar to the given image sample. This search compares the feature vector (or signature) extracted from the sample with the feature vectors of images that have already been extracted and indexed in the image database. Based on this comparison, images that are close to the sample image are returned.

Image feature specification queries specify or sketch image features like color, texture, or shape, which are translated into a feature vector to be matched with the feature vectors of the images in the database. Content-based retrieval has wide applications, including medical diagnosis, weather prediction, TV production, Web search engines for images, and e-commerce.

Some systems, such as QBIC (Query By Image Content), support both sample-based and image feature specification queries. There are also systems that support both content based and description-based retrieval.

Several approaches have been proposed and studied for similarity-based retrieval in image databases, based on image signature:

Color histogram–based signature: In this approach, the signature of an image includes color histograms based on the color composition of an image regardless of its scale or orientation. This method does not contain any information about shape, image topology, or texture. Thus, two images with similar color composition but that contains very different shapes or textures may be identified as similar, although they could be completely unrelated semantically.

Multifeature composed signature: In this approach, the signature of an image includes a composition of multiple features: color histogram, shape, image topology, and texture. The extracted image features are stored as metadata, and images are indexed based on such metadata. Often, separate distance functions can be defined for each feature and subsequently combined to derive the overall results. Multidimensional content-based search often uses one or a few probe features to search for images containing such (similar) features. It can therefore be used to search for similar images. This is the most popularly used approach in practice.

Wavelet-based signature: This approach uses the dominant wavelet coefficients of an image as its signature. Wavelets capture shape, texture, and image topology information in a single unified framework. This improves efficiency and reduces the need for providing multiple search primitives (unlike the second method above).

However, since this method computes a single signature for an entire image, it may fail to identify images containing similar objects where the objects differ in location or size.

Wavelet-based signature with region-based granularity: In this approach, the computation and comparison of signatures are at the granularity of regions, not the entire image. This is based on the observation that similar images may contain similar regions, but a region in one image could be a translation or scaling of a matching region in the other. Therefore, a similarity measure between the query image Q and a target image T can be defined in terms of the fraction of the area of the two images covered by matching pairs of regions from Q and T. Such a region-based similarity search can find images containing similar objects, where these objects may be translated or scaled.

Multidimensional Analysis of Multimedia Data

“Can we construct a data cube for multimedia data analysis?” To facilitate the multidimensional analysis of large multimedia databases, multimedia data cubes can be designed and constructed in a manner similar to that for traditional data cubes from relational data. A multimedia data cube can contain additional dimensions and measures for multimedia information, such as color, texture, and shape.

Let’s examine a multimedia data mining system prototype called Multimedia Miner, which extends the DBMiner system by handling multimedia data. The example database tested in the Multimedia Miner system is constructed as follows. Each image contains two descriptors: a feature descriptor and a layout descriptor. The original image is not stored directly in the database; only its descriptors are stored. The description information encompasses fields like image file name, image URL, image type (e.g., gif, tiff, jpeg,mpeg, bmp, avi), a list of all known Web pages referring to the image (i.e., parent URLs), a list of keywords, and a thumbnail used by the user interface for image and video browsing.

The feature descriptor is a set of vectors for each visual characteristic. The main vectors are a color vector containing the color histogram quantized to 512 colors (8x8x8 for RxGxB), an MFC(Most Frequent Color) vector, and an MFO(Most Frequent Orientation) vector. The MFC and MFO contain five color cancroids and five edge orientation cancroids for the five most frequent colors and five most frequent orientations, respectively. The edge orientations used are 0, 22:5, 45, 67:5, 90, and so on.

The layout descriptor contains a color layout vector and an edge layout vector. Regardless of their original size, all images are assigned an 8x8 grid. The most frequent color for each of the 64 cells is stored in the color layout vector, and the number of edges for each orientation in each of the cells is stored in the edge layout vector. Other sizes of grids, like 4x4, 2x2, and 1x1, can easily be derived.

5. Describe agent based approach to web mining.

The agent-based approach to Web mining involves the development of sophisticated AI (Artificial Intelligence) systems that can act autonomously or semi-autonomously on behalf of a particular user to discover and organize Web-based information. Generally, the agent-based Web mining systems can be placed into the following three categories.

1. Intelligent Search Agents

Several intelligent Web agents have been developed that search for relevant information using characteristics of a particular domain (and possibly a user profile) to organize and interpret the discovered information. For example, agents such as Harvest, FAQ-Finder, Information Manifold, OCCAM, and Parasite rely either on pre-specified and domain specific information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents. Other agents, such as ShopBot and ILA (Internet Learning Agent), attempt to interact with and learn the structure of unfamiliar information sources. ShopBot retrieves product information from a variety of vendor sites using only general information about the product domain. ILA, on the other hand, learns models of various information sources and translates these into its own internal concept hierarchy.

2. Information Filtering/Categorization

A number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. For example, HyPursuit uses semantic information embedded in link structures as well as document content to create cluster hierarchies of hypertext documents, and structure an information space. BO (Bookmark Organizer) combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information.

3. Personalized Web Agents

Another category of Web agents includes those that obtain or learn user preferences and discover Web information sources that correspond to these preferences, and possibly those of other individuals with similar interests (using collaborative filtering). A few recent examples of such agents include the WebWatcher, Syskill & Webert and others. For example, Syskill & Webert is a system that utilizes a user profile and learns to rate Web pages of interest using a Bayesian classifier.

6. Explain how Data Mining is useful in telecommunications.

Data Mining flourishes in telecommunications due to the availability of vast quantities of high quality data. A significant stream of it consists of call records collected at network switches used primarily for billing; it enables Data Mining applications in toll-fraud detection and consumer marketing. Perhaps the best-known marketing application of Data Mining, albeit via unconfirmed anecdote, concerns MCI’s “Friends & Family” promotion launched in the domestic US market in 1991. As the anecdote goes, market researchers observed relatively small subgraphs in this long-distance phone company’s large call-graph of network activity, revealing the promising strategy of adding entire calling circles to the company’s subscriber base, rather than the traditional and costly approach of seeking individual customers one at a time. Indeed, MCI increased its domestic US market share in the succeeding years by exploiting the “viral” capabilities of calling circles; one infected member causes others to become infected. Interestingly, the plan was abandoned some years later (not available since 1997), possibly because the virus had run its course but more likely due to other competitive forces. In toll-fraud detection,

Data Mining has been instrumental in completely changing the landscape for how anomalous behaviors are detected. Nearly all fraud detection systems in the telecommunications industry 10 years ago were based on global threshold models; they can be expressed as rule sets of the form “If a customer makes more than X calls per hour to country Y; then apply treatment Z.” The placeholders X, Y, and Z are parameters of these rule sets applied to all customers. Given the range of telecommunication customers, blanket application of these rules produces many false positives. Data Mining methods for customized monitoring of land and mobile phone lines were subsequently developed by leading service providers, including AT&T, MCI, and Verizon, 5.3 Emerging Scientific Applications in Data Mining 179 whereby each customer’s historic calling patterns are used as a baseline against which all new calls are compared. So, for customers routinely calling country Y more than X times a day, such alerts would be suppressed, but if they ventured to call a different country Y _, an alert might be generated.

Methods of this type were presumably in place for the credit card industry a few years before emerging in telecom. But the size of the transaction streams is far greater in telecom, necessitating new approaches to the problem. It is expected that algorithms based on call-graph analysis and customized monitoring will become more prevalent in both toll-fraud detection and marketing of telecommunications services. The emphasis on so-called “relational data” is an emerging area for Data Mining research, and telecom provides relational data of unprecedented size and scope.

These applications are enabled by data from the billing stream. As the industry transforms itself from a circuit-switched to a packet-switched paradigm, the Data Mining community could well experience a dearth of data, since billing is likely to be increasingly insensitive to usage. Moreover, the number of records that could potentially be recorded in a packet-switched network (such as packet headers) is orders of magnitude greater than today’s circuit switched networks are. Thus, unless a compelling business need is identified, the cost of collecting, transmitting, parsing, and storing this data will be too great for the industry to willingly accept. A dearth of data could well spell the end to future significant Data Mining innovations in telecommunications.

mc0088 august 2012 set1+set2

Documents