data mining by chris grav lee

30
Data Mining Slides by Chris Gravlee

Upload: umeshkathuria

Post on 01-Oct-2015

228 views

Category:

Documents


1 download

DESCRIPTION

Data Mining by Chris Grav Lee

TRANSCRIPT

  • Data MiningSlides by Chris Gravlee

  • OverviewSometimes called Data or Knowledge DiscoveryThe process of analyzing data from different perspectives and summarizing it into useful information, that can be used to increase revenue, cut costs, or bothUsers are able to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases

  • DataData are many facts, numbers, or text that can be processed by a computerOrganizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:Operational or transactional data such as, sales, cost, inventory, payroll, and accounting,No operational data, such as industry sales, forecast data, and macro economic dataMeta Data: data about the data itself, such as logical database design or data dictionary definitions

  • InformationThe patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.

  • KnowledgeInformation can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts.

  • What is Data Mining? Data Mining is the process of extracting knowledge hidden from large volumes of raw data. The importance of collecting data that reflect our business or scientific activities to achieve competitive advantage is widely recognized now. Powerful systems for collecting data and managing it in large databases are in place in all large and mid-range companies. However, the bottleneck of turning this data into our success is the difficulty of extracting knowledge about the system we study from the collected data. Human analysts with no special tools can no longer make sense of enormous volumes of data that require processing in order to make informed business decisions. Data mining automates the process of finding relationships and patterns in raw data and delivers results that can be either utilized in an automated decision support system or assessed by a human analyst. What goods should be promoted to this customer? What is the probability that a certain customer will respond to a planned promotion? Can one predict the most profitable securities to buy/sell during the next trading session? Will this customer default on a loan or pay back on schedule? What medical diagnose should be assigned to this patient? How large the peak loads of a telephone or energy network are going to be? Why the facility suddenly starts to produce defective goods?

  • What is Data Mining?Modeling the investigated system, discovering relations that connect variables in a database are the subject of data mining. Modern computer data mining systems self learn from the previous history of the investigated system, formulating and testing hypotheses about the rules which this system obeys. When concise and valuable knowledge about the system of interest had been discovered, it can and should be incorporated into some decision support system which helps the manager to make wise and informed business decisions.

  • Why use data mining? Data might be one of the most valuable assets of an organization - but only if we know how to reveal valuable knowledge hidden in raw data. For instance, data mining allows us to extract diamonds of knowledge from historical data and predict outcomes of future situations. In business, it will help us optimize the business decisions, increase the value of each customer and communication, and improve satisfaction of customer with our services. In medical domain, it may help discover causes of particular diseases that were not known before. Data mining can be used on any type of data (ex. financial, medical, education, communication, industrial)Data that require analysis differ for companies in different industries. Examples include: Sales and contacts histories Call support data Demographic data on customers and prospects Patient diagnoses and prescribed drugs data Click stream and transactional data from a website In all these cases data mining can help reveal knowledge hidden in data and turn this knowledge into a crucial competitive advantage.

  • What can Data Mining do for us? Identify our best prospects and then retain them as customers. By concentrating marketing efforts only on the best prospects we will save time and money, thus increasing effectiveness of the marketing operation. Predict cross-sell opportunities and make recommendations. Whether we have a traditional or web-based operation, we can help the customers quickly locate products of interest to them - and simultaneously increase the value of each communication with a customer. Learn parameters influencing trends in sales and margins. One may think this can be done with OLAP (Online Analytical Processing) tools. True, OLAP can help prove a hypothesis - but only if we know what questions to ask in the first place. In the majority of cases we may have no clue on what combination of parameters influences our operation. In these situations data mining is the only real option. Segment markets and personalize communications. There might be distinct groups of customers, patients, or natural phenomena that require different approaches in their handling. If we have a broad customer range, we would need to address teenagers in California and married homeowners in Minnesota with different products and messages in order to optimize a marketing campaign.

  • Reasons for the growing popularity of Data Mining Growing Data Volume The main reason for necessity of automated computer systems for intelligent data analysis is the enormous volume of existing and newly appearing data that require processing. The amount of data accumulated each day by various business, scientific, and governmental organizations around the world is daunting. According to information from GTE research center, only scientific organizations store each day about 1 TB (terabyte!) of new information.

  • Reasons for the growing popularity of Data Mining Limitations of Human AnalysisTwo other problems that surface when human analysts process data are the inadequacy of the human brain when searching for complex multifactor dependencies in data, and the lack of objectiveness in such an analysis. A human expert is always a hostage of the previous experience of investigating other systems. Sometimes this helps, sometimes this hurts, but it is almost impossible to get rid of this fact.

  • Reasons for the growing popularity of Data Mining Low Cost of Machine LearningOne additional benefit of using automated data mining systems is that this process has a much lower cost than hiring an army of highly trained (and paid) professional statisticians. While data mining does not eliminate human participation in solving the task completely, it significantly simplifies the job and allows an analyst who is not a professional in statistics and programming to manage the process of extracting knowledge from data.

  • Knowledge Discovery in Databases

    A six or more step process: data warehousing, data selection, data preprocessing, data transformation, data mining, interpretation/evaluation Data Mining is sometimes referred to as KDD DM and KDD tend to be used as synonyms

  • Typical Applications of Data Mining

    Sales/Marketing Provide better customer service Improve cross-selling opportunities (beer and Increase direct mail response ratesCustomer Retention Identify patterns of defection Predict likely defectionsRisk Assessment and Fraud Identify inappropriate or unusual behavior

  • Motivation: The Sizes

    Databases today are huge: More than 1,000,000 entities/records/rows From 10 to 10,000 fields/attributes/variables Giga-bytes and tera-bytes Databases a growing at an unprecedented rate The corporate world is a cut-throat world Decisions must be made rapidly Decisions must be made with maximum knowledge

  • Motivation for doing Data Mining

    Investment in Data Collection/Data Warehouse Add value to the data holding Competitive advantage More effective decision making OLTP Data Warehouse Decision Support Work to add value to the data holding Support high level and long term decision making Fundamental move in use of Databases

  • Importance of Data MiningBy applying data mining techniques, which are elements of statistics, artificial intelligence and machine learning, they are able to identify trends within the data that they did not know existed. Data mining can best be described as a business intelligence (BI) technology that has various techniques to extract comprehensible, hidden and useful information from a population of data. This BI technology makes it possible to discover hidden trends and patterns in large amounts of data. The output of a data mining exercise can take the form of patterns, trends or rules that are implicit in the data. Through data mining and the new knowledge it provides, individuals are able to leverage the data to create new opportunities or value for their organizations. The following are examples of practical uses of data mining and the value it provides those who use this technology to mine their data.

  • Fraud Detection Credit card issuers have been using data mining techniques to detect potentially fraudulent credit card transactions. When a credit transaction is executed, the transaction and all data elements describing the transaction are analyzed using a sophisticated data mining technique called neural networks to determine whether or not the transaction is a potentially fraudulent charge based upon known fraudulent charges. By utilizing data mining, credit card issuers have decreased and mitigated losses due to fraudulent charges.

  • Inventory Logistics

    By incorporating data mining techniques, retailers can improve their inventory logistics and thereby reduce their cost in handling inventory.Through data mining, a retailer can identify the demographics of its customers such as gender, martial status, number of children, etc. and the products that they buy.This information can be extremely beneficial in stocking merchandise in new store locations as well as identifying "hot" selling products in one demographic market that should also be displayed in stores with similar demographic characteristics. For nationwide retailers, this information can have a tremendous positive impact on their operations by decreasing inventory movement as well as placing inventory in locations where it is likely to sell.

  • Defect Analysis Through the use of data mining techniques, manufacturers are able to identify the characteristics surrounding defective products, such as day of week and time of the manufacturing run, components being used and individuals working on the assembling line.By understanding these characteristics, changes can be made to the manufacturing process to improve the quality of the products being produced. High-quality products lead to improved reputation of the organization within its industry and help to drive sales. In addition, profitability improves through the reduction of return materials allowances and field service calls.

  • Focused Hiring Some employers use data mining techniques to understand the characteristics of their top performing individuals. By understanding the characteristics of this group such as education, years of experience, skills and personality traits, a hiring profile can be established to help recruit and hire individuals who possess similar characteristics as their best- performing individuals. While this technique has been used, one must realize that profiling is based upon historical data, which may not be indicative of future top-performing individuals due to changes in social, economic and environmental conditions.

  • Techniques Used in Data Mining

    Link Analysisassociation rules, sequential patterns, time sequences Predictive Modelingtree induction, neural networks, regression Database Segmentationclustering, k-means, Deviation Detectionvisualization, statistics

  • Data Mining Techniques - Cluster Analysis Many data mining applications make use of clustering according to similarity for example to segment a client/customer base. Clustering according to optimization of set functions is used in data analysis Clustering/segmentation in databases are the processes of separating a data set into components that reflect a consistent pattern of behavior. Once the patterns have been established they can then be used to "deconstruct" data into more understandable subsets and also they provide sub-groups of a population for further analysis or action which is important when dealing with very large databases. For example a database could be used for profile generation for target marketing where previous response to mailing campaigns can be used to generate a profile of people who responded and this can be used to predict response and filter mailing lists to achieve the best response.

  • Data Mining Techniques - InductionA database is a store of information but more important is the information which can be inferred from it. There are two main inference techniques available ie deduction and induction. Deduction is a technique to infer information that is a logical consequence of the information in the database e.g. the join operator applied to two relational tables where the first concerns employees and departments and the second departments and managers infers a relation between employee and managers. Induction has been described earlier as the technique to infer information that is generalized from the database as in the example mentioned above to infer that each employee has a manager. This is higher level information or knowledge in that it is a general statement about objects in the database. The database is searched for patterns or regularities. Induction has been used in the following ways within data mining:

  • Decision Trees Decision trees are simple knowledge representation and they classify examples to a finite number of classes, the nodes are labeled with attribute names, the edges are labeled with possible values for this attribute and the leaves labeled with different classes. Objects are classified by following a path down the tree, by taking the edges, corresponding to the values of the attributes in an object.The following is an example of objects that describe the weather at a given time.

  • Rule Induction A data mine system has to infer a model from the database, that is it may define classes such that the database contains one or more attributes that denote the class of a tuple (i.e. the predicted attributes while the remaining attributes are the predicting attributes.) Class can then be defined by condition on the attributes. When the classes are defined the system should be able to infer the rules that govern classification, in other words the system should find the description of each class. Production rules have been widely used to represent knowledge in expert systems and they have the advantage of being easily interpreted by human experts because of their modularity i.e. a single rule can be understood in isolation and doesn't need reference to other rules. The propositional like structure of such rules has been described earlier but can summed up as if-then rules.

  • Neural Networks Neural networks are an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if" questions.

  • Neural NetworksNeural networks have broad applicability to real world business problems and have already been successfully applied in many industries. Since neural networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including: sales forecasting industrial process control customer research data validation risk management target marketing etc.

  • Neural NetworksNeural networks use a set of processing elements (or nodes) analogous to neurons in the brain. These processing elements are interconnected in a network that can then identify patterns in data once it is exposed to the data, i.e. the network learns from experience just as people do. This distinguishes neural networks from traditional computing programs, that simply follow instructions in a fixed sequential order.