data mining kathy s schwaig. outline motivation definitions techniques applications portions of...
TRANSCRIPT
Data Mining
Kathy S Schwaig
Outline
MotivationDefinitionsTechniquesApplications
Portions of this presentation are adapted from J. Han Simon Fraser University, Canada
Motivation
Data found in data warehouses is not, by itself, of great intrinsic value.
Value comes from the knowledge that can be discovered from data.
What do you do with it?
• Magnitude of data due to machine-readable text disseminated across networks.
• Difficult to distill information for analysis.
• Tools needed to 'mine' information to bring out key, relevant facts.
•Users need to rapidly filter and assimilate useful information from a variety of data sources.
Data Volume Problems
Data Mining
The process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
Extraction of hidden, predictive information from large databases.
Provide answers to questions a decision maker had previously not thought to ask
Data Mining
Search for relationships, patterns, and trends which, prior to the search were not known to exist or were not visible.
E.g. “Find related buying patterns.”“Find related buying patterns.”
““There is a pattern that occurs X% of the time There is a pattern that occurs X% of the time that when someone buys window coverings (not that when someone buys window coverings (not shades, blinds, or other specifics), and within 1 shades, blinds, or other specifics), and within 1 to 3 months buys linens, within the next 4 to 3 months buys linens, within the next 4 months buys furniture.”months buys furniture.”
Data Mining
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions
End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
Data Mining Analysis Techniques Examples
Characterization Association Classification Prediction Clustering (Data Segmentation)
Characterization
Demographics: address, income, recreational equipment ownership, etc.
Psychographics: lifestyle/personality characteristics like “highly protective of children; impulsive shopper
Technographic(web based): attributes of your computer system; browser, operating system, modem speed, etc.
Association
Occurrences linked to a single event; Identify items that are likely to be purchased or viewed at the same session (web)
Example: Amazon.com…..Customers that bought Grapes of Wrath also bought Great Gatsby
Classification
Recognize patterns that describe a group to which an item belongs by examing existing items that have been classified and by inferring a set of rules
Example: Credit Card companies have discovered the characteristics of customers likely to leave and have provided a model to help predict who will leave in the future.
Prediction
Guesses an unknown value such as income when you know other things about a person.
Example: lifetime monetary value, Often used in demographic data to fill in blank information. For example, we know someone’s address, car preference and job title but not their income. We can look at others with similar characteristics and from their data infer the missing income figure.
Clustering
Identify people who share common characteristics. A way of identifying differing groups within the data
Patterns
Scuba gear and Australian vacations
Skim milk and whole wheat bread
AT&T’s stock rises at least 2% after every 3-day slump in DOW
• Discovered what appeared to be a curious purchasing trend.
• Music retailer’s 493 stores were selling a lot of rap and alternative CDs to people older than 65.
Camelot Music Inc.
Are All the “Discovered” Patterns Interesting?
A data mining query may generate thousands of patterns.
Are they interesting? Why or why not?
Interesting if: easily understood by humans valid on new or test data with some degree of certainty potentially useful novel validates some hypothesis that a user seeks to confirm
Applications: MCIApplications: MCI
How to find the customers you want to keep How to find the customers you want to keep from among the from among the millionsmillions? ?
Comb marketing data on 140 million Comb marketing data on 140 million households, each evaluated on as many as households, each evaluated on as many as 10,000 attributes— e.g. income, lifestyle, 10,000 attributes— e.g. income, lifestyle,
and details about past calling habits. and details about past calling habits.
But which set of those attributes is the most But which set of those attributes is the most important to monitor, and within what important to monitor, and within what
range of values?range of values?
•IBM SP/2 super computer, its data warehouse, has identified variables it finds most telling about it’s customers, and from that, compiled a set of 22 very detailed and highly confidential statistical customer profiles– none of which could have been developed without data mining programs
MCI
Wal-Mart
Point of sale transaction data is captured at each retail store and transmitted to Wal-Mart’s Arkansas data
warehouse.
Over 3,500 independent suppliers have online access to information about their respective products in that data
warehouse. They may query that data to analyze trends by item and store, using that information to find the products
that need replenishment,
and thus allow them to get the right products to each store on time
Data Mining Should Not be Used Blindly!
Data mining find regularities from history, but history is not the same as the future.
Association does not dictate trend nor causality!? Drink diet drinks lead to obesity! David Heckerman’s counter-example (1997)
Barbecue source, hot dogs and hamburgers.
Web Mining: Lots To Be Done!
Types of Web mining Web usage mining: which page or graphic was
served(URL) linked to date, time, browser information Web content mining: how are visitors responding to your
content (which links they select, where they spend time, which search terms they use, where they browse)
Other than managers, who could REALLY use this information?
Challenges to Web Mining
Web: A huge, widely-distributed, highly heterogeneous, semi-
structured, interconnected, evolving, hypertext/hypermedia
information repository.
Problems:
the “abundance” problem
limited coverage of the Web (hidden Web sources)
limited query interface: keyword-oriented search
limited customisation to individual users
DBMS, and data miners will play an increasingly important role in
the new generation of Internet
Summary
•Need for data mining
• Approaches
• Problems
• Applications
• Web data mining
Appendix: Market Analysis and Management
Data sources Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, studies. Target marketing
Clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.
Customer purchasing patterns Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis Associations/co-relations between product sales Prediction based on the association information.
Appendix: Market Analysis and Management (Con’t)
Customer profiling data mining can tell you what types of customers buy
what products (clustering or classification).
Customer requirements identify best products for different customers prediction to find what factors will attract new customers
Summary information multi-dimensional summary reports; statistical summary information
Appendix: Corporate Analysis and Risk Management
Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)
Resource planning summarize and compare resources and spending
Competition Monitor competitors and market directions. Segment customers into classes with class-based pricing
procedure. Set pricing strategy in a highly competitive market.
Appendix: Fraud Detection and Management
Applications Widely used in health care, retail, credit card services, telecommunications
(phone card fraud). Approach
use historical data to build models of fraudulent behavior and use data mining to help identify similar instances.
Examples Auto Insurance: detect a group of people who stage accidents to collect
insurance Money Laundering: detect suspicious money transactions (US Treasury's
Financial Crimes Enforcement Network) Medical Insurance: detect professional patients and ring of doctors and
ring of references
Appendix: Fraud Detection and Management (Con’t)
Telephone fraud: Telephone call model: destination of call, duration,
time of day or week. Analyze patterns that deviate from expected norm. British Telecom identified discrete groups of callers
with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.
Appendix: Other Application
Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
Appendix: Decision Support and OLAP
DSS: Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions what were the sales volumes by region and product
category for the last year? How did the share price of computer manufacturers
correlate with quarterly profits over the past 10 years? Will a 10% discount increase sales volume sufficiently?
•OLAP- On-line analytical processing. Refers to array-oriented database applications that allow users to view, navigate through, manipulate, and analyze multi-dimensional databases. An element of a decision support system.
•Data mining is a powerful, high-performance data analysis tool for decision support.