data mining issues and applications

32
Copyright © 2006, SAS Institute Inc. All rights reserved. Data Mining Sue Walsh Higher Education Consulting SAS

Upload: tommy96

Post on 07-May-2015

1.452 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Data MiningSue WalshHigher Education ConsultingSAS

Page 2: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

OverviewBrief Historical Perspective

Defining Data Mining

Issues• Data Collection and Data Organization• Modeling Issues and Data Difficulties• Skepticism and Communication

Applications

SAS Enterprise Miner Demonstration

SAS Enterprise Miner versus SAS/STAT

Another Kind of Data Mining - Text Mining

Page 3: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

History

Page 4: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Data Mining, circa 1963

IBM 7090 600 cases

“Machine storage limitationsrestricted the total number ofvariables which could beconsidered at one time to 25.”

“Machine storage limitationsrestricted the total number ofvariables which could beconsidered at one time to 25.”

Page 5: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Since 1963

Moore’s Law:The information density on silicon-integrated circuits doubles every 18 to 24 months.

Parkinson’s Law:Work expands to fill the time available for its completion.

Page 6: Data Mining Issues and Applications

6

elec

tron

ic p

oint

-of-sa

le d

ata

hosp

ital p

atient

reg

istrie

s

cata

log

orde

rs

ban

k tran

sact

ions

rem

ote

sens

ing

imag

es

tax

ret

urns

airli

ne res

erva

tions

c

redi

t ca

rd c

harg

es

stoc

k trad

es

OLT

P t

elep

hone

cal

ls

Data Deluge

Page 7: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

The DataExperimental Opportunistic

Purpose Research Operational

Value Scientific Commercial

Generation Actively Passivelycontrolled observed

Size Small Massive

Hygiene Clean Dirty

State Static Dynamic

Page 8: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

The Origins of Data Mining

Databases

StatisticsPatternRecognition

KDD

MachineLearning AI

Neurocomputing

Data Mining

Page 9: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Solving the Data Puzzle- a Step-by-Step Approach

Data collection• Transactional systems• Customer information systems

Data organization

Data analysis

Reporting

The Result

Business Decisions

Page 10: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Definition

Page 11: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

What Is Data Mining?

• IT− Complicated database queries

• ML− Inductive learning from examples

• Stat− What we were taught not to do

Page 12: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Data Mining – The SAS Definition

Advanced methods for exploring and modeling relationships in large amounts of data.

Page 13: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Solving the Data Puzzle- a Step-by-Step Approach

Data collection• Transactional systems• Customer information systems

Data organization - data warehousing

Data analysis - data mining

Reporting

Action

Page 14: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

The SAS Approach to Data MiningSEMMA

Sample

Explore

Modify

Model

Assess

Page 15: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Issues

Page 16: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Data Collection and Data Organization

What data has been collected and where is it?

How do I combine legacy systems with current data systems?• Customer Story

What is the meaning of some of these data values?

Page 17: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Modeling Issues and Data DifficultiesData PreparationRare or Unknown Targets• Over Sampling

UndercoverageDirty Data• Errors• Missing Values

Dimension Reduction (Variable Selection)Under and Over FittingTemporal InfidelityModel Evaluation

Page 18: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Skepticism and Communication

Skepticism• Breaking the Rules (statisticians)• Magic (non-analytical individuals)

Communication

Page 19: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Applications

Page 20: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Health CareDrug development – to help uncover less expensive but equally effective drug treatments.

Medical diagnostics – imaging, real-time monitoring (e.g., predicting women at high risk for emergency C-section).

Insurance claims analysis – identify customers likely to buy new policies; define behavior patterns of risky customers.

Page 21: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Business and Finance

Banks - to detect which customers are using which products so they can offer the right mix of products and services to better meet customer needs – cross sell and up sell.

Credit card companies - to assist in mailing promotional materials to people who are most likely to respond.

Lenders - to determine which applicants are most likely to default on a loan.

Page 22: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

The Absa Group (a South African Bank)

Challenge:Reduce operating expenses and cut losses by leveraging data to improve security and enhance customer relationships.

Solution:SAS helped Absa reduce armed robberies by 41 percent over two years, netting a 38 percent reduction in cash loss and an 11 percent increase in customer satisfaction ratings.

Page 23: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Sports and GamblingSports teams – to analyze data to determine favorable player match ups and call the best plays

Gaming industry - to analyze customer gambling trends at casinos.

Sports Fanatics – to predict which teams will be chosen for tournament berths as well as to predict game winners.

Page 24: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

EducationEnrollment Management – which students are likely to attend

Retention/Graduation Analysis – which students will remain enrolled after the first year and/or through graduation

Donation Prediction – who is likely to donate and how much might they donate

Faculty Churn – what faculty members are most likely to leave the institution

Page 25: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Other Application AreasInsurance – pricing, fraud detection, risk analysis

Stock Market – market timing, stock selection, risk analysis

Transportation – performance & network optimization to predict life-cycle costs of road pavementTelecommunications – churn reductionRetail – market basket analysis to help determine marketing strategies

Page 26: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Demonstration

Page 27: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Data Mining with SAS Enterprise Miner versus with SAS/STAT

Features in SAS Enterprise Miner not in SAS/STAT• Decision trees• Neural networks• Automatic data splitting• Automatic score code• Model comparison tool

Features in SAS/STAT not in SAS Enterprise Miner• Diagnostic statistics

The products offer different model evaluation statistics because of the difference in purpose.

Page 28: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Another Kind of Data Mining

Page 29: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Text Mining – What is it?

Text mining is a process that employs a set of algorithms for converting unstructured text into structured data objects and the quantitative methods used to analyze these data objects.

“SAS defines text mining as the process of investigating a large collection of free-form documents in order to discover and use the knowledge that exists in the collection as a whole.” (SAS® Text Miner: Distilling Textual Data for Competitive Business Advantage)

Page 30: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Another View of Text Mining

TextText

AAMiracleMiracleOccursOccurs

NumbersNumbers

Page 31: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Text Mining Applications

Automotive Early Warning System• Wallace and Cermack (2004) describe the use of text

mining for warranty analysis related to the TREAD act.

Medical Information Management• TextWise Labs uses sophisticated text mining

methodology to extract medical information from disparate data sources on the Internet.

• Computer Science Innovations Inc. is developing an application for the National Cancer Institute that automatically converts medical records into XML data.

Page 32: Data Mining Issues and Applications

Copyright © 2006, SAS Institute Inc. All rights reserved.

Text Mining ApplicationsInsurance Claim Fraud• Insurance companies employ Special Investigative

Units (SIU) to investigate claims for fraud. Data mining methods can be employed to automate the process of referral. Text mining methods are applied to claims examiner notes, physician reports, and other textual data to enhance predictive accuracy.

Technical Support• Sanders and DeVault (2004) describe a process that

employs text mining to improve efficiency in a technical support environment.