cit 858: data mining and data warehousing course instructor: bajuna salehe email:...
TRANSCRIPT
CIT 858: Data Mining and Data Warehousing
Course Instructor: Bajuna SaleheEmail: [email protected]: www.ifm.ac.tz/staff/bajuna/courses/
Introduction to Data Mining and Data Warehousing
Data Mining and Data Warehousing Agenda
What is Data Mining?What is Data Warehousing?The source of invention of Data Mining and
Data Warehousing.Drowning in Data Starving for Knowledge.Evolution of Database Technology to the
current state. (Home Work)
What Is Data Mining? Data mining (knowledge discovery from
data) Extraction of interesting (non-trivial,
implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer? Should have been named “knowledge mining
from data” which is too long or “knowledge mining” not reflecting the
emphasis on mining from huge data
What Is Data Mining?
Many people treat data mining as a synonym for another popularly used term Knowledge Discovery from Data/Databases (KDD).
KDD as the process is depicted below:
The KDD Process
Cleaning & Integration
Evaluation & Presentation
Data Warehouse
Databases
Selection & Transformation
Data Mining
Knowledge
KDD Process
1) Data cleaningTo move noise and inconsistent data
2) Data integrationWhere multiple data sources may be
combined
3) Data selectionWhere data relevant to the analysis task are
retrieved from the database.
KDD Process
4) Data transformationWhere data are transformed or consolidated
into forms appropriate for mining by performing summary or aggregation operations, for instance.
5) Data miningAn essential process where intelligent
methods are applied in order to extract data pattern.
KDD Process
6) Pattern evaluation.To identify the truly interesting pattern
representing knowledge.
7) Knowledge presentationWhere visualization and knowledge
representation techniques are used to present the mined knowledge to the users.
8) Use of discovered knowledge
Data Mining: On What Kinds Of Data?
Relational database
Data warehouse
Transactional database
Advanced database and information repositorySpatial and temporal dataStream dataMultimedia databaseText databases & WWW
Data Mining Functionalities
Association (correlation and causality)Cheese & Bread
Classification and Prediction Construct models that describe and
distinguish classes or concepts for future prediction
Predict some unknown or missing numerical values
Data Mining Functionalities (cont…)
Cluster analysis Class label is unknown: Group data to form new
classes, e.g., cluster houses to find distribution patterns
Outlier analysis Outlier: a data object that does not comply with the
general behavior of the data Noise or exception? No! useful in fraud detection and
rare event analysis
Necessity Is The Mother Of Invention
Data explosion problem Automated data collection tools and mature database
technology lead to huge amounts of data accumulated
We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
Data warehousing and on-line analytical processingMining interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Evolution Of Database Technology
1960s:Data collection, database creation, IMS and network DBMS
1970s: Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering,
etc.)
Evolution Of Database Technology
1990s: Data mining, data warehousing, multimedia
databases, and Web databases
2000sStream data management and miningData mining with a variety of applicationsWeb technology and global information
systems
Potential Applications
Data analysis and decision supportMarket analysis and managementRisk analysis and managementFraud detection and detection of unusual patterns
Other applicationsText mining (email, documents) and Web miningStream data miningDNA and bio-data analysis
Fraud Detection & Mining Unusual Patterns
Applications: Health care, retail, credit card service, telecommunications
Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance
Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests
Telecommunications: phone-call fraud Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm Retail industry
Analysts estimate that 38% of retail shrink is due to dishonest employees Anti-terrorism
Approaches: Clustering, model construction, outlier analysis, etc.
Other Applications
Sports IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover customer preference and behavior to help analyzing effectiveness of Web marketing, improving Web site organization, etc.
What is Data Warehouse? Defined in many different ways, but not
rigorouslyA decision support database that is maintained
separately from the organization’s operational database
Support information processing by providing a solid platform of consolidated, historical data for analysis
“A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision-making process”
—Bill Inmon
The source of Invention of DW and Data Mining Data explosion problem
Automated data collection tools and mature database technology lead to huge amounts of data accumulated
We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
Drowning In Data, Starving For Knowledge
DATA KNOWLEDGE
Importance of Data Mining
By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from databases and viewed or browsed from different angles.
The discovered knowledge can be applied to decision making process.