methods and tools of data analysisanna.lamek/zajecia/... · comparison chart of database types data...
TRANSCRIPT
Methods and tools of data analysis
Lecture introduction
Dr Eng. Anna Lamek
Labs notes and grading policy >> www.ii.pwr.wroc.pl/~anna.lamek
2
See more details: http://pwr.edu.pl/en/students/academic-calendar
7 meetings, term of our lecture: Monday ODD 9:15-10:45 A1, room no 329
Organized classes commence on 1 Oct with an even week and last for 15 weeks (8 even weeks and 7 odd weeks) until 30 Jan 2019.
Lecture content and evaluation form
• Data warehousing / Data Mining – methods and practicalapplications: examples.
• Warehouse data pre-processing (how to prepare decisionmatrix?)
• Multivariate analysis (optimal and acceptable viariants)
• Decision trees.
• Regression trees.
• Seasonal decomposition in forecasting.
• Association rules methods.
• Last lecture: written test (14.01.2018), results will be announced online
3
• You will work in teams of 2-3 Students. Every group will represent single store (id 1-24) in our warehouse of daily items holding, running their business in USA, Canada and Mexico. The team organizes its own work. Each group will get their own numberand the list of tasks will be availble on website (www.ii.pwr.wroc.pl/~anna.lamek)
• There will be 3 tasks:– Intro task (today, at the end of lab intro presentation)– TASK 1: Examples of aggregation & work with queries, will be announced)– TASK 2: Decision trees algorithm task (will be announced)
• Solution of each task should be documented by groups as a report file, moreover allthe implementation files should be also prepared for lecturer on due-dates
• Deadlines of each task will be announced on website(www.ii.pwr.wroc.pl/~anna.lamek)
• Penalty of lateness – 25% of max points for task per each day !!• Final Grade depends on partial grades for each phase:
- Quality of your documentation: 30%. - Quality of the implementation: 70%.
• Your attendance during labs is obligatory!• To solve lab tasks it is necessary to know the theory (LECTURE), students are
required to familiarize themselves with the theoretical characteristics of each part ofthe lab before they start their lab work. It’s also necessary to be familiar with someskills how to work with queries and decision trees algorithms in MS Access and MSExcel
4
Labs grading policy
Lecture / Laboratory additional notes, officehours, important annoucements
www.ii.pwr.wroc.pl/~anna.lamek
Password: kaczorekdonaldek
• Office hours: (officially confirmed on Friday, 12th of October)B4, room no 5.13
– Monday ODD 7:00-9:00– Monday EVEN 9:00-11:00– Tuesday ODD 11:00-13:00– Tuesday EVEN 13:00-15:00
[email protected] CONFIRM YOUR MEETING, PROVIDING DETAILS OF YOUR PROBLEM!!
5
PRIMARY LITERATURE: [1] David H., Heikki M., Padhraic S., Data Mining, MIT, 2001. [2] Han J., Kamber M.: Data Mining. Concept and Techniques, Elsevier Morgan Kaufmann Publishers, 2006. [3] Han J., Jiawei : Data Mining: Concepts and Technics, 2006. [4] Larose D.T.: Discovering Knowledge in Data Analysis. An Introduction to Data Mining, John Wiley & Sons, 2005. [5] Shmueli, Galit, Data Mining for Business Intelligence: Consepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, Wiley-Interscience, 2006. [6] Sumathi S., Introduction to Data Mining and Its Application, 2006. SECONDARY LITERATURE: [1] Cooc D.J., Holder L.B.: Mining Graph Data, Hoboken, N.J. : Wiley-Interscience, 2007. [2] Morrison D.F.: Multivariate Statistical Methods, McGrow-Hill, 1990. [3] Olson D.L. Advance Data Mining Techniques, Springer, 2008. [4] Larose D. T., Data Mining methods and Models, IEEE Computer Society Press, 2006.
6
An Introduction to Data Warehousing / Data miningwith some basic examples
SOURCE: Prof. Li Yang's Homepage @ UTC/CSEAssociate Professor and Graduate Program Coordinator in the Department of Computer Science and Engineering at University of Tennessee at Chattanooga. She is the Director of UTC Information Security (InfoSec) Center.
8
Data, Data everywhere now...
• I can’t find the data I need– data is scattered over the network
– many versions, subtle differences I can’t get the data I need
need an expert to get the data
I can’t understand the data I’ve found
available data poorly documented
I can’t use the data I’ve found
results are unexpected
data needs to be transformed from one form to other
9
Where is all the
Customer Data?
EMCS
Legacy,
packaged apps
It’s obvious…
• It’s impossible to process those data manually in aneffective way to reach business goals
• Set of tools which can help managers & CEOs to make tacticand strategic decisios is needed >>> DATA WAREHOUSE
10
So What Is a Data Warehouse?
Definition: A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]
• By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise.
• OLTP systems are usually designed independently of each other and it is difficult for them to share information.
• Thanks to OLAP we are able to provide information in the right place, right time with right cost
Why Do We Need Data Warehouses?
• Consolidation of information resources
• Improved query performance
• Separate research and decision support functions from the operational systems
• Foundation for data mining, data visualization, advanced reporting and OLAP tools
13
Which are ourlowest/highest margin
customers ?
Who are my customers and what products are they buying?
Which customersare most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
Why Data Warehousing?
What Is a Data Warehouse Used for?
• Knowledge discovery
– Making consolidated reports
– Finding relationships and correlations (even those unexpactable>> data mining)
– Examples
• Banks identifying credit risks
• Insurance companies searching for fraud
• Medical research (disease reasons/causes)
• Goals
• Structure
• Size
• Performance optimization
• Technologies used
How Do Data Warehouses Differ From Operational Systems?
Comparison Chart of Database Types
Data warehouse Operational system
Subject oriented Transaction oriented
Large (hundreds of GB up to several TB)
Small (MB up to several GB)
Historic data Current data
De-normalized table structure (few tables, many columns per table)
Normalized table structure (many tables, few columns per table)
Batch updates Continuous updates
Usually very complex queries Simple to complex queries
Design Differences
Star Schema
Data WarehouseOperational System
ER Diagram
Supporting a Complete Solution
Operational System-
Data Entry
Data Warehouse-
Data Retrieval
Data Warehouses, Data Marts, and Operational Data Stores
• Data Warehouse – The queryable source of data in the enterprise. It is comprised of the union of all of its constituent data marts.
• Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group.
• Operational Data Store (ODS) – A point of integration for operational systems that developed independent of each other. Since an ODS supports day to day operations, it needs to be continually updated.
20
Decision Support
• Used to manage and control business
• Data is historical or point-in-time
• Optimized for inquiry rather than update
• Use of the system is loosely defined and can be ad-hoc
• Used by managers and end-users to understand the business and make judgements
21
What are the users saying...
• Data should be integrated across the enterprise
• Summary data had a real value to the organization
• Historical data held the key to understanding data over time
• What-if capabilities are required
22
Data Warehousing --It is a process
• Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible
• A decision support database maintained separately from the organization’s operational database
23
Data Warehouse Architecture
Relational
Databases
Legacy
Data
Purchased
Data
Data Warehouse
Engine
Optimized Loader
Extraction
Cleansing
Analyze
Query
Metadata Repository
24
From the Data Warehouse to Data Marts
DepartmentallyStructured
IndividuallyStructured
Data WarehouseOrganizationallyStructured
Less
More
HistoryNormalizedDetailed
Data
Information
25
Users have different views of Data
Organizationallystructured
OLAP
Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
Farmers: Harvest informationfrom known access paths
Tourists: Browse information harvestedby farmers
26
Schema Design
Schema Types of DW
–Star Schema
–Snowflake schema
– Fact constellation
27
Star Schema
• A single fact table and for each dimension one dimension table
• Does not capture hierarchies directly
T
i
me
p
r
o
d
c
u
s
t
c
i
t
y
f
a
c
t
date, custno, prodno, cityname, sales
28
Dimension Tables
• Dimension tables– Define business in terms already familiar to users
– Wide rows with lots of descriptive text
– Small tables (about a million rows)
– Joined to fact table by a foreign key
– heavily indexed
– typical dimensions• time periods, geographic region (markets, cities), products, customers,
salesperson, etc.
29
Fact Table
• Central table
– Typical example: individual sales records
– mostly raw numeric items
– narrow rows, a few columns at most
– large number of rows (millions to a billion)
– Access via dimensions
30
Snowflake schema
• Represent dimensional hierarchy directly by normalizing tables.
• Easy to maintain and saves storage
T
i
me
p
r
o
d
c
u
s
t
c
i
t
y
f
a
c
t
date, custno, prodno, cityname, ...
r
e
g
i
o
n
31
Fact Constellation
• Fact Constellation
– Multiple fact tables that share many dimension tables
– Booking and Checkout may share many dimension tables in the hotel industry
Hotels
Travel Agents
Promotion
Room Type
Customer
Booking
Checkout
32
Which structure is the best one?
33
Deploying Data Warehouses
• What business information keeps you in business today? What business information can put you out of business tomorrow?
• What business information should be a mouse click away?
• What business conditions are the driving the need for business information?
34
Cultural Considerations
• Not just a technology project
• New way of using information to support daily activities and decision making
• Care must be taken to prepare organization for change
• Must have organizational backing and support
35
User Training
• Users must have a higher level of IT proficiency than for operational systems
• Training to help users analyze data in the warehouse effectively
Summary: Building a Data Warehouse
– Analysis
– Design
– Import data
– Install front-end tools
– Test and deploy
Data Warehouse Lifecycle
A case -- the STORET Central Warehouse
• Improved performance and faster data retrieval
• Ability to produce larger reports
• Ability to provide more data query options
• Streamlined application navigation
Old Web Application Flow
Central Warehouse Application Flow
Search Criteria
Selection
Report Size Feedback/
Report Customization
Report Generation
http://epa.gov/storet/dw_home.html
STORET Central Warehouse:
Web Application Demo
STORET Central Warehouse – Potential Future Enhancements
• More query functionality
• Additional report types
• Web Services
• Additional source systems?
STORET
State
System A
StateSystem B
Data Warehouse Components
Data
Data Clean-up and
Processing
Data Mart #1:
Data Mart #2
Data Mart #3
End User Applications
Report Writers
Ad Hoc Query Tools
Data Mining
feed
feed
feed
feed
Populate,
replicate,
recover
Populate,
replicate,
recover
Populate,
replicate,
recover
Data
Data
extract
extract
extract
Conformed dimensions
Conformed facts
Conformed dimensions
Conformed facts
Source Systems
(Legacy)Data Staging Area
“The Data Warehouse”
Presentation Servers
End User
Data Access
Upload model resultsUpload cleaned dimensions
SOURCE: Ralph Kimball
43
Online analytical processing(OLAP)
44
Nature of OLAP Analysis
• Aggregation -- (total sales, percent-to-total)
• Comparison -- Budget vs. Expenses
• Ranking – „Top 10 customers”
• Access to detailed and aggregate data
• Complex criteria specification
• Visualization
• Need interactive response to aggregate queries
45
Month
1 2 3 4 76 5
Pro
du
ct
Toothpaste
JuiceCola
Milk
Cream
Soap
WS
N
Dimensions: Product, Region, Time
Hierarchical summarization paths
Product Region Time
Industry Country Year
Category Region Quarter
Product City Month week
Office Day
Multi-dimensional Data• Measure - sales (actual, plan, variance)
46
Conceptual Model for OLAP
• Numeric measures to be analyzed
– e.g. Sales (Rs), sales (volume), budget, revenue, inventory
• Dimensions
– other attributes of data, define the space
– e.g., store, product, date-of-sale
– hierarchies on dimensions
• e.g. branch -> city -> state
47
Operations
• Rollup: summarize data
– e.g., given sales data, summarize sales for last year by product category and region
• Drill down: get more details
– e.g., given summarized sales as above, find breakup of sales by city within each region, or within the specific region
48
More OLAP Operations
• Hypothesis driven search: E.g. factors affecting defaulters
– view defaulting rate on age aggregated over other dimensions
– for particular age segment detail along profession
• Need interactive response to aggregate queries– => precompute various aggregates
49
OLAP: 3 Tier DSS
Data Warehouse
Database Layer
Store atomic data in industry standard Data Warehouse.
OLAP Engine
Application Logic Layer
Generate SQL execution plans in the OLAP engine to obtain OLAP functionality.
Decision Support Client
Presentation Layer
Obtain multi-dimensional reports from the DSS Client.
50
Strengths of OLAP
• It is a powerful visualization tool
• It provides fast, interactive response times
• It is good for analyzing time series
• It can be useful to find some clusters and outliners
• Many vendors offer OLAP tools
51
OLAP and Executive Information Systems
• Andyne Computing -- Pablo
• Arbor Software -- Essbase
• Cognos -- PowerPlay
• Comshare -- Commander OLAP
• Holistic Systems -- Holos
• Information Advantage --AXSYS, WebOLAP
• Informix -- Metacube
• Microstrategies --DSS/Agent
• Oracle -- Express
• Pilot -- LightShip
• Planning Sciences --Gentium
• Platinum Technology --ProdeaBeacon, Forest & Trees
• SAS Institute -- SAS/EIS, OLAP++
• Speedware -- Media
Thank You for Your attention
52