methods and tools of data analysisanna.lamek/zajecia/... · comparison chart of database types data...

52
Methods and tools of data analysis Lecture introduction Dr Eng. Anna Lamek Labs notes and grading policy >> www.ii.pwr.wroc.pl/~anna.lamek [email protected]

Upload: others

Post on 29-Jan-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Methods and tools of data analysis

Lecture introduction

Dr Eng. Anna Lamek

Labs notes and grading policy >> www.ii.pwr.wroc.pl/~anna.lamek

[email protected]

Page 2: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

2

See more details: http://pwr.edu.pl/en/students/academic-calendar

7 meetings, term of our lecture: Monday ODD 9:15-10:45 A1, room no 329

Organized classes commence on 1 Oct with an even week and last for 15 weeks (8 even weeks and 7 odd weeks) until 30 Jan 2019.

Page 3: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Lecture content and evaluation form

• Data warehousing / Data Mining – methods and practicalapplications: examples.

• Warehouse data pre-processing (how to prepare decisionmatrix?)

• Multivariate analysis (optimal and acceptable viariants)

• Decision trees.

• Regression trees.

• Seasonal decomposition in forecasting.

• Association rules methods.

• Last lecture: written test (14.01.2018), results will be announced online

3

Page 4: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

• You will work in teams of 2-3 Students. Every group will represent single store (id 1-24) in our warehouse of daily items holding, running their business in USA, Canada and Mexico. The team organizes its own work. Each group will get their own numberand the list of tasks will be availble on website (www.ii.pwr.wroc.pl/~anna.lamek)

• There will be 3 tasks:– Intro task (today, at the end of lab intro presentation)– TASK 1: Examples of aggregation & work with queries, will be announced)– TASK 2: Decision trees algorithm task (will be announced)

• Solution of each task should be documented by groups as a report file, moreover allthe implementation files should be also prepared for lecturer on due-dates

• Deadlines of each task will be announced on website(www.ii.pwr.wroc.pl/~anna.lamek)

• Penalty of lateness – 25% of max points for task per each day !!• Final Grade depends on partial grades for each phase:

- Quality of your documentation: 30%. - Quality of the implementation: 70%.

• Your attendance during labs is obligatory!• To solve lab tasks it is necessary to know the theory (LECTURE), students are

required to familiarize themselves with the theoretical characteristics of each part ofthe lab before they start their lab work. It’s also necessary to be familiar with someskills how to work with queries and decision trees algorithms in MS Access and MSExcel

4

Labs grading policy

Page 5: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Lecture / Laboratory additional notes, officehours, important annoucements

www.ii.pwr.wroc.pl/~anna.lamek

Password: kaczorekdonaldek

• Office hours: (officially confirmed on Friday, 12th of October)B4, room no 5.13

– Monday ODD 7:00-9:00– Monday EVEN 9:00-11:00– Tuesday ODD 11:00-13:00– Tuesday EVEN 13:00-15:00

[email protected] CONFIRM YOUR MEETING, PROVIDING DETAILS OF YOUR PROBLEM!!

5

Page 6: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

PRIMARY LITERATURE: [1] David H., Heikki M., Padhraic S., Data Mining, MIT, 2001. [2] Han J., Kamber M.: Data Mining. Concept and Techniques, Elsevier Morgan Kaufmann Publishers, 2006. [3] Han J., Jiawei : Data Mining: Concepts and Technics, 2006. [4] Larose D.T.: Discovering Knowledge in Data Analysis. An Introduction to Data Mining, John Wiley & Sons, 2005. [5] Shmueli, Galit, Data Mining for Business Intelligence: Consepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, Wiley-Interscience, 2006. [6] Sumathi S., Introduction to Data Mining and Its Application, 2006. SECONDARY LITERATURE: [1] Cooc D.J., Holder L.B.: Mining Graph Data, Hoboken, N.J. : Wiley-Interscience, 2007. [2] Morrison D.F.: Multivariate Statistical Methods, McGrow-Hill, 1990. [3] Olson D.L. Advance Data Mining Techniques, Springer, 2008. [4] Larose D. T., Data Mining methods and Models, IEEE Computer Society Press, 2006.

6

Page 7: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

An Introduction to Data Warehousing / Data miningwith some basic examples

SOURCE: Prof. Li Yang's Homepage @ UTC/CSEAssociate Professor and Graduate Program Coordinator in the Department of Computer Science and Engineering at University of Tennessee at Chattanooga. She is the Director of UTC Information Security (InfoSec) Center.

Page 8: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

8

Data, Data everywhere now...

• I can’t find the data I need– data is scattered over the network

– many versions, subtle differences I can’t get the data I need

need an expert to get the data

I can’t understand the data I’ve found

available data poorly documented

I can’t use the data I’ve found

results are unexpected

data needs to be transformed from one form to other

Page 9: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

9

Where is all the

Customer Data?

EMCS

Legacy,

packaged apps

Page 10: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

It’s obvious…

• It’s impossible to process those data manually in aneffective way to reach business goals

• Set of tools which can help managers & CEOs to make tacticand strategic decisios is needed >>> DATA WAREHOUSE

10

Page 11: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

So What Is a Data Warehouse?

Definition: A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

• By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise.

• OLTP systems are usually designed independently of each other and it is difficult for them to share information.

• Thanks to OLAP we are able to provide information in the right place, right time with right cost

Page 12: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Why Do We Need Data Warehouses?

• Consolidation of information resources

• Improved query performance

• Separate research and decision support functions from the operational systems

• Foundation for data mining, data visualization, advanced reporting and OLAP tools

Page 13: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

13

Which are ourlowest/highest margin

customers ?

Who are my customers and what products are they buying?

Which customersare most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

Why Data Warehousing?

Page 14: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

What Is a Data Warehouse Used for?

• Knowledge discovery

– Making consolidated reports

– Finding relationships and correlations (even those unexpactable>> data mining)

– Examples

• Banks identifying credit risks

• Insurance companies searching for fraud

• Medical research (disease reasons/causes)

Page 15: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

• Goals

• Structure

• Size

• Performance optimization

• Technologies used

How Do Data Warehouses Differ From Operational Systems?

Page 16: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Comparison Chart of Database Types

Data warehouse Operational system

Subject oriented Transaction oriented

Large (hundreds of GB up to several TB)

Small (MB up to several GB)

Historic data Current data

De-normalized table structure (few tables, many columns per table)

Normalized table structure (many tables, few columns per table)

Batch updates Continuous updates

Usually very complex queries Simple to complex queries

Page 17: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Design Differences

Star Schema

Data WarehouseOperational System

ER Diagram

Page 18: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Supporting a Complete Solution

Operational System-

Data Entry

Data Warehouse-

Data Retrieval

Page 19: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Data Warehouses, Data Marts, and Operational Data Stores

• Data Warehouse – The queryable source of data in the enterprise. It is comprised of the union of all of its constituent data marts.

• Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group.

• Operational Data Store (ODS) – A point of integration for operational systems that developed independent of each other. Since an ODS supports day to day operations, it needs to be continually updated.

Page 20: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

20

Decision Support

• Used to manage and control business

• Data is historical or point-in-time

• Optimized for inquiry rather than update

• Use of the system is loosely defined and can be ad-hoc

• Used by managers and end-users to understand the business and make judgements

Page 21: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

21

What are the users saying...

• Data should be integrated across the enterprise

• Summary data had a real value to the organization

• Historical data held the key to understanding data over time

• What-if capabilities are required

Page 22: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

22

Data Warehousing --It is a process

• Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible

• A decision support database maintained separately from the organization’s operational database

Page 23: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

23

Data Warehouse Architecture

Relational

Databases

Legacy

Data

Purchased

Data

Data Warehouse

Engine

Optimized Loader

Extraction

Cleansing

Analyze

Query

Metadata Repository

Page 24: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

24

From the Data Warehouse to Data Marts

DepartmentallyStructured

IndividuallyStructured

Data WarehouseOrganizationallyStructured

Less

More

HistoryNormalizedDetailed

Data

Information

Page 25: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

25

Users have different views of Data

Organizationallystructured

OLAP

Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data

Farmers: Harvest informationfrom known access paths

Tourists: Browse information harvestedby farmers

Page 26: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

26

Schema Design

Schema Types of DW

–Star Schema

–Snowflake schema

– Fact constellation

Page 27: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

27

Star Schema

• A single fact table and for each dimension one dimension table

• Does not capture hierarchies directly

T

i

me

p

r

o

d

c

u

s

t

c

i

t

y

f

a

c

t

date, custno, prodno, cityname, sales

Page 28: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

28

Dimension Tables

• Dimension tables– Define business in terms already familiar to users

– Wide rows with lots of descriptive text

– Small tables (about a million rows)

– Joined to fact table by a foreign key

– heavily indexed

– typical dimensions• time periods, geographic region (markets, cities), products, customers,

salesperson, etc.

Page 29: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

29

Fact Table

• Central table

– Typical example: individual sales records

– mostly raw numeric items

– narrow rows, a few columns at most

– large number of rows (millions to a billion)

– Access via dimensions

Page 30: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

30

Snowflake schema

• Represent dimensional hierarchy directly by normalizing tables.

• Easy to maintain and saves storage

T

i

me

p

r

o

d

c

u

s

t

c

i

t

y

f

a

c

t

date, custno, prodno, cityname, ...

r

e

g

i

o

n

Page 31: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

31

Fact Constellation

• Fact Constellation

– Multiple fact tables that share many dimension tables

– Booking and Checkout may share many dimension tables in the hotel industry

Hotels

Travel Agents

Promotion

Room Type

Customer

Booking

Checkout

Page 32: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

32

Which structure is the best one?

Page 33: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

33

Deploying Data Warehouses

• What business information keeps you in business today? What business information can put you out of business tomorrow?

• What business information should be a mouse click away?

• What business conditions are the driving the need for business information?

Page 34: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

34

Cultural Considerations

• Not just a technology project

• New way of using information to support daily activities and decision making

• Care must be taken to prepare organization for change

• Must have organizational backing and support

Page 35: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

35

User Training

• Users must have a higher level of IT proficiency than for operational systems

• Training to help users analyze data in the warehouse effectively

Page 36: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Summary: Building a Data Warehouse

– Analysis

– Design

– Import data

– Install front-end tools

– Test and deploy

Data Warehouse Lifecycle

Page 37: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

A case -- the STORET Central Warehouse

• Improved performance and faster data retrieval

• Ability to produce larger reports

• Ability to provide more data query options

• Streamlined application navigation

Page 38: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Old Web Application Flow

Page 39: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Central Warehouse Application Flow

Search Criteria

Selection

Report Size Feedback/

Report Customization

Report Generation

Page 40: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

http://epa.gov/storet/dw_home.html

STORET Central Warehouse:

Web Application Demo

Page 41: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

STORET Central Warehouse – Potential Future Enhancements

• More query functionality

• Additional report types

• Web Services

• Additional source systems?

STORET

State

System A

StateSystem B

Page 42: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Data Warehouse Components

Data

Data Clean-up and

Processing

Data Mart #1:

Data Mart #2

Data Mart #3

End User Applications

Report Writers

Ad Hoc Query Tools

Data Mining

feed

feed

feed

feed

Populate,

replicate,

recover

Populate,

replicate,

recover

Populate,

replicate,

recover

Data

Data

extract

extract

extract

Conformed dimensions

Conformed facts

Conformed dimensions

Conformed facts

Source Systems

(Legacy)Data Staging Area

“The Data Warehouse”

Presentation Servers

End User

Data Access

Upload model resultsUpload cleaned dimensions

SOURCE: Ralph Kimball

Page 43: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

43

Online analytical processing(OLAP)

Page 44: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

44

Nature of OLAP Analysis

• Aggregation -- (total sales, percent-to-total)

• Comparison -- Budget vs. Expenses

• Ranking – „Top 10 customers”

• Access to detailed and aggregate data

• Complex criteria specification

• Visualization

• Need interactive response to aggregate queries

Page 45: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

45

Month

1 2 3 4 76 5

Pro

du

ct

Toothpaste

JuiceCola

Milk

Cream

Soap

WS

N

Dimensions: Product, Region, Time

Hierarchical summarization paths

Product Region Time

Industry Country Year

Category Region Quarter

Product City Month week

Office Day

Multi-dimensional Data• Measure - sales (actual, plan, variance)

Page 46: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

46

Conceptual Model for OLAP

• Numeric measures to be analyzed

– e.g. Sales (Rs), sales (volume), budget, revenue, inventory

• Dimensions

– other attributes of data, define the space

– e.g., store, product, date-of-sale

– hierarchies on dimensions

• e.g. branch -> city -> state

Page 47: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

47

Operations

• Rollup: summarize data

– e.g., given sales data, summarize sales for last year by product category and region

• Drill down: get more details

– e.g., given summarized sales as above, find breakup of sales by city within each region, or within the specific region

Page 48: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

48

More OLAP Operations

• Hypothesis driven search: E.g. factors affecting defaulters

– view defaulting rate on age aggregated over other dimensions

– for particular age segment detail along profession

• Need interactive response to aggregate queries– => precompute various aggregates

Page 49: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

49

OLAP: 3 Tier DSS

Data Warehouse

Database Layer

Store atomic data in industry standard Data Warehouse.

OLAP Engine

Application Logic Layer

Generate SQL execution plans in the OLAP engine to obtain OLAP functionality.

Decision Support Client

Presentation Layer

Obtain multi-dimensional reports from the DSS Client.

Page 50: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

50

Strengths of OLAP

• It is a powerful visualization tool

• It provides fast, interactive response times

• It is good for analyzing time series

• It can be useful to find some clusters and outliners

• Many vendors offer OLAP tools

Page 51: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

51

OLAP and Executive Information Systems

• Andyne Computing -- Pablo

• Arbor Software -- Essbase

• Cognos -- PowerPlay

• Comshare -- Commander OLAP

• Holistic Systems -- Holos

• Information Advantage --AXSYS, WebOLAP

• Informix -- Metacube

• Microstrategies --DSS/Agent

• Oracle -- Express

• Pilot -- LightShip

• Planning Sciences --Gentium

• Platinum Technology --ProdeaBeacon, Forest & Trees

• SAS Institute -- SAS/EIS, OLAP++

• Speedware -- Media

Page 52: Methods and tools of data analysisanna.lamek/ZAJECIA/... · Comparison Chart of Database Types Data warehouse Operational system ... for operational systems that developed independent

Thank You for Your attention

52