methods and tools of data analysisanna.lamek/zajecia/... · comparison chart of database types data...

Post on 29-Jan-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Methods and tools of data analysis

Lecture introduction

Dr Eng. Anna Lamek

Labs notes and grading policy >> www.ii.pwr.wroc.pl/~anna.lamek

anna.lamek@pwr.wroc.pl

2

See more details: http://pwr.edu.pl/en/students/academic-calendar

7 meetings, term of our lecture: Monday ODD 9:15-10:45 A1, room no 329

Organized classes commence on 1 Oct with an even week and last for 15 weeks (8 even weeks and 7 odd weeks) until 30 Jan 2019.

Lecture content and evaluation form

• Data warehousing / Data Mining – methods and practicalapplications: examples.

• Warehouse data pre-processing (how to prepare decisionmatrix?)

• Multivariate analysis (optimal and acceptable viariants)

• Decision trees.

• Regression trees.

• Seasonal decomposition in forecasting.

• Association rules methods.

• Last lecture: written test (14.01.2018), results will be announced online

3

• You will work in teams of 2-3 Students. Every group will represent single store (id 1-24) in our warehouse of daily items holding, running their business in USA, Canada and Mexico. The team organizes its own work. Each group will get their own numberand the list of tasks will be availble on website (www.ii.pwr.wroc.pl/~anna.lamek)

• There will be 3 tasks:– Intro task (today, at the end of lab intro presentation)– TASK 1: Examples of aggregation & work with queries, will be announced)– TASK 2: Decision trees algorithm task (will be announced)

• Solution of each task should be documented by groups as a report file, moreover allthe implementation files should be also prepared for lecturer on due-dates

• Deadlines of each task will be announced on website(www.ii.pwr.wroc.pl/~anna.lamek)

• Penalty of lateness – 25% of max points for task per each day !!• Final Grade depends on partial grades for each phase:

- Quality of your documentation: 30%. - Quality of the implementation: 70%.

• Your attendance during labs is obligatory!• To solve lab tasks it is necessary to know the theory (LECTURE), students are

required to familiarize themselves with the theoretical characteristics of each part ofthe lab before they start their lab work. It’s also necessary to be familiar with someskills how to work with queries and decision trees algorithms in MS Access and MSExcel

4

Labs grading policy

Lecture / Laboratory additional notes, officehours, important annoucements

www.ii.pwr.wroc.pl/~anna.lamek

Password: kaczorekdonaldek

• Office hours: (officially confirmed on Friday, 12th of October)B4, room no 5.13

– Monday ODD 7:00-9:00– Monday EVEN 9:00-11:00– Tuesday ODD 11:00-13:00– Tuesday EVEN 13:00-15:00

anna.lamek@pwr.wroc.plPLEASE CONFIRM YOUR MEETING, PROVIDING DETAILS OF YOUR PROBLEM!!

5

PRIMARY LITERATURE: [1] David H., Heikki M., Padhraic S., Data Mining, MIT, 2001. [2] Han J., Kamber M.: Data Mining. Concept and Techniques, Elsevier Morgan Kaufmann Publishers, 2006. [3] Han J., Jiawei : Data Mining: Concepts and Technics, 2006. [4] Larose D.T.: Discovering Knowledge in Data Analysis. An Introduction to Data Mining, John Wiley & Sons, 2005. [5] Shmueli, Galit, Data Mining for Business Intelligence: Consepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, Wiley-Interscience, 2006. [6] Sumathi S., Introduction to Data Mining and Its Application, 2006. SECONDARY LITERATURE: [1] Cooc D.J., Holder L.B.: Mining Graph Data, Hoboken, N.J. : Wiley-Interscience, 2007. [2] Morrison D.F.: Multivariate Statistical Methods, McGrow-Hill, 1990. [3] Olson D.L. Advance Data Mining Techniques, Springer, 2008. [4] Larose D. T., Data Mining methods and Models, IEEE Computer Society Press, 2006.

6

An Introduction to Data Warehousing / Data miningwith some basic examples

SOURCE: Prof. Li Yang's Homepage @ UTC/CSEAssociate Professor and Graduate Program Coordinator in the Department of Computer Science and Engineering at University of Tennessee at Chattanooga. She is the Director of UTC Information Security (InfoSec) Center.

8

Data, Data everywhere now...

• I can’t find the data I need– data is scattered over the network

– many versions, subtle differences I can’t get the data I need

need an expert to get the data

I can’t understand the data I’ve found

available data poorly documented

I can’t use the data I’ve found

results are unexpected

data needs to be transformed from one form to other

9

Where is all the

Customer Data?

EMCS

Legacy,

packaged apps

It’s obvious…

• It’s impossible to process those data manually in aneffective way to reach business goals

• Set of tools which can help managers & CEOs to make tacticand strategic decisios is needed >>> DATA WAREHOUSE

10

So What Is a Data Warehouse?

Definition: A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

• By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise.

• OLTP systems are usually designed independently of each other and it is difficult for them to share information.

• Thanks to OLAP we are able to provide information in the right place, right time with right cost

Why Do We Need Data Warehouses?

• Consolidation of information resources

• Improved query performance

• Separate research and decision support functions from the operational systems

• Foundation for data mining, data visualization, advanced reporting and OLAP tools

13

Which are ourlowest/highest margin

customers ?

Who are my customers and what products are they buying?

Which customersare most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

Why Data Warehousing?

What Is a Data Warehouse Used for?

• Knowledge discovery

– Making consolidated reports

– Finding relationships and correlations (even those unexpactable>> data mining)

– Examples

• Banks identifying credit risks

• Insurance companies searching for fraud

• Medical research (disease reasons/causes)

• Goals

• Structure

• Size

• Performance optimization

• Technologies used

How Do Data Warehouses Differ From Operational Systems?

Comparison Chart of Database Types

Data warehouse Operational system

Subject oriented Transaction oriented

Large (hundreds of GB up to several TB)

Small (MB up to several GB)

Historic data Current data

De-normalized table structure (few tables, many columns per table)

Normalized table structure (many tables, few columns per table)

Batch updates Continuous updates

Usually very complex queries Simple to complex queries

Design Differences

Star Schema

Data WarehouseOperational System

ER Diagram

Supporting a Complete Solution

Operational System-

Data Entry

Data Warehouse-

Data Retrieval

Data Warehouses, Data Marts, and Operational Data Stores

• Data Warehouse – The queryable source of data in the enterprise. It is comprised of the union of all of its constituent data marts.

• Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group.

• Operational Data Store (ODS) – A point of integration for operational systems that developed independent of each other. Since an ODS supports day to day operations, it needs to be continually updated.

20

Decision Support

• Used to manage and control business

• Data is historical or point-in-time

• Optimized for inquiry rather than update

• Use of the system is loosely defined and can be ad-hoc

• Used by managers and end-users to understand the business and make judgements

21

What are the users saying...

• Data should be integrated across the enterprise

• Summary data had a real value to the organization

• Historical data held the key to understanding data over time

• What-if capabilities are required

22

Data Warehousing --It is a process

• Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible

• A decision support database maintained separately from the organization’s operational database

23

Data Warehouse Architecture

Relational

Databases

Legacy

Data

Purchased

Data

Data Warehouse

Engine

Optimized Loader

Extraction

Cleansing

Analyze

Query

Metadata Repository

24

From the Data Warehouse to Data Marts

DepartmentallyStructured

IndividuallyStructured

Data WarehouseOrganizationallyStructured

Less

More

HistoryNormalizedDetailed

Data

Information

25

Users have different views of Data

Organizationallystructured

OLAP

Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data

Farmers: Harvest informationfrom known access paths

Tourists: Browse information harvestedby farmers

26

Schema Design

Schema Types of DW

–Star Schema

–Snowflake schema

– Fact constellation

27

Star Schema

• A single fact table and for each dimension one dimension table

• Does not capture hierarchies directly

T

i

me

p

r

o

d

c

u

s

t

c

i

t

y

f

a

c

t

date, custno, prodno, cityname, sales

28

Dimension Tables

• Dimension tables– Define business in terms already familiar to users

– Wide rows with lots of descriptive text

– Small tables (about a million rows)

– Joined to fact table by a foreign key

– heavily indexed

– typical dimensions• time periods, geographic region (markets, cities), products, customers,

salesperson, etc.

29

Fact Table

• Central table

– Typical example: individual sales records

– mostly raw numeric items

– narrow rows, a few columns at most

– large number of rows (millions to a billion)

– Access via dimensions

30

Snowflake schema

• Represent dimensional hierarchy directly by normalizing tables.

• Easy to maintain and saves storage

T

i

me

p

r

o

d

c

u

s

t

c

i

t

y

f

a

c

t

date, custno, prodno, cityname, ...

r

e

g

i

o

n

31

Fact Constellation

• Fact Constellation

– Multiple fact tables that share many dimension tables

– Booking and Checkout may share many dimension tables in the hotel industry

Hotels

Travel Agents

Promotion

Room Type

Customer

Booking

Checkout

32

Which structure is the best one?

33

Deploying Data Warehouses

• What business information keeps you in business today? What business information can put you out of business tomorrow?

• What business information should be a mouse click away?

• What business conditions are the driving the need for business information?

34

Cultural Considerations

• Not just a technology project

• New way of using information to support daily activities and decision making

• Care must be taken to prepare organization for change

• Must have organizational backing and support

35

User Training

• Users must have a higher level of IT proficiency than for operational systems

• Training to help users analyze data in the warehouse effectively

Summary: Building a Data Warehouse

– Analysis

– Design

– Import data

– Install front-end tools

– Test and deploy

Data Warehouse Lifecycle

A case -- the STORET Central Warehouse

• Improved performance and faster data retrieval

• Ability to produce larger reports

• Ability to provide more data query options

• Streamlined application navigation

Old Web Application Flow

Central Warehouse Application Flow

Search Criteria

Selection

Report Size Feedback/

Report Customization

Report Generation

http://epa.gov/storet/dw_home.html

STORET Central Warehouse:

Web Application Demo

STORET Central Warehouse – Potential Future Enhancements

• More query functionality

• Additional report types

• Web Services

• Additional source systems?

STORET

State

System A

StateSystem B

Data Warehouse Components

Data

Data Clean-up and

Processing

Data Mart #1:

Data Mart #2

Data Mart #3

End User Applications

Report Writers

Ad Hoc Query Tools

Data Mining

feed

feed

feed

feed

Populate,

replicate,

recover

Populate,

replicate,

recover

Populate,

replicate,

recover

Data

Data

extract

extract

extract

Conformed dimensions

Conformed facts

Conformed dimensions

Conformed facts

Source Systems

(Legacy)Data Staging Area

“The Data Warehouse”

Presentation Servers

End User

Data Access

Upload model resultsUpload cleaned dimensions

SOURCE: Ralph Kimball

43

Online analytical processing(OLAP)

44

Nature of OLAP Analysis

• Aggregation -- (total sales, percent-to-total)

• Comparison -- Budget vs. Expenses

• Ranking – „Top 10 customers”

• Access to detailed and aggregate data

• Complex criteria specification

• Visualization

• Need interactive response to aggregate queries

45

Month

1 2 3 4 76 5

Pro

du

ct

Toothpaste

JuiceCola

Milk

Cream

Soap

WS

N

Dimensions: Product, Region, Time

Hierarchical summarization paths

Product Region Time

Industry Country Year

Category Region Quarter

Product City Month week

Office Day

Multi-dimensional Data• Measure - sales (actual, plan, variance)

46

Conceptual Model for OLAP

• Numeric measures to be analyzed

– e.g. Sales (Rs), sales (volume), budget, revenue, inventory

• Dimensions

– other attributes of data, define the space

– e.g., store, product, date-of-sale

– hierarchies on dimensions

• e.g. branch -> city -> state

47

Operations

• Rollup: summarize data

– e.g., given sales data, summarize sales for last year by product category and region

• Drill down: get more details

– e.g., given summarized sales as above, find breakup of sales by city within each region, or within the specific region

48

More OLAP Operations

• Hypothesis driven search: E.g. factors affecting defaulters

– view defaulting rate on age aggregated over other dimensions

– for particular age segment detail along profession

• Need interactive response to aggregate queries– => precompute various aggregates

49

OLAP: 3 Tier DSS

Data Warehouse

Database Layer

Store atomic data in industry standard Data Warehouse.

OLAP Engine

Application Logic Layer

Generate SQL execution plans in the OLAP engine to obtain OLAP functionality.

Decision Support Client

Presentation Layer

Obtain multi-dimensional reports from the DSS Client.

50

Strengths of OLAP

• It is a powerful visualization tool

• It provides fast, interactive response times

• It is good for analyzing time series

• It can be useful to find some clusters and outliners

• Many vendors offer OLAP tools

51

OLAP and Executive Information Systems

• Andyne Computing -- Pablo

• Arbor Software -- Essbase

• Cognos -- PowerPlay

• Comshare -- Commander OLAP

• Holistic Systems -- Holos

• Information Advantage --AXSYS, WebOLAP

• Informix -- Metacube

• Microstrategies --DSS/Agent

• Oracle -- Express

• Pilot -- LightShip

• Planning Sciences --Gentium

• Platinum Technology --ProdeaBeacon, Forest & Trees

• SAS Institute -- SAS/EIS, OLAP++

• Speedware -- Media

Thank You for Your attention

52

top related