penn state student chapter of the association for computing machinery we welcome all interested...

Penn State Student Chapter of theAssociation for

Computing Machinery

We welcome all interested students to our 4th general meeting of the Spring 2005 semester!

When: Monday, April 11th, 2005 from 7-8 pm Where: Cybertorium (213 IST)

Agenda:• Brief overview of our ACM chapter• New officer introductions• Special topic presentation: No Pain, No Game

Presented by IST Professor Brian K. Smith• Co-op/Intern presentation: Working at IBM

Presented by Rick Osowski

Free refreshments will be provided

2

IST 210

Data Warehousing, Data Mining, and Advanced Applications

3

IST 210 Data Rich, but Information Poor

Data is stored, not explored : by its volume and complexity it represents a burden, not a support

Data overload results in uninformed decisions, contradictory information, higher overhead, wrong decisions, increased costs

Data is not designed and is not structured for successful management decision making

4

IST 210 Improving Decision Making

Data

Information

Decisions

Data Warehouse

5

IST 210 Data Warehouse Concepts

6

IST 210 What’s a Data Warehouse?

A data warehouse is a single, integrated source of decision support information formed by collecting data from multiple sources, internal to the organization as well as external, and transforming and summarising this information to enable improved decision making.

A data warehouse is designed for easy access by users to large amounts of information, and data access is typically supported by specialized analytical tools and applications.

7

IST 210

Data Warehouse Characteristics

Key Characteristics of a Data Warehouse

Subject-oriented Integrated Time-variant Non-volatile

8

IST 210 Subject Oriented• Example for an insurance company :

PolicyPolicyCustomerCustomer

Data

LossesLosses PremiumPremium

Commercial and Life

Insurance Systems

Commercial and Life

Insurance Systems

Auto and Fire Policy

Processing Systems

Auto and Fire Policy

Processing Systems

Data

Accounting System

Accounting System

Claims Processing

System

Claims Processing

SystemBilling System

Billing System

Applications Area Data Warehouse

9

IST 210 Integrated

• Data is stored once in a single integrated location(e.g. insurance company)

Data WarehouseDatabase

Subject = Customer

Auto PolicyProcessing

System

Auto PolicyProcessing

System

Customer data stored in severaldatabases

Fire PolicyProcessing

System

Fire PolicyProcessing

System

FACTS, LIFECommercial, Accounting

Applications

FACTS, LIFECommercial, Accounting

Applications

10

IST 210 Time - Variant

Data is tagged with some element of time - creation date, as of date, etc.

Data is available on-line for long periods of time for trend analysis and forecasting. For example, five or more years

Data Warehouse Data

Time Data

{Key

• Data is stored as a series of snapshots or views which record how it is collected across time.

11

IST 210 Non-Volatile

• Existing data in the warehouse is not overwritten or

updated. External Sources

• Read-Only

DataWarehouseDatabaseData

WarehouseEnvironment

Data Warehouse

Environment

ProductionDatabases

ProductionApplications

ProductionApplications

• Update• Insert• Delete

• Load

12

IST 210 Transaction System vs. Data Warehouse

13

IST 210

On-line, real time update into disparate systems

Day-to-day operations System Experts

UsersData Manipulation

Unix

VMS

MVS

Other

Transaction-Based Reporting System

14

IST 210

BENEFIT: Reduce data processing

costs

BENEFIT: Integrated, consistent data

available for analysis

BENEFIT: Improve Network Reporting processes and

analytical capabilities

Data Staging, Transformation and Cleansing

Data Staging, Transformation and Cleansing

Interfaces

Executive Reporting and On-Line Analysis

EnvironmentOther

VMS

MVS

Unix

Su

mm

arization

OLAP

DataWarehouse

Warehouse-Based Reporting System

15

IST 210

Transaction - Warehouse Process

TransformSummarize &

Refine

On-line, real time update.

“Transaction Based Process”

Day-to-day operations

Detailed Information to operational systems.

“Warehouse Based Process”

Decision support for management use.

Batch Load

16

IST 210

Supports management analysis and decision-

making processes Contains summarized, refined, and cleansed

information Non-volatile -- provides a data “snapshot”;

adjustments are not permitted, or are limited Business analysis requirements drive the data

structure and system design Integrated, consistent information on a single

technology platform Users have direct, fast access via On-line

Analytical Processing tools Minimal impact on operational processes

Data Warehouse

Supports day-to-day operational processes Contains raw, detailed data that has not been

refined or cleansed Volatile -- data changes from day-to-day, with

frequent updates Technical issues drive the data structure and

system design Disparate data structures, physical locations,

query types, etc. Users rely on technical analysts for reporting

needs Operational processes impacted by queries

run off of system

Transaction System

Transaction System vs. Data Warehouse

17

IST 210 Data Warehouse Architecture

18

IST 210 Data Warehouse Architecture

Conversion & Interface

OLAPCubes

Ad-hocReporting

CannedReports

Data MartsStaging AreaODS

Operational System Data Warehouse

19

IST 210

Map source data to target Data scrubbing Derive new data Data Extraction Transform / convert data Create / modify metadata

Conversion& Cleansing

Data Warehouse ArchitectureConversion and Cleansing Activities

20

IST 210

DetailedData

Metadata

Ranges from detailed to summarized data

Contains metadata Many views of the data Subject-Oriented Time-variant

SummaryData

Data Warehouse ArchitectureData Warehouse Components

21

IST 210

Requirements Gathering Process Business Measure Definition

Standard definition and related business rules and formulas

Source data element(s), including quality constraints

Data granularity levels (e.g., county detail for state)

Data retention (e.g., one month, one quarter, one year, multiple years)

Priority of the information (For example, is the information necessary to derive other business measures?)

Data load frequency (e.g., monthly, quarterly, etc.)

22

IST 210 Star Join Schema

Region_Dimension_Tableregion _id

NENWSESW

region _doc

NortheastNorthwestSoutheastSouthwest

account _id

100000110000120000130000140000

account _doc

ABC ElectronicsMidway ElectricVictor ComponentsWashburn, Inc.Zerox

Account_Dimension_Table

Product_Dimension_Tableprod_grp_id

102030

prod_id

100140220

prod_grp_desc

Fewer devicesCircuit boardsComponents

prod_desc

Power supplyMotherboardCo-processor

month

01-199602-199603-1996

mo_in_fiscal_yr

456

month_name

JanuaryFebruaryMarch

prod_id

100140220

region_id

SWNESW

account_id

100000110000100000

vend_id

100200300

net-sales

30,00023,00032,000

gross_sales

50,00042,00049,000

Monthly_Sales_Summary_Table

Time_Dimension_Table

Fact Table

Dimension Tables

Vendor_Dimension_Tablevend_id

100200300

vendor_desc

PowerAge, Inc.Advanced Micro DevicesFarad Incorporated

month

01-199602-199603-1996

23

IST 210 Multi-Dimensional Analysis

Zip Code

County

Region

State

ProductFamily

Client Type

Account

Store

ProductLine Brand

Category

GroupItem

Class of Trade

Net Sales by Brand byRegion by Client Type

Geography Dimension

Customer Dimension

Product DimensionProduct Dimension

Business Measure:Net Sales

DW0117

24

IST 210 Application Solution Classes

Executive information system (EIS) : Present information at the highest level of summarization using

corporate business measures. They are designed for extreme ease-of-use and, in many cases, only a mouse is required. Graphics are usually generously incorporated to provide at-a-glance indications of performance

Decision Support Systems (DSS) : They ideally present information in graphical and tabular

form, providing the user with the ability to drill down on selected information. Note the increased detail and data manipulation options presented

25

IST 210

1

Data Mining

26

IST 210 Data Mining The process of extracting valid, previously

unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis,1996).

Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data.

27

IST 210 Data Mining Reveals information that is hidden and

unexpected, as little value in finding patterns and relationships that are already intuitive.

Patterns and relationships are identified by examining the underlying rules and features in the data.

Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing.

Relatively new technology, however already used in a number of industries.

28

IST 210 Examples of Applications of Data Mining

Retail / Marketing Identifying buying patterns of customers Finding associations among customer demographic

characteristics Predicting response to mailing campaigns Market basket analysis

Banking Detecting patterns of fraudulent credit card use Identifying loyal customers Predicting customers likely to change their credit card

affiliation Determining credit card spending by customer groups

29

IST 210 Examples of Applications of Data Mining

Insurance Claims analysis Predicting which customers will buy new policies

Medicine Characterizing patient behavior to predict surgery

visits Identifying successful medical therapies for different

illnesses

30

IST 210 Data Mining Operations and Associated Techniques

31

IST 210 Database Segmentation Aim is to partition a database into an unknown number

of segments, or clusters, of similar records. Uses unsupervised learning to discover homogeneous

sub-populations in a database to improve the accuracy of the profiles.

Less precise than other operations thus less sensitive to redundant and irrelevant features.

Sensitivity can be reduced by ignoring a subset of the attributes that describe each instance or by assigning a weighting factor to each variable.

Applications of database segmentation include customer profiling, direct marketing, and cross selling.

32

IST 210 Scatterplot

33

IST 210 Visualization

34

IST 210

Data Mining and Data Warehousing

Major challenge to exploit data mining is identifying suitable data to mine.

Data mining requires single, separate, clean, integrated, and self-consistent source of data.

A data warehouse is well equipped for providing data for mining.

Data quality and consistency is a pre-requisite for mining to ensure the accuracy of the predictive models. Data warehouses are populated with clean, consistent data.

35

IST 210 Data Mining and Data Warehousing It is advantageous to mine data from multiple sources

to discover as many interrelationships as possible. Data warehouses contain data from a number of sources.

Selecting the relevant subsets of records and fields for data mining requires the query capabilities of the data warehouse.

The results of a data mining study are useful if there is some way to further investigate the uncovered patterns. Data warehouses provide the capability to go back to the data source.

36

IST 210 Advanced Database Topics

37

IST 210 A Little History Prior to the 1980s hierarchical and network

databases. Hardware dumb terminals using private

networks Database centralized and stored on the disk

packs End user terminals simply input/output devices

Processing at the mainframe Data text data Networks had to handle text data No access from outside to the organization's

private network.

38

IST 210

Microcomputer enabled workstation processing power.

Satellite and network technology provided for very high speed, high traffic, and low cost long distance communications networks.

Internet in the late 1990s and the corresponding phenomenal growth in electronic commerce (E-commerce) necessitated public access to data in people's homes.

The volume of data needed to be transmitted increased greatly.

New Needs

39

IST 210

Business environment changed during the last two decades

Information stored at different locations, on different hardware and operating systems, with different commercial DBMS products, and with different underlying data models had to be combined

The centralized database was no longer feasible to handle these new demands

New Needs

40

IST 210

Distributed Database Scenario

There are many advantages to using a distributed database rather than a centralized database. They are:

Improved performance, because high traffic data are stored locally.

More efficient data management, because the DBA workload is shared.

Better network integrity, because the whole system does not stop if one computer goes down.

Expansion of the database is facilitated when the organization grows, since new data does not have to be centralized. It can remain and be administered in the original location.

Data for the whole organization can still be accessed from any location.

41

IST 210

Data administration is improved (??) In a distributed database system even a

simple task like creating a backup copy of the database can take a considerable amount of time.

If the database is divided among several locations the time and workload for this task can be shared.

Distributed Database

42

IST 210 Replication of Data System failure in one location should not stop

processing in other locations Replicate all or parts of the database in more than one

location. Database replication improves performance and

provides a fail-safe option, but it involves considerable complexity

Replication of frequently used data improves response time and reduces network traffic

If the data changes at one location it must be changed at all locations

43

IST 210

Distributed Systems in an Ideal World

C. J. Date established rules for the ideal distributed DBMS system

Rules are a goal that distributed systems strive toward, but have not yet reached

According to Date's rules: Each site is responsible for its own portion of the

distributed database, including security, backup, and recovery.

Each site has equal capabilities and does not rely on any other site.

The system should work regardless of the computer hardware, operating system, or network installed at any site.

44

IST 210

Date's Rules of Distributed Databases:

1. Local site independence2. Central site independence3. Failure independence4. Location transparency5. Fragmentation transparency6. Replication transparency7. Distributed query processing8. Distributed transaction processing9. Hardware independence10. Operating system independence11. Network independence12. Database independence

45

IST 210

Complexities of Distributed Databases

There also are many complications involved in the management of distributed database systems.

The distributed database must be carefully designed to insure the following:

Store data as close as possible to where it is used most often.

Make the location of the data transparent to the end user.

Make the system easy to expand. Optimize queries to improve response time in the

distributed environment.

46

IST 210 Database Design The designer must analyze the organization's

needs and business processes to determine the best way to distribute the database.

There are several possibilities for storing the data in more than one location:

Centralized master database Replication of the entire or part of the database in

several locations Horizontal partitions Vertical partitions Mixture of the above

47

IST 210 Fragmentation

Horizontal fragmentation of the database

means that rows of a table(s) may be stored in different locations

Similar to the separation of the customer table in the retailing example above.

Vertical fragmentation means that columns of a table ( i.e., attributes or groups of attributes of an entity) are stored in different locations.

48

IST 210 Query Formulation Distributed databases require a considerable

amount of network overhead Poorly formulated query it may cause

unnecessary data retrieval from the database Query optimization is ideally performed by the

distributed database management system

49

IST 210 OODB In traditional relational databases E-R Modeling and

normalization focuses on identifying entities, their attributes, and the relationships between entities

This works well for most organizational data, especially business data

The advent of the microcomputer and processing power on the desktop

Computer aided design, CAD, became the norm for engineering work, so it became necessary to store drawings

Powerful multimedia PCs with sound cards and color monitors enabled the manipulation of sound and video files

Many other applications were developed that required more than just text and numeric processing

50

IST 210 Why?? These new applications were facilitated by the

development of Object-Oriented Programming Still evolving development of object-oriented data

modeling, object-oriented databases, and object-oriented database management systems

OODBMS and O/R DBMS are two types of database management systems that are currently available

O/R DBMS uses the basic theory of relational database management systems with object-oriented features added

OODBMS is more object-oriented and was developed separately from the relational products

OODMBS suffers from a lack of standardization that is available with relational database systems

penn state student chapter of the association for computing machinery we welcome all interested...

Documents

data access

data warehouseist

data mining

data rich

consistent data available

nonvolatileexisting

contradictory information

information poordata