business intelligence session

Post on 10-Dec-2015

34 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

What is Business Intelligence and its uses as a Business Manager

TRANSCRIPT

Business Intelligence

What is Business Intelligence (BI)

• Business Intelligence (BI) refers to skills, processes, technologies, applications and practices used to support decision making.

• Systems that provide directed background data and reporting tools to support and improve the decision-making process.

• A popularized, umbrella term used to describe a set of concepts and methods to improve business decision making by using fact-based support systems. The term is sometimes used interchangeably with briefing books and executive information systems.

• Business Intelligence is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help clients make better business decisions.

• A system that collects, integrates, analyses and presents business information to support better business decision making.

• Business Intelligence is an environment in which business users receive information that is reliable, secure, consistent, understandable, easily manipulated and timely...facilitating more informed decision making

What is Business Intelligence (BI)

Improving organizations by providing business insights to all employees leading to better, faster, more relevant decisions

What is Business Intelligence (BI)

Business Intelligence enables the business to make intelligent, fact-based decision

Aggregate Data

Database, Data Mart, Data Warehouse, ETL Tools,

Integration Tools

Present Data

EnrichData

Inform a Decision

Reporting Tools, Dashboards, Static

Reports, Mobile Reporting, OLAP Cubes

Add Context to Create Information, Descriptive Statistics, Benchmarks, Variance to Plan or LY

Decisions are Fact-based and Data-driven

Content

The business determines the “what”, BI enables the “how”

Performance

Minimize report creation and collection times (near zero)

Usability

Delivery Method Push vs Pull

Medium Excel, PDF, Dashboard, Cube, Mobile Device

Enhance Digestion Fewer clicks

Tell a Story Trend, Context, Related Metrics, Multiple Views

CPU – Content, Performance, Usability

Why is Business Intelligence So Important?

Time

With Business Intelligence, we can get data to you in a timely manner.

Making Business Decisions is a Balance

Data Opinion

(aka Best Professional Judgment)

In the absence of data, business decisions are often made by the HiPPO.

Core Capabilities of BI

Why do companies need BI?

Tactical / Strategic BI

What’s the best that can happen?

What will happen next?

What if these trends continue?

Why is this happening?

What actions are needed?

Where exactly is the problem?

How many, how often, where?

What happened?

Sophistication of Intelligence

Operational BI

Optimization

Predictive Modeling

Forecasting/extrapolation

Statistical analysis

Alerts

Query/drill down

Ad hoc reports

Standard reportsCo

mp

etit

ive

Ad

van

tag

e

Benefits of Business Intelligence

• Improve Management Processes– planning, controlling, measuring and/or changing resulting in increased revenues and reduced costs

• Improve Operational Processes– fraud detection, order processing, purchasing.. resulting in increased revenues and reduced costs

• Predict the Future

BI Golden Rules

• Data Quality & Accuracy

• Data Consistency

• Data Timeliness

“Get the right information to the right people at the right time”

Gartner BI Maturity Model

Business Intelligence

Reporting

Business Intelligence

Data Mining

Knowledge Management

Expert Systems

Reporting ToolsSoftware that reads data

Processes data by:FilteringSortingGroupingSimple calculations

Produces formatted summaries of data

OLAPOn-Line Analytical Processing

Consolidation

Drill-Down

Slicing and Dicing

Drill Down

Slice and Dice

OLAP ExampleReport A

Report B

Data WarehouseLarge Database

Subject-Oriented

Integrated

Time-Variant

Nonvolatile

User-Friendly Interface

Data Warehouse System

Oper-ational DB

OtherDB

ExternalDB

DataWare-house

Reporting

Data Mining

KM

Expert

Data MartsMini data warehouses

Hold subsets of data from the data warehouse

Data focuses on a specific aspect of the company

Data Mining ToolsSoftware that searches through data

Uses complex statistical calculations

OutputsTrendsPatternsCorrelationsExceptions

Knowledge Management• Process

• Creating Value from Intellectual Property

• Sharing Knowledge with Others

Expert SystemsKnowledge captured as rules.

BI ChallengesPoor data quality

Context

User resistance

Database Management System (DMBS)

Database Management System (DMBS)

DBMS contains information about a particular enterprise• Collection of interrelated data• Set of programs to access the data • An environment that is both convenient and efficient to use

Database Applications:• Banking: all transactions• Airlines: reservations, schedules• Universities: registration, grades• Sales: customers, products, purchases• Online retailers: order tracking, customized recommendations• Manufacturing: production, inventory, orders, supply chain• Human resources: employee records, salaries, tax deductions

Databases touch all aspects of our lives

Purpose of Database Systems

In the early days, database applications were built directly on top of file systemsDrawbacks of using file systems to store data:

Data redundancy and inconsistencyMultiple file formats, duplication of information in different files

Difficulty in accessing data Need to write a new program to carry out each new task

Data isolation — multiple files and formatsIntegrity problems

Integrity constraints (e.g. account balance > 0) become “buried” in program code rather than being stated explicitlyHard to add new constraints or change existing ones

Purpose of Database Systems (contd.)

Drawbacks of using file systems (cont.) Atomicity of updates

Failures may leave database in an inconsistent state with partial updates carried outExample: Transfer of funds from one account to another should either complete or not happen at all

Concurrent access by multiple usersConcurrent accessed needed for performanceUncontrolled concurrent accesses can lead to inconsistencies

Example: Two people reading a balance and updating it at the same time

Security problemsHard to provide user access to some, but not all, data

Database systems offer solutions to all the above problems

View of DataView of DataAn architecture for a database system

View 1

Physicallevel

Logical level

View 2 View n…

View level

Levels of Abstraction

Physical level: describes how a record (e.g. customer) is stored.Logical level: describes data stored in database, and the relationships among the data.

type customer = record name: string; street: string; city: integer;end;

View level: application programs hide details of data types. Views can also hide information (e.g. salary) for security purposes.

Instances and SchemasSimilar to types and variables in programming languages

Schema – the logical structure of the database Example: The database consists of information about a set of customers and accounts and the relationship between them)Analogous to type information of a variable in a programPhysical schema: database design at the physical levelLogical schema: database design at the logical level

Instance – the actual content of the database at a particular point in time Analogous to the value of a variable

Physical Data Independence – the ability to modify the physical schema without changing the logical schema

Applications depend on the logical schemaIn general, the interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.

Data IndependenceAbility to modify a schema definition in one level without affecting a schema definition in the other levels.The interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.Two levels of data independence

Physical data independenceLogical data independence

Data ModelsA collection of tools for describing:

DataData relationshipsData semanticsData constraints

Object-based logical modelsEntity-relationship modelObject-oriented modelSemantic modelFunctional model

Record-based logical modelsRelational model (e.g., SQL/DS, DB2)Network modelHierarchical model (e.g., IMS)

The Entity-Relationship Model

Models an enterprise as a collection of entities and relationships

Entity: a “thing” or “object” in the enterprise that is distinguishable from other objects

Described by a set of attributesRelationship: an association among several entities

Represented diagrammatically by an entity-relationship diagram:

Relational Model

Example of tabular data in the relational model:

name ssn street city account-numberJohnson 192-83-7465 Alma Palo Alto A-101Smith 019-28-3746 North Rye A-215Johnson 192-83-7465 Alma Palo Alto A-201Jones 321-12-3123 Main Harrison A-217Smith 019-28-3746 North Rye A-201

account-number balanceA-101 500A-201 900A-215 700A-217 750

Data Definition Language (DDL)Specification notation for defining the database schemaDDL compiler generates a set of tables stored in a data dictionaryData dictionary contains metadata (data about data)Data storage and definition language – special type of DDL in which the storage structure and access methods used by the database system are specified

Data Manipulation Language (DML)

Language for accessing and manipulating the data organized by the appropriate data modelTwo classes of languages

Procedural – user specifies what data is required and how to get those dataNonprocedural – user specifies what data is required without specifying how to get those data

Transaction ManagementA transaction is a collection of operations that performs a single logical function in a database application.Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g. power failures and operating system crashes) and transaction failures.Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database.

Storage ManagementStorage manager is a program module that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system.The storage manager is responsible to the following tasks:

Interaction with the file manager Efficient storing, retrieving and updating of data

Issues:Storage accessFile organizationIndexing and hashing

Database AdministratorCoordinates all the activities of the database system; the database administrator has a good understanding of the enterprise’s information resources and needs:

Database administrator’s duties include:Schema definitionStorage structure and access method definitionSchema and physical organization modificationGranting user authority to access the databaseSpecifying integrity constraintsActing as liaison with usersMonitoring performance and responding to changes in requirements

Database Users

Users are differentiated by the way they expect to interact with the system.Application programmers: interact with system through DML calls.Specialized users: write specialized database applications that do not fit into the traditional data processing framework Sophisticated users: form requests in a database query language.Naive users: invoke one of the permanent application programs that have been written previously

Overall System Structure

naïve users application sophisticated database (tellers, agents, etc) programmers users administrator

application interface

Application program

query database scheme

users

application program object code query evaluation

engine

Embedded DML precompiler

DML compiler

DDL interpreter

transaction manager

buffer manager

query processor

storage manager

database- management system

File manager

indices Statistical data

Data files Data dictionary

disk storage

SQL (Structured Query Language)

SQL: widely used non-procedural languageExample: Find the name of the customer with customer-id 192-83-7465

select customer.customer_namefrom customerwhere customer.customer_id = ‘192-83-7465’

Example: Find the balances of all accounts held by the customer with customer-id 192-83-7465

select account.balancefrom depositor, accountwhere depositor.customer_id = ‘192-83-7465’ and

depositor.account_number = account.account_number

Application programs generally access databases through one ofLanguage extensions to allow embedded SQLApplication program interface (e.g., ODBC/JDBC) which allow SQL queries to be sent to a database

Database Design

The process of designing the general structure of the database:

Logical Design – Deciding on the database schema. Database design requires that we find a “good” collection of relation schemas.

Business decision – What attributes should we record in the database?Computer Science decision – What relation schemas should we have and how should the attributes be distributed among the various relation schemas?

Physical Design – Deciding on the physical layout of the database

Data Warehousing

In the Beginning, life was simple…

But…

Our information needs…

Kept growing. (The Spider web)

SOURCE: William H. Inmon

Purpose

To explore and discuss the purpose and principles of

data warehousing.

So What Is a Data Warehouse?

A data warehouse is the data repository of an enterprise. It is generally used for research and decision support.

By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise.

OLTP systems are usually designed independently of each other and it is difficult for them to share information.

Why Do We Need Data Warehouses?

Consolidation of information resourcesImproved query performanceSeparate research and decision support functions from the operational systemsFoundation for data mining, data visualization, advanced reporting and OLAP tools

What Is a Data Warehouse Used for?

Knowledge discoveryMaking consolidated reportsFinding relationships and correlationsData miningExamples

Banks identifying credit risksInsurance companies searching for fraudMedical research

How Do Data Warehouses Differ From Operational Systems?

Data warehouse Operational systemSubject oriented Transaction oriented

Large (hundreds of GB up to several TB)

Small (MB up to several GB)

Historic data Current data

De-normalized table structure (few tables, many columns per table)

Normalized table structure (many tables, few columns per table)

Batch updates Continuous updates

Usually very complex queries Simple to complex queries

Design Differences

Star Schema

Data WarehouseOperational System

ER Diagram

Supporting a Complete Solution

Operational System-Data Entry

Data Warehouse-Data Retrieval

What Is a Data Warehouse?In many organizations, we want a central “store” of all of our entities, concepts, metadata, and historical information

For doing data validation, complex mining, analysis, prediction, …This is the data warehouse

To this point we’ve focused on scenarios where the data “lives” in the sources – here we may have a “master” version (and archival version) in a central database

For performance reasons, availability reasons, archival reasons, …

Master Data ManagementOne of the “modern” uses of the data warehouse is not only to support analytics but to serve as a reference to all of the entities in the organization

A cleaned, validated repository of what we know… which can be linked to by data sources… which may help with data cleaning

… and which may be the basis of data governance (processes by which data is created and modified in a systematic way, e.g., to comply with gov’t regulations)

There is an emerging field called master data management out the process of creating these

Data Warehouses, Data Marts, and Operational Data Stores

Data Warehouse – The queryable source of data in the enterprise. It is comprised of the union of all of its constituent data marts.Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group.Operational Data Store (ODS) – A point of integration for operational systems that developed independent of each other. Since an ODS supports day to day operations, it needs to be continually updated.

Building a Data Warehouse

AnalysisDesign

Import dataInstall front-end tools

Test and deploy

Data Warehouse Lifecycle

Stage 1: Analysis

Identify:Target Questions

Data needsTimeliness of data

Granularity

Create an enterprise-level data dictionaryDimensional analysisIdentify facts and dimensions

Analysis– Design– Import data– Install front-end tools– Test and deploy

Stage 2: Design

Star schemaData TransformationAggregates Pre-calculated ValuesHW/SW Architecture

– Analysis

Design– Import data– Install front-end tools– Test and deploy

Dimensional Modeling

Dimensional Modeling

Fact Table – The primary table in a dimensional model that is meant to contain measurements of the business.Dimension Table – One of a set of companion tables to a fact table. Most dimension tables contain many textual attributes that are the basis for constraining and grouping within data warehouse queries.

Stage 3: Import Data

Identify data sourcesExtract the needed data from existing systems to a data staging areaTransform and Clean the data

Resolve data type conflictsResolve naming and key conflictsRemove, correct, or flag bad dataConform Dimensions

Load the data into the warehouse

– Analysis– Design

Import data– Install front-end tools– Test and deploy

Importing Data Into the Warehouse

OLTP 1

OLTP 2

OLTP 3

Data Staging Area DataWarehouse

Operational Systems(source systems)

Stage 4: Install Front-end Tools

Reporting toolsData mining toolsGISEtc.

– Analysis– Design– Import data

Install front-end tools– Test and deploy

Stage 5: Test and Deploy

Usability testsSoftware installationUser trainingPerformance tweaking based on usage

– Analysis– Design– Import data– Install front-end tools

Test and deploy

On-Line Analytical Processing (OLAP)

Making Decision Support Possible

69

Strengths of OLAP

• It is a powerful visualization paradigm

• It provides fast, interactive response times

• It is good for analyzing time series

• It can be useful to find some clusters and outliers

• Many vendors offer OLAP tools

70

Month1 2 3 4 76 5

Pro

du

ct

Toothpaste

JuiceColaMilk

Cream

Soap

Regio

n

WS

N

Dimensions: Product, Region, TimeHierarchical summarization paths

Product Region TimeIndustry Country Year

Category Region Quarter

Product City Month Week

Office Day

Multi-dimensional Data

• “Hey…I sold $100M worth of goods”

71

Data Cube Lattice• Cube lattice

• ABC AB AC BC A B C none

• Can materialize some groupbys, compute others on demand• Question: which groupbys to materialze?• Question: what indices to create• Question: how to organize data (chunks, etc)

72

Visualizing Neighbors is simpler

1 2 3 4 5 6 7 8AprMayJunJulAugSepOctNovDecJanFebMar

Month Store SalesApr 1Apr 2Apr 3Apr 4Apr 5Apr 6Apr 7Apr 8May 1May 2May 3May 4May 5May 6May 7May 8Jun 1Jun 2

73

A Visual Operation: Pivot (Rotate)

10

47

30

12

Juice

Cola

Milk

Cream

NYLA

SF

3/1 3/2 3/3 3/4

Date

Month

Reg

ion

Product

74

“Slicing and Dicing”

Product

Sales Channel

Regio

ns

Retail Direct Special

Household

Telecomm

Video

Audio IndiaFar East

Europe

The Telecomm Slice

75

Roll-up and Drill Down

• Sales Channel• Region• Country• State • Location Address• Sales Representative

Roll

Up

Higher Level ofAggregation

Low-levelDetails

Drill-D

ow

n

76

Nature of OLAP Analysis• Aggregation -- (total sales,

percent-to-total)• Comparison -- Budget vs. Expenses• Ranking -- Top 10, quartile analysis• Access to detailed and aggregate

data• Complex criteria specification• Visualization

77

Organizationally Structured Data

• Different Departments look at the same detailed data in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of data

marketing

manufacturing

sales

finance

78

Multidimensional Spreadsheets

• Analysts need spreadsheets that support

• pivot tables (cross-tabs)• drill-down and roll-up• slice and dice• sort• selections• derived attributes

• Popular in retail domain

79

OLAP - Data Cube

• Idea: analysts need to group data in many different ways• eg. Sales(region, product, prodtype, prodstyle, date, saleamount)• saleamount is a measure attribute, rest are dimension attributes• groupby every subset of the other attributes

• materialize (precompute and store) groupbys to give online response• Also: hierarchies on attributes: date -> weekday,

date -> month -> quarter -> year

80

Relational OLAP: 3 Tier DSS

Data Warehouse ROLAP Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in industry standard RDBMS.

Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.

Obtain multi-dimensional reports from the DSS Client.

81

MD-OLAP: 2 Tier DSSMDDB Engine MDDB Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.

Obtain multi-dimensional reports from the DSS Client.

OLAPNeed for More Intensive Decision Support4 Main Characteristics

Multidimensional data analysisAdvanced Database SupportEasy-to-use end-user interfacesSupport Client/Server architecture

Multidimensional Data Analysis Techniques

Advanced Data Presentation Functions3-D graphics, Pivot Tables, Crosstabs, etc.Compatible with Spreadsheets & Statistical packagesAdvanced data aggregations, consolidation and classification across time dimensionsAdvanced computational functionsAdvanced data modeling functions

Advanced Database Support

Advanced Data Access FeaturesAccess to many kinds of DBMS’s, flat files, and internal and external data sourcesAccess to aggregated data warehouse dataAdvanced data navigation (drill-downs and roll-ups)Ability to map end-user requests to the appropriate data sourceSupport for Very Large Databases

Easy-to-Use End-User Interface

Graphical User InterfacesMuch more useful if access is kept simple

Client/Server ArchitectureFramework for the new systems to be designed, developed and implementedDivide the OLAP system into several components that define its architecture

Same ComputerDistributed among several computer

OLAP Architecture3 Main Modules

GUI (Graphical User Interface)Analytical Processing LogicData-processing Logic

OLAP Client/Server Architecture

Relational OLAPRelational Online Analytical Processing

OLAP functionality using relational database and familiar query tools to store and analyze multidimensional data

Multidimensional data schema supportData access language & query performance for multidimensional dataSupport for Very Large Databases

Multidimensional Data Schema Support

Decision Support Data tends to beNonnormalizedDuplicatedPreaggregated

Star SchemaSpecial Design technique for multidimensional data representationsOptimize data query operations instead of data update operations

Star SchemasData Modeling Technique to map multidimensional decision support data into a relational databaseCurrent Relational modeling techniques do not serve the needs of advanced data requirements

Data Warehouse Architecture

At the top – a centralized database

Generally configured for queries and appends – not transactionsMany indices, materialized views, etc.

Data is loaded and periodically updated via Extract/Transform/Load (ETL) tools

Data Warehouse

ETL ETL ETL ETL

RDBMS1 RDBMS2

HTML1 XML1

ETL pipelineoutputs

ETL

ETL ToolsETL tools are the equivalent of schema mappings in virtual integration, but are more powerfulArbitrary pieces of code to take data from a source, convert it into data for the warehouse:

import filters – read and convert from data sourcesdata transformations – join, aggregate, filter, convert datade-duplication – finds multiple records referring to the same entity, merges themprofiling – builds tables, histograms, etc. to summarize dataquality management – test against master values, known business rules, constraints, etc.

Example ETL Tool Chain

• This is an example for e-commerce loading

• Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load

Invoice line items

SplitDate-time

Filterinvalid

JoinFilterinvalid

Invalid dates/ times

Invaliditems

Itemrecords

Filternon -

match

Invalidcustomers

Group by customer

Customerbalance

Customerrecords

An Introduction to Text Mining

Predictive analysis helps connect data to effective action by drawing reliable conclusions about current conditions and future events.

— Gareth Herschel, Research Director, Gartner Group

Predictive Analytics: Defined

Predictive analysis:•Leverages an organization’s business knowledge by applying sophisticated analytic techniques to enterprise data•It turns that data into insights that lead to the development of programs to increase revenues, reduce costs, improve processes, and prevent criminal or fraudulent activities•It encourage actions that demonstrably change how people behave as your customers, employees, patients, students, and citizensBottom line: it turns data into effective actions that positively impact your bottom line

Unstructured Data Management

Text Mining is a subset of Unstructured Data Management.

UDM can be broken down into:Content and Document ManagementSearch and Retrieval XML database and toolsCategorization, Classification, and Visualization

Data Warehouse vs. Document Warehouse

Data warehouseWho, what, when, where, how muchInternally focusedOperational information Rarely include external information

Document warehouseWhyMay not be internally focusedMay contain a range of informationOften integrate external information

Document Warehouse Features

There is no single document structure or document typeDocuments are drawn from multiple sourcesEssential features of documents are automatically extracted and explicitly stored in the document warehouseDocument warehouses are designed to integrate semantically related documents

100

Building the Document Warehouse

IdentifySources

RetrieveDocument

TextAnalysis

Pre-processDocument

CompileMetadata

101

Predict, Impact, Deploy

Customer

Data

Attitudes

Actions

Attributes

Business User

Grow

Retain

Fraud

Outcomes

Attract

Data Collection

Text

Surveys

WebChannel

OperationalSystems

Text

Bu

sin

ess

UI

Expert UIExpert UI

Concepts

Concept Maps

Clustering

Categoriza-tion

Trending

Information Extraction

Prediction

NLP

The Building Blocks of Language

MorphologySyntaxSemanticsPhonologyPragmatics

MorphologyUnderstanding words

StemsAffixes

PrefixSuffix

Inflectional elements

Reducing complexity of

analysis

Reduces complexity of

representation

Supports text mining

Noun

PrefixNoun Stem

Suffix

- abledisputein -

SyntaxThe Bank of Canada will curb inflation with higher interest rates

Prepositional phrase

Adjective

Sentence

Noun phrase Verb phrase

NounVerbAux

Noun phrase

NounAdjective

Noun

The Bank ofCanada

inflationcurbwill

Interest rateshigher

with

SemanticsThe meaning of it allApproaches to meaning

Semantic networksDeductive logicRule-based systems

Useful for classification

Problems with NLPLimitations of Natural Language Processing

Correctly identifying the role of noun phrasesRepresenting abstract conceptsClassifying synonymsRepresenting the number of concepts

Problems with NLPLimitations of technology

Language specific designs are requiredClassification speedClassifying hybrid words and sentences

Underlying Technology is Based on Linguistics

The Linguistic Approach:Does not treat a document as a bag of words

Removes ambiguity by extracting structured concepts

Concepts are the DNA of text.

Text is unstructured, ambiguous, and language dependent.

From Text to Concepts

Morphology

Syntax

Semantics StatisticsLinguistic

Terminology

Extractor

ScalableAccurate

Customizable Discovery-Oriented

•Compound words•Proper nouns•Figures•Named entities•Domain specifics

•Speed•Multiple formats•Multiple languages

•SPSS dictionaries•User dictionaries•Extraction rules•Extraction patterns

•Known terms•Unknown terms•New terms

•Trends

“The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.”- The Gartner group.

Data Mining

Why data mining?Data Mining software generally employs modeling algorithms

designed to handle non-linearities and unusual patterns in dataAs opposed to classical linear models (e.g., linear regression) that aren’t as

capable

A related issue is ‘noise’ in the data: where, for example, 2 seemingly similar sets of inputs yield a different output

Use the cross industry standard process for data mining (CRISP-DM)

Based on real-world lessons:Focus on business issues

User-centric & interactiveFull processResults are used

A Data Mining Methodology

Data Mining is not…Keep in mind that data mining is not…

“Blind” application of analysis/modeling algorithmsBrute-force crunching of bulk dataBlack box technologyMagic

114

Back to the Process

Text Mining

UnderstandingBusiness Understanding

Determine objectiveAssess situationDetermine data mining goalsProduce project plan

Data UnderstandingCollect initial dataDescribe dataExplore dataVerify data quality

Data PreparationData

Data setData set descriptionSelect dataClean dataConstruct data set / Integrate dataFormat data

TextConcept extractionConcept combinationConcept assessment

ModelingSelect modeling technique

Universe of techniquesAppropriate techniques

DataText

RequirementsConstraintsSelected tools

Generate test designRun model(s)Assess model(s)

EvaluationResults = Models + Findings

Evaluate resultsReview processDetermine next steps

DeploymentPlan deploymentPlan monitoring and maintenanceFinal reportProject review

Unsupervised methods:Group patients by drugs and demographic information and try to find

unusual patients

Supervised methods:Attempt to predict amount due and find sets of cases where the amount

due is very different from the predicted amount

Data Mining Approaches

Copyright 2003-4, SPSS Inc. 121

What Does Data Mining Do?• Data mining uses existing data to:

• Predict• Category membership• Numeric Value• Ie. Credit risk

• Group• Cluster (group) things together based

on their characteristics• Ie. Different types of TV viewers

• Associate• Find events that occur together, or in

a sequence• Ie. Beer and diapers

• Find outliers• Identify cases that don’t follow

expected behavior• Ie. Fraudulent behaviour

Data Mining vs. Text MiningData Mining

• Process directly• Identify causal

relationship• Structured numeric

transaction data residing in rational data warehouse

• Text Mining• Linguistic processing or

natural language processing (NLP)

• Discover heretofore unknown information[2]

• Applications deal with much more diverse and eclectic collections of systems and formats[4]

ConfusionIs text mining the same as information extraction? No!Information Extraction (IE)

Extract facts about pre-specified entities, events or relationships from unrestricted text sources.No novelty: only information is already present is extracted.

Two Foundations

Information Retrieval (IR)

Artificial Intelligence (AI)

Information RetrievalThe science of searching for

• Information in documents• Documents themselves• Metadata which describe documents• Text, sound, images or data, within database:

relational stand-alone database or hypertext networked databases such as the Internet or intranets.

Information RetrievalGerard Salton

Functional overview of IR

Artificial IntelligenceArtificial intelligence (AI) is a branch of computer science and engineering that deals with intelligent behavior, learning, and adaptation in machines.

Natural Language ProcessingNatural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Statistical natural language processing uses stochastic, probabilistic and statistical methods to resolve some of the difficulties : e.g. text segmentation, word sense disambiguation

Data Warehousing, OLAP and data mining: what and why?

130

Data Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Data Warehouse for Decision Support & OLAP

Putting Information technology to help the knowledge worker make faster and better decisions

Which of my customers are most likely to go to the competition?

What product promotions have the biggest impact on revenue?

How did the share price of software companies correlate with profits over last 10 years?

Decision SupportUsed to manage and control business

Data is historical or point-in-time

Optimized for inquiry rather than update

Use of the system is loosely defined and can be ad-hoc

Used by managers and end-users to understand the business and make judgements

Data Mining works with Warehouse Data

Data Warehousing provides the Enterprise with a memory

Data Mining provides the Enterprise with intelligence

We want to know ...Given a database of 100,000 names, which persons are the least likely to default on their credit cards?

Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?

If I raise the price of my product by Rs. 2, what is the effect on my ROI?

If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?

If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?

Which of my customers are likely to be the most loyal?

Data Mining helps extract such information

What makes data mining possible?

Advances in the following areas are making data mining deployable:• data warehousing • better and more data (i.e., operational, behavioral, and

demographic) • the emergence of easily deployed data mining tools and • the advent of new data mining techniques.

Why Separate Data Warehouse?

Missing data: Decision support requires historical data, which op dbs do not typically maintain.

Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources.

Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

top related