unit 6 · 2016-05-10 · is less time consuming and more machine driven. score carding, dash...

28
Business intelligence and Data Mining Unit 6

Upload: others

Post on 31-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Business intelligence and Data Mining

Unit 6

Page 2: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Objectives

Upon completion of this unit you will be able to:

define business intelligence

discuss the importance of data mining

apply drivers the for business intelligence initiatives in modern

organizations

recognize the structure, components, and process of BI

Page 3: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Business Intelligence (BI)

Business intelligence (BI) is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions.

Generally according to Al-Azmi (2013) business intelligence refers to the skills, processes, technologies, applications, and practices used to leverage an institution‟s internal and external information assets to support and improve decision making.

Business Intelligence solutions can be divided into two groups of analysis types. Query-Reporting-Analysis: This type of analysis is often query based and is

normally used for determining “What happened?” in a business over a given period of time. Because queries are used the user already knows what kind of information to search for. Additionally, Business Intelligence solutions of this kind are generally operated manually and are therefore time consuming.

Intelligent Analysis (Data Mining): While the Query-Reporting-Analysis is able to provide answers for questions of the “What happened?” kind, Data Mining utilizes clever algorithms for a much deeper and intelligent analysis of data. BI solutions using Data Mining techniques are then capable of handling “What will happen?” and “How/why did this happen?” matters. All this is done in a semi- or full-automatic process saving both time and resources.

Page 4: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Business Intelligence (BI) Cont...

BI applications include the activities of decision support systems,

query and reporting,

online analytical processing (OLAP),

statistical analysis,

forecasting and data mining.

Business intelligence applications can be: Mission-critical and integral to an enterprise's operations or occasional to

meet a special requirement

Enterprise-wide or local to one division, department

Centrally initiated or driven by user demand

Page 5: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Business Intelligence (BI) Cont... The term BI was used as early as September, 1996, when a Gartner

Group report said:“By 2000, Information Democracy will emerge in forward-thinkingenterprises, with Business Intelligence information and applicationsavailable broadly to employees, consultants, customers, suppliers, andthe public. The key to thriving in a competitive marketplace is stayingahead of the competition. Making sound business decisions based onaccurate and current information takes more than intuition. Dataanalysis, reporting, and query tools can help business users wadethrough a sea of data to synthesize valuable information from it -today these tools collectively fall into a category called „BusinessIntelligence‟.”

• The terms business intelligence and business analytics are often used interchangeably. However, there are some key differences

Page 6: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Business Intelligence (BI) Cont...Business Intelligence answers questions like:

Business Analytics answers questions like:

What happened?

When?

Who?

How many?

Why did it happen?

Will it happen again?

What will happen if we

change x?

What else does the data tell

us that never thought to

ask?

Page 7: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Business Intelligence (BI) Cont...

BI includes: BA includes:

Reporting (KPIs, metrics)

Automated Monitoring/Alerting (thresholds)

Dashboards

Scorecards

OLAP (Cubes, Slice & Dice, Drilling)

Ad hoc query

Statistical/Quantitative

Analysis

Data Mining

Predictive Modelling

Multivariate Testing

Page 8: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Business Intelligence (BI) Cont...

Why institutions concern about Business Intelligence?

The purpose of business intelligence is to support the mission

and goals of the institution by enabling fact-based decision

making.

An effective business intelligence solution can be used to:

provide insight and measurement regarding strategic and tactical efforts

provide the ability to see the big picture and to find the needle in the

haystack

support fact-based decision making

provide rapid feedback regarding actions

validate or discredit assumptions

discover non-intuitive relationships

Page 9: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Business Intelligence (BI) Cont...

Business Intelligence Techniques(2)

There are huge varieties of BI solutions and techniques available

AQL (Associative Query Logic): Analytical data processing tool that compared to OLAP is less time consuming and more machine driven.

Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers to get a broad view of the performance of a business while Dashboarding/ Information Visualization deal with visual representation of abstract data. Business Performance Management A tool for analyzing the current state of a business and for improving future strategies.

DM (Data mining): Numerous methods for automatically searching large amounts of data for patterns and other interesting relations.

Data warehouses - Logical collections of information with structures that favour efficient data analysis (such as OLAP).

DSS (Decision Support Systems): Machine driven system that aids the decision making process in a business.

Document warehouses: Instead of informing the business what things have happened (like the data warehouse does) the document warehouse is able to state why things have happened.

Page 10: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Business Intelligence (BI) Cont... Business Intelligence Techniques (2)

EIS (Executive Information Systems): These systems are often considered as a specialized form of DSS with the purpose of facilitating the information and decision making needs of senior executives.

MIS (Management Information Systems): A machine driven system for processing data and providing analysis reports for decision making and planning. In order to retrieve data the system has access to all communication channels in a business.

GIS (Geographic Information Systems): A computer system for working with geographical data (e.g. satellite images) with editing, analyzing and displaying functionality.

OLAP (Online Analytical Processing): OLAP is a tool for doing quick analytical processing of multidimensional data by running queries against structured OLAP cubes that is build from a set of data sources.

Text mining: This task is generally referred to as the process of extracting interesting and nontrivial information/knowledge from unstructured text

Page 11: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining What is it?

Data mining is the process of analyzing data from different

perspectives and summarizing it into useful information -

information that can be used to increase revenue, cuts costs, or

both.

Data mining software is one of a number of analytical tools for

analyzing data. It allows users to analyze data from many different

dimensions or angles, categorize it, and summarize the relationships

identified.

Technically, data mining is the process of finding correlations or

patterns among dozens of fields in large relational databases.

A major goal of data mining is to discover previously unknown

relationships among the data

Page 12: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont.. Data mining parameters include: Association - looking for patterns where one event is connected to another event Sequence or path analysis - looking for patterns where one event leads to another

later event Classification - looking for new patterns (May result in a change in the way the

data is organized but that's ok) Clustering - finding and visually documenting groups of facts not previously

known Forecasting - discovering patterns in data that can lead to reasonable predictions

about the future (This area of data mining is known as predictive analytics.)

• Data mining techniques are used in a many research areas, including mathematics, cybernetics, genetics and marketing. Web mining, a type of data mining used in customer relationship management (CRM), takes advantage of the huge amount of information gathered by a Web site to look for patterns in user behaviour.

Page 13: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..

Organizations are using data mining to help manage all phases of the customer life cycle, including acquiring new customers, increasing revenue from existing customers, and retaining good customers.

By determining characteristics of good customers (profiling), a company can target prospects with similar characteristics.

By profiling customers who have bought a particular product it can focus attention on similar customers who have not bought that product (cross-selling).

By profiling customers who have left, a company can act to retain customers who are at risk for leaving (reducing churn or attrition), because it is usually far less expensive to retain a customer than acquire a new one.

Page 14: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..Successful data mining

There are two keys to success in data mining.

First is coming up with a precise formulation of the problem you

are trying to solve. A focused statement usually results in the best

payoff.

The second key is using the right data. After choosing from the data

available to you, or perhaps buying external data, you may need to

transform and combine it in significant ways.

Page 15: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..

How does data mining work

1. Classification (1)

Classification is supposedly the most popular Data Mining tasks considering its broad application domain.

Its main purpose is to classify one or more data samples that may consist of few or many features (dimensions). The latter case makes the classification task more complex due to the large number of dimensions.

The actual number of classes is not always given or obvious in a classification task. Therefore, it is possible to distinguish between supervised and unsupervised classification.

For supervised classification the number of classes is known along with the properties of each class. Neither of these is given in unsupervised classification which makes this task the more challenging one of the two.

Page 16: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..

1. Classification (2)

The list below further exemplifies the use of the classification

task.

Is a given credit card transaction fraudulent?

What type of subscription should be offered a given customer?

What type of structure does a specific protein have?

Is this customer likely to buy a bicycle?

Why is my system failing?

Page 17: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..

2. Estimation

Estimation is somewhat similar to classification, algorithm-wise. However, estimation does not deal with determining a class for a particular data sample. Instead, it tries to predict a certain measure for a given data sample.

The list below further exemplifies the use of the estimation task.

What is the turnover of a company going to be?

What is the density of a given fluid?

When will a pregnant woman give birth?

For how long will this product work before failing?

How much is a specific project going to cost?

Page 18: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..

3. Segmentation

Segmentation basically deals with the task of grouping a given data set into a few main groups (clusters).

The task of describing a large multidimensional data set (say customers) will therefore benefit from the use of segmentation. Moreover, many algorithm types can be used in segmentation systems.

The list below further exemplifies the use of the segmentation task.

How can a given buyer/supplier group be differentiated?

Which types of ground does a given satellite image contain?

Is a specific transaction an outlier?

Which segments is a market based on?

5Which groups of visitors are using a given search engine?

Page 19: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..

4. Forecasting

Forecasting is another important Data Mining task that is used for predicting future data values given a time series of prior data.

Forecasting is a popular task often performed using simple statistical methods.

However, forecasting done in the Data Mining domain uses advanced (learning) methods (e.g. Neural Networks, Hidden Markov Models) that in many cases are more accurate and informative than the standard statistical methods (e.g. moving averages).

The list below further exemplifies the use of the forecasting task. What will the weather be like tomorrow? Will a particular stock price rise over the next couple of days? What are the inventory levels next month? How many sunspots will occur next year? How will the average temperature on earth evolve throughout the next 10

years?

Page 20: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..5. Association

Association deals with task of locating events that are frequently occurring together, this enables them benefiting from this knowledge.

One of the most popular examples of association is probably Amazon.com‟s web shop. This web can or able to recommend related products to customers.

The list below further exemplifies the use of the association task. Which products should I recommend to my customers? Which services are used together? Which products are highly likely to be purchased together in a

supermarket? Which books are highly likely to be borrowed together in a library? Which dishes from a cookbook go well together?

Page 21: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..

Classes:

Stored data is used to locate data in predetermined groups.

For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order.

This information could be used to increase traffic by having daily specials.

Clusters:

Data items are grouped according to logical relationships or consumer preferences.

For example, data can be mined to identify market segments or consumer affinities

Sequential patterns:

Data is mined to anticipate behaviour patterns and trends.

For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Page 22: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..6. Text Analysis

Another key Data Mining task is text analysis. Text analysis has several purposes and is often used for finding key terms and phrases in text bits.

The text analysis can convert unstructured text into useful structured data that can be further processed by other Data Mining tasks (e.g. classification, segmentation, association).

The list below further exemplifies the use of the text analysis task.

Which segments does a given mailbox contain?

How is a document classified?

Which subjects does a specific web page contain?

How is a quick overview of multiple lecture notes from a classmate gained?

Which terms are likely to occur together?

Page 23: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..

Data types

In order to implement and use the different Data Mining

tasks, knowledge about their input (data) is a necessity.

Qualitative (also categorical, nominal, modal)

Variables explicitly describe various properties of an object e.g.

hair colour Rv = (blonde, brown, brunette, red, etc...) or

Boolean values Rv = (0, 1).

Note that qualitative variables have no intrinsic ordering i.e.

there is no agreed way of ordering hair colours from highest to

lowest.

Page 24: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..

Quantitative (also real, numeric continuous)

Variables may describe properties such as age, profit,

temperature. An m dimensional data set, where it for each

dimensional holds Rv = R, has a m dimensional real object

space U = Rm.

Set valued

Variables have multiple attributes e.g. variables for describing

movies where each variable has 3 attributes (movie title, leading

actors and year).

Page 25: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..

Ordinal Variables are similar to qualitative variables, but with a clear ordering of the

variables e.g. the fuel level in a car given by 3 variables Rv = (low, medium, high). Various quantitative variables are also regarded to be ordinal.

Cyclic Variables are of a periodic nature e.g. the 60 minutes in an hour for which Rv =

Z/ (60Z). For cyclic variables standard mathematical operations such as addition and subtraction are not directly applicable - certain periodicity precautions are necessary when processing such variables.

Seasonal A characteristic of a time series in which the data experiences regular and

predictable changes which recur seasonally e.g. calendar year which Rv = Z/ (12Z). Any predictable change or pattern in a time series that recurs or repeats over a one-year period can be said to be seasonal.

Note that seasonal effects are different from cyclical effects, as seasonal cycles are contained within one calendar year, while cyclical effects (such as boosted sales due to low unemployment rates) can span time periods shorter or longer than one calendar year

Page 26: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont.. It is also important to know the deference between structured data

sets and unstructured ones.

Examples of both structured and unstructured data Structured Data Dimensional Data

Buyer demographics

Supplier demographics

Product properties

Transactional Data

Order headers with buyer, supplier, ship to address, etc.

Order lines with product id, unit price, number of items, etc.

Credit card purchases

Unstructured Data Textual Data

Descriptions of order lines

Buyer comments

A doctors notes

Page 27: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Data Mining Cont..Different levels of analysis are available:

Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.

Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.

Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.

Nearest neighbour method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbour technique.

Rule induction: The extraction of useful if-then rules from data based on statistical significance.

Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.

Page 28: Unit 6 · 2016-05-10 · is less time consuming and more machine driven. Score carding, Dash boarding and Information visualization: Scorecarding is a method that allows managers

Thank you