Advanced Data AnalyticsAn Introduction
Data Mining in Advanced AnalyticsDr Paul Kennedy
[email protected] for Quantum Computation & Intelligent Systems
School of Software, Faculty of Engineering & IT
1Friday, 5 July 2013
Outline
• What is Data Analytics (DA)?
• Motivation for DA
• Main approaches
• DA professionals
• Links to other topics
• Overview of techniques
Paul Kennedy - [email protected]
2Friday, 5 July 2013
• Data Analytics is the analysis of large databases to find novel, commercially valuable and exploitable patterns.
• Aim: discover meaningful insights and knowledge from data.
• Discoveries expressed as models.
• Data mining = process of building models.
Paul Kennedy - [email protected]
4Friday, 5 July 2013
• A model
• Captures the essence of the discovered knowledge.
• Can assist in understanding the world.
• Can be used to make predictions.
Models
Paul Kennedy - [email protected]
5Friday, 5 July 2013
Where applied?• Who by?
• Business, government, financial services, biology, medicine, risk and intelligence, science and engineering.
• Data collected about
• Businesses, customers, human resources, products, manufacturing processes, suppliers, business partners, local and international markets & competitors.
• Why?
• Better support managers, find fraudulent behaviour, understand scientific processes, finding opportunities.
Paul Kennedy - [email protected]
6Friday, 5 July 2013
Collecting Data• We have always collected, checked and organised
data.
• 5500 years ago Sumerians marked tax records onto dried mud tablets.
• Scientists have looked through microscopes and telescopes and drawn what they saw.
• Market researchers ran surveys or had TV diaries
• Medical laboratories take dozens of measurements per patient
Paul Kennedy - [email protected]
8Friday, 5 July 2013
Data
• Analysing
• Since then, people have sought ways to use the recorded information to improve their lives (financially, health, ...)
• Understanding
• People can understand these amounts of data.
• But nowadays, there is a data explosion.
Paul Kennedy - [email protected]
9Friday, 5 July 2013
Data explosion• Most data now goes straight to computers without
humans seeing them.
• Tax records submitted electronically
• Telescopes operated remotely and digital images goes to computer files.
• Market and POS data go to data warehouses.
• High throughput technology make simultaneous measurements of 1000s of genes per patient.
• This deluge of data is useless to unaided people!
Paul Kennedy - [email protected]
10Friday, 5 July 2013
TechAmerica Foundation: Federal Big Data Commission !
Cover Page
A Practical Guide To Transforming The Business of Government
DEMYSTIFYINGBIG DATA
"#$%&#$'()*(+$,-./$#0,&(1234'&502467(1$'$#&8(90:(;&5&(<2//077024
Big Data ...• Huge global interest
currently.
• Obama administration in 2011 announced $200m for Big Data R&D in US
• TechAmerica Foundation released report describing “transformational” power of Big Data and recommendations for training huge number of data scientist & analysts urgently needed.
Paul Kennedy - [email protected]
Source: http://www.techamericafoundation.org/bigdata
11Friday, 5 July 2013
Is it really an “explosion”?
• 2011: 1.8 zetabytes of information created globally and expected to double each year
• = 200 billion 2-hour HD movies that one person could watch for 47 million years straight!
• From sensors, satellites, social media, mobile comms, email, RFID and enterprise applications.
• Source: Demystifying Big Data, TechAmerica Foundation, 2012.
Paul Kennedy - [email protected]
12Friday, 5 July 2013
Helping to catch the backpacker killer
• Australia’s most notorious serial murder case
• Early 1990s, 7 young backpackers murdered.
• Police had developed a profile.
• Huge dataset generated of vehicle records, gym memberships, gun licensing and police records.
• Link analysis software from Sydney company NetMap Analytics, narrowed list of suspects from 18 million to 32, which included the murderer: Ivan Milat.
Paul Kennedy - [email protected]
14Friday, 5 July 2013
Predicting the 2012 US election result
• Nate Silver used predictive analytics & statistics to correctly predict outcomes of 50 out of 50 states from polling and related data.
• Republican pundits were confident in their landslide-win predictions. Democrat pundits predicted razor-thin victory.
• Shows the power of a data-centric approach over “gut-feeling”.
Paul Kennedy - [email protected]
15Friday, 5 July 2013
Fitting to the business• Understand the business context, and stronger, framing
a business question.
• Translating the business question into a data analytics question.
• Collecting, understanding and processing data from across the business and possibly externally.
• Build models and evaluate them.
• Deploying the results in the business to deliver benefits.
• Iterative process.
Paul Kennedy - [email protected]
17Friday, 5 July 2013
Fitting to the business
Mathematical Model
Predict ‘class’ of unseen rows
e.g. customers
Find relationships between rows or
columns
e.g. to target
e.g. customer groupsPaul Kennedy - [email protected]
18Friday, 5 July 2013
Two main approaches
• Unsupervised methods
• Model tries to make sense of the data set or characterise it.
• Supervised methods
• Model learns a relationship between inputs and outputs from historical data.
• Model can then be used to predict output for new data.
Paul Kennedy - [email protected]
19Friday, 5 July 2013
Fitting to the business
Mathematical Model
Predict ‘class’ of unseen rows
e.g. customers
Find relationships between rows or
columns
e.g. to target
e.g. customer groupsPaul Kennedy - [email protected]
20Friday, 5 July 2013
Data Warehousing to Data Mining
• Data Warehouse: an organisation-wide integrated access to a centralised repository + data models
• On-Line Analytic Processing (OLAP):
• statistical summaries and basic analytical modeling
• build and cache fixed ‘cubes’ (business intelligence)
• restructure data for efficient analysis
• Fast summarisation and aggregation at different levels
Paul Kennedy - [email protected]
21Friday, 5 July 2013
Data Mining to Knowledge Discovery• Data: raw uninterpreted facts
e.g. Tom, 20 years old, student
• Information relates items of Data togethere.g. Tom is 20 years old
• Knowledge relates items of Information togetherTom is 20 years old → Tom pays > $1500 insurance
• Modeling the world (= generalising)[18 - 25] years old → P(accident) = high
Paul Kennedy - [email protected]
22Friday, 5 July 2013
Data Mining - a Business Intelligence view
Data Mining
Data mining problem(s)
PatternsBusiness
IntelligenceBusiness Problem
Paul Kennedy - [email protected]
23Friday, 5 July 2013
Data Mining - a Business Intelligence view
Data Mining
Data mining problem(s)
PatternsBusiness
IntelligenceBusiness Problem
Domain Domain
Paul Kennedy - [email protected]
24Friday, 5 July 2013
Data Mining - a Business Intelligence view
Data Mining
Data mining problem(s)
PatternsBusiness
IntelligenceBusiness Problem
Domain Domain
Data & Information Visualisation
Data Warehousing
Methods and Frameworks
Knowledge Discovery Techniques
Paul Kennedy - [email protected]
25Friday, 5 July 2013
CRISP-DM viewPaul Kennedy - [email protected]
Source: Kenneth Jensen / Wikimedia Commons / Public Domain
26Friday, 5 July 2013
The rising profession of Data Analyst
• “Data mining as a profession is definitely growing because data is growing. Data is becoming more and more usable because of data warehousing (where information from many locations can be centrally mined). So the only way is up.” - Eugene Dubossarsky (Ernst & Young)
• If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. - Prof. Hal Varian, UC Berkeley, Chief Economist at Google.
• The ATO has a network of 30+ data miners working with another 70 or so analytics staff. - Dr Warwick Graco, Australian Taxation Office
Paul Kennedy - [email protected]
28Friday, 5 July 2013
Data Miners / Data Analysts
• Typical data mining jobs pay six-figure salaries. The required blend of skills makes good data miners a rare breed. - Ronnie Chan, senior IT specialist IBM's DB2 team
• Data miners are the SAS of the IT industry, and it's not a job for beginners. Demand is strong for people who have the technical skills combined with business knowledge. “To produce useable results, data miners must draw on advanced analytical approaches such as predictive modelling, association discovery and sequence discovery.” - Peter Norris, Business ManagerComputer Associates
Paul Kennedy - [email protected]
29Friday, 5 July 2013
10 Hot IT Skills for 2013
• ComputerWorld, 24/9/12
• #5 Business Intelligence / Analytics
• “Big data is one of the top priorities for many companies, but getting the right people to analyze all that information is challenging, says Jerry Luftman, managing director at the Global Institute for IT Management and a leader in the Society for Information Management.
• The best candidates have technical know-how, business knowledge and strong statistical and mathematical backgrounds -- an uncommon mix of skills, Luftman says. In fact, some companies are hiring statisticians and teaching them about technology and business.”
Paul Kennedy - [email protected]
30Friday, 5 July 2013
Gartner Top 10 Strategic Technology Trends for 2013• Gartner identifies the Top 10 Strategic
Technology Trends for 2013, October 23, 2012
• Of the 10 strategic trends, two were for data analytics.
• Strategic Big Data
• Actionable Analytics
Paul Kennedy - [email protected]
31Friday, 5 July 2013
Gartner Top 10 Strategic Technology Trends for 2013• Strategic Big Data
• “Big Data is moving from a focus on individual projects to an influence on enterprises’ strategic information architecture. Dealing with data volume, variety, velocity and complexity is forcing changes to many traditional approaches. This realization is leading organizations to abandon the concept of a single enterprise data warehouse containing all information needed for decisions. Instead they are moving towards multiple systems, including content management, data warehouses, data marts and specialized file systems tied together with data services and metadata, which will become the "logical" enterprise data warehouse.”
Paul Kennedy - [email protected]
32Friday, 5 July 2013
Gartner Top 10 Strategic Technology Trends for 2013• Actionable Analytics
• “Analytics is increasingly delivered to users at the point of action and in context. With the improvement of performance and costs, IT leaders can afford to perform analytics and simulation for every action taken in the business. The mobile client linked to cloud-based analytic engines and big data repositories potentially enables use of optimization and simulation everywhere and every time. This new step provides simulation, prediction, optimization and other analytics, to empower even more decision flexibility at the time and place of every business process action.”
Paul Kennedy - [email protected]
33Friday, 5 July 2013
Institute of Analytics Professionals of Australia
• “Our mission is to unite, inform, support and promote analytics professionals in Australia. We provide information sources, a virtual community, a networking hub and a professional identity. We promote the benefits of analytics in modern business.”
• www.iapa.org.au
Paul Kennedy - [email protected]
34Friday, 5 July 2013
Privacy
• Privacy is important and it is an ethical concern for data analysts.
• Laws directly govern data mining in Australia and overseas.
• Some basic principles from OECD:
• Collection limitation: data should be obtained lawfully and fairly
• Data quality: data should be relevant to the stated purposes, accurate, complete and up-to-date.
• Purpose specification: should give purpose for use of data and data should be destroyed if it no longer serves the purpose.
• Use limitation: use of data for other purposes than specified is forbidden
Paul Kennedy - [email protected]
35Friday, 5 July 2013
Market analysis & management
• Data sources?
• Credit card transactions, loyalty cards, discount coupons, customer complaint calls, social media, plus (public) lifestyle studies
• Target marketing
• Find clusters of ‘model’ customers who share same characteristics: interest, income level, spending habits, etc.
• Determine customer purchasing patterns over time
• e.g. conversion of single to joint bank account: marriage, ...
• Cross-market analysis
• Associations / co-relations between product sales
• Prediction based on the association information.Paul Kennedy - [email protected]
37Friday, 5 July 2013
Market analysis & management (cont’d)
• Customer profiling
• Data analytics can tell you what types of customers buy what products (clustering or classification)
• Identifying customer requirements
• Identifying the best products for different customers
• Use prediction to find what factors will attract new customers.
• Provide summary information
• Various multidimensional summary reports
• Statistical summary information (mean and variance ...)Paul Kennedy - [email protected]
38Friday, 5 July 2013
Databases
Data Warehouse
Task-relevant Data
Patterns
Knowledge
Note: iterative processnot waterfall!
DataCleaning &Integration
DataSelection
DataMining
PatternEvaluation
The KnowledgeDiscoveryProcess
Paul Kennedy - [email protected]
40Friday, 5 July 2013
The KDD Process• Learn the application domain (prior knowledge & goals)
• Create target data set: data selection
• Data cleaning and preprocessing (may take 60% of effort!)
• Data reduction and transformation
• Find useful features, dimensionality/variable reduction, invariant representation
• Choose functions of data mining: the “data mining problem”
• Summarisation, classification, regression, association, clustering
• Choose the data mining algorithm(s)
• Data Mining: find patterns of interest
• Pattern evaluation and knowledge presentation
• Visualisation, transformation, remove redundant patterns, ...
• Use of discovered knowledgePaul Kennedy - [email protected]
41Friday, 5 July 2013
Data Mining
Other Disciplines
Information Science
VisualisationArtificial
Intelligence
StatisticsDatabase
Technology
•HCI•High Perfomance Computing•Software Engineering
42Friday, 5 July 2013
Database technology
• OLTP → OLAP →OLAM
• Data Warehouses
• Subject-oriented, integrated, time-variant, non-volatile
• Excellent starting point for data mining
• Data Marts: specialised, smaller data store
• OLAP: drill-down, roll-up, slice-n-dice, data cubes
Paul Kennedy - [email protected]
43Friday, 5 July 2013
OLAP vs Data MiningOLAP - On-Line Analytical Processing
• Emphasis on Query
• Generally know what you want to find.
• Expressible in SQL
• Drill-down, data cubes
Data Mining
Emphasis on Exploration
General idea of target but not how to find.
Let the machine drive the exploration
Paul Kennedy - [email protected]
44Friday, 5 July 2013
Statistics• Data, Counting, Probabilities, Hypothesis Testing
• Correlation and regression analyses
• Exploratory data analysis
• Predictive models
• CART : Classification And Regression Trees
• MARS: Multi Adaptive Regression Splines
• TreeNet
• Random Forest
• Important foundations for data mining and knowledge discovery
• Ensemble methods
• Computational requirements → Sampling
Paul Kennedy - [email protected]
45Friday, 5 July 2013
Artificial Intelligence (AI)
• Brings to data analytics
• The inductive approach (machine learning) - the design cycle for predictive modeling
• Knowledge representation
• Inference
• Generalisation: everyone who drank beer in Sydney in 1900 is now dead.
• Inference: Therefore, beer is fatal.
• Warning: it’s easy to get into a similar situation in data analytics!
• Uses Data Analytics
• e.g. as supporting components in multi-agent systems.
• e.g. in multi-agent electronic markets: negotiation agents request information about their opponents & text mining bots deliver that kind of information.
Paul Kennedy - [email protected]
46Friday, 5 July 2013
Artificial Intelligence (AI)
• The design cycle for predictive modeling
• Issues:
• Algorithms developed for toy datasets (< few hundred points)
• Prior knowledge (e.g. bias)
• Model deviation from true model
• Sampling distributions
• Computational complexity
Collect data
Select features
Select model type
"Train" classifier
Evaluate classifier
Paul Kennedy - [email protected]
47Friday, 5 July 2013
Visualisation• Deals with visual
presentation of the data.
• “A picture is worth a thousand words” - true?
• Taps into human strengths
• In Data Analytics
• Understanding data
• Visualising the process
• Visualising and communicating the results
48Friday, 5 July 2013
Data Analytics: Techniques (unsupervised)
• Association analysis (correlation and causality)
• Identify attribute-value conditions that frequently occur in the data
• Examples:
• age(P, “20..29”) ^ income(P, “20..29K”) → buys(P, “DVDs”)[support = 2%, confidence = 60%]
• contains(T, “MP3 player”) → contains(T, “sound processing software”)[1%, 75%]
• Support: fraction of data with ‘attribute’ and ‘value’.
• Confidence: fraction of data with ‘attribute’ where the rule holds (i.e. where attribute → value.
Paul Kennedy - [email protected]
50Friday, 5 July 2013
Data Analytics: Techniques (unsupervised)
• Clustering (cluster analysis)
• Identify groups within data where data points in the group are similar to one another but different to those in other groups.
• Identify groups within data that maximise intraclass similarity and minimise interclass similarity.
• Examples:
• cluster crime locations based on characteristics of the crimes.
• cluster students based on their marks in assignments for all the core subjects of their degree.
• Building models from unlabelled data: unsupervised learning Paul Kennedy - [email protected]
51Friday, 5 July 2013
Data Analytics: Techniques (supervised)• Classification and Prediction
• Using historical data find a model which describes and distinguishes data classes or concepts for the purpose of using the model to classify or predict the class of unknown entities.
• Examples:
• Build a model to classify countries based on climate or cars based on engine efficiency and on-road behaviour.
• Build a model to predict whether customer are likely to purchase a download of a particular music file.
• Build a model to predict the grade (Z, P, C, D, H) of a student based on students who previously did a subject.
• Building models from labelled data: supervised learning.
Paul Kennedy - [email protected]
52Friday, 5 July 2013
Data Analytics: Techniques
• Outlier analysis
• Identify entities that are different to other entities or to a model of data.
• i.e. Find exceptions to the rule!
• Example: odd patterns can be easily hidden among 10 million transactions, but may indicate fraud.
• Usually statistics consider them as noise or an exception.
• Data analytics: rare and unusual events or items are generally interesting.
• Time-series analysis
• Identify similar patterns over time - trends, deviation, sequential patterns, periodicity analysis
• Example: predicting trends in share pricesPaul Kennedy - [email protected]
53Friday, 5 July 2013
Understandableby Humans
“Understandable”by Computers
Association Rules
Bayesian Networks
Decision Trees
Neural Networks
Paul Kennedy - [email protected]
54Friday, 5 July 2013