introduction - computing science - simon fraser … business intelligence jian pei: cmpt 741/459...

67
Introduction

Upload: phamtruc

Post on 16-Oct-2018

231 views

Category:

Documents


1 download

TRANSCRIPT

Introduction

Motivation: Business Intelligence

Jian Pei: CMPT 741/459 Data Mining -- Introduction 2

Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …)

Product information (Product-id, category, manufacturer, made-in, stock-price, …)

Sales information (customer-id, product-id, #units, unit-price, sales-representative, …)

Business queries:

Techniques: Business Intelligence

•  Multidimensional data analysis •  Online query answering •  Interactive data exploration

Jian Pei: CMPT 741/459 Data Mining -- Introduction 3

Motivation: Store Layout Design

Jian Pei: CMPT 741/459 Data Mining -- Introduction 4

http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

Techniques: Store Layout Design

•  Customer purchase patterns •  Business strategies

Jian Pei: CMPT 741/459 Data Mining -- Introduction 5

Motivation: Community Detection

Jian Pei: CMPT 741/459 Data Mining -- Introduction 6

http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811

Techniques: Community Detection

•  Similarity between objects •  Partitioning objects into groups

– No guidance about what a group is

Jian Pei: CMPT 741/459 Data Mining -- Introduction 7

Motivation: Disease Prediction

Jian Pei: CMPT 741/459 Data Mining -- Introduction 8

Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat …

What medical problems does this patient has?

Techniques: Disease Prediction

•  Features •  Model

Jian Pei: CMPT 741/459 Data Mining -- Introduction 9

Motivation: Fraud Detection

Jian Pei: CMPT 741/459 Data Mining -- Introduction 10

http://i.imgur.com/ckkoAOp.gif

Techniques: Fraud Detection

•  Features •  Dissimilarity •  Groups and noise

Jian Pei: CMPT 741/459 Data Mining -- Introduction 11

http://i.stack.imgur.com/tRDGU.png

What Is Data Science About?

•  Data •  Extraction of knowledge from data •  Continuation of data mining and knowledge

discovery from data (KDD)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 12

What Is Data?

•  Values of qualitative or quantitative variables belonging to a set of items

•  Represented in a structure, e.g., tabular, tree or graph structure

•  Typically the results of measurements •  As an abstract concept can be viewed as the

lowest level of abstraction from which information and then knowledge are derived

Jian Pei: CMPT 741/459 Data Mining -- Introduction 13

What Is Information?

•  “Knowledge communicated or received concerning a particular fact or circumstance”

•  Conceptually, information is the message (utterance or expression) being conveyed

•  Cannot be predicted •  Can resolve uncertainty

Jian Pei: CMPT 741/459 Data Mining -- Introduction 14

What Is Knowledge?

•  Familiarity with someone or something, which can include facts, information, descriptions, or skills acquired through experience or education

•  Implicit knowledge: practical skill or expertise •  Explicit knowledge: theoretical

understanding of a subject

Jian Pei: CMPT 741/459 Data Mining -- Introduction 15

Data Systems

•  A data system answers queries based on data acquired in the past

•  Base data – the rawest data not derived from anywhere else

•  Knowledge – information derived from the base data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 16

Dealing with Data – Querying

•  Given a set of student records about name, age, courses taken and grades

•  Simple queries – What is John Doe’s age?

•  Aggregate queries – What is the average GPA of all students at this

school? •  Queries can be arbitrarily complicated

– Find the students X and Y whose grades are less than 3% apart in as many courses as possible

Jian Pei: CMPT 741/459 Data Mining -- Introduction 17

Queries

•  A precise request for information •  Subjects in databases and information

retrieval – Databases: structured queries on structured

(e.g., relational) data –  Information retrieval: unstructured queries on

unstructured (e.g., text, image) data •  Important assumptions

–  Information needs – Query languages

Jian Pei: CMPT 741/459 Data Mining -- Introduction 18

Data-driven Exploration

•  What should be the next strategy of a company? – A lot of data: sales, human resource, production,

tax, service cost, … •  The question cannot be translated into a

precise request for information (i.e., a query) •  Developing familiarity (knowledge) and

actionable items (decisions) by interactively analyzing data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 19

Data-driven Thinking

•  Starting with some simple queries •  New queries are raised by consuming the

results of previous queries •  No ultimate query in design!

– But many queries can be answered using DB/IR techniques

Jian Pei: CMPT 741/459 Data Mining -- Introduction 20

The Art of Data-driven Thinking

•  The way of generating queries remains an art! – Different people may derive different results

using the same data

“If you torture the data long enough, it will confess” – Ronald H. Coase

•  More often than not, more data may be needed – datafication

Jian Pei: CMPT 741/459 Data Mining -- Introduction 21

Queries for Data-driven Thinking

•  Probe queries – finding information about specific individuals

•  Aggregation – finding information about groups •  Pattern finding – finding commonality in

population •  Association and correlation – finding

connections among individuals and groups •  Causality analysis – finding causes and

consequences

Jian Pei: CMPT 741/459 Data Mining -- Introduction 22

What Is Data Mining?

•  Broader sense: the art of data-driven thinking

•  Technical sense: the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96] – Methods and tools of answering various types of

queries in the data mining process in the broader sense

Jian Pei: CMPT 741/459 Data Mining -- Introduction 23

Machine Learning

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”

– Tom M. Mitchell •  Essentially, learn the distribution of data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 24

Data mining vs. Machine Learning

•  Machine learning focuses on prediction, based on known properties learned from the training data

•  Data mining focuses on the discovery of (previously) unknown properties on the data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 25

Jian Pei: CMPT 741/459 Data Mining -- Introduction 26

The KDD Process

Data

Target data

Preprocessed data

Transformed data

Patterns

Knowledge

Selection Preprocessing

Transformation

Data mining

Interpretation/evaluation

Data Mining R&D

•  New problem identification •  Data collection and transformation •  Algorithm design and implementation •  Evaluation

– Effectiveness evaluation – Efficiency & scalability evaluation

•  Deployment and business solution

Jian Pei: CMPT 741/459 Data Mining -- Introduction 27

Data Mining on Big Data

“Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it”

– Hal Varian, Google’s Chief Economist

Jian Pei: CMPT 741/459 Data Mining -- Introduction 28

What Is Big Data?

•  No quantitative definition! •  “Big data is like teenage sex

– everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...”

– Dan Ariely

Jian Pei: CMPT 741/459 Data Mining -- Introduction 29

Data Volume vs. Storage Cost

•  The unit cost of disk storage decreases dramatically

Jian Pei: CMPT 741/459 Data Mining -- Introduction 30

Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB

http://ns1758.ca/winch/winchest.html

Big Data – Volume

“Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time”

— Wikipedia

Jian Pei: CMPT 741/459 Data Mining -- Introduction 31

H1N1 Pandemic Crisis (2009) •  A new flu virus combining elements of the viruses

that cause bird flu and swine flu •  The US Centers for Disease Control and Prevention

(CDC) requested reports from doctors, tabulated the numbers once a week – a two-week lag in data collection

•  Google used user search keywords to predict the spread of winter flu –  A supervised approach based on more than 3 billion

search queries every day, examining 450 million different models, using 2007-2008 data from CDC

•  Some things can be done based on large scale data, but cannot be done on a smaller scale data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 32

Detecting Hurricane – Unsupervised

Jian Pei: CMPT 741/459 Data Mining -- Introduction 33

D. Kang, D. Jiang, J. Pei, Z. Liao, X. Sun, and H-J. Choi. "Multidimensional Mining of Large-Scale Search Logs: A Topic-Concept Cube Approach". In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11), Hong Kong, China, February 9-12, 2011.

Big Data: Volume

•  Every day, about 7 billion shares change hands on US equity markets – About 2/3 is traded by computer algorithms based

on huge amounts of data to predict gains and risk •  In Q2 2015

– Facebook has 1.49 billion active users – Wechat has 600 million active users, 100 million

outside China –  LinkedIn has 380 million active users – Twitter has 304 active users

Jian Pei: CMPT 741/459 Data Mining -- Introduction 34

Velocity

•  Google processes 24+ petabytes of data per day

•  Facebook gets 10+ million new photos uploaded every hour

•  Facebook members like or leave a comment 3+ billion times per day

•  YouTube users upload 1+ hour of video every second

•  400+ million tweets per day

Jian Pei: CMPT 741/459 Data Mining -- Introduction 35

What Has Been Changed?

•  The 1880 census in the US took 8 years to complete – The 1890 census would need 13 years – using

punch cards, it was reduced to less than 1 year •  It is essential to get not only the accurate but

also the timely data – Statisticians use sampling to estimate

•  Recently, with the new technologies, the ways of data collection and transmission have been fundamentally changed

Jian Pei: CMPT 741/459 Data Mining -- Introduction 36

Sampling for Volume/Velocity?

•  Sampling idea: the marginal new information brought by larger amount of data shrinks quickly – The sample should be truly random

•  On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories of attribute combinations – Finding outliers and exceptions

•  Big data contains signals of different strengths – No noise, instead weaker and weaker, but still may

be interesting and important signals

Jian Pei: CMPT 741/459 Data Mining -- Introduction 37

Big Data – Leytro Pictures

•  Lytro pictures record the whole light field – Photographers can decide later which parts to

focus on •  Big data tries to record as much information

as possible – Analysts can decide later what to extract from

big data – Both advantages and challenges

Jian Pei: CMPT 741/459 Data Mining -- Introduction 38

Veracity

•  “1 in 3 business leaders don't trust the information they use to make decisions”

•  Assuming a slowly growing total cost budget, tradeoff between data volume and data quality

•  Loss of veracity in combining different types of information from different sources

•  Loss of veracity in data extraction, transformation, and processing

Jian Pei: CMPT 741/459 Data Mining -- Introduction 39

Variety

•  Integrating data capturing different aspects of a data object – Vancouver Canucks: game video, technical

statistics, social media, … – Different pieces are in different format

•  Different views of the same data object from different sources – Did the soccer ball pass the goal line? – The views may not be consistent

Jian Pei: CMPT 741/459 Data Mining -- Introduction 40

Four V-challenges

•  Volume: massive scale and growth, 40% per year in global data generated

•  Velocity: real time data generation and consumption

•  Variety: heterogeneous data, mainly unstructured or semi-structured, from many sources

•  Veracity

Jian Pei: CMPT 741/459 Data Mining -- Introduction 41

Is Big Data Really New?

•  People were aware of the existence of big data long time ago, but no one can access it until very recently –  (Genesis 28:15) “I am with you and will watch

over you wherever you go” –  “密室私语,天闻如雷;暗室欺⼼心,神目如电;善恶之报,如影随⾏行”

– Similar statements in Quran and Sutra •  What has been changed?

– How is data connected with people

Jian Pei: CMPT 741/459 Data Mining -- Introduction 42

Diversity in Data Usage

•  In the past, only very few projects can afford to be data-intensive

•  Nowadays, excessive applications are (naturally) data-intensive

Jian Pei: CMPT 741/459 Data Mining -- Introduction 43

Jian Pei: CMPT 741/459 Data Mining -- Introduction 44

Datafication

•  Extract data about an object or event in a quantified way so that it can be analyzed – Different from digitalization

•  An important feature of big data •  Key: new data, new applications, new

opportunities

Jian Pei: CMPT 741/459 Data Mining -- Introduction 45

New Values of Datafication

•  Example: Captcha and ReCaptcha (Luis von Ahn)

•  How to create new values of data and datafication? – Connecting data with new users – Connecting different pieces of data to present a

bigger picture •  Important techniques

– Data aggregation – Extended datafication

Jian Pei: CMPT 741/459 Data Mining -- Introduction 46

Big Data Players

•  Data holders •  Data specialists •  Big-data mindset leaders •  A capable company may play 2 or 3 roles at

the same time •  What is most important, big-data mindset,

skills, or data itself?

Jian Pei: CMPT 741/459 Data Mining -- Introduction 47

Privacy

•  “… big data analytics have the potential to eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace”

— Executive Office of the (US) President

Jian Pei: CMPT 741/459 Data Mining -- Introduction 48

A Beautiful Story about Big Data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 49

Source: http://abcnews.go.com/blogs/lifestyle/2014/01/the-genius-okcupid-hack-that-led-to-true-love/

Romantics in the Big Data Age

•  Datafication and feature selection •  Using data about many people (e.g., 20,000

women in McKinlay’s story) •  Ranking and drilling down into groups •  Connecting data analytics with practice

(Chris McKinlay dated 88 until he met Christine Wang)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 50

Keep in Mind

“Our industry does not respect tradition – it only respects innovation.”

– Satya Nadella

Jian Pei: CMPT 741/459 Data Mining -- Introduction 51

Jian Pei: CMPT 741/459 Data Mining -- Introduction 52

Goals of This Course

•  Data-driven thinking – towards being a (big) data scientist

•  Principles and hands-on skills of data mining, particularly in the context of big data –  Identifying new data mining problems – Data mining algorithm design – Data mining applications

•  Novel problems for upcoming research

Format

•  Due to the fast progress in data mining, we will go beyond the textbook substantially

•  Active classroom discussion •  Open questions and brainstorming •  Textbook: Data Mining – Concepts and

Techniques (3rd ed)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 53

Read – Try – Think

•  Reading –  (required) Textbook and a small number of research

papers – You have to have the 3rd ed of the textbook! –  (open end, not covered by the exam) Technical and

non-technical materials •  Trying

– Assignments and a project •  Thinking

– Examine everything from a data scientist angle from today

Jian Pei: CMPT 741/459 Data Mining -- Introduction 54

Jian Pei: CMPT 741/459 Data Mining -- Introduction 55

Data Mining: History

•  1989 IJCAI Workshop on Knowledge Discovery in Databases –  Knowledge Discovery in Databases (G.

Piatetsky-Shapiro and W. Frawley, 1991) •  91-94 Workshops on Knowledge

Discovery in Databases –  Advances in Knowledge Discovery and

Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 56

Data Mining: History (cont’d)

•  95-98 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) –  Journal of Data Mining and Knowledge Discovery (1997)

•  ACM SIGKDD conferences since 1998 and SIGKDD Explorations

•  More conferences on data mining –  PAKDD (1997), PKDD (1997), SIAM-Data Mining

(2001), (IEEE) ICDM (2001), etc. •  ACM Transactions on KDD starting in 2007

Jian Pei: CMPT 741/459 Data Mining -- Introduction 57

KDD Conferences

•  ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining (KDD) (best)

•  IEEE International Conference on Data Mining (ICDM)

•  SIAM Data Mining Conference (SDM)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 58

Regional Conferences

•  Conference on Principles and practices of Knowledge Discovery and Data Mining (PKDD) – European KDD – Co-organized with ECML (European Conference

on Machine Learning) •  Pacific-Asia Conference on Knowledge

Discovery and Data Mining (PAKDD) – Asian KDD

Jian Pei: CMPT 741/459 Data Mining -- Introduction 59

Journals

•  ACM Transactions on KDD •  IEEE Transactions on Knowledge and Data

Engineering (TKDE) •  Data Mining and Knowledge Discovery

(DAMI or DMKD) •  Knowledge and Information Systems •  KDD Explorations

Differences between 459 and 741

•  CMPT 459 – undergraduate version – Basic concepts and methods – What, why, how – Focus: essential data mining methods and

variations •  CMPT 741 – graduate version

– Focus: how to use the principles and ideas to solve new problems – new methods may be needed!

– For course-based/Big Data Professional program students, something in between

Jian Pei: CMPT 741/459 Data Mining -- Introduction 60

Student Groups

•  459 students •  741 course-based/Big Data Professional

program students •  741 thesis-based students •  Different groups will be trained differently to

meet their objectives

Jian Pei: CMPT 741/459 Data Mining -- Introduction 61

Knowing Your Peers

Jian Pei: CMPT 741/459 Data Mining -- Introduction 62

Preparation

Jian Pei: CMPT 741/459 Data Mining -- Introduction 63

Workload

Jian Pei: CMPT 741/459 Data Mining -- Introduction 64

Evaluation •  5 regular assignments

–  Exam questions will be similar to those in regular assignments

•  5 mini assignments –  Team work (2 students at a time) –  One has to team up with different students in different

mini assignments •  Project

–  Mining a real data set •  Exam

–  Solving questions using the materials covered in the class or their simple combinations

Jian Pei: CMPT 741/459 Data Mining -- Introduction 65

Lectures

•  Cover major ideas and critical details •  To-do-list specifies the materials one should

understand •  Assignments are the hints for the final exam •  Extended materials are only for students

who want to learn more, and are not required in the exam

Jian Pei: CMPT 741/459 Data Mining -- Introduction 66

To-Do-List

•  Read Chapter 1 in the textbook •  Understand the concepts mentioned in

Section 1.8 (some of them are omitted in the lecture notes)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 67