introduction - computing science - simon fraser … business intelligence jian pei: cmpt 741/459...

Introduction

Motivation: Business Intelligence

Jian Pei: CMPT 741/459 Data Mining -- Introduction 2

Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …)

Product information (Product-id, category, manufacturer, made-in, stock-price, …)

Sales information (customer-id, product-id, #units, unit-price, sales-representative, …)

Business queries:

Techniques: Business Intelligence

•  Multidimensional data analysis •  Online query answering •  Interactive data exploration


Motivation: Store Layout Design


http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

Techniques: Store Layout Design

•  Customer purchase patterns •  Business strategies


Motivation: Community Detection


http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811

Techniques: Community Detection

•  Similarity between objects •  Partitioning objects into groups

– No guidance about what a group is


Motivation: Disease Prediction


Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat …

What medical problems does this patient has?

Techniques: Disease Prediction

•  Features •  Model


Motivation: Fraud Detection


http://i.imgur.com/ckkoAOp.gif

Techniques: Fraud Detection

•  Features •  Dissimilarity •  Groups and noise


http://i.stack.imgur.com/tRDGU.png

What Is Data Science About?

•  Data •  Extraction of knowledge from data •  Continuation of data mining and knowledge

discovery from data (KDD)


What Is Data?

•  Values of qualitative or quantitative variables belonging to a set of items

•  Represented in a structure, e.g., tabular, tree or graph structure

•  Typically the results of measurements •  As an abstract concept can be viewed as the

lowest level of abstraction from which information and then knowledge are derived


What Is Information?

•  “Knowledge communicated or received concerning a particular fact or circumstance”

•  Conceptually, information is the message (utterance or expression) being conveyed

•  Cannot be predicted •  Can resolve uncertainty


What Is Knowledge?

•  Familiarity with someone or something, which can include facts, information, descriptions, or skills acquired through experience or education

•  Implicit knowledge: practical skill or expertise •  Explicit knowledge: theoretical

understanding of a subject


Data Systems

•  A data system answers queries based on data acquired in the past

•  Base data – the rawest data not derived from anywhere else

•  Knowledge – information derived from the base data


Dealing with Data – Querying

•  Given a set of student records about name, age, courses taken and grades

•  Simple queries – What is John Doe’s age?

•  Aggregate queries – What is the average GPA of all students at this

school? •  Queries can be arbitrarily complicated

– Find the students X and Y whose grades are less than 3% apart in as many courses as possible


Queries

•  A precise request for information •  Subjects in databases and information

retrieval – Databases: structured queries on structured

(e.g., relational) data –  Information retrieval: unstructured queries on

unstructured (e.g., text, image) data •  Important assumptions

–  Information needs – Query languages


Data-driven Exploration

•  What should be the next strategy of a company? – A lot of data: sales, human resource, production,

tax, service cost, … •  The question cannot be translated into a

precise request for information (i.e., a query) •  Developing familiarity (knowledge) and

actionable items (decisions) by interactively analyzing data


Data-driven Thinking

•  Starting with some simple queries •  New queries are raised by consuming the

results of previous queries •  No ultimate query in design!

– But many queries can be answered using DB/IR techniques


The Art of Data-driven Thinking

•  The way of generating queries remains an art! – Different people may derive different results

using the same data

“If you torture the data long enough, it will confess” – Ronald H. Coase

•  More often than not, more data may be needed – datafication


Queries for Data-driven Thinking

•  Probe queries – finding information about specific individuals

•  Aggregation – finding information about groups •  Pattern finding – finding commonality in

population •  Association and correlation – finding

connections among individuals and groups •  Causality analysis – finding causes and

consequences


What Is Data Mining?

•  Broader sense: the art of data-driven thinking

•  Technical sense: the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96] – Methods and tools of answering various types of

queries in the data mining process in the broader sense


Machine Learning

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”

– Tom M. Mitchell •  Essentially, learn the distribution of data


Data mining vs. Machine Learning

•  Machine learning focuses on prediction, based on known properties learned from the training data

•  Data mining focuses on the discovery of (previously) unknown properties on the data



The KDD Process

Data

Target data

Preprocessed data

Transformed data

Patterns

Knowledge

Selection Preprocessing

Transformation

Data mining

Interpretation/evaluation

Data Mining R&D

•  New problem identification •  Data collection and transformation •  Algorithm design and implementation •  Evaluation

– Effectiveness evaluation – Efficiency & scalability evaluation

•  Deployment and business solution


Data Mining on Big Data

“Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it”

– Hal Varian, Google’s Chief Economist


What Is Big Data?

•  No quantitative definition! •  “Big data is like teenage sex

– everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...”

– Dan Ariely


Data Volume vs. Storage Cost

•  The unit cost of disk storage decreases dramatically


Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB

http://ns1758.ca/winch/winchest.html

Big Data – Volume

“Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time”

— Wikipedia


H1N1 Pandemic Crisis (2009) •  A new flu virus combining elements of the viruses

that cause bird flu and swine flu •  The US Centers for Disease Control and Prevention

(CDC) requested reports from doctors, tabulated the numbers once a week – a two-week lag in data collection

•  Google used user search keywords to predict the spread of winter flu –  A supervised approach based on more than 3 billion

search queries every day, examining 450 million different models, using 2007-2008 data from CDC

•  Some things can be done based on large scale data, but cannot be done on a smaller scale data


Detecting Hurricane – Unsupervised


D. Kang, D. Jiang, J. Pei, Z. Liao, X. Sun, and H-J. Choi. "Multidimensional Mining of Large-Scale Search Logs: A Topic-Concept Cube Approach". In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11), Hong Kong, China, February 9-12, 2011.

Big Data: Volume

•  Every day, about 7 billion shares change hands on US equity markets – About 2/3 is traded by computer algorithms based

on huge amounts of data to predict gains and risk •  In Q2 2015

– Facebook has 1.49 billion active users – Wechat has 600 million active users, 100 million

outside China –  LinkedIn has 380 million active users – Twitter has 304 active users


Velocity

•  Google processes 24+ petabytes of data per day

•  Facebook gets 10+ million new photos uploaded every hour

•  Facebook members like or leave a comment 3+ billion times per day

•  YouTube users upload 1+ hour of video every second

•  400+ million tweets per day


What Has Been Changed?

•  The 1880 census in the US took 8 years to complete – The 1890 census would need 13 years – using

punch cards, it was reduced to less than 1 year •  It is essential to get not only the accurate but

also the timely data – Statisticians use sampling to estimate

•  Recently, with the new technologies, the ways of data collection and transmission have been fundamentally changed


Sampling for Volume/Velocity?

•  Sampling idea: the marginal new information brought by larger amount of data shrinks quickly – The sample should be truly random

•  On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories of attribute combinations – Finding outliers and exceptions

•  Big data contains signals of different strengths – No noise, instead weaker and weaker, but still may

be interesting and important signals


Big Data – Leytro Pictures

•  Lytro pictures record the whole light field – Photographers can decide later which parts to

focus on •  Big data tries to record as much information

as possible – Analysts can decide later what to extract from

big data – Both advantages and challenges


Veracity

•  “1 in 3 business leaders don't trust the information they use to make decisions”

•  Assuming a slowly growing total cost budget, tradeoff between data volume and data quality

•  Loss of veracity in combining different types of information from different sources

•  Loss of veracity in data extraction, transformation, and processing


Variety

•  Integrating data capturing different aspects of a data object – Vancouver Canucks: game video, technical

statistics, social media, … – Different pieces are in different format

•  Different views of the same data object from different sources – Did the soccer ball pass the goal line? – The views may not be consistent


Four V-challenges

•  Volume: massive scale and growth, 40% per year in global data generated

•  Velocity: real time data generation and consumption

•  Variety: heterogeneous data, mainly unstructured or semi-structured, from many sources

•  Veracity


Is Big Data Really New?

•  People were aware of the existence of big data long time ago, but no one can access it until very recently –  (Genesis 28:15) “I am with you and will watch

over you wherever you go” –  “密室私语，天闻如雷；暗室欺⼼心，神目如电；善恶之报，如影随⾏行”

– Similar statements in Quran and Sutra •  What has been changed?

– How is data connected with people


Diversity in Data Usage

•  In the past, only very few projects can afford to be data-intensive

•  Nowadays, excessive applications are (naturally) data-intensive


Datafication

•  Extract data about an object or event in a quantified way so that it can be analyzed – Different from digitalization

•  An important feature of big data •  Key: new data, new applications, new

opportunities


New Values of Datafication

•  Example: Captcha and ReCaptcha (Luis von Ahn)

•  How to create new values of data and datafication? – Connecting data with new users – Connecting different pieces of data to present a

bigger picture •  Important techniques

– Data aggregation – Extended datafication


Big Data Players

•  Data holders •  Data specialists •  Big-data mindset leaders •  A capable company may play 2 or 3 roles at

the same time •  What is most important, big-data mindset,

skills, or data itself?


Privacy

•  “… big data analytics have the potential to eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace”

— Executive Office of the (US) President


A Beautiful Story about Big Data


Source: http://abcnews.go.com/blogs/lifestyle/2014/01/the-genius-okcupid-hack-that-led-to-true-love/

Romantics in the Big Data Age

•  Datafication and feature selection •  Using data about many people (e.g., 20,000

women in McKinlay’s story) •  Ranking and drilling down into groups •  Connecting data analytics with practice

(Chris McKinlay dated 88 until he met Christine Wang)


Keep in Mind

“Our industry does not respect tradition – it only respects innovation.”

– Satya Nadella



Goals of This Course

•  Data-driven thinking – towards being a (big) data scientist

•  Principles and hands-on skills of data mining, particularly in the context of big data –  Identifying new data mining problems – Data mining algorithm design – Data mining applications

•  Novel problems for upcoming research

Format

•  Due to the fast progress in data mining, we will go beyond the textbook substantially

•  Active classroom discussion •  Open questions and brainstorming •  Textbook: Data Mining – Concepts and

Techniques (3rd ed)


Read – Try – Think

•  Reading –  (required) Textbook and a small number of research

papers – You have to have the 3rd ed of the textbook! –  (open end, not covered by the exam) Technical and

non-technical materials •  Trying

– Assignments and a project •  Thinking

– Examine everything from a data scientist angle from today



Data Mining: History

•  1989 IJCAI Workshop on Knowledge Discovery in Databases –  Knowledge Discovery in Databases (G.

Piatetsky-Shapiro and W. Frawley, 1991) •  91-94 Workshops on Knowledge

Discovery in Databases –  Advances in Knowledge Discovery and

Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)


Data Mining: History (cont’d)

•  95-98 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) –  Journal of Data Mining and Knowledge Discovery (1997)

•  ACM SIGKDD conferences since 1998 and SIGKDD Explorations

•  More conferences on data mining –  PAKDD (1997), PKDD (1997), SIAM-Data Mining

(2001), (IEEE) ICDM (2001), etc. •  ACM Transactions on KDD starting in 2007


KDD Conferences

•  ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining (KDD) (best)

•  IEEE International Conference on Data Mining (ICDM)

•  SIAM Data Mining Conference (SDM)


Regional Conferences

•  Conference on Principles and practices of Knowledge Discovery and Data Mining (PKDD) – European KDD – Co-organized with ECML (European Conference

on Machine Learning) •  Pacific-Asia Conference on Knowledge

Discovery and Data Mining (PAKDD) – Asian KDD


Journals

•  ACM Transactions on KDD •  IEEE Transactions on Knowledge and Data

Engineering (TKDE) •  Data Mining and Knowledge Discovery

(DAMI or DMKD) •  Knowledge and Information Systems •  KDD Explorations

Differences between 459 and 741

•  CMPT 459 – undergraduate version – Basic concepts and methods – What, why, how – Focus: essential data mining methods and

variations •  CMPT 741 – graduate version

– Focus: how to use the principles and ideas to solve new problems – new methods may be needed!

– For course-based/Big Data Professional program students, something in between


Student Groups

•  459 students •  741 course-based/Big Data Professional

program students •  741 thesis-based students •  Different groups will be trained differently to

meet their objectives


Knowing Your Peers


Preparation


Workload


Evaluation •  5 regular assignments

–  Exam questions will be similar to those in regular assignments

•  5 mini assignments –  Team work (2 students at a time) –  One has to team up with different students in different

mini assignments •  Project

–  Mining a real data set •  Exam

–  Solving questions using the materials covered in the class or their simple combinations


Lectures

•  Cover major ideas and critical details •  To-do-list specifies the materials one should

understand •  Assignments are the hints for the final exam •  Extended materials are only for students

who want to learn more, and are not required in the exam


To-Do-List

•  Read Chapter 1 in the textbook •  Understand the concepts mentioned in

Section 1.8 (some of them are omitted in the lecture notes)


introduction - computing science - simon fraser … business intelligence jian pei: cmpt 741/459...

Documents