introduction - computing science - simon fraser … business intelligence jian pei: cmpt 741/459...
TRANSCRIPT
Motivation: Business Intelligence
Jian Pei: CMPT 741/459 Data Mining -- Introduction 2
Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …)
Product information (Product-id, category, manufacturer, made-in, stock-price, …)
Sales information (customer-id, product-id, #units, unit-price, sales-representative, …)
Business queries:
Techniques: Business Intelligence
• Multidimensional data analysis • Online query answering • Interactive data exploration
Jian Pei: CMPT 741/459 Data Mining -- Introduction 3
Motivation: Store Layout Design
Jian Pei: CMPT 741/459 Data Mining -- Introduction 4
http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg
Techniques: Store Layout Design
• Customer purchase patterns • Business strategies
Jian Pei: CMPT 741/459 Data Mining -- Introduction 5
Motivation: Community Detection
Jian Pei: CMPT 741/459 Data Mining -- Introduction 6
http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811
Techniques: Community Detection
• Similarity between objects • Partitioning objects into groups
– No guidance about what a group is
Jian Pei: CMPT 741/459 Data Mining -- Introduction 7
Motivation: Disease Prediction
Jian Pei: CMPT 741/459 Data Mining -- Introduction 8
Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat …
What medical problems does this patient has?
Techniques: Disease Prediction
• Features • Model
Jian Pei: CMPT 741/459 Data Mining -- Introduction 9
Motivation: Fraud Detection
Jian Pei: CMPT 741/459 Data Mining -- Introduction 10
http://i.imgur.com/ckkoAOp.gif
Techniques: Fraud Detection
• Features • Dissimilarity • Groups and noise
Jian Pei: CMPT 741/459 Data Mining -- Introduction 11
http://i.stack.imgur.com/tRDGU.png
What Is Data Science About?
• Data • Extraction of knowledge from data • Continuation of data mining and knowledge
discovery from data (KDD)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 12
What Is Data?
• Values of qualitative or quantitative variables belonging to a set of items
• Represented in a structure, e.g., tabular, tree or graph structure
• Typically the results of measurements • As an abstract concept can be viewed as the
lowest level of abstraction from which information and then knowledge are derived
Jian Pei: CMPT 741/459 Data Mining -- Introduction 13
What Is Information?
• “Knowledge communicated or received concerning a particular fact or circumstance”
• Conceptually, information is the message (utterance or expression) being conveyed
• Cannot be predicted • Can resolve uncertainty
Jian Pei: CMPT 741/459 Data Mining -- Introduction 14
What Is Knowledge?
• Familiarity with someone or something, which can include facts, information, descriptions, or skills acquired through experience or education
• Implicit knowledge: practical skill or expertise • Explicit knowledge: theoretical
understanding of a subject
Jian Pei: CMPT 741/459 Data Mining -- Introduction 15
Data Systems
• A data system answers queries based on data acquired in the past
• Base data – the rawest data not derived from anywhere else
• Knowledge – information derived from the base data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 16
Dealing with Data – Querying
• Given a set of student records about name, age, courses taken and grades
• Simple queries – What is John Doe’s age?
• Aggregate queries – What is the average GPA of all students at this
school? • Queries can be arbitrarily complicated
– Find the students X and Y whose grades are less than 3% apart in as many courses as possible
Jian Pei: CMPT 741/459 Data Mining -- Introduction 17
Queries
• A precise request for information • Subjects in databases and information
retrieval – Databases: structured queries on structured
(e.g., relational) data – Information retrieval: unstructured queries on
unstructured (e.g., text, image) data • Important assumptions
– Information needs – Query languages
Jian Pei: CMPT 741/459 Data Mining -- Introduction 18
Data-driven Exploration
• What should be the next strategy of a company? – A lot of data: sales, human resource, production,
tax, service cost, … • The question cannot be translated into a
precise request for information (i.e., a query) • Developing familiarity (knowledge) and
actionable items (decisions) by interactively analyzing data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 19
Data-driven Thinking
• Starting with some simple queries • New queries are raised by consuming the
results of previous queries • No ultimate query in design!
– But many queries can be answered using DB/IR techniques
Jian Pei: CMPT 741/459 Data Mining -- Introduction 20
The Art of Data-driven Thinking
• The way of generating queries remains an art! – Different people may derive different results
using the same data
“If you torture the data long enough, it will confess” – Ronald H. Coase
• More often than not, more data may be needed – datafication
Jian Pei: CMPT 741/459 Data Mining -- Introduction 21
Queries for Data-driven Thinking
• Probe queries – finding information about specific individuals
• Aggregation – finding information about groups • Pattern finding – finding commonality in
population • Association and correlation – finding
connections among individuals and groups • Causality analysis – finding causes and
consequences
Jian Pei: CMPT 741/459 Data Mining -- Introduction 22
What Is Data Mining?
• Broader sense: the art of data-driven thinking
• Technical sense: the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96] – Methods and tools of answering various types of
queries in the data mining process in the broader sense
Jian Pei: CMPT 741/459 Data Mining -- Introduction 23
Machine Learning
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”
– Tom M. Mitchell • Essentially, learn the distribution of data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 24
Data mining vs. Machine Learning
• Machine learning focuses on prediction, based on known properties learned from the training data
• Data mining focuses on the discovery of (previously) unknown properties on the data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 25
Jian Pei: CMPT 741/459 Data Mining -- Introduction 26
The KDD Process
Data
Target data
Preprocessed data
Transformed data
Patterns
Knowledge
Selection Preprocessing
Transformation
Data mining
Interpretation/evaluation
Data Mining R&D
• New problem identification • Data collection and transformation • Algorithm design and implementation • Evaluation
– Effectiveness evaluation – Efficiency & scalability evaluation
• Deployment and business solution
Jian Pei: CMPT 741/459 Data Mining -- Introduction 27
Data Mining on Big Data
“Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it”
– Hal Varian, Google’s Chief Economist
Jian Pei: CMPT 741/459 Data Mining -- Introduction 28
What Is Big Data?
• No quantitative definition! • “Big data is like teenage sex
– everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...”
– Dan Ariely
Jian Pei: CMPT 741/459 Data Mining -- Introduction 29
Data Volume vs. Storage Cost
• The unit cost of disk storage decreases dramatically
Jian Pei: CMPT 741/459 Data Mining -- Introduction 30
Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB
http://ns1758.ca/winch/winchest.html
Big Data – Volume
“Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time”
— Wikipedia
Jian Pei: CMPT 741/459 Data Mining -- Introduction 31
H1N1 Pandemic Crisis (2009) • A new flu virus combining elements of the viruses
that cause bird flu and swine flu • The US Centers for Disease Control and Prevention
(CDC) requested reports from doctors, tabulated the numbers once a week – a two-week lag in data collection
• Google used user search keywords to predict the spread of winter flu – A supervised approach based on more than 3 billion
search queries every day, examining 450 million different models, using 2007-2008 data from CDC
• Some things can be done based on large scale data, but cannot be done on a smaller scale data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 32
Detecting Hurricane – Unsupervised
Jian Pei: CMPT 741/459 Data Mining -- Introduction 33
D. Kang, D. Jiang, J. Pei, Z. Liao, X. Sun, and H-J. Choi. "Multidimensional Mining of Large-Scale Search Logs: A Topic-Concept Cube Approach". In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11), Hong Kong, China, February 9-12, 2011.
Big Data: Volume
• Every day, about 7 billion shares change hands on US equity markets – About 2/3 is traded by computer algorithms based
on huge amounts of data to predict gains and risk • In Q2 2015
– Facebook has 1.49 billion active users – Wechat has 600 million active users, 100 million
outside China – LinkedIn has 380 million active users – Twitter has 304 active users
Jian Pei: CMPT 741/459 Data Mining -- Introduction 34
Velocity
• Google processes 24+ petabytes of data per day
• Facebook gets 10+ million new photos uploaded every hour
• Facebook members like or leave a comment 3+ billion times per day
• YouTube users upload 1+ hour of video every second
• 400+ million tweets per day
Jian Pei: CMPT 741/459 Data Mining -- Introduction 35
What Has Been Changed?
• The 1880 census in the US took 8 years to complete – The 1890 census would need 13 years – using
punch cards, it was reduced to less than 1 year • It is essential to get not only the accurate but
also the timely data – Statisticians use sampling to estimate
• Recently, with the new technologies, the ways of data collection and transmission have been fundamentally changed
Jian Pei: CMPT 741/459 Data Mining -- Introduction 36
Sampling for Volume/Velocity?
• Sampling idea: the marginal new information brought by larger amount of data shrinks quickly – The sample should be truly random
• On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories of attribute combinations – Finding outliers and exceptions
• Big data contains signals of different strengths – No noise, instead weaker and weaker, but still may
be interesting and important signals
Jian Pei: CMPT 741/459 Data Mining -- Introduction 37
Big Data – Leytro Pictures
• Lytro pictures record the whole light field – Photographers can decide later which parts to
focus on • Big data tries to record as much information
as possible – Analysts can decide later what to extract from
big data – Both advantages and challenges
Jian Pei: CMPT 741/459 Data Mining -- Introduction 38
Veracity
• “1 in 3 business leaders don't trust the information they use to make decisions”
• Assuming a slowly growing total cost budget, tradeoff between data volume and data quality
• Loss of veracity in combining different types of information from different sources
• Loss of veracity in data extraction, transformation, and processing
Jian Pei: CMPT 741/459 Data Mining -- Introduction 39
Variety
• Integrating data capturing different aspects of a data object – Vancouver Canucks: game video, technical
statistics, social media, … – Different pieces are in different format
• Different views of the same data object from different sources – Did the soccer ball pass the goal line? – The views may not be consistent
Jian Pei: CMPT 741/459 Data Mining -- Introduction 40
Four V-challenges
• Volume: massive scale and growth, 40% per year in global data generated
• Velocity: real time data generation and consumption
• Variety: heterogeneous data, mainly unstructured or semi-structured, from many sources
• Veracity
Jian Pei: CMPT 741/459 Data Mining -- Introduction 41
Is Big Data Really New?
• People were aware of the existence of big data long time ago, but no one can access it until very recently – (Genesis 28:15) “I am with you and will watch
over you wherever you go” – “密室私语,天闻如雷;暗室欺⼼心,神目如电;善恶之报,如影随⾏行”
– Similar statements in Quran and Sutra • What has been changed?
– How is data connected with people
Jian Pei: CMPT 741/459 Data Mining -- Introduction 42
Diversity in Data Usage
• In the past, only very few projects can afford to be data-intensive
• Nowadays, excessive applications are (naturally) data-intensive
Jian Pei: CMPT 741/459 Data Mining -- Introduction 43
Datafication
• Extract data about an object or event in a quantified way so that it can be analyzed – Different from digitalization
• An important feature of big data • Key: new data, new applications, new
opportunities
Jian Pei: CMPT 741/459 Data Mining -- Introduction 45
New Values of Datafication
• Example: Captcha and ReCaptcha (Luis von Ahn)
• How to create new values of data and datafication? – Connecting data with new users – Connecting different pieces of data to present a
bigger picture • Important techniques
– Data aggregation – Extended datafication
Jian Pei: CMPT 741/459 Data Mining -- Introduction 46
Big Data Players
• Data holders • Data specialists • Big-data mindset leaders • A capable company may play 2 or 3 roles at
the same time • What is most important, big-data mindset,
skills, or data itself?
Jian Pei: CMPT 741/459 Data Mining -- Introduction 47
Privacy
• “… big data analytics have the potential to eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace”
— Executive Office of the (US) President
Jian Pei: CMPT 741/459 Data Mining -- Introduction 48
A Beautiful Story about Big Data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 49
Source: http://abcnews.go.com/blogs/lifestyle/2014/01/the-genius-okcupid-hack-that-led-to-true-love/
Romantics in the Big Data Age
• Datafication and feature selection • Using data about many people (e.g., 20,000
women in McKinlay’s story) • Ranking and drilling down into groups • Connecting data analytics with practice
(Chris McKinlay dated 88 until he met Christine Wang)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 50
Keep in Mind
“Our industry does not respect tradition – it only respects innovation.”
– Satya Nadella
Jian Pei: CMPT 741/459 Data Mining -- Introduction 51
Jian Pei: CMPT 741/459 Data Mining -- Introduction 52
Goals of This Course
• Data-driven thinking – towards being a (big) data scientist
• Principles and hands-on skills of data mining, particularly in the context of big data – Identifying new data mining problems – Data mining algorithm design – Data mining applications
• Novel problems for upcoming research
Format
• Due to the fast progress in data mining, we will go beyond the textbook substantially
• Active classroom discussion • Open questions and brainstorming • Textbook: Data Mining – Concepts and
Techniques (3rd ed)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 53
Read – Try – Think
• Reading – (required) Textbook and a small number of research
papers – You have to have the 3rd ed of the textbook! – (open end, not covered by the exam) Technical and
non-technical materials • Trying
– Assignments and a project • Thinking
– Examine everything from a data scientist angle from today
Jian Pei: CMPT 741/459 Data Mining -- Introduction 54
Jian Pei: CMPT 741/459 Data Mining -- Introduction 55
Data Mining: History
• 1989 IJCAI Workshop on Knowledge Discovery in Databases – Knowledge Discovery in Databases (G.
Piatetsky-Shapiro and W. Frawley, 1991) • 91-94 Workshops on Knowledge
Discovery in Databases – Advances in Knowledge Discovery and
Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 56
Data Mining: History (cont’d)
• 95-98 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) – Journal of Data Mining and Knowledge Discovery (1997)
• ACM SIGKDD conferences since 1998 and SIGKDD Explorations
• More conferences on data mining – PAKDD (1997), PKDD (1997), SIAM-Data Mining
(2001), (IEEE) ICDM (2001), etc. • ACM Transactions on KDD starting in 2007
Jian Pei: CMPT 741/459 Data Mining -- Introduction 57
KDD Conferences
• ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining (KDD) (best)
• IEEE International Conference on Data Mining (ICDM)
• SIAM Data Mining Conference (SDM)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 58
Regional Conferences
• Conference on Principles and practices of Knowledge Discovery and Data Mining (PKDD) – European KDD – Co-organized with ECML (European Conference
on Machine Learning) • Pacific-Asia Conference on Knowledge
Discovery and Data Mining (PAKDD) – Asian KDD
Jian Pei: CMPT 741/459 Data Mining -- Introduction 59
Journals
• ACM Transactions on KDD • IEEE Transactions on Knowledge and Data
Engineering (TKDE) • Data Mining and Knowledge Discovery
(DAMI or DMKD) • Knowledge and Information Systems • KDD Explorations
Differences between 459 and 741
• CMPT 459 – undergraduate version – Basic concepts and methods – What, why, how – Focus: essential data mining methods and
variations • CMPT 741 – graduate version
– Focus: how to use the principles and ideas to solve new problems – new methods may be needed!
– For course-based/Big Data Professional program students, something in between
Jian Pei: CMPT 741/459 Data Mining -- Introduction 60
Student Groups
• 459 students • 741 course-based/Big Data Professional
program students • 741 thesis-based students • Different groups will be trained differently to
meet their objectives
Jian Pei: CMPT 741/459 Data Mining -- Introduction 61
Evaluation • 5 regular assignments
– Exam questions will be similar to those in regular assignments
• 5 mini assignments – Team work (2 students at a time) – One has to team up with different students in different
mini assignments • Project
– Mining a real data set • Exam
– Solving questions using the materials covered in the class or their simple combinations
Jian Pei: CMPT 741/459 Data Mining -- Introduction 65
Lectures
• Cover major ideas and critical details • To-do-list specifies the materials one should
understand • Assignments are the hints for the final exam • Extended materials are only for students
who want to learn more, and are not required in the exam
Jian Pei: CMPT 741/459 Data Mining -- Introduction 66