leskovec - stanford universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced...

51
CS345a: Data Mining Jure Leskovec Stanford University

Upload: dokiet

Post on 23-Apr-2018

229 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

CS345a: Data MiningJure LeskovecStanford University

Page 2: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Instructors: Instructors: Jure Leskovec A d R j Anand Rajaraman

TAs:Abhi h k G t Abhishek Gupta Roshan SumbalyR h t 345 i 0910 t ff@ Reach us at cs345a‐win0910‐staff@ lists.stanford.eduM i f t f d d / l / 345 More info on www.stanford.edu/class/cs345a

21/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 3: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Homework: 20% Homework: 20% Gradiance and other 3 l t d f th t 3 late days for the quarter All homeworks must be handed in

Project: 40% Project: 40% Start early

k l f Takes lots of time Final Exam: 40%

31/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 4: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Basic databases: CS145 Basic databases: CS145 Algorithms: Dynamic programming basic data structures Dynamic programming, basic data structures

Basic statistics: Moments t pi al distrib tions re ression Moments, typical distributions, regression, …

Programming:Y h i b t C /J ill b f l Your choice, but C++/Java will be very useful

We provide some background, but the classWe provide some background, but the class will be fast paced

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 4

Page 5: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Software implementation related to course Software implementation related to course subject matter

Should involve an original component or Should involve an original component or experiment

More later about available data and More later about available data and computing resources

It’s going to be fun and hard work

51/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 6: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Many past projects have dealt with Many past projects have dealt with collaborative filtering (advice based on what similar people do)similar people do) E.g., Netflix Challenge

Others have dealt with engineering solutions Others have dealt with engineering solutions to machine‐learning problems

Lots of interesting project ideas Lots of interesting project ideas If you can’t think of one please come talk to us

61/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 7: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Data: Data: Netflix WebBaseWebBase Wikipedia TRECC ShareThis Googleg

Infrastructure: Aster Data cluster on Amazon EC2 Supports both MapReduce and SQL

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 7

Page 8: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

ML generally requires a large ML generally requires a large “training set” of correctly classified data:classified data: Example: classify Web pages by topic

Hard to find well‐classified data: Open Directory works for page topics,Open Directory works for page topics, because work is collaborative and shared by many. Other good exceptions?

81/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 9: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Many problems require thought: Many problems require thought:1. Tell important pages from unimportant 

(PageRank)(PageRank)2. Tell real news from publicity (how?)3 Distinguish positive from negative product3. Distinguish positive from negative product 

reviews (how?)4 Feature generation in ML4. Feature generation in ML5. Etc., etc.

91/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 10: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Working in pairs OK but Working in pairs OK, but …1. No more than two per project.2 W ill t f i th f2. We will expect more from a pair than from an 

individual.3 The effort should be roughly evenly distributed3. The effort should be roughly evenly distributed.

101/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 11: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Map‐Reduce and HadoopMap Reduce and Hadoop Recommendation systems Collaborative filtering

Dimensionality reduction Dimensionality reduction Finding nearest neighbors Finding similar sets Minhashing, Locality‐Sensitive hashing

Clustering PageRank and measures of importance in graphs PageRank and measures of importance in graphs (link analysis) Spam detection Topic‐specific search

111/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 12: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Large scale machine learning Large scale machine learning Association rules, frequent itemsets Extracting structured data (relations) from the Extracting structured data (relations) from the Web

Clustering data Clustering data Graph partitioning Spam detection Spam detection Managing Web advertisements Mining data streams Mining data streams

121/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 13: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Lots of data is being collectedLots of data is being collected and warehoused  Web data, e‐commerce

h d / purchases at department/grocery stores Bank/Credit Card transactions

Computers are cheap and powerfulp p p Competitive Pressure is Strong  Provide better, customized services for an edge (e.g. in g ( gCustomer Relationship Management)

1/5/2010 13Jure Leskovec, Stanford CS345a: Data Mining

Page 14: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression datap

scientific simulations generating terabytes of data

T di i l h i i f ibl f Traditional techniques infeasible for raw data

Data mining helps scientists  in classifying and segmenting data in Hypothesis Formation

1/5/2010 14Jure Leskovec, Stanford CS345a: Data Mining

Page 15: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

There is often information “hidden” in the data that is not readily evidentnot readily evident

Human analysts take weeks to discover useful informationM h f th d t i l d t ll Much of the data is never analyzed at all

3,500,000

4,000,000

2,000,000

2,500,000

3,000,000 The Data Gap

T t l di k (TB) i 1995

00 000

1,000,000

1,500,000Total new disk (TB) since 1995

Number of

0

500,000

1995 1996 1997 1998 1999

analysts

1/5/2010 15Jure Leskovec, Stanford CS345a: Data Mining

Page 16: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Many Definitions Many Definitions Non‐trivial extraction of implicit, previously unknown and useful information from dataunknown and useful information from data

Exploration & analysis, by automatic or semi automatic means ofsemi‐automatic means, of large quantities of data in order to discoverin order to discover meaningful patterns 

1/5/2010 16Jure Leskovec, Stanford CS345a: Data Mining

Page 17: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Process of semi automatically analyzing large Process of semi‐automatically analyzing large databases to find patterns that are: valid: hold on new data with some certainty valid:  hold on new data with some certainty novel:  non‐obvious to the system

f l h ld b ibl t t th it useful:  should be possible to act on the item  understandable: humans should be able to interpret the patterninterpret the pattern

1/5/2010 17Jure Leskovec, Stanford CS345a: Data Mining

Page 18: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

A big data mining risk is that you will A big data‐mining risk is that you will “discover” patterns that are meaningless.

Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than yourmore places for interesting patterns than your amount of data will support, you are bound to find crapfind crap.

181/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 19: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

A parapsychologist in the 1950’s hypothesized A parapsychologist in the 1950 s hypothesized that some people had Extra‐Sensory PerceptionPerception

He devised an experiment where subjects were asked to guess 10 hidden cards – red orwere asked to guess 10 hidden cards – red or blue

He discovered that almost 1 in 1000 had ESP – He discovered that almost 1 in 1000 had ESP –they were able to get all 10 right

191/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 20: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

He told these people they had ESP and called He told these people they had ESP and called them in for another test of the same type

Alas he discovered that almost all of them Alas, he discovered that almost all of them had lost their ESP

What did he conclude? What did he conclude?

He concluded that you shouldn’t tell people He concluded that you shouldn t tell people they have ESP; it causes them to lose it. 

201/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 21: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Banking: loan/credit card approval:g / pp predict good customers based on old customers

Customer relationship management:id tif th h lik l t l f tit identify those who are likely to leave for a competitor

Targeted marketing:  identify likely responders to promotionsidentify likely responders to promotions

Fraud detection: telecommunications, finance from an online stream of event identify fraudulent 

tevents Manufacturing and production: automatically adjust knobs when process parameterautomatically adjust knobs when process parameter changes

1/5/2010 21Jure Leskovec, Stanford CS345a: Data Mining

Page 22: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Medicine: disease outcome, effectiveness ofMedicine: disease outcome, effectiveness of treatments analyze patient disease history: find relationship between diseasesbetween diseases 

Molecular/Pharmaceutical: id tif d identify new drugs

Scientific data analysis: id if l i b hi f b l identify new galaxies by searching for sub clusters

Web site/store design and promotion: find affinity of visitor to pages and modify layout

1/5/2010 22Jure Leskovec, Stanford CS345a: Data Mining

Page 23: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Overlaps with machine learning statisticsOverlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on scalability of number of features and instances stress on algorithms and

Machine Learning/Pattern 

Statistics/AI

stress on algorithms and architectures whereas foundations of  methods 

Recognition

Data Mining

and formulations provided by statistics and machine learning automation for handling large Database automation for handling large, heterogeneous data

systems

1/5/2010 23Jure Leskovec, Stanford CS345a: Data Mining

Page 24: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Prediction Methods Prediction Methods Use some variables to predict unknown or future values of other variablesfuture values of other variables.

Description MethodsDescription Methods Find human‐interpretable patterns that describe the data.

1/5/2010 24Jure Leskovec, Stanford CS345a: Data Mining

Page 25: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Classification Classification Clustering Association Rule Discovery: Association Rule Discovery: Sequential Pattern Discovery Regression Regression Anomaly Detection

1/5/2010 25Jure Leskovec, Stanford CS345a: Data Mining

Page 26: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Early Class: Attributes:

Courtesy: http://aps.umn.edu

y

Intermediate

• Stages of Formation • Image features, • Characteristics of light

waves received, etc.

Intermediate

LateLate

Data Size: • 72 million stars, 20 million galaxies• Object Catalog: 9 GB• Image Database: 150 GB

1/5/2010 26Jure Leskovec, Stanford CS345a: Data Mining

Page 27: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Observe Stock MovementsObserve Stock Movements Cluster them: Stock‐{UP/DOWN} Similarity Measure: 

T i t i il if th t d ib d b Two points are more similar if the events described by them frequently happen together on the same day. 

Discovered Clusters Industry Group

1Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN

Technology1‐DOWN

A l C DOWN A t d k DOWN DEC DOWN

2Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,Computer-Assoc-DOWN,Circuit-City-DOWN,

Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,MBNA Corp DOWN Morgan Stanley DOWN i i l3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

4Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,Schlumberger-UP

Oil-UP

1/5/2010 27Jure Leskovec, Stanford CS345a: Data Mining

Page 28: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Given database of user preferences predict Given database of user preferences,  predict preference of new user 

Example: p Predict what new movies you will like based on your past preferences others with similar past preferences their preferences for the new movies

Example: Example:  Predict what books/CDs a person may want to buy  (and suggest it or give discounts to tempt (and suggest it, or give discounts to tempt customer)

1/5/2010 28Jure Leskovec, Stanford CS345a: Data Mining

Page 29: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Detect significant deviations Detect significant deviations from normal behavior

Applications: Applications: Credit Card Fraud Detection

Network Intrusion DetectionDetection

1/5/2010 29Jure Leskovec, Stanford CS345a: Data Mining

Page 30: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Supermarket shelf management. Goal: To identify items that are bought together by sufficiently many customers.

Approach: Process the point of sale data collected with Approach: Process the point‐of‐sale data collected with barcode scanners to find dependencies among items.

A classic rule ‐‐ If a customer buys diaper and milk, then he is likely to buy beer. So, don’t be surprised if you find six‐packs stacked next to diapers!

TID Items

1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk

Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}

Rules Discovered:{Milk} --> {Coke}{Diaper, Milk} --> {Beer}p

4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

1/5/2010 30Jure Leskovec, Stanford CS345a: Data Mining

Page 31: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Network intrusion detection using a combination of i l l di d l ifi i 4 Gsequential rule discovery and classification tree on 4 GB 

DARPA data Won over (manual) knowledge engineering approach http://www cs columbia edu/~sal/JAM/PROJECT/ provides good http://www.cs.columbia.edu/ sal/JAM/PROJECT/ provides good detailed description of the entire process

Major US bank: Customer attrition prediction Segment customers based on financial behavior: 3 segments Segment customers based on financial behavior: 3 segments Build attrition models for each of the 3 segments 40‐50% of attritions were predicted == factor of 18 increase 

T d di k i j US b k Targeted credit marketing: major US banks find customer segments based on 13 months credit balances build another response model based on surveys increased response 4 times 2% increased response 4 times – 2%

1/5/2010 31Jure Leskovec, Stanford CS345a: Data Mining

Page 32: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Scalability Scalability Dimensionality Complex and Heterogeneous Data Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Data Ownership and Distribution Privacy Preservation Streaming Data Streaming Data

1/5/2010 32Jure Leskovec, Stanford CS345a: Data Mining

Page 33: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

[Leskovec et al., TWEB ’07]

Senders and followers of recommendations receive discounts on products

10% credit 10% off

R d i d b f Recommendations are made to any number of people at the time of purchaseO l h i i h b fi

Jure Leskovec, Stanford CS345a: Data Mining

Only the recipient who buys first gets a discount

1/5/2010 33

Page 34: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Product recommendation 

knetwork 

purchase following a recommendation

customer recommending acustomer recommending a product

customer not buying a recommended productrecommended product

341/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 35: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Large online retailer (June 2001 to May 2003) Large online retailer (June 2001 to May 2003)

15,646,121 recommendationsd 3,943,084 distinct customers

548,523 products recommended 99% of them belonging 4 main product 99% of them belonging 4 main product groups: books books DVDs musicmusic VHS

351/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 36: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

RecommendationsRecommendations sender (shadowed) recipient (shadowed) recommendation time buy bit purchase time purchase time product price

Additional product info (from the retailer’s website) Additional product info (from the retailer s website) categories reviews ratings

361/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 37: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

What role does the product category play? What role does the product category play?      

products customers recommenda-tions edges

buy + getdiscount

buy + no discounttions discount discount

Book 103,161 2,863,977 5,741,611 2,097,809 65,344 17,769

DVD 19,829 805,285 8,180,393 962,341 17,232 58,189

Music 393,598 794,148 1,443,847 585,738 7,837 2,739

Video 26,131 239,583 280,270 160,683 909 467

F ll 542 719 3 943 084 15 646 121 3 153 676 91 322 79 164Full 542,719 3,943,084 15,646,121 3,153,676 91,322 79,164

peoplerecommendations

Jure Leskovec, Stanford CS345a: Data Mining

highlow

1/5/2010 37

Page 38: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

There are relatively few DVD titles, but DVDs account for ~ 50% of recommendationsrecommendations.

recommendations per person DVD: 10 books and music: 2 VHS: 1 VHS: 1

recommendations per purchase books: 69 DVDs: 108 music: 136 music: 136 VHS: 203

Overall there are 3.69 recommendations per node on 3.85 different products.

Music recommendations reached about the same number of people as DVDs but used only 1/5 as many recommendations 

Book recommendations reached by far the most people – 2.8 million.  All networks have a very small number of unique edges For books videos All networks have a very small number of unique edges. For books, videos 

and music the number of unique edges is smaller than the number of nodes – the networks are highly disconnected

381/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 39: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

12x 104

6

10

12ne

nt

4 x 106

6

8

t com

pon

2n

# nodes6

4

6

e of

gia

nt

0 10 200

m (month)

1.7*106m

0

2siz

by monthquadratic fit

39

0 1 2 3 4x 106

0number of nodes

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 40: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

94% of users make first recommendation without having94% of users make first recommendation without having received one previously

linear growth: ~ 165,000 new users added each monthsize of giant connected component increases from 1% to 2 5% size of giant connected component increases from 1% to 2.5% of the network (100,420 users) – small!

some sub‐communities are better connected 24% out of 18,000 users for westerns on DVD 26% of 25,000 for classics on DVD 19% of 47,000 for anime (Japanese animated film) on DVD19% of 47,000 for anime (Japanese animated film) on DVD

others are just as disconnected 3% of 180,000 home and gardening

2 7% f hild ’ d fit DVD 2‐7% for children’s and fitness DVDs

401/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 41: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Does sending more recommendations Does sending more recommendations influence more purchases?

5

6

7

ases

3

4

ber o

f Pur

cha

20 40 60 80 100 120 1400

1

2

Num

Jure Leskovec, Stanford CS345a: Data Mining

20 40 60 80 100 120 140Outgoing Recommendations

1/5/2010 41

Page 42: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

consider whether sender has at least one successful d irecommendation

controls for sender getting credit for purchase that resulted from others recommending the same product to the same personperson

0.1

0.12

it

probability of i i    

0.06

0.08

bilit

y of

Cre

d receiving a credit levels off for DVDs

0.02

0.04

Pro

bab

42

10 20 30 40 50 60 70 800

Outgoing Recommendations

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 43: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

DVD recommendations

asing

0.090.1

DVD recommendations(8.2 million observations)

of purcha

0 050.060.070.08

bability o

0 020.030.040.05

Prob

00.010.02

0 10 20 30 40

43

0 10 20 30 40# recommendations received

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 44: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Effectiveness of subsequent recommendations? Effectiveness of subsequent recommendations? Multiple recommendations between two individuals weaken the impact of the bond on purchases

0.06

0.07

g

p p

0.05

lity

of b

uyin

g

0.03

0.04

Pro

babi

l

5 10 15 20 25 30 35 400.02

Exchanged recommendationsJure Leskovec, Stanford CS345a: Data Mining1/5/2010 44

Page 45: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Consider successful recommendations in terms of av # senders of recommendations per book categoryav. # senders of recommendations per book category av. # of recommendations accepted

books overall have a 3% success rate  (2% with discount, 1% without)

Lower than average success rate Lower than average success rate fiction romance (1.78), horror (1.81) teen (1.94), children’s books (2.06) i (2 30) i fi (2 34) t d th ill (2 40) comics (2.30), sci‐fi (2.34), mystery and thrillers (2.40)

nonfiction sports (2.26) home & garden (2.26) travel (2 39) travel (2.39)

Higher than average success rate professional & technical medicine (5.68) professional & technical (4 54) professional & technical (4.54) engineering (4.10), science (3.90),  computers & internet (3.61) law (3.66), business & investing (3.62)

451/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 46: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Professional & technical book recommendations are more often accepted

Some organized contexts other than professional also have higher success rate, e.g. religion g , g g overall success rate 3.13% Christian themed books Christian living and theology (4.7%)Christian living and theology (4.7%) Bibles (4.8%)

not‐as‐organized religion new age (2.5%)g ( ) occult  spirituality (2.2%)

Well organized hobbies books on orchids recommended successfully twice as often as books 

on tomato growing

461/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 47: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Variable transformation Coefficient

const -0.940 ***# recommendations ln(r) 0 426 ***# recommendations ln(r) 0.426 # senders ln(ns) -0.782 ***# recipients ln(n ) -1 307 ***# recipients ln(nr) 1.307 product price ln(p) 0.128 ***# reviews ln(v) -0 011 ***# reviews ln(v) -0.011 avg. rating ln(t) -0.027 *

R2 0 74

1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 47

R2 0.74significance at the 0.01 (***), 0.05 (**) and 0.1 (*) levels 

Page 48: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

47 000 customers responsible for the 2 5 out of47,000 customers responsible for the 2.5 out of 16 million recommendations in the system

29% success rate per recommender of an anime DVD

Giant component covers 19% of the nodes

Overall, recommendations for DVDs are more likely to result in a purchase (7%), but the anime 

i dcommunity stands out

Jure Leskovec, Stanford CS345a: Data Mining1/5/2010 48

Page 49: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Three colors: blue, white & red, showing purchasers only

491/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 50: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Small community Small community few reviews, senders, and recipients b t di d ti h l but sending more recommendations helps

Pricey products Rating doesn’t play as much of a role Rating doesn t play as much of a role

501/5/2010 Jure Leskovec, Stanford CS345a: Data Mining

Page 51: Leskovec - Stanford Universityweb.stanford.edu/class/cs345a/slides/01-intro.pdfwill be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: ... Minhashing, Locality ... Topic ‐specific

Observations for diffusion models:Observations for diffusion models: purchase decision more complex than threshold or simple infection

influence saturates as the number of contacts expands

links user effectiveness if they are overused links user effectiveness if they are overused

Conditions for successful recommendations: professional and organizational contexts discounts on expensive items

ll i h l k i i i small, tightly knit communities

Jure Leskovec, Stanford CS345a: Data Mining1/5/2010 51