08-31 fayyad-data mining techniques in the analysis · 28 research in the u.s, broadband hits 68%...

59
0 Data Mining Techniques in the Analysis of Massive Data sets Usama Fayyad, Ph.D. Chief Data Officer & Sr. VP Research and Strategic Data Solutions Yahoo! Inc. Research

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

0

Data Mining Techniques in the Analysis of Massive Data sets

Usama Fayyad, Ph.D.Chief Data Officer &

Sr. VP Research and Strategic Data Solutions

Yahoo! Inc.

Research

Page 2: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

1

Research

Overview of this talk

• Introduction

• Case Studies in Science Data Analysis– Work done while at NASA-JPL (Jet Propulsion Lab), Pasadena, CA,

USA

– In collaboration with Caltech Astronomy

– In collaboration with Brown University Planetary Geology

• Case study in dealing with massive numbers of Consumers at Yahoo!

• The case for building out new sciences for the Internet– Yahoo! Research overview

• If we have time (unlikely)– Actual case studies from Yahoo! targeting applications

Page 3: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

2

Research

What is Data Mining?

Finding interesting structure in data• Structure: refers to statistical patterns, predictive

models, hidden relationships• Interesting: ?

• Examples of tasks addressed by Data Mining– Predictive Modeling (classification, regression)– Segmentation (Data Clustering )– Affinity (Summarization)

• relations between fields, associations, visualization

Page 4: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

3

Research

Example: Cataloging Sky Objects

Page 5: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

Research

Data Mining Based Solution

• Automated recognition of over 2 billion sky objects

• 94% accuracy in recognizing sky objects

• Classify objects that are at least one magnitude fainter than “state-of-art”.

• Tripled the “data yield”

• Generate sky catalogs with much richer content:

–on order of billions of objects: > 2x107 galaxies > 2x108 stars, 105 quasars

• Discovered new quasars 40 times more efficiently

So why did it work???

Page 6: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

Research

Page 7: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

6

Research

Beyond Traditional Data Analysis

• Scaling analysis to large databases– How to deal with data without having to move it out? – Are there abstract primitive accesses to the data, in database

systems, that can provide mining algorithms with the information to drive the search for patterns?

– How do we minimize--or sometimes even avoid--having to scan the large database in its entirety?

• Automated search– Enumerate and create numerous hypotheses– Fast search– Useful data reductions

• More emphasis on understandable models– Finding patterns and models that are “interesting” or “novel” to

users.

• Scaling to high-dimensional data and models.

Page 8: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

Research

SKICAT Clustering

Use clustering algorithm to address the more difficult scientific analysis questions:

• Do astronomer-specified classes indeed arise naturally in data?

–Are there new classes in data?–Are there interesting clusters in data?

Page 9: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

Research

Clustering Results

Clean Test• Assembled data of two classes: stars and

galaxies; minimal data massaging/normalization, - 1,500 points.

• Clustering algorithm discovered 4 classes consistently!!!

• Initially disappointing result, but…

• Very reassuring to us and to Astronomers.

Page 10: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

9

Research

Page 11: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

10

Research

Clustering

• In search of low probability (rare) classes≈ 100 objects per 106 - 107 sky objects

• E.g. high-redshift quasars (z>4)– Using SKICAT discovered 20 new z>4 quasars

– reduced observation time by factor of 40.

How does one find minority classes?How does one find minority classes?––Most clustering algorithms would ignore Most clustering algorithms would ignore

them as noise or undesirable sidethem as noise or undesirable side--effectseffects––sampling is NOT usefulsampling is NOT useful––Need to scale to > millions of casesNeed to scale to > millions of cases

Page 12: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

11

Research

In Search of Minority Classes

• One Approach:– Randomly sample

– Cluster on ≈ 103 - 104 sky objects to model background classes

– Classify all 106 - 107 objects

– Discard high-probability objects, collect residue

– Focus on the residuals “misfits”

– Iterate...

Page 13: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

12

Research

Page 14: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

13

Research

Concensus Labels (FF05)

Page 15: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

14

Research

There is no “Ground Truth”

Page 16: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

15

Research

Overview

Page 17: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

16

Research

Matched Filter – Rough Detection

Page 18: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

17

Research

Detection stage

Page 19: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

18

Research

Page 20: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

19

Research

Page 21: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

20

Research

Principal Components

Page 22: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

21

Research

Clustering can help classification…

Page 23: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

22

Research

FROC evaluation

FROC = Free Response ROCReceiver Operating Characteristic

+, * indicate human scientist performanceFOA is best possible rate

Focus of AttentionHomogenous image performance

Page 24: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

23

Research

Grand Challenges in Data Mining

• Pragmatic:– Achieving integration and invisibility

• Research:– Solving some serious unaddressed problems

Page 25: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

24

Summary of Grand Challenges

Page 26: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

25

Research

Challenges

1. Where’s the Data?

2. In Situ mining

3. Domain knowledge

4. Life-cycle maintenance

5. Metrics

1. Understanding “large”

2. Simplicity knob

3. Interestingness

4. Scalability

5. Theory of what we do

0. Public and challenging benchmark data sets

A Scorecard for the field: At least 2 advances in the next 10 years!!!

Pragmatic Technical

Page 27: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

26

Yahoo! Case Study

Evolving the Data Strategy

Page 28: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

27

Research

506.3602.4

702.4787.5

868.3974.3

1,076.7

0.0

200.0

400.0

600.0

800.0

1,000.0

1,200.0

2001 2002 2003 2004 2005 2006 2007

Worldwide Total JapanRest of World Western EuropeAsia/Pacific United States

Globally, Internet Users Will Number Over 1 Billion by 2007

Source: IDC, December 2003.

Internet Users in Millions:

Page 29: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

28

Research

In the U.S, Broadband Hits 68% in ’04, and Continues to Grow

Source: Forrester Research, June 2004.

0.6 2.66.1

10.515.6

19.527.4

38

48.3

5763.6

68.2

101.1 102.1 103.2 104.3 105.5 106.6 107.7 108.9 110.1 111.2 112.4 113.6

80.277.174.372.1

6867.263.5

59.2

44.3

33.4

83.787.2

0.0

20.0

40.0

60.0

80.0

100.0

120.0

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

US

Hou

seho

lds

(mill

ions

)

All US households (millions)Internet householdsBroadband households

Actual Forecast

• In 1998, 33% of US households were online.

• At the beginning of 2004, 68% of US Households were online.

• By the end of 2009, 77% of US Households will be online.

• Broadband is expected to grow at a 20% CAGR between over the next 5 years.

Page 30: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

29

Research

Yahoo! is the #1 Destination on the Web

73% of the U.S. Internet population uses Yahoo!

– About 500 million usrs per month globally!

• Global network of content, commerce, media, search and access products

• 100+ properties including mail, TV, news, shopping, finance, autos, travel, games, movies, health, etc.

• 12 terabytes of data collected each day… and growing

• Representing over 5,000 cataloged consumer behaviors

More people visited Yahoo! in the past month than:

• Use coupons• Vote• Recycle• Exercise regularly• Have children living at

home• Wear sunscreen regularly

Sources: Mediamark Research, Spring 2004 and comScore Media Metrix, February 2005.

Data is used to develop content, consumer, category and campaign insights for our key content partners and large advertisers

Page 31: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

30

Research

DNA of a Yahoo! User

Male, age 23

Lives in NYC

Interested in music

Student

Searches fpradventure

travel destinations

Heavy userof IM

Visited Y! Autos last month

Listens to Pop and Rockon Y! Launch

Looks up PGA results

Responds most toads in Yahoo! News

Clicked on an auto Ad

Registration Campaign Yahoo! Behavior

The Y! User DNA

All of this data combines to paint a picture of each Yahoo! User

Page 32: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

31

Research

Addressing the Challenge: Reaching your customers

Yahoo! ID #xxx1

Target ‘thumbprint’

Poor match, not likely to be in the desired target

Yahoo! ID #xxx2

Target ‘thumbprint’

Good match, likely to be in the desired target

Yahoo!User DB-Scored & Ranked

Target Group

In addition to directly targeting customers, we can also build look-alike models using the data match as its base.

Page 33: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

32

Research

Inventing The New Sciences of the Internet at

Yahoo! Research

Page 34: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

33

Research

New Science?

• The Internet touches all of our lives: personal, commercial, corporate, educational, government, etc…

• Yet many of the basic notions we talk about:– Search, Community, Personalization,

Engagement, Interactive Content, Information Navigation

– Are not at all understood, or well-defined– These are not disciplines that academia or any

industry research labs focus on…

Page 35: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

34

Research

Areas of Research

• Community:– How do you know what to believe on the Internet?– Trust models on-line and trust propagation– What makes communities thrive? Whither?– Social media, tagging, image and video sharing

• Microeconomics: a new generation of economics driven by massive interactions– Auction marketplaces– The web as a new LEI of activities and economies

• Information Navigation and Search– We are in the early days of search and retrieval

• User Experience

Page 36: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

35

Research

Mission & Vision

Vision: Where the Internet’s future is invented– with innovative economic models for advertisers, publishers

and consumers.

Mission: NEXT -- Invent theNext generation Internet by defining the future media to Engage consumers andeXtend the economics for advertisers and publishers

through new sciences that establish theTechnical leadership of Yahoo!

Page 37: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

36

Research

How we get there

• Scientific excellence– World-recognized leadership through Business

impact

– Build the Largest, Deepest and Smartest Research Organization focused on a few chosen areas

– Explore areas that nobody else is exploring

– Open model with strong emphasis on publication, peer review, and real problems

Page 38: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

37

Research

A sampling of the Top Researchers now at Yahoo!

– Prabhakar Raghavan: CTO Verity, Web Research architect at IBM, Head of Y!R

– Andrei Broder: inventor of key search, web spam, technologies

– Andrew Tomkins: chief Scientist of WebFountain, inventor of key algorithms, structure of Web graph

– Raghu Ramakrishnan : world authority on data mining, database systems, and social search

– Ricardo Baeza-Yates: renowned expert in text and query mining, authored seminal texts in IR

– Michael Schwarz: renowned economist, auctions, web, Harvard and U.C. Berkeley

Page 39: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

38

New Economics

Auction Markets and Sponsored Search

Page 40: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

39

Research

Burgeoning research area

• Understanding auctions, incentive mechanisms, pricing, optimization – a huge deal– Already multi-billion dollar business, growing fast– Interface of microeconomics and CS– Early work – Edelman, Ostrovsky, Schwarz

• Advertiser’s perspective – spending budget optimally– Many open problems, a few papers, some of them

quite realistic– Marketplace design

Page 41: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

40

Research

Generic questions

• Of the various advertisers for a keyword, which one(s) get shown?

• What do they pay on a click through?

• The answers turn out to draw on insights from microeconomics

Page 42: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

41

Research

Ads go in slots like this one

and this one.

Page 43: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

42

Research

Advertisers generally prefer this slot

to this one.

Page 44: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

43

“Social” Search

Is the Turing test always the right question?

Page 45: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

44

Research

Page 46: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

45

Research

The power of social media

• Flickr – community phenomenon• Millions of users share and tag each

others’ photographs (why???)• The wisdom of the crowd can be used to

search• The principle is not new – anchor text

used in “standard” search• Don’t try to pass the Turing test?

Page 47: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

46

Research

Challenges in social media

• How do we use these tags for better search?

• What’s the ratings and reputation system?

• How do you cope with spam?

• The bigger challenge: where else can you exploit the power of the people?

• What are the incentive mechanisms?

Page 48: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

47

Questions?

Page 49: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

48

Research

Advertising: Brand and DR

Knowledge of users & their behavior throughout the purchase funnel can grow brand & direct response revenue

Awareness

Purchase

Consideration

≈ $200B Brand Advertising Market

≈ $200B Direct Marketing Market

Most time & activity is in consideration & engagement, but there are limited metrics & reach strategies

Page 50: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

49

Why is search-related advertising so powerful?

A question for the Audience:

Page 51: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

50

Research

Moving Customers up the Funnel

Impulse Banners– Target users based on their activity – both search and

property -- within the NEXT HOUR• Behavioral Categories – Apparel, Computers, Home Appliances –

all the same categories that you can use for regular behavioral targeting!

Awareness

Purchase

Consideration

Page 52: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

51

Research

Impulse Banner Example

All within1 HOUR2

Sees that “Credit Card” falls under

the category: “Finance/Credit

and Credit Services”

25% - 261% higher CTR than RON

1

User searches on

the word “Credit Card”

3

serves “Finance/Credit and

Credit Services” banners to User anywhere on the Yahoo! network within 1 HOUR

Page 53: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

52

Research

Way Impulse Works

• Searches are not at all associated or tracked through personally identifiable information

• No long-term memory of search terms, all stored on client cookie.

• We generalize the category is targeting is at generic category: e.g. Financial Services, not “credit card”

• All targeting done in anonymous mode

Page 54: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

53

Research

Moving Down the Funnel

• New generation marketing solutions to take brand advertisers down the marketing funnel

Awareness

Purchase

Consideration

Page 55: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

54

Research

Types of Data Captured Across the Yahoo Network

Demographics (Registration)

Behavior (Clickstream)

Interests (Declared)

Campaign Response

(View/Click)

Interests (Inferred)

Shopping (Research/

Transactions)

Search Interests

(Key Words)

Attitudinal (Survey)

Page 56: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

55

Research

Identifying Car Shoppers

1. Identify relevant actions that indicate buying intent and aggregate them on an user basis

2. Compute a “purchase intent” score for each user

3. Segment results by score to identify top prospects

- Browse specs

- Configure and price- Compare vehicles

- Loan Calculator

- Car manufacturers- Car buying guides

- Car dealers - Local dealer lookups

Page 57: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

56

Research

What Can We Do With This Data?

• Using rich click stream data on Y!, we can identify consumers shopping for a car…– Capability unique to online medium; hard to

identify “in the market” consumers off-line

• …with a reasonable degree of accuracy– 70% identified “in the market” users looking to buy

within 3 months

– 24% users said they actually made a purchase within a month (results from a self-reported survey)

Page 58: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

57

Research

How Big is the Opportunity?

• Can identify ~250,000 “leads” a month on Yahoo (at 90% confidence level)– About 30% of total new cars sold in the U.S.

every month to individuals• More opportunities

– Car financing, Car insurance, Used car listings– Other categories: travel intenders, high net worth

individuals, small businesses, etc…

Page 59: 08-31 Fayyad-Data Mining Techniques in the Analysis · 28 Research In the U.S, Broadband Hits 68% in ’04, and Continues to Grow Source: Forrester Research, June 2004. 0.6 2.6 6.1

58

Research

An Example

• A test ad-campaign with a major Euro automobile manufacturer– Designed a test that served the same ad creative to test and

control groups on Yahoo– Success metric: performing specific actions on Jaguar

website

• Test results: 900% conversion lift vs. control group– Purchase Intenders were 9 times more likely to configure a

vehicle, request a price quote or locate a dealer than consumers in the control group

– ~3x higher click through rates vs. control group