data cloud yury lifshits yahoo! research

Post on 23-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Cloud

Yury Lifshits

Yahoo! Research

http://yury.name

My Beliefs

The key challenge in web search is structured search

Part 1: What is structured search?

The key challenge in structured search is collecting data

Part 2: Data distribution & idea of Data Cloud

Part 3: Demo: numeric data distribution

The key challenge in collecting data is incentive design

Part 4: Economics of data distribution

StructuredSearch

Data

Structured data

Entity unit:

• Identifier

• Metadata:

– Explicit key-value pairs

– Relational properties

– Evaluation

Semi-structured data

Content unit:

• Body: text, video, audio, or image

• Metadata:

– Explicit key-value pairs

– Relational properties

– Evaluation

Data = data of entities + data of content

Structured Search

Factoid search“what's the value of property X of object Y“

Entity hubs– Domain hubs

Structured object search"all concerts this weekend in SF under 20$ sorted by popularity"– Time focus– Ranking focus – Relations focus

Structured content search "all videos with Tom Brady"“all comments and blog posts about Bing"

Yury’s Wishlist

Business-generated data• Products, services, news, wishlists, contact data

Reality stream, sensors• Where what have happened

Expert knowledge• Glossary, issues, typical solutions, object databases, related

objects graph

Events• Sport, concerts, education, corporate, community, private

Market graph & signals• Like, interested, use, following, want to buy; votes and ratings

Search as a Platform

App 4 Classic search App 1 App 2 App 3

Structured DataStructured DataWeb index

Post analysis Query analysis

Data CloudHow to collect all structured data in one place?

Data Producers

• People: forums, wiki, mail groups, blogs, social networks

• Enterprizes: product profiles, corporate news, professional content

• Sensors: GPS modules, web cameras, traffic sensors, RFID

• Transactional data

Data Distributors

Data distributor is any technical solution to accumulate, organize and provide access to structured and semi-structured data

Data publisher: the original distributor of some data

Data retailer: a consumer-facing distributor of some data

Data Consumers

• Humans– Email

– Aggregators: news, friend feeds, RSS readers

– Search

– Browsing / random walks

• Intelligence projects– Recommendation systems

– Trend mining

Data Cloud

Data Cloud is a centralized fully-functional data distribution service

Success metric for data cloud strategy = the total “value” of data on the cloud

To-Cloud Solutions

• Extraction– DBpedia.org, “web tables”

• Semantic markup, data APIs– Yahoo! SearchMonkey

• Feeds– Yahoo! Shopping

– Disqus.com, js-kit.com, Facebook Connect

• Direct publishing

On-Cloud Solutions

• Ontology maintenance– Freebase

• Normalization, de-duplication, antispam

• Named entity recognition, metadata inference, ranking

• Data recycling (cross-references)– Amazon Public Data Sets

– Viral license

• Hosted search – Yahoo! BOSS

From-Cloud Solutions

• Search, audience– Y! SearchMonkey, Google Base

• Data API, dump access, update stream

• Custom notifications– Gnip.com

• Data cloud as a primary backend

• Access control– Ad distribution. (AT&T and Yahoo! Local deal)

Demo:webNumbr.com

Joint work with Paul Tarjan

webNumbr.com: Import

• Crawl numbers from the webURL + XPath + regex

• Create “numbr pages”• Update their values every hour • Keep the history

Anyone can create a numbrhttp://webnumbr.com/create

webNumbr.com: Export

• Embed code

• Graphs

• Search & browse

• RSS

Economics of Data Distribution

Joint work with Ravi Kumar and Andrew Tomkins

Network Effect in Two-Sided Markets

Two sided market = every product serves consumers of two types A and B

Cross-side network effect: the more type-A users product X has, the more attractive it is for type-B consumers and vice versa

Examples: operating systems, credit cards, e-commerce marketplaces

Two-sided network effects: A theory of information product designG. Parker, M.W. Van Alstyne, N. Bulkley, M. Van Alstyne

Basic model

• Distributors D1, … Dk

• Producer/consumer joins only one distributor

• Initial shares (p1,c1) … (pk,ck)

• New consumer selects a distributor with a probability proportional to pi

• New producer selects a distributor with probability proportional to ci

Basic model

a1 a4a2 a3

a1 a4a3a2

Market Shares Dynamics

Theorem 1Market shares will stabilize

Theorem 2With super-liner preference rule

one of distributors will tip

Theorem 3With sub-liner preference rule

market shares will flatten

External Factor

Preference rule with external factor:

ei+ci/(c1+…+ck)

Theorem 4 Market shares will stabilize on e1 : e2 : … : ek

Coalition

Data Cloud

Coalitions

Theorem 5

If all market shares are below 1/sqrt(k)

coalition (sharing data) is profitable for

all distributors

Corollary

Coalitions are not monotone

Example: 5 : 4 : 1 : 1

Model Variations

• Same-side network effect

• Different p-to-c and c-to-p rules

• Multi-homing (overlapping audiences)

• n^2 vs. nlog n revenue models

• Mature market: newcomer rate = departing rate

• Diverse market (many types of producers and consumers)

• Newcoming and departing distributors

• Directed coalitions

Challenges

Marketing

• Data demand?

• Data offerings?

• Requirements for distribution technology?

Incentive design

• Incentives for data sharing?

• Centralized or distributed?– For profit or non-profit?

• Data licensing and ownership?

• Monetizing data cloud?

More Challenges

Prototyping:• Data marketplace: open data & data demand• Search plugins: related objects, glossaries, object timelines• Publishing tools for structured data• Data client: structured news, bookmarking, notifications

Tech design:• Access management• Namespace design

User interface:• Structured search UI• Discovery UI

Thanks!

Follow my research:http://twitter.com/yurylifshitshttp://yury.name/blog

top related