big data meetup

16
Geoffrey Hendrey @geoffhendrey Architecture for real-time ad-hoc query on distributed filesystems

Upload: geoff-hendrey

Post on 02-Jul-2015

1.059 views

Category:

Technology


2 download

DESCRIPTION

Vertascale Real-time big data hadoop AUDIO: http://files.meetup.com/7151882/Real-Time%20Big%20Data%20130312%20Geoff%20Hendrey.opus

TRANSCRIPT

Page 1: Big data meetup

Geoffrey Hendrey

@geoffhendrey

Architecture for real-time ad-hoc query on distributed filesystems

Page 2: Big data meetup

Motivation

• Big Data is more opaque than small data

– Spreadsheets choke

– BI tools can’t scale

– Small samples often fail to replicate issues

• Engineers, data scientists, analysts need:

– Faster “time to answer” on Big Data

– Rapid “find, quantify, extract”

• Solve “I don’t know what I don’t know”

• This is NOT about looking up items in a product

catalog (i.e. not a consumer search problem)

Page 3: Big data meetup

Scaling search with classic sharding

Page 4: Big data meetup

Classic “side system” approach

• Definition of KLUDGE: “a system and

especially a computer system made up of

poorly matched components” –Merriam-Webster

HadoopSearchCluster

?????

Page 5: Big data meetup

Classic “search toolkit”

• Built around fulltext use case

• Inverted Indexes optimized for on-the-fly ranking of results– TF-IDF

– Okapi BM-25

• Yet never able to fully realize google-style search capability

• Issues:– Phrase detection

– Pseudo synonymy

– Open loop architecture

Page 6: Big data meetup

Big data ad-hoc query

• Not typically a fulltext “document search” problem

• Data is structured, mixed structured, and denormalized– Log lines

– Json records

– CSV files

– Hadoop native formats (SequenceFile)

• Ranking is explicit (ORDER BY), not relevance based

• Sometimes “needle in haystack” (support, debugging)

• Sometimes “haystack in haystack” (summary analytics, segmentation)

Page 7: Big data meetup

Dremel MPP query execution tree

Page 8: Big data meetup

Finer points of Dremel architecture

• MapReduce friendly

• In-Situ approach is DFS friendly

• Excels at aggregation. Not so much for needle-in-haystack.

• Column storage format accelerates mapreduce(less extraneous data pushed through)

• But in some regards still a “side system”

• Applications must explicitly store their data in a columnar format

• “massive” is both a benefit and a hazard– Complex (operationally and WRT query execution)

– Queries can execute quickly…on huge clusters

Page 9: Big data meetup

Crawled In-Situ Index Architecture

HDFSMapReduce

Data Crawl

In-situ Index

SimpleSearch

Application

Hadoop

Page 10: Big data meetup

Benefits to crawled In-Situ index

• No changes to application data format– CSV

– JSON

– SequenceFile

• Clear “separation of concerns” between data and index

• Indexes become “disposable”: easily built, easily thrown away

• There is no “side system” that needs to be maintained

• Use the mapreduce “hammer” to pound a nail

Page 11: Big data meetup

Architect for Elasticity

AWS S3

Elastic MapReduce

JetS3tEC2

M1.large

ApplicationCrawl

Index

HTTP

Interesting: you don’t actually need to have hadoop installed…

Page 12: Big data meetup

Declarative Crawl Indexing

HDFSMapReduce

Data Crawl

In-situ Index

SimpleSearch

Application

Hadoop

{

"filter”:"column[4]==\"athens\""

}

Parse.json

• Indexer reads declarative instructions from in-situ file• “pull” vs. traditional “push” indexing approach

Page 13: Big data meetup

Thin index

• Index size is small because data is a holistic part of the system

• data does not need to be “put into” the search system and repicated in the index.

HDFSMapReduce

Data Crawl

In-situ Index

Data

Index

Page 14: Big data meetup

Lazy data loading

HDFSMapReduce

Data Crawl

ExecutionRuntime

Data

IndexLRU

IndexCache

Lazy Pull

Lazy Pull

Page 15: Big data meetup

Column Oriented Approach

Page 16: Big data meetup

Contact Info

Email:

[email protected]

Private Beta

http://vertascale.com