big data meetup

Geoffrey Hendrey

@geoffhendrey

Architecture for real-time ad-hoc query on distributed filesystems

Motivation

• Big Data is more opaque than small data

– Spreadsheets choke

– BI tools can’t scale

– Small samples often fail to replicate issues

• Engineers, data scientists, analysts need:

– Faster “time to answer” on Big Data

– Rapid “find, quantify, extract”

• Solve “I don’t know what I don’t know”

• This is NOT about looking up items in a product

catalog (i.e. not a consumer search problem)

Scaling search with classic sharding

Classic “side system” approach

• Definition of KLUDGE: “a system and

especially a computer system made up of

poorly matched components” –Merriam-Webster

HadoopSearchCluster

?????

Classic “search toolkit”

• Built around fulltext use case

• Inverted Indexes optimized for on-the-fly ranking of results– TF-IDF

– Okapi BM-25

• Yet never able to fully realize google-style search capability

• Issues:– Phrase detection

– Pseudo synonymy

– Open loop architecture

Big data ad-hoc query

• Not typically a fulltext “document search” problem

• Data is structured, mixed structured, and denormalized– Log lines

– Json records

– CSV files

– Hadoop native formats (SequenceFile)

• Ranking is explicit (ORDER BY), not relevance based

• Sometimes “needle in haystack” (support, debugging)

• Sometimes “haystack in haystack” (summary analytics, segmentation)

Dremel MPP query execution tree

Finer points of Dremel architecture

• MapReduce friendly

• In-Situ approach is DFS friendly

• Excels at aggregation. Not so much for needle-in-haystack.

• Column storage format accelerates mapreduce(less extraneous data pushed through)

• But in some regards still a “side system”

• Applications must explicitly store their data in a columnar format

• “massive” is both a benefit and a hazard– Complex (operationally and WRT query execution)

– Queries can execute quickly…on huge clusters

Crawled In-Situ Index Architecture

HDFSMapReduce

Data Crawl

In-situ Index

SimpleSearch

Application

Hadoop

Benefits to crawled In-Situ index

• No changes to application data format– CSV

– JSON

– SequenceFile

• Clear “separation of concerns” between data and index

• Indexes become “disposable”: easily built, easily thrown away

• There is no “side system” that needs to be maintained

• Use the mapreduce “hammer” to pound a nail

Architect for Elasticity

AWS S3

Elastic MapReduce

JetS3tEC2

M1.large

ApplicationCrawl

Index

HTTP

Interesting: you don’t actually need to have hadoop installed…

Declarative Crawl Indexing

HDFSMapReduce

Data Crawl

In-situ Index

SimpleSearch

Application

Hadoop

{

"filter”:"column[4]==\"athens\""

}

Parse.json

• Indexer reads declarative instructions from in-situ file• “pull” vs. traditional “push” indexing approach

Thin index

• Index size is small because data is a holistic part of the system

• data does not need to be “put into” the search system and repicated in the index.

HDFSMapReduce

Data Crawl

In-situ Index

Data

Index

Lazy data loading

HDFSMapReduce

Data Crawl

ExecutionRuntime

Data

IndexLRU

IndexCache

Lazy Pull

Lazy Pull

Column Oriented Approach

Contact Info

Email:

[email protected]

Private Beta

http://vertascale.com

mailto:[email protected]

http://vertascale.com

big data meetup

Technology