big data meetup
DESCRIPTION
Vertascale Real-time big data hadoop AUDIO: http://files.meetup.com/7151882/Real-Time%20Big%20Data%20130312%20Geoff%20Hendrey.opusTRANSCRIPT
Geoffrey Hendrey
@geoffhendrey
Architecture for real-time ad-hoc query on distributed filesystems
Motivation
• Big Data is more opaque than small data
– Spreadsheets choke
– BI tools can’t scale
– Small samples often fail to replicate issues
• Engineers, data scientists, analysts need:
– Faster “time to answer” on Big Data
– Rapid “find, quantify, extract”
• Solve “I don’t know what I don’t know”
• This is NOT about looking up items in a product
catalog (i.e. not a consumer search problem)
Scaling search with classic sharding
Classic “side system” approach
• Definition of KLUDGE: “a system and
especially a computer system made up of
poorly matched components” –Merriam-Webster
HadoopSearchCluster
?????
Classic “search toolkit”
• Built around fulltext use case
• Inverted Indexes optimized for on-the-fly ranking of results– TF-IDF
– Okapi BM-25
• Yet never able to fully realize google-style search capability
• Issues:– Phrase detection
– Pseudo synonymy
– Open loop architecture
Big data ad-hoc query
• Not typically a fulltext “document search” problem
• Data is structured, mixed structured, and denormalized– Log lines
– Json records
– CSV files
– Hadoop native formats (SequenceFile)
• Ranking is explicit (ORDER BY), not relevance based
• Sometimes “needle in haystack” (support, debugging)
• Sometimes “haystack in haystack” (summary analytics, segmentation)
Dremel MPP query execution tree
Finer points of Dremel architecture
• MapReduce friendly
• In-Situ approach is DFS friendly
• Excels at aggregation. Not so much for needle-in-haystack.
• Column storage format accelerates mapreduce(less extraneous data pushed through)
• But in some regards still a “side system”
• Applications must explicitly store their data in a columnar format
• “massive” is both a benefit and a hazard– Complex (operationally and WRT query execution)
– Queries can execute quickly…on huge clusters
Crawled In-Situ Index Architecture
HDFSMapReduce
Data Crawl
In-situ Index
SimpleSearch
Application
Hadoop
Benefits to crawled In-Situ index
• No changes to application data format– CSV
– JSON
– SequenceFile
• Clear “separation of concerns” between data and index
• Indexes become “disposable”: easily built, easily thrown away
• There is no “side system” that needs to be maintained
• Use the mapreduce “hammer” to pound a nail
Architect for Elasticity
AWS S3
Elastic MapReduce
JetS3tEC2
M1.large
ApplicationCrawl
Index
HTTP
Interesting: you don’t actually need to have hadoop installed…
Declarative Crawl Indexing
HDFSMapReduce
Data Crawl
In-situ Index
SimpleSearch
Application
Hadoop
{
"filter”:"column[4]==\"athens\""
}
Parse.json
• Indexer reads declarative instructions from in-situ file• “pull” vs. traditional “push” indexing approach
Thin index
• Index size is small because data is a holistic part of the system
• data does not need to be “put into” the search system and repicated in the index.
HDFSMapReduce
Data Crawl
In-situ Index
Data
Index
Lazy data loading
HDFSMapReduce
Data Crawl
ExecutionRuntime
Data
IndexLRU
IndexCache
Lazy Pull
Lazy Pull
Column Oriented Approach
Contact Info
Email:
Private Beta
http://vertascale.com