big data 2.0 - milwaukee big data user group presentation
TRANSCRIPT
![Page 1: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/1.jpg)
Big Data 2.0
Milwaukee Big Data Users Group
12.1.2014
![Page 2: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/2.jpg)
DBMS Technology OverviewGoal
• Provide a technology recommendation for serving reporting needs for the next 3 – 5 years
• Explore different technologies for their suitability for a strategic reporting data platform
![Page 3: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/3.jpg)
DBMS Technology OverviewVendor agnostic approach
• Vendor agnostic DBMS technologies evaluated– Categories
• RDBMS– Row based vs column based
• In-memory data base– Row based vs column based
• NoSQL– Document based (Disk)– Key value based (IMDG)– Graph based (IMDG)
– Criteria• Overall design• Pros/cons
![Page 4: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/4.jpg)
DBMS Technology OverviewRepresentative vendor evaluation criteria
• …followed by quick evaluation of two vendors representing each technology– Thought leadership
– Market share / # of production customers
– Capacity / scalability
– Functionality
– Expertise availability
– Resilience
– Cost (license, infrastructure & expertise)
– Interface compatibility (drop-in-ability)
![Page 5: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/5.jpg)
DBMS Technology OverviewOpen Discussion on drop-in-ability
• Re-tooling interfaces is expensive
– Focus is on query/reporting tools (in my evaluation)
– List of possible solutions drastically reduced by this criterion
– SQL compatibility (very important syntactic sugar)
– ACID compliance (dual use technology for OLTP needs)
• A cost-effective, performant, resilient solution that requires interface re-tooling is DOA for my client’s environment
![Page 6: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/6.jpg)
Striking phrases
• Disk is the new tape, memory is the new disk
• IMDG’s are increasingly being referred to as Big Data 2.0
![Page 7: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/7.jpg)
RDBMSrow based
• OLAP needs typically serviced by partitioning (row & column)
• 30 years old (proven technology)• IMDB implementations typically have same pros/cons,
although cost and performance characteristics are different
• Pros– Great OLTP performance– Efficient at whole-row operations
• Cons– Inefficient at data set operations– Scalability is typically not linear
Row-based
Data Cols Time Location Product Vendor
Block 1
Block 2
Block3
2/23 0900 IL023 Gown112 ML
2/23 0423 OH12 Mask221 123
2/24 1543 CN881 Swab993 456
![Page 8: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/8.jpg)
RDBMScolumn based
• Optimized for OLAP needs as it’s optimized to answer questions on data characteristics
• Great performance on aggregate functions (avg, count, sum, min, max)• IMDB implementations typically have same pros/cons, although cost and
performance characteristics are different
• Pros– Aggregate functions are very fast as entire column can be fetched quickly– Efficient at data set operations– Easily compressed, especially for data that is sparsely populated
• Cons– Inefficient at retrieving many columns of a single row– Row functions are slower
Column-based
Block 0 Time 2/23 0900 2/23 0423 2/24 1543
Block 1
Block 2
Block3
Location IL023 OH12 CN881
Product Gown112 Mask221 Swab993
Vendor ML 123 456
![Page 9: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/9.jpg)
NoSQLdocument/XML based
• Focus is typically on sharding strategy as opposed to up-front data modeling (models typically evolve greatly during construction)
• Similar to key-value stores, where values are stored in a standardized structure (although document stores keep metadata as well)
• An example of data in a document database:– {officeName:”3Pillar Noida”,
{Street: “B-25, City:”Noida”, State:”UP”, Pincode:”201301”}}
– {officeName:”3Pillar Timisoara”,{Boulevard:”Coriolan Brediceanu No. 10”, Block:”B, Ist Floor”, City: “Timisoara”, Pincode: 300011”}}
– {officeName:”3Pillar Cluj”,{Latitude:”40.748328”, Longitude:”-73.985560”}}
• Pros– Not limited to querying by keys (can query inside documents using JSON/XML query
mechanisms)– Maps well to semi-structured or variable structured data
• Cons– Sharding strategy can be challenging– Doesn’t support relations (no RI), as opposed to key value or graph stores
![Page 10: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/10.jpg)
NoSQLIMDG (common to key-value / graph)
• IMDG’s referred to as “Big Data 2.0”• Host data in memory and distribute across cluster of commodity
servers• Employ an object-oriented data model that provides read/write
times << 1 ms• As data is stored in virtual memory pool, parallel data computations
are easily performed • As in document databases, focus is on sharding strategy as opposed
to up-front physical data modeling• Majority of implementations utilize JVM’s (although a handful of
.Net are out there)• GC, specifically the unpredictability of GC, is a major concern
– Vendors utilize off-heap storage to alleviate this by moving LRU data to off-heap JVM’s, relying on high-speed messaging for data transport
![Page 11: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/11.jpg)
NoSQLIMDG (key-value)
• Typically stored as a set of distributable maps• Pros
– Data distribution is designed from the ground up– Keys and values are Java (or .Net) objects– No bias between OLTP and OLAP
• Cons– Alternate lookup mechanisms require a map with an
alternate key (although main data payload can be shared as values are objects that support multiple pointers)
– Expertise is typically harder to find (on characteristics of memory structure behavior at larger sizes)
![Page 12: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/12.jpg)
NoSQLIMDG (graph)
• Allow a set of nodes (object instances) with dynamic properties (cols/attributes) to be arbitrarily linked to other nodes through edges (associations)
• Each node only knows its adjacent nodes• As the number of nodes increases, cost of a local hop
remains constant• Whereas a RDBMS is optimized for aggregation, a
graph database is optimized for connections• Fastest growth area in NoSQL in the last year – 250%
![Page 13: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/13.jpg)
NoSQLIMDG (graph cont’d)
• 60% of Facebook graph is hosted on one instance of Neo4J
• Pros– Powerful general purpose (reusable) data model– Connected data locally indexed– Easy to query– Optimized for recursive structures (think BoM)– Great at use cases with complex relationships (supply
chain management)
• Cons– Sharding strategy is difficult– Requires re-wiring your brain (think object model
instead of data model)
![Page 14: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/14.jpg)
Particular vendor evaluations
<<Vendor evaluation.xls>>
![Page 15: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/15.jpg)
Recap
• In addition to normal criteria (scalability, functionality, cost, etc.), drop-in-ability should be considered as well
• Niche-technologies are available for more mainstream use cases, due to falling hardware prices
![Page 16: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/16.jpg)
Questions/Comments
?
![Page 17: Big Data 2.0 - Milwaukee Big Data User Group Presentation](https://reader030.vdocument.in/reader030/viewer/2022032616/55a871ec1a28abaf778b4873/html5/thumbnails/17.jpg)
Thank you
… for your time
Michael VogtDirector, Data [email protected](o) 414.347.1303 or 312.985.8100(c) 312.772.4762