![Page 1: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/1.jpg)
Evolution of Big Data ArchitecturesArchitecture Summit, Aug 2012
Ashish Thusoo
![Page 2: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/2.jpg)
Outline
Demand for Big Data
Architectural Trade Offs and Evolution
Where next?
![Page 3: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/3.jpg)
The Changing Planet
3 Technology Drivers
Devices
Infrastructure
Applications
![Page 4: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/4.jpg)
Evolution: Devices
![Page 5: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/5.jpg)
Evolution: Devices
Key Capabilities
Connected
Location Aware
Sensory & Powerful
![Page 6: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/6.jpg)
Evolution: Devices
![Page 7: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/7.jpg)
Evolution: Connectivity
Mobile Subscription Density 2004
![Page 8: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/8.jpg)
Evolution: Connectivity
Mobile Subscription Density 2010
![Page 9: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/9.jpg)
Evolution: Bandwidth
![Page 10: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/10.jpg)
Evolution: Applications
Salient Traits
Cloud based
Web scale
![Page 11: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/11.jpg)
Explosion in Data
Big Data
Volume
Velocity
Variety
![Page 12: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/12.jpg)
Big Data: Volume
Volume:
2011: 1.8 zettabytes of digital universe
2009 - 2020: 35 zettabytes
![Page 13: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/13.jpg)
Big Data: Velocity
Velocity
340 million tweets per day
72 hours of video uploaded every minute on YouTube
2.9 million emails a second
![Page 14: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/14.jpg)
Big Data: Variety
Variety
Video
Pictures
Applications Logs
etc. etc...
![Page 15: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/15.jpg)
Disruptive Architectures
![Page 16: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/16.jpg)
Disruptions in Data Arch
Change in Focus (1990s -> 2000s)
Performance -> Scalability & Availability
Rigid/Structured -> Flexible/Semistructured
![Page 17: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/17.jpg)
Scalability & Availability
![Page 18: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/18.jpg)
Towards Scalability
Problem
10K ops/sec -> 1M ops/sec
TB of data -> PB of data
![Page 19: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/19.jpg)
Towards Scalability
Solution: SHARDING (Divide and Conquer)
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
![Page 20: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/20.jpg)
Towards Scalability
How do we quickly route a record to a shard?
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
fn( )- Consistent Hashing- Mapping Table
![Page 21: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/21.jpg)
Towards Scalability
What happens is part of the record is in one shard and part in another?
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
![Page 22: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/22.jpg)
Towards Scalability
Keep it Simple: Application deals with atomicity & consistency semantics
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
![Page 23: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/23.jpg)
Towards AvailabilityWhat if my shard is down? Where do I put my record?
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
X?
![Page 24: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/24.jpg)
Towards AvailabilityLets just replicate the shards and pray that one is available :)
1101100011000001100100101111101011011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
X11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
![Page 25: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/25.jpg)
Towards Availability
Replication strategies
What should be the number of replicas?
How to rebuild a replica?
How to propogate a record to a replica?
![Page 26: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/26.jpg)
1990s vs 2000sDifferent Focus: 1990s (Raw Performance)
Optimal I/O structures
Cache Sensitive Algorithms
2000s (Scalability, Availability)
Sharding
Replication
![Page 27: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/27.jpg)
Flexibility/Semi-structure
![Page 28: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/28.jpg)
Towards Flexibility
Problem
Does structure in a database make it slower to write applications (sprint vs waterfall model)?
My data is not records and tables?
![Page 29: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/29.jpg)
Towards Flexibility
How knowing my record structure help by data system?
Helps to optimize execution plans
Helps to optimize my storage layouts
Trade off?
Application change means database schema change, rebuilding indexes etc. etc.
![Page 30: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/30.jpg)
Towards Flexibility
Most of my operations are simple lookups, range lookups and updates
Since the execution is simple we don’t need all the structure
Keep enough structure to support fast gets and puts
![Page 31: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/31.jpg)
Towards Flexibility
Solution: Key-Value Stores (NoSQL)
1101100011
1101100011
1101100011
1101100011
1101100011
1101100011
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
KEY VALUE
1101100011 11011000110000011001001011111010
1101100011
1101100011
1101100011 11011000110000011001001011111010
- Sorted HashMaps
- Sorted Files
![Page 32: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/32.jpg)
Towards Flexibility
Need to update related “values” of a key (Some Atomicity)
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110
11011000110
11011000110
11011000110
11011000110
11011000110
KEY VALUE
![Page 33: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/33.jpg)
Towards Flexibility
Need update related “values” of a key (Some Atomicity)
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110
11011000110
11011000110
11011000110
11011000110
11011000110
KEY VALUE11011000110
11011000110
11011000110
11011000110
11011000110
11011000110
TAG
TAG = COLUMN FAMILY
![Page 34: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/34.jpg)
Towards Flexibility
gets and puts are fine for online applications BUT..
What about Analytics?
Transformations can be really complicated...
![Page 35: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/35.jpg)
Towards Flexibility
Is there a simple construct that can solve a number of analytics queries
of course: SORT
And it can be parallelized too
![Page 36: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/36.jpg)
Towards Flexibility
MAP/REDUCE (Scalable Parallel Pluggable SORT)
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Mappers11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Reducers
m{ } r{ }m: user defined map functionr: user defined reduce function
![Page 37: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/37.jpg)
Towards Flexibility
MAP/REDUCE and Failures
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Mappers
X11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Reducers
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
![Page 38: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/38.jpg)
1990s vs 2000sDifferent Focus: 1990s (Raw Performance)
Structure important for speed optimizations
Stream everything through Query plan
2000s (Sprint mode of application development)
Support dev efficiency and data variety
Checkpointing for restartability
![Page 39: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/39.jpg)
Where now?
![Page 40: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/40.jpg)
The New Meets The Old
Disruption?
Well we still need SQL
We still need to make these work with other components
Guess what? Efficiency is also important at scale
![Page 41: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/41.jpg)
Where Does New Fail?
Transactions?
Moving money from one account to another
Graphs?
Networks everywhere
How to do second order analysis on graphs
![Page 42: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/42.jpg)
Thank You!
![Page 43: Ashish thusoo evolution of big data architectures](https://reader036.vdocument.in/reader036/viewer/2022062514/55a60d151a28abd77b8b48c1/html5/thumbnails/43.jpg)