getting your head around big data

Click here to load reader

Post on 19-Aug-2014

644 views

Category:

Engineering

3 download

Embed Size (px)

DESCRIPTION

My talk on Big Data from Dallas Day of .NET 2014

TRANSCRIPT

  • https://github.com/glennblock https://twitter.com/gblock I should be tweeting"
  • 3 Make machine data accessible, usable and valuable to everyone.
  • Platform for Machine Data Any Machine Data HA Indexes and Storage Search and Investigation Proactive Monitoring Operational Visibility Real-time Business Insights Commodity Servers Online Services Web Services Servers Security GPS Location Storage Desktops Networks Packaged Applications Custom ApplicationsMessaging Telecoms Online Shopping Cart Web Clickstreams Databases Energy Meters Call Detail Records Smartphones and Devices RFID
  • DATA
  • 15,000 BC Pictures Lascaux, France
  • 6000 BC Symbols
  • 3,500 BC Language
  • 1,275 BC Papyrus
  • 1st - 13th Century - Codex
  • 13th Century Movable type
  • 15th Century Printing press
  • 19th to 20th century Babbage Analytical engine
  • 1936 Turing machine
  • 1945 ENIAC
  • 1947 The first bug
  • 1977 - Arpanet
  • 1990s Internet
  • Phones and Tablets
  • RFID
  • Cloud
  • Services
  • New consumer devices 23
  • 90 percent of all the data in the world has been generated over the last two years source: sciencedaily.com
  • Every day 2.5 quintillion bytes of data is generated 1 quintillion = 1 + 18 zeros! 57.5 billion 32 GB iPads source: storagenewsletter.com
  • 2.7 zettabytes exist in the digital universe 1 zettabyte = 1 + 21 zeros! 42zb = All human speech digitized source: highscalability.com
  • How big is big?
  • Thats A LOT of data!
  • How do you harness it?
  • This is what big data is really about.
  • Asking questions and getting answers
  • Massive amounts of data. Machine generated VOLUME
  • Data is coming from a multitude of sources Mix of structured and un-structured (JSON, XML, CSV, Plain Text) Need a way to store it and and query it VARIETY
  • VARIETY Log files Activity Feeds Emails Device Streams Audio Files Videos
  • Data arrives at many different frequencies Need to be able to process real time. VELOCITY
  • Not all data that is stored is useful. Need to identify the useful data Need to wade through all the noise VERACITY
  • SOLUTIONS
  • Map/Reduce function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
  • Hi scale and availability databases
  • Distributed processing of large datasets
  • Data Visualization and analysis
  • End to end tools
  • More information www.mongodb.org www.memsql.com cassandra.apache.org hadoop.apache.org www.tableausoftware.com www.elasticsearch.org splunk.com
  • @gblock http://github.com/glennblock http://www.flickr.com/photos/[email protected]/4050576435