tech4africa - opportunities around big data

of 42/42
1 Big Data Steve Watt Technology Strategy @ HP [email protected] @wattsteve

Post on 28-Nov-2014




5 download

Embed Size (px)




2. AgendaHardware Software Data Big Data SituationalApplications2 3. Situational Applications3 eaghra (Flickr) 4. Web 2.0 Era Topic Map ProduceProcessInexpensivData e StorageExplosion LAM Social PPlatform Publishinsg PlatformsSituational ApplicationsWeb 2.0Mashups EnterprisSOAe4 5. 5 6. Big Data6 blmiers2 (Flickr) 7. The data just keeps growing 1024 GIGABYTE= 1 TERABYTE 1024 TERABYTES = 1 PETABYTE 1024 PETABYTES = 1 EXABYTE1 PETABYTE 13.3 Years of HD Video20 PETABYTES Amount of Data processed by Google daily5 EXABYTES All words ever spoken by humanity 8. MobileApp Economy for DevicesSensor WebApp for this App for thatAn instrumented and monitored worldSet TopTablets, etc. Multiple Sensors in your pocketBoxes Real-time DataThe Fractured Web OpportunityFacebook Twitter LinkedInService EconomyService for thisGoogle NetFlixNew York TimesService for that eBayPandora PayPalWeb 2.0 Data Exhaust of Historical and Real-time Data Web 2.0 - Connecting PeopleAPI Foundation Web as a Platform 8 Web 1.0 - Connecting MachinesInfrastructure 9. Data Deluge! But filter patterns canhelp9Kakadu (Flickr) 10. FilteringWithSearch 10 11. FilteringSociallyAwesome 11 12. FilteringVisually 12 13. But filter patterns force you down a pre-processedpathM.V. Jantzen (Flickr) 14. What if you could ask your own questions? 14 wowwzers(Flickr) 15. And go from discovering Something about Everything MrB-MMX (Flickr) 16. To discovering Everything about Something ?16 17. How do we do this? Lets examine a few techniques forGathering, Storing, Processing &17Delivering Data @ Scale 18. Gathering DataData Marketplaces 18 19. 19 20. 20 21. Gathering DataApache Nutch(Web Crawler) 21 22. Storing, Reading and Processing - Apache HadoopCluster technology with a single master and scale out with multiple slavesIt consists of two runtimes: The Hadoop Distributed File System (HDFS) Map/ReduceAs data is copied onto the HDFS it ensures the data is blocked and replicated to other machines to provide redundancyA self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop Master which in-turn distributes the job to each slave in the cluster.Jobs run on data that is on the local disks of the machine they are sent to ensuring data localityNode (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re- execute a job on any node in the cluster. Want to know more?22 Hadoop The Definitive Guide (2nd Edition) 23. Delivering Data @ ScaleStructured DataLow Latency & Random AccessColumn Stores (Apache HBase or Apache Cassandra) faster seeks better compression simpler scale out De-normalized Data is written as it is intended to be queried Want to know more?23 HBase The Definitive Guide & Cassandra High Performance 24. Storing, Processing & Delivering : Hadoop + NoSQLGatherRead/TransforLow-mlatency ApplicationWeb DataNutchQueryCrawl ServeCopyApacheHadoop Log Files Flume ConnectorHDFS NoSQLRepository NoSQL SQOOP Connector/A Connector PI Relational Data-Clean and Filter Data (JDBC)- Transform and Enrich Data MySQL- Often multiple Hadoop jobs 24 25. Some things to keepin mind 25 Kanaka Menehune (Flickr) 26. Some things to keep in mindProcessing arbitrary types of data (unstructured, semi- structured, structured) requires normalizing data with many different kinds of readers Hadoop is really great at this !However, readers wont really help you process truly unstructured data such as prose. For that youre going to have to get handy with Natural Language Processing. But this is really hard. Consider using parsing services & APIs like Open Calais Want to know more?26 Programming Pig (OREILLY) 27. Open Calais (Gnosis)27 28. Statistical real-time decision making Capture Historical information Use Machine Learning to build decision making models (such as Classification, Clustering & Recommendation) Mesh real-time events (such as sensor data) against Models to make automated decisions Want to know more?28 Mahout in Action 29. 29Pascal Terjan (Flickr 30. 30 31. 31 32. 32 33. 33 34. Making the data STRUCTUREDRetrieving HTMLPrelim Filtering on URLCompany POJO then /t Out34 35. Aargh!My viz toolrequireszipcodes to plotgeospatially!35 36. Apache Pig Script to Join on City to get ZipCode and Write the results to VerticaZipCodes = LOAD demo/zipcodes.txt USING PigStorage(t) AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD demo/crunchbase.txt USING PigStorage(t) AS(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORE CrunchBaseZip INTO{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Yearint, Investor int, Amount varchar(40))}USING com.vertica.pig.VerticaStorer(VerticaServer,OSCON,5433,dbadmin,); 37. Total Tech Investments By Year 38. Investment Funding By Sector 39. Total Investments By Zip Code for all Sectors $1.2 Billion in Boston $7.3 Billion in San Francisco$2.9 Billion in Mountain View$1.7 Billion in Austin39 40. Total Investments By Zip Code for Consumer Web$600 Million in Seattle $1.2 Billion in Chicago $1.7 Billion in San Francisco40 41. Total Investments By Zip Code for BioTech$1.3 Billion in Cambridge $528 Million in Dallas $1.1 Billion in San Diego41 42. Questions?42