big data analytics with nginx, logstash, redis, google bigquery and neo4j, javier ramirez, datawaki

Download Big Data analytics with Nginx, Logstash, Redis, Google Bigquery and Neo4j, javier ramirez, datawaki

If you can't read please download the document

Upload: javier-ramirez

Post on 28-Jul-2015

165 views

Category:

Data & Analytics


1 download

TRANSCRIPT

1. javier ramirez @supercoco9 https://datawaki.com Big Data analytics with Nginx, Logstash, Redis, Google BigQuery, and Neo4j datawaki 2. moral of the story you can do big, if you know how 3. javier ramirez @supercoco9 https://datawaki.com 4. Apache Hadoop Apache Cassandra Apache Spark Apache Storm Hbase Kafka javier ramirez @supercoco9 https://datawaki.com 5. bigdata is cool but... expensive cluster hard to set up and monitor not interactive enough 6. Data analysis as a service Google BigQuery javier ramirez @supercoco9 https://datawaki.com 7. javier ramirez @supercoco9 https://datawaki.com The right-now data analytics platform for your website, your backend, and your business datawaki 8. The Challenge Several thousands of req./s From many devices/apps Provide real-time alerts Analyze billions of rows interactively Extract graph information javier ramirez @supercoco9 https://datawaki.com 9. The real challenge Cheap javier ramirez @supercoco9 https://datawaki.com 10. data from many sources HTTP Libraries available for virtually any programming language De facto standard for inter-system comms. Easy to script from command line tools 11. Free, open-source, high-performance HTTP server and reverse proxy Nginx is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption. Used by Netflix, Hulu, Pinterest, CloudFlare, Airbnb, WordPress.com, GitHub, SoundCloud, Zynga, Eventbrite, Zappos, Media Temple, Heroku, RightScale, Engine Yard and MaxCDN Free, open-source, high-performance HTTP server and reverse proxy Nginx is known for its high performance, stability, rich feature set, simple configuration, and low resource consumption. Used by Netflix, Hulu, Pinterest, CloudFlare, Airbnb, WordPress.com, GitHub, SoundCloud, Zynga, Eventbrite, Zappos, Media Temple, Heroku, RightScale, Engine Yard and MaxCDN 12. Log NGINX Log NGINX Several hundred thousand request per second/server Limited by network bandwidth Just add more servers ($5) and balance data input 13. Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way. logstash: handle the log data 14. Highly scalable (Jruby process) Input/Output/Codecs/Filters Easily extendable using ruby Logstash 15. Log NGINX Log NGINX Logstash Data Verification: we discard invalid inputs in Logstash We complete messsages with basic info (timestamp, origin...) Redis data input 16. open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. http://redis.io started in 2009 by Salvatore Sanfilippo @antirez 100+ contributors at https://github.com/antirez/redis javier ramirez @supercoco9 https://datawaki.com codemotion 2013 17. Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q SET: 552,028 requests per second GET: 707,463 requests per second LPUSH: 767,459 requests per second LPOP: 770,119 requests per second Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q SET: 122,556 requests per second GET: 123,601 requests per second LPUSH: 136,752 requests per second LPOP: 132,424 requests per second javier ramirez @supercoco9 https://datawaki.com codemotion 2013 18. Redis keeps everything in memory all the time javier ramirez @supercoco9 https://datawaki.com 19. what it's being used for javier ramirez @supercoco9 https://datawaki.com 20. twitter user info from gizmoduck (memcached) user id tweet id metadata write API (from browser or client app) rpushx to Redis tweet info from tweetypie (memcached + mysql) your twitter timeline javier ramirez @supercoco9 https://datawaki.com fanout (flockDB) one per follower 21. products using Redis javier ramirez @supercoco9 https://datawaki.com Pinterest SnapChat World of Warcraft GitHub HipChat SoundCloud Tumblr Booking.com YouPorn... 22. Log NGINX Log NGINX Logstash Redis Ruby Worker Ruby Worker Alert system data input 23. javier ramirez @supercoco9 https://datawaki.com 24. Google BigQuery Data analysis as a service http://developers.google.com/bigquery javier ramirez @supercoco9 https://datawaki.com 25. Based on Dremel Specifically designed for interactive queries over petabytes of real-time data javier ramirez @supercoco9 https://datawaki.com 26. Analysis of crawled web documents. Tracking install data for applications on Android Market. Crash reporting for Google products. OCR results from Google Books. Spam analysis. Debugging of map tiles on Google Maps. Tablet migrations in managed Bigtable instances. Results of tests run on Googles distributed build system. Disk I/O statistics for hundreds of thousands of disks. Resource monitoring for jobs run in Googles data centers. Symbols and dependencies in Googles codebase. What Dremel has been used for in Google 27. INPUT / OUTPUT Big Data's #1 Enemy 28. INDEXES Data Scientists's #1 Enemy 29. Columnar storage javier ramirez @supercoco9 https://datawaki.com 30. Colossus filesystem Distributed/redundant Parallel reads Ultra fast network 31. highly distributed execution using a tree javier ramirez @supercoco9 https://datawaki.com 32. loading data You can feed flat CSV-like files or nested JSON objects javier ramirez @supercoco9 https://datawaki.com 33. web console screenshot javier ramirez @supercoco9 https://datawaki.com 34. javier ramirez @supercoco9 https://datawaki.com analytical SQL functions. correlations. window functions. views. JSON fields. timestamped tables. 35. Things you always wanted to try but were too scared to javier ramirez @supercoco9 https://datawaki.com select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed) 36. Global Database of Events, Language and Tone quarter billion rows 30 years updated daily http://gdeltproject.org/data.html#googlebigquery 37. SELECT Year, Actor1Name, Actor2Name, Count FROM ( SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rank FROM (SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), WHERE Actor1Name IS NOT null AND Actor2Name IS NOT null GROUP EACH BY 1, 2, 3 HAVING Count > 100 ) WHERE rank=1 ORDER BY Year 38. BigQuery pricing $20 per stored TB 1000000 rows => $0.004 / month $5 per processed TB 1 full scan (1MM rows) ~ 200 MB 1 count = 0 MB 1 full scan over 1 column ~ 15 MB *the 1st TB every month is free of charge javier ramirez @supercoco9 https://datawaki.com 39. Log NGINX Log NGINX Logstash Redis BigQuery Ruby Worker Ruby Worker Alert system data input 40. Neo4j is a high performance graph store with all the features expected of a mature and robust database, like a friendly query language and ACID transactions. The programmer works with a flexible network structure of nodes and relationships rather than static tablesyet enjoys all the benefits of enterprise-quality database. For many applications, Neo4j offers orders of magnitude performance benefits compared to relational DBs. 41. Define data flows (funnels) for users or devices Check if the data points are part of a funnel Store BigQuery ID on the graph so we can cross analytical queries with data flows How are we using neo4j 42. MATCH startPath=(root)-[:`2010`]->()-[:`12`]->()-[:`31`]-> (startLeaf), endPath=(root)-[:`2011`]->()-[:`01`]->() -[:`03`]->(endLeaf), valuePath=(startLeaf)-[:NEXT*0..]->(middle)- [:NEXT*0..]->(endLeaf), vals=(middle)-[:VALUE]->(event) WHERE root.name = 'Root' RETURN event.name ORDER BY event.name ASC Cypher Query Language 43. Neo4J web console 44. Log NGINX Log NGINX Logstash Redis BigQuery Neo4j Ruby Worker Ruby Worker Alert system data input 45. Postgre SQL Log NGINX Log NGINX Logstash Redis BigQuery Neo4j Ruby Worker Ruby Worker Rails App Alert system datawaki in a nutshell Report system user interaction data input 46. Cost of a minimum system Nginx $5 per server Logstash $10 per server Redis $5 Ruby workers $5 per server BigQuery $5 per 500MM rows Neo4j $10 per server Rails $5 per server total: $45 / month + backups javier ramirez @supercoco9 https://datawaki.com 47. ig 48. javier ramirez @supercoco9 https://datawaki.com Thanks! datawaki