data analytics infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · no central...
TRANSCRIPT
![Page 1: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/1.jpg)
Data Analytics Infrastructure
Le Nguyen The Dat
@lenguyenthedat
Data Science SG
Nov 2015 Meetup
![Page 2: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/2.jpg)
Backgrounds ZALORA Group (2013 – 2014)
o Biggest online fashion retails in South East Asia o Data Infrastructure & Data Science
![Page 3: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/3.jpg)
BackgroundsCommercialize.TV (2015 – )
o Multi Channel media network – focusing on China audiences
o Data Infrastructure & Insights
![Page 4: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/4.jpg)
Challenges No central data source:
o Data stored in multiple locations o Unclear ownership
Data definition and quality: o Little to none documentation
o Different formula, rules owned by different departments o Always dirty no matter what
Reporting – Descriptive analytics:
o Immediate needs, automations o Important to do it right (and quick!)
![Page 5: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/5.jpg)
Data Warehouse
![Page 6: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/6.jpg)
Database TechnologiesSQL – Relational Databases:
o MySQL, PostgreSQL o MS SQL Server, Oracle SQL
NoSQL: o Redis o Cassandra o MongoDB o DynamoDB (AWS) o RethinkDB
Map Reduce ecosystem:
o Hadoop: HDFS – Pig – Hive – Hbase o Spark: RDD – Spork – Shark (Spark SQL) – Hbase-Spark
Massively Parallel Processing (MPP): o Vertica (HP) - Greenplum (EMC) – Netezza (IBM) o ParAccel (Amazon Redshift)
![Page 7: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/7.jpg)
Database TechnologiesSQL – Relational Databases: ✔
o MySQL, PostgreSQL o MS SQL Server, Oracle SQL
NoSQL: (?) o Redis ✖ o Cassandra ✔ o MongoDB ✖ o DynamoDB (AWS) ✖ o Neo4j ✖
Map Reduce ecosystem: ✔
o Hadoop: HDFS – Pig – Hive – Hbase o Spark: RDD – Spork – Shark (Spark SQL) – Hbase-Spark
Massively Parallel Processing (MPP): ✔ o Vertica (HP) - Greenplum (EMC) – Netezza (IBM) o ParAccel (Amazon Redshift)
![Page 8: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/8.jpg)
Data WarehouseAmazon Redshift aws.amazon.com/redshift
o Cloud-based, Fully managed o SQL (PostgreSQL 8.0.2) o On-demand ($2000/year) o Scalable (Petabyte-scale) o FAST! amplab.cs.berkeley.edu/benchmark/
All in ONE place!
o Product information o Customer information o Tracking data o External data sources (Social Media, 3rd Party datasets)
![Page 9: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/9.jpg)
Extract-Transform-Load (ETL)
![Page 10: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/10.jpg)
ETLAmazon Redshift’s COPY command. docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
![Page 11: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/11.jpg)
ETL Custom made:
o Simple bash script, python, SQL o Use cases:
• Scrappers • Excel / CSV imports
Data Pipeline Frameworks:
o Large scale, more complicated o Examples:
• Spotify’s Luigi – github.com/spotify/luigi • Yelp’s Mycroft – github.com/Yelp/mycroft
3rd Party Services:
o aws.amazon.com/redshift/partners • Flydata • Rjmetrics
![Page 12: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/12.jpg)
Applications
![Page 13: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/13.jpg)
Self-services Dashboards Intermediate users
o SQL / Excel / WYSIWYG Tools o Re:dash – www.redash.io
• Open source: github.com/getredash/redash • Try it out: demo.redash.io • Self-manage & deployment:
o Docker o Pre-baked AMI (Amazon Web Services) o Google Cloud Images
• Supports lots of database types (Redshift, MySQL, PostgreSQL, Big Query, MongoDB…)
• Users need to know SQL • Web-based, collaborative work type
![Page 14: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/14.jpg)
Demo: re:dash data sources usage http://demo.redash.io/queries/756
![Page 15: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/15.jpg)
Demo: NYC Taxis Tip Amounts http://demo.redash.io/queries/753
![Page 16: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/16.jpg)
Self-services Dashboards Intermediate users
o SQL / Excel / WYSIWYG Tools o Tableau – www.tableau.com
• Licensed software (14 days trial) • Tableau Public (Free: public.tableau.com) • Self-host or Tableau host (fully managed) • Supports a lot more database types • Group, User management – customized access right • Drag & Drop software as well as web-based
![Page 17: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/17.jpg)
Demo: Social media dashboard Baidu è Import.io API èData Warehouse è Tableau
![Page 18: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/18.jpg)
Demo: Market Research - video platform performance SimilarWeb API èData Warehouse è Tableau
![Page 19: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/19.jpg)
Others (Tableau Public): • SEA Games Result history tiny.cc/seagames • Rakuten – Viki data challenge tiny.cc/viki-viz
![Page 20: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/20.jpg)
Advanced applications Advanced users
o Data Warehouse connection (JDBC - PostgreSQL) o Automated, highly customized reports. o Data Science:
• Recommendation engine • Predictive modeling • Classifications
![Page 21: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/21.jpg)
Advanced applicationsInternal reporting tool Data Warehouse è SQL, Python (Django), JS è Product-Finder
[ Black dress | SKU or ID | Tiffany| Atmosphere ]
Sales Info Tracking Info
Product Info
![Page 22: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/22.jpg)
Advanced applicationsRecommendation Engine Data Warehouse è SQL, Python, Haskell è ZALORA Website
Similar products: Similar products:
Similar products: Similar products:
![Page 23: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/23.jpg)
Conclusions
![Page 24: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/24.jpg)
Team & Technology stacko Small team of 1-4 programmers
o Amazon Web Services • No upfront cost
• Low maintenance
• Scalability
• Integrations
o Shell Scripts, Python, Haskell, D3.js
o Unix, Open-source technologies
![Page 25: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/25.jpg)
Takeawayso (Good) data infrastructure is important:
• Build it first (before you hire a data scientist!) • Build it right: stable – fast – scalable.
o There is no silver bullet: • Understand what you need • Always do more research
o Data infrastructure is NOT that hard! • Utilize existing, modern technologies • Avoid old, proprietary technology that were built for the 90s!
![Page 26: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/26.jpg)
ReferencesEngineering Blogs and Tutorials
o https://blog.asana.com/2014/11/stable-accessible-data-infrastructure-startup/ o https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83 o https://engineering.pinterest.com/blog/powering-interactive-data-analysis-
redshift o http://engineering.ifttt.com/data/2015/10/14/data-infrastructure/ o https://blog.rjmetrics.com/2015/10/15/the-data-infrastructure-meta-analysis-
how-top-engineering-organizations-built-their-big-data-stacks/ o https://www.youtube.com/watch?v=reQtXquDpzo o https://www.periscopedata.com/amazon-redshift-guide
Benchmarks
o https://amplab.cs.berkeley.edu/benchmark/ o https://www.flydata.com/blog/with-amazon-redshift-ssd-querying-a-tb-of-data-
took-less-than-10-seconds/ o https://www.flydata.com/blog/hive-and-redshift-a-brief-comparison/
![Page 27: Data Analytics Infrastructurelenguyenthedat.com/extras/decks/dssg-data-infras.pdf · No central data source: o Data stored in multiple locations o Unclear ownership Data definition](https://reader033.vdocument.in/reader033/viewer/2022053000/5f045f9a7e708231d40da858/html5/thumbnails/27.jpg)
Thank you!