data-driven development era and its technologies
TRANSCRIPT
Data-Driven Development Era and Its Technologies
Developers Summit 2015 Autumn (Oct 14, 2015)
Satoshi Tagomori (@tagomoris)
Main Topics around "Data"• Data collection
• Storage
• Data processing • Batch distributed processing • Stream processing • Machine Learning • Near real-time query & Data lake
• Visualization
Using Services or Not• Using services fully-managed:
• Google BigQuery & Dataflow • Treasure Data services
• Using services self-managed: • Amazon EMR & Redshift • Google Cloud Dataproc
• Using your own environment & cluster
Using Services or Not• Using services fully-managed:
• Google BigQuery & Dataflow • Treasure Data services
• Using services self-managed: • Amazon EMR & Redshift • Google Cloud Dataproc
• Using your own environment & cluster
a bit more cost extremely less efforts
fully controlled by self extremely more efforts
less cost less efforts
Why should we use services?
• About distributed systems: • hard to operate & upgrade • impossible to "small-start" • very hard to hire professional engineer
• Data Driven Development: • collect/store data at first! • consider output data at second! • "before building your own environment"
Really? Are you TD guy?
• ...Really!
• But it requires very long discussions :P
• "スタートアップのデータ処理基盤、作るか、使うか"http://tsuchinoko.dmmlabs.com/?p=1770
"What" decides "How"
• Distributed systems are to solve problems • There're many kind of data • There're many problems
• Systems solve different problems from each other • There are no "Silver bullet"!
What First, How Second
• What do you want to do? • Reporting? Analytics? Recommendation? or ...
• What type of data you wan to process? • Stored large log? Stream sensor data? or ...
• What is you need as result? • CSV? Spreadsheet? Graph? DB Relation? or ...
How? (just for example)
• MapReduce, Tez • Large batch jobs, big JOINs, high stability
• Spark • Small/Middle batch jobs, machine learning
• Impala, Presto, Drill, Redshift, BigQuery • Near-real-time search, small-to-large analytics
• Storm, Spark streaming • Stream data conversion/aggregation
Data Collection• Data Driven Development -> collect at first!
• As batch: Data already exists as files • Easily integrated with existing batch systems • Sqoop, Embulk, ...
• As stream: Data just generated now • Easily connected with monitoring systems • Without burst network traffic • Flume, Logstash, Fluentd, ...
Other Important Topics
• Storage: Performance, Availability, Schema management • Apache Hadoop HDFS, Apache HBase, Amazon S3, Cloudera Kudu, ...
• Visualization: Functionality, Connectivity, Visibility • Tableau, Pentaho, Many other enterprise products, ...
• Distributed Queues: Performance, Stability, Connectivity • Apache Kafka, Amazon Kinesis, ...