demystifying data engineering

39
Demystifying Data Engineering

Upload: nathanmarz

Post on 15-Aug-2015

980 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Demystifying Data Engineering

Demystifying Data Engineering

Page 2: Demystifying Data Engineering

Data engineering

• Software engineering with an emphasis on dealing with large amounts of data

• A “specialty” of software engineering

Page 3: Demystifying Data Engineering

Why now?

• Always value in scale, but it was previously too difficult / expensive

• Economics and technology advances make these scales accessible

Page 4: Demystifying Data Engineering

Enable others to answer questions on dataset within latency constraints

Page 5: Demystifying Data Engineering

Data engineering

• Distributed systems – consensus, consistency, availability, etc.

• Parallel processing

• Databases

• Queuing

Page 6: Demystifying Data Engineering

Data engineering

• Human-fault tolerance

• Metrics and monitoring

• Multi-tenancy

Page 7: Demystifying Data Engineering

BackType

• When I joined:

• Comment search by keyword

• Comment search by user

• Basic stats on commenters

• Link search on Twitter

Page 8: Demystifying Data Engineering

BackType

Kyoto Cabinet

Custom workers

Custom crawlers

Page 9: Demystifying Data Engineering

BackType

• Inflexible

• Prone to corruption

• Heavy operational burden

• Not scalable

• Not fault-tolerant

Page 10: Demystifying Data Engineering

BackType

• Enable asking any question (with high latency)

• Allows exploration and experimentation

• Establishes human-fault tolerance

Page 11: Demystifying Data Engineering

Collector

Collector

Collector

Collector

Page 12: Demystifying Data Engineering
Page 13: Demystifying Data Engineering

ElephantDB

• Export results of MapReduce pipelines for querying

• Low latency querying but out of date by many hours

• Incredibly simple

Page 14: Demystifying Data Engineering
Page 15: Demystifying Data Engineering

• Infrastructure

• Data pipelines

• Abstractions

Data engineering

Page 16: Demystifying Data Engineering

Data pipeline example

Tweets(S3)

Normalize URLs

Compute hour bucket

Sum by hour/url

Emit ElephantDB

indexes

Page 17: Demystifying Data Engineering

Data pipeline example

Tweets(Kafka)

Normalize URLs

Compute hour bucket

Update hour/url bucket

Cassandra

Page 18: Demystifying Data Engineering

Abstraction example

MapReduce Cascading Cascalog

Page 19: Demystifying Data Engineering
Page 20: Demystifying Data Engineering

Infrastructure• HDFS

• MapReduce

• Kafka

• Storm

• Spark

• Cassandra

• HBase

• ElephantDB

• Zookeeper

Page 21: Demystifying Data Engineering

Streaming compute team at Twitter

• Started streaming compute team at Twitter

• One shared Storm cluster for entire company

Page 22: Demystifying Data Engineering

Multi-tenancy

• Independent applications on same cluster

• Topologies should not affect one another

Page 23: Demystifying Data Engineering

Resource allocation

• Topologies should be given an appropriate amount of resources

Page 24: Demystifying Data Engineering

Initial approach

• Use Mesos to provide resource guarantees

• Users include resources needed as part of topology submission

Page 25: Demystifying Data Engineering
Page 26: Demystifying Data Engineering

Solution

• Implement new scheduler which gives production topologies dedicated hardware

• Only Storm team can configure production topologies

• Left-over machines are used as failover or for in-development topologies

Page 27: Demystifying Data Engineering
Page 28: Demystifying Data Engineering

Data Engineering vs Data Science

• Well-defined problems

• No special statistics skills required

• Larger scope

• Not just analytics

Page 29: Demystifying Data Engineering

Open source

• Almost all major Big Data tools are open source (e.g. Hadoop, Storm, Spark, Kafka, Cassandra, HBase, etc.)

• Many have commercial support

Page 30: Demystifying Data Engineering

Open source

• Very important for recruiting data engineers

• Strong developers want to work at places where they can be involved with open source

Page 31: Demystifying Data Engineering

Open source

• Develop a technology brand for company (in conjunction with a tech blog)

• Creating a popular open source project can give you access to lots of strong engineers

Page 32: Demystifying Data Engineering

Open source

• Identify strong engineers in the community you may want to recruit

• Learn best practices and get help from the people who know the tools the best

• *Do not* expect to get “free work” on your projects

Page 33: Demystifying Data Engineering

Ideal data engineer

• Strong software engineering skills

• Abstraction

• Testing

• Version control

• Refactoring

Page 34: Demystifying Data Engineering

Ideal data engineer

• Strong software engineering skills

• Strong algorithm skills

Page 35: Demystifying Data Engineering

Ideal data engineer

• Strong software engineering skills

• Strong algorithm skills

• Good at digging into open source code

Page 36: Demystifying Data Engineering

Ideal data engineer

• Strong software engineering skills

• Strong algorithm skills

• Good at digging into open source code

• Good at stress testing

Page 37: Demystifying Data Engineering
Page 38: Demystifying Data Engineering

Finding strong data engineers

• Standard “coding on the whiteboard” interviews are near useless

• Use take home projects to gauge general programming ability

• The best is to see projects that require data engineering

Page 39: Demystifying Data Engineering

Questions?