demystifying data engineering

Demystifying Data Engineering

Data engineering

• Software engineering with an emphasis on dealing with large amounts of data

• A “specialty” of software engineering

Why now?

• Always value in scale, but it was previously too difficult / expensive

• Economics and technology advances make these scales accessible

Enable others to answer questions on dataset within latency constraints

Data engineering

• Distributed systems – consensus, consistency, availability, etc.

• Parallel processing

• Databases

• Queuing

Data engineering

• Human-fault tolerance

• Metrics and monitoring

• Multi-tenancy

BackType

• When I joined:

• Comment search by keyword

• Comment search by user

• Basic stats on commenters

• Link search on Twitter

BackType

Kyoto Cabinet

Custom workers

Custom crawlers

BackType

• Inflexible

• Prone to corruption

• Heavy operational burden

• Not scalable

• Not fault-tolerant

BackType

• Enable asking any question (with high latency)

• Allows exploration and experimentation

• Establishes human-fault tolerance

Collector

Collector

Collector

Collector

ElephantDB

• Export results of MapReduce pipelines for querying

• Low latency querying but out of date by many hours

• Incredibly simple

• Infrastructure

• Data pipelines

• Abstractions

Data engineering

Data pipeline example

Tweets(S3)

Normalize URLs

Compute hour bucket

Sum by hour/url

Emit ElephantDB

indexes

Data pipeline example

Tweets(Kafka)

Normalize URLs

Compute hour bucket

Update hour/url bucket

Cassandra

Abstraction example

MapReduce Cascading Cascalog

Infrastructure• HDFS

• MapReduce

• Kafka

• Storm

• Spark

• Cassandra

• HBase

• ElephantDB

• Zookeeper

Streaming compute team at Twitter

• Started streaming compute team at Twitter

• One shared Storm cluster for entire company

Multi-tenancy

• Independent applications on same cluster

• Topologies should not affect one another

Resource allocation

• Topologies should be given an appropriate amount of resources

Initial approach

• Use Mesos to provide resource guarantees

• Users include resources needed as part of topology submission

Solution

• Implement new scheduler which gives production topologies dedicated hardware

• Only Storm team can configure production topologies

• Left-over machines are used as failover or for in-development topologies

Data Engineering vs Data Science

• Well-defined problems

• No special statistics skills required

• Larger scope

• Not just analytics

Open source

• Almost all major Big Data tools are open source (e.g. Hadoop, Storm, Spark, Kafka, Cassandra, HBase, etc.)

• Many have commercial support

Open source

• Very important for recruiting data engineers

• Strong developers want to work at places where they can be involved with open source

Open source

• Develop a technology brand for company (in conjunction with a tech blog)

• Creating a popular open source project can give you access to lots of strong engineers

Open source

• Identify strong engineers in the community you may want to recruit

• Learn best practices and get help from the people who know the tools the best

• *Do not* expect to get “free work” on your projects

Ideal data engineer

• Strong software engineering skills

• Abstraction

• Testing

• Version control

• Refactoring

Ideal data engineer


• Strong algorithm skills

Ideal data engineer



• Good at digging into open source code

Ideal data engineer



• Good at digging into open source code

• Good at stress testing

Finding strong data engineers

• Standard “coding on the whiteboard” interviews are near useless

• Use take home projects to gauge general programming ability

• The best is to see projects that require data engineering

Questions?

demystifying data engineering

Technology

demystifying data engineering

large amounts of data

major big data tools

strong engineers

popular open source

storm team

cluster topologies

streaming compute team