demystifying data engineering
TRANSCRIPT
Demystifying Data Engineering
Data engineering
• Software engineering with an emphasis on dealing with large amounts of data
• A “specialty” of software engineering
Why now?
• Always value in scale, but it was previously too difficult / expensive
• Economics and technology advances make these scales accessible
Enable others to answer questions on dataset within latency constraints
Data engineering
• Distributed systems – consensus, consistency, availability, etc.
• Parallel processing
• Databases
• Queuing
Data engineering
• Human-fault tolerance
• Metrics and monitoring
• Multi-tenancy
BackType
• When I joined:
• Comment search by keyword
• Comment search by user
• Basic stats on commenters
• Link search on Twitter
BackType
Kyoto Cabinet
Custom workers
Custom crawlers
BackType
• Inflexible
• Prone to corruption
• Heavy operational burden
• Not scalable
• Not fault-tolerant
BackType
• Enable asking any question (with high latency)
• Allows exploration and experimentation
• Establishes human-fault tolerance
Collector
Collector
Collector
Collector
ElephantDB
• Export results of MapReduce pipelines for querying
• Low latency querying but out of date by many hours
• Incredibly simple
• Infrastructure
• Data pipelines
• Abstractions
Data engineering
Data pipeline example
Tweets(S3)
Normalize URLs
Compute hour bucket
Sum by hour/url
Emit ElephantDB
indexes
Data pipeline example
Tweets(Kafka)
Normalize URLs
Compute hour bucket
Update hour/url bucket
Cassandra
Abstraction example
MapReduce Cascading Cascalog
Infrastructure• HDFS
• MapReduce
• Kafka
• Storm
• Spark
• Cassandra
• HBase
• ElephantDB
• Zookeeper
Streaming compute team at Twitter
• Started streaming compute team at Twitter
• One shared Storm cluster for entire company
Multi-tenancy
• Independent applications on same cluster
• Topologies should not affect one another
Resource allocation
• Topologies should be given an appropriate amount of resources
Initial approach
• Use Mesos to provide resource guarantees
• Users include resources needed as part of topology submission
Solution
• Implement new scheduler which gives production topologies dedicated hardware
• Only Storm team can configure production topologies
• Left-over machines are used as failover or for in-development topologies
Data Engineering vs Data Science
• Well-defined problems
• No special statistics skills required
• Larger scope
• Not just analytics
Open source
• Almost all major Big Data tools are open source (e.g. Hadoop, Storm, Spark, Kafka, Cassandra, HBase, etc.)
• Many have commercial support
Open source
• Very important for recruiting data engineers
• Strong developers want to work at places where they can be involved with open source
Open source
• Develop a technology brand for company (in conjunction with a tech blog)
• Creating a popular open source project can give you access to lots of strong engineers
Open source
• Identify strong engineers in the community you may want to recruit
• Learn best practices and get help from the people who know the tools the best
• *Do not* expect to get “free work” on your projects
Ideal data engineer
• Strong software engineering skills
• Abstraction
• Testing
• Version control
• Refactoring
Ideal data engineer
• Strong software engineering skills
• Strong algorithm skills
Ideal data engineer
• Strong software engineering skills
• Strong algorithm skills
• Good at digging into open source code
Ideal data engineer
• Strong software engineering skills
• Strong algorithm skills
• Good at digging into open source code
• Good at stress testing
Finding strong data engineers
• Standard “coding on the whiteboard” interviews are near useless
• Use take home projects to gauge general programming ability
• The best is to see projects that require data engineering
Questions?