productive data engineer

How to Be Productive Data Engineer

Rafal Wojdyla - rav@spotify.comNote: My views are my own and don't necessarily represent those of Spotify.

• Operations

• Development

• Organization

• Culture

What is Spotify?For everyone:

• Streaming Service

• Launched in October 2008

• 60 Million Monthly Users

• 15 Million Paid Subscribers

+ and for me:

• 1.3K nodes Hadoop cluster

Automation

Apache AmbariCloudera Manager

+ Puppet

Not InventedHere

Never InventedHere

Wild Wild West

Apache Bigtop

Enable log aggregation

To enable log aggregation

yarn.log-aggregation-enable = trueyarn.log-aggregation.retain-seconds = ?

+ <property>+ <name>yarn.log-aggregation-enable</name>+ <value>true</value>+ </property>++ <property>+ <name>yarn.log-aggregation.retain-seconds</name>+ <value>315569260</value>+ + </property>

Heap Memory used is 97%

Hellelephant

Custom logs• Profiling

• Garbage collection

Right tool for the job

Right abstraction for the job

Scaling machines is easy, scaling

people is hard

• Map split size

• Number of reducers

• HDFS data retention

• User feedback (ongoing)

Automation

Organization

Ownerless

Ownerless Squad

Ownerless

Squad Upgrades

Ownerless

Squad Upgrades Getting there

Culture

ExperimentFail Fast

Embrace Failure

But we have tried!

Non grata

spark.storage.memoryFraction (0.6)spark.shuffle.memoryFraction (0.2)

In shuffle heavy algorithms reduce cache fraction in favour of shuffle.

spark.executor.heartbeatInterval (10K)spark.core.connection.ack.wait.timeout (60)

Increase in case of long GC pauses.

Learnings• Operations Automation

• Development Abstraction

• Organization Team

• Culture Experiment

Join the bandEngineers wanted inNYC & Stockholm

http://spotify.com/jobs

productive data engineer

hadoop nodes

spotify service

nodes hadoop clusterbefore

operating hadoop clusters

spotify powerpoint template

handful of nodes

development decision

single moment

Data & Analytics

stata: getting starting and being productive with va data

solutions engineer bruno Šimić columnstore...bruno Šimić...

leap into the big data age and boost your productive ... ·...

iacs annual report · information from the graduating...

netapp certified implementation engineer - data protection

who’s hiring? · 2020. 11. 24. · test engineer software...

muhr.mahidol.ac.th · pentaho, visualization power bl 3:...

crm data management strategy key to agile and productive...

top 8 data center engineer resume samples

fm 5 34 engineer field data 1987

productive data engineer speaker notes

combining new earth observation data sources for a...

combining heterogeneous data to reverse engineer regulatory...

cloud academy december 2018 data report...of the data...

gis data sharing - developed, productive, beneficial

making predictive analytics productive: are you learning...

(the life of a) data engineer

being productive with jpa using spring data, deltaspike...

engineer data

fm 5 34 engineer field data