![Page 1: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/1.jpg)
Data Pipeline ArchitectData Pipeline Architect
Data PipelinesFor small, messy and tedious data.
Vladislav Supalov, 27th October 2016
![Page 2: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/2.jpg)
Data Pipeline ArchitectData Pipeline Architect
How to tell if this talk is for you?
22
● Big Data○ Pretty fascinating○ “Good problem to have”
● Most companies○ Not quite there○ Should not start at this level
● This is for you, if you are close to the data at a○ Startup○ Growing company○ Established company which is about to start an initiative
● Working with a new CDO, CAO, Head of BI
![Page 3: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/3.jpg)
Data Pipeline ArchitectData Pipeline Architect
I want to help you achieve better results!
33
● What will help you to deal with …?○ small data (not much is needed to be valuable)○ messy data (multiple data sources, no overview)○ tedious-to-handle data (multiple data sources, lots of manual work)
● “Use <tech X> in <way Y> and you will be fine”. Nope.○ Just dealing with data is not a magic bullet○ This will not guarantee good results for your company○ You might get lucky of course. That’s not a safe bet.
● How can we improve your chances? Reduce risk.○ Focus on what matters
![Page 4: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/4.jpg)
Data Pipeline ArchitectData Pipeline Architect
Jumping to tech we would dive too deep, too early.
44
● What people tend to think about first:○ Dashboards○ Tools○ Technical solutions, best practices & tricks
● That’s tactics
● We should not jump into implementation details right away.● Let’s not.
![Page 5: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/5.jpg)
Data Pipeline ArchitectData Pipeline Architect
The Craft of Designing & Building Data PipelinesShould start with understanding the business.
![Page 6: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/6.jpg)
Data Pipeline ArchitectData Pipeline Architect
Hi, I’m Vladislav!
66
● Data background○ Machine learning, computer vision, data mining
● Fascination with DevOps○ Efficient, reliable infrastructure setups○ Monitoring, automation, processes
● Currently: Co-founding a startup - Pivii Technologies○ Startup, accelerated by Axel Springer Plug and Play○ Artificial intelligence for content marketing○ AI, ML, CV, data!○ pivii.co
● Previously: Building a data engineering consulting business○ datapipelinearchitect.com
vsupalov
![Page 7: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/7.jpg)
Data Pipeline ArchitectData Pipeline Architect
Preferred consulting situation:
77
● Mobile application marketing agency○ Not necessarily huge data○ Very valuable and worthwhile (from a certain point)
● “We built prototype analytics tools in-house and they are mostly functional”○ “We have seen the value!”○ But are painful to work with & broken○ “Time and money is still being wasted.”
● Tools were created out of an actual need○ Organically, little planning○ “How can we do better?”○ “Where do we go from here?”
![Page 8: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/8.jpg)
Data Pipeline ArchitectData Pipeline Architect
Common Success Pattern: Business Value was Created.Already achieved visible and measurable impact for the company.Or have gotten VERY close to do so. Are thinking about ROI.
![Page 9: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/9.jpg)
Data Pipeline ArchitectData Pipeline Architect
Business first. Tech follows.
99
● Key to successful data projects○ Especially with limited resources○ And small data
● Technical decisions should be informed by business needs and goals
● Handling data is a very small part of the whole○ Straightforward once business needs are clear
● It starts with the mindset○ Don't consider data plumbing in isolation
![Page 10: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/10.jpg)
Data Pipeline ArchitectData Pipeline Architect
Key: being conscious and deliberate about the intention of creating business value.Let’s take a brief detour.
![Page 11: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/11.jpg)
Data Pipeline ArchitectData Pipeline Architect
Consider sword fighting.
1111
● A great samurai sword master● 1584 - 1645● Miyamoto Musashi
○ Martial artist○ Tactician○ Strategist○ Artist○ Sculptor○ Calligrapher○ Writer○ Philosopher○ ...
Images: Miyamoto Musashi, self-portrait, http://sv-musashi1.com/about_Musashi.htm, Musashi Miyamoto with two Bokken, http://www.akinokai.org/images/Images.htm?Musashi.jpg
![Page 12: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/12.jpg)
Data Pipeline ArchitectData Pipeline Architect
“The primary thing when you take a sword in your hands is your intention to cut the enemy, whatever the means.”
- Miyamoto Musashi, The Book of Five Rings
![Page 13: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/13.jpg)
Data Pipeline ArchitectData Pipeline Architect
“Whenever you parry, hit, spring, strike or touch theenemy’s cutting sword, you must cut the enemyin the same movement.”
- Miyamoto Musashi, The Book of Five Rings
![Page 14: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/14.jpg)
Data Pipeline ArchitectData Pipeline Architect
“It is essential to attain this.If you think only of hitting, springing, striking or touchingthe enemy, you will not be able actually to cut him.”
- Miyamoto Musashi, The Book of Five Rings
![Page 15: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/15.jpg)
Data Pipeline ArchitectData Pipeline Architect
“More than anything, you must be thinkingof carrying your movement through to cutting him.You must thoroughly research this.”
- Miyamoto Musashi, The Book of Five Rings
![Page 16: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/16.jpg)
Data Pipeline ArchitectData Pipeline Architect
The Goal of swordfighting is to cut the opponent.
1616
● Stating this makes it seem very obvious.○ Why the effort and emphasis?
● It’s not. Even for aspiring practitioners.○ Results suffer.
● Mindset is essential for mastery● The core advice (to my understanding):
○ Attain, cultivate and apply a goal-oriented mindset○ Aim every step you take towards the goal
![Page 17: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/17.jpg)
Data Pipeline ArchitectData Pipeline Architect
Back to the world of data-handling businesses!
1717
● When working with company data○ Before starting out on a project○ Understand what you want and can achieve○ Aim to create a positive impact on the business○ Make it a constant, conscious goal
● The main tasks to do so are:○ Understand the business○ Understand the people
■ It’s about communication○ Understand current processes○ Be prepared to learn and revise
![Page 18: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/18.jpg)
Data Pipeline ArchitectData Pipeline Architect
Use this process when approaching a new project:
1818
● Qualify client/project○ Does it make sense to get involved?○ Is it evident that we can create value?
● Perform conversations/interviews○ Find out more about the context
■ company, status, goals, limitations...○ Learn from first-hand experience
● Summarize information, learnings and plans in writing○ Roadmap document○ Depicting the situation and ways forward
![Page 19: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/19.jpg)
Data Pipeline ArchitectData Pipeline Architect
Is there potentialfor a good fit?Do budget, topic and goals seem in order?
![Page 20: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/20.jpg)
Data Pipeline ArchitectData Pipeline Architect
Qualifying considerations. Learning about the client and project.
2020
● What are you working on?● What part of the project would you like help with?
● What needs to happen to make this a success for you?
● Why was this project started? What are the business goals?● Is there an event that triggered it?● Why especially now?
● What’s the budget? (ballpark estimate)
● When are you looking to get started?
![Page 21: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/21.jpg)
Data Pipeline ArchitectData Pipeline Architect
Still good? Let’s start a business relationship.Initial research and planning. Roadmapping consulting package.
![Page 22: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/22.jpg)
Data Pipeline ArchitectData Pipeline Architect
Four people to talk to:
2222
● Project owner○ We want this guy to be successful
● Business owner or C-level perspective○ Knows what’s best for the business○ "What could the ceo ask you in the hallway"
● Data wrangler - tales from the trenches○ Insights into day-to-day business and data details
● Engineering Side○ Current tech stack○ Infos on constraints and preferences○ Last touches
● Conversation focus, questions and duration vary from person to person.
![Page 23: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/23.jpg)
Data Pipeline ArchitectData Pipeline Architect
Interviews completed, situation understood and put into writing.
23
● A bit of focused communication, we have a great foundation!○ Project motivation○ Business goals○ Who should benefit○ How to make it happen
● Different perspectives on the project and business.
● Time for tech!○ Context clear (goals, constraints)
● Best case:○ Very few choices left to make
![Page 24: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/24.jpg)
Data Pipeline ArchitectData Pipeline Architect
Here’s what I would have told myself when starting out:
24
● Learn about the company○ Easier with fresh eyes
● Understand the business○ Multiple perspectives
● Keep the goal in mind○ Helps learning the right things○ Cultivate a business mindset (help earn more/lose less)○ Aim for results
■ I will not stop saying this anytime soon :)
● Have a process laid out24
![Page 25: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/25.jpg)
Data Pipeline ArchitectData Pipeline Architect
Finally: Tactical Advice Which Fits the Remaining Time.That’s the right proportion :)
![Page 26: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/26.jpg)
Data Pipeline ArchitectData Pipeline Architect
Don’t roll your own home-baked scripts.
2626
● "Quick and easy" isn't
● Uniqueness is bad, boring is good○ Learning curve for others○ Original author leaving○ Maintenance time, tricky bugs, code duplication○ Unexpected failure modes
● Extensibility?● Growth?● Metadata?
![Page 27: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/27.jpg)
Data Pipeline ArchitectData Pipeline Architect
You should know about workflow engines.
2727
● Workflow = “[..] orchestrated and repeatable pattern of business activity [..]” [1]● Data flow = “bunch of data processing tasks with inter-dependencies” [2]
● Pipelines of batch jobs○ complex, long-running
● Dependency management● Reusability of intermediate steps● Logging and alerting● Failure handling● Monitoring● Lots of effort went into them (Broken data? Crashes? Partial failures?)
[1] https://en.wikipedia.org/wiki/Workflow[2] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“
![Page 28: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/28.jpg)
Data Pipeline ArchitectData Pipeline Architect
If in doubt, try Luigi.
2828
● Spotify○ Lots of data!○ 10k+ Hadoop jobs every day [1]
● Battle hardened○ Published 2009○ Has been used in production by large companies for a while
● Python● Modular & extensible● Dependency graph● Not just for data tasks
[1] Erik Bernhardsson, 2013, “Building Data Pipelines with Python and Luigi”
![Page 29: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/29.jpg)
Data Pipeline ArchitectData Pipeline Architect
Usually worthwhile pipeline properties:
2929
● Keep it small and lean● Make learning and iterating easy
○ Changes should be cheap to accommodate for (both time and money)● Build something to start learning● Get data into one place● Don’t reinvent the wheel
○ The tools are out there○ ETL and workflow engines
● Create quick positive results, be efficient (lazy)○ Many small improvements everywhere○ Instead of solving everything for one group○ More bang-for-the-buck
![Page 30: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/30.jpg)
Data Pipeline ArchitectData Pipeline Architect
In conclusion:
30
● Don’t dive into tactics right away● Aim to create business value
○ Make it a conscious goal
● Understand the business, people and processes○ This will take some time. It’s a good investment.○ Have a process yourself○ Tech choices will follow
● Try to make it easy to learn and iterate● Get data in one place● Don’t go with home-baked scripts● Consider workflow engines
○ Luigi in particular30
![Page 31: "Data Pipelines for Small, Messy and Tedious Data", Vladislav Supalov, CAO & Co-Founder of Pivii Technologies](https://reader034.vdocument.in/reader034/viewer/2022051709/58713ad51a28abf0568b6c41/html5/thumbnails/31.jpg)
Data Pipeline ArchitectData Pipeline Architect
Thanks! Want to learn more?
“What questions to ask? Am I missing something?”For your future interviews and planning:
I want to share my seed-question lists with you!
Just drop me your email address at:http://datapipelinearchitect.com/datanatives/