democratization of data @indix
TRANSCRIPT
![Page 1: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/1.jpg)
Democratization of DataWhy and how we built an internal data pipeline platform @Indix
![Page 2: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/2.jpg)
About me
Manoj MahalingamPrincipal Engineer @Indix
![Page 3: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/3.jpg)
People
Documents Businesses
Places Products
ConnectedDevices
Six Business Critical Indexes
![Page 4: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/4.jpg)
Enabling businesses to build location-aware software.
~3.6 million websites use Google maps
Enabling businesses to build product-aware software.
Indix catalogs over 2.1 billion product offers
Indix - The “Google Maps” of Products
![Page 5: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/5.jpg)
Crawling Pipeline
Data PipelineML
AggregateMatchStandardizeExtract AttributesClassifyDedupe
Parse
Crawl Data
CrawlSeed
Brand & Retailer Websites
Feeds Pipeline
Transform Clean Connect
Feed Data
Brand & Retailer Feeds
Indix Product Catalog
Customizable Feeds
Search & Analytics
Index
Indexing PipelineReal Time
Index Analyze Derive Join
API (Bulk &
Synchronous)
Product Data Transformation
Service
Data Pipeline @Indix
![Page 6: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/6.jpg)
Democratization of Data
Enable everyone in the organization to know what data is available, and then understand and work with it.
![Page 7: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/7.jpg)
At Indix, we have and work with a lot of data.
![Page 8: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/8.jpg)
Scale of Data @ Indix
2.1 BillionProduct
URLs 8 TB HTML Data
Crawled Daily
1B Unique
Products
7000Categories
120 BPrice
Points
3000Sites
![Page 9: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/9.jpg)
● We have data in different
shapes and sizes.
● HTML pages, Thrift and avro records.
● And also the usual suspects - CSVs and plain text data.
![Page 10: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/10.jpg)
● Datasets can be in TBs or a few hundred KBs.
● Few billion records or a couple of hundreds.
![Page 11: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/11.jpg)
But...the data’s potential couldn’t be realized
![Page 12: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/12.jpg)
Data wasn’t discoverable
● The biggest problem was in knowing what data exists and where.
● Some of the data was in S3. Some in HDFS. Some in Google sheets.
● There was no way to know how frequently and when the data changed or updated.
![Page 13: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/13.jpg)
The schema wasn’t readily known
● The schema of the data, as expected, kept changing and it was difficult to keep track of which version of data had which schema.
● While Thrift and Avro alleviate this to an extent, access to data wasn’t simple, especially for non-engineers.
![Page 14: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/14.jpg)
Writing code limited scope
● We use Scalding and Spark for our MR jobs. Having to code and tweak the jobs limited the scope of who can write and run these jobs.
● “Readymade” jobs may not enable desired tweaks if needed, affecting productivity and increasing dependencies.
● Having to write code and ship jars hinders adhoc data experimentation.
![Page 15: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/15.jpg)
Cost control wasn’t trivial
● While data came in various sizes and shapes, what people did with the data also varied - some use cases needed sample of the data, while others wanted aggregations on the entire data.
● It wasn’t trivial to handle all the different workloads while minimizing costs.
● There was also the problem of adhoc jobs starving production jobs in our existing Hadoop clusters.
![Page 16: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/16.jpg)
Goals of Internal Data Pipeline PlatformEnable easy discovery of
data.
Allow Schema to be
transparent and easy to
create while also allowing
introspection.
Minimal coding - have
prebuilt transformations for
common tasks and enable
SQL based workflow.
![Page 17: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/17.jpg)
Goals of Internal Data Pipeline PlatformUI and Wizard based
workflow to enable ANYONE
in the organization to run
pipelines and extract data.
Manage underlying clusters
and resources transparently
while optimizing for costs.
Support data
experimentations and also
production / customer use
cases.
![Page 18: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/18.jpg)
MDA - Marketplace of Datasets and Algorithms
![Page 19: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/19.jpg)
Tech Stack
![Page 20: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/20.jpg)
MDA - DEMO!!!
![Page 21: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/21.jpg)
MDA with our Data Pipeline
MatchAttributesBrandClassifyDedup
![Page 22: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/22.jpg)
MDA with our Data Pipeline
MatchAttributesBrandClassifyDedup
Enrich Data Classify BrandFeed data from Customer
Feed output to customer
![Page 23: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/23.jpg)
MDA for ML Training Data
Filter Sample Preprocess
Training Data
![Page 24: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/24.jpg)
Notebooks//Setup the MDA client
import com.indix.holonet.core.client.SDKClient
val host = "holonet.force.io"
val port = 80
val client = SDKClient(host, port, spark)
//Create dataframe from any MDA dataset
val df = client.toDF("Indix", "PriceHistoryProfile")
df.show
![Page 25: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/25.jpg)
Dec 2015
Start work on MDA
Mar 2016
First release
Lot more transforms including sampling, full Hive SQL support and UX fixes
Late 2016
Performance improvements, Spark and infra upgrades.
June 2017
Ability to run pipelines in customer’s cloud infra
Jul 2016 Early 2017
Completely redesign the UI based on over year of feedback and learnings. GraphQL for the UI.
First closed preview of MDA for a customer
Aug 2017
![Page 26: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/26.jpg)
What does the future hold?● We are far from done - things like automatic schema
inference, better caching are already planned.
● And as is the original vision, make it fully self-served for our customers (internal and external.)
● Integration with other tools out there like Superset
● Open source as much as possible. First cut - http://github.com/indix/sparkplug
![Page 27: Democratization of Data @Indix](https://reader034.vdocument.in/reader034/viewer/2022052514/5a6479447f8b9a4c568b4649/html5/thumbnails/27.jpg)
Questions?I blog at https://stacktoheap.com
Twitter and most other platforms @manojlds