apache ignite - using a memory grid for heterogeneous ... · why - oilfield drilling data...
TRANSCRIPT
![Page 1: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/1.jpg)
Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks
A Use Case Guided Explanation
Chris HerreraHashmap
![Page 2: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/2.jpg)
2
Topics
• Who - Key Hashmap Team Members• The Use Case - Our Need for a Memory Grid• Requirements• Approach V1• Approach V1.5• Approach V2• Lessons Learned• What’s Next• Questions
![Page 3: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/3.jpg)
3
Who - Hashmap
WHO ● Big Data, IIoT/IoT, AI/ML Services since 2012● HQ Atlanta area with offices in Houston, Toronto,
and Pune● Consulting Services and Managed Services
REACH● 125 Customers across 25 Industries
PARTNERS● Cloud and technology platform providers
![Page 4: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/4.jpg)
4
Who - Hashmap Team Members
Chris HerreraChief Architect/Innovation Officer
HashmapHouston, TX
Akshay MhetreTeam LeadHashmapPune, India
Jay KapadnisLead ArchitectHashmapPune, India
![Page 5: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/5.jpg)
The Use CaseOilfield Drilling Data Processing
![Page 6: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/6.jpg)
6
Plan
Why - Oilfield Drilling Data Processing
WITSML Server
Plan Store
Optimize
The Process
Execute
![Page 7: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/7.jpg)
7
Why - Oilfield Drilling Data Processing
Vendors Financial Homegrown
The Plan● How to match the data● Deduplication● Missing information● Various formats● Various ingest paths
TDM EDM WellView Homegrown
Data Analyst
![Page 8: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/8.jpg)
8
Why - Oilfield Drilling Data Processing
Rig Site Data Flow
Mud Logger
Cement
Wireline
MWD
CSVCSV
CSV
CSVCSV
DLIS
WITSML Server
WITSML Server
Magic
● Operational Data● Missing classification● Unknown quality● Various formats● Various ingest paths● Unknown completeness
Data Analyst
![Page 9: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/9.jpg)
9
Why - Oilfield Drilling Data Processing
Oilfield Drilling Data Processing - Office
Vendors Financial Homegrown
TDM EDM WellView Homegrown
● Impossible to generate insights without huge data cleansing operations
● Extracting value is a very expensive operation that has to be done with a combination of experts
● Generating reports requires a huge number of man-hours
Data Analyst
![Page 10: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/10.jpg)
10
Why - Oilfield Drilling Data Processing
BUT WAIT…
![Page 11: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/11.jpg)
11
Why - Oilfield Drilling Data Processing
Feature EngineeringGenerate additional features that are required to get useful insights into the data
Persist & ReportLand the data into a store that allows for BI reports and interactive queries
CleanDeduplicate, interpolate, pivot, split, aggregate
LoadLoad the data into a staging area to start understanding what to do with it
Identify & EnrichUnderstand where the data came from and what its global key should be
ParseParse the data from CSV, WITSML, DLIS, etc...
We still have all the compute to deal with, some of which is very legacy code
![Page 12: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/12.jpg)
RequirementsWhat do we have to do?
![Page 13: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/13.jpg)
13
Functional Requirements
Cleaning and Feature Engineering (the legacy code I referred to)• Parse WITSML / DLIS• Attribute Mapping• Unit Conversions• Null Value Handling• Rig Operation Enrichment• Rig State Detection• Invisible Lost Time Analysis• Anomaly Detection
![Page 14: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/14.jpg)
14
Non-Functional Requirements
DescriptionRequirement
● Very flexible ingest● Flexible simple transformations
1 Heterogeneous Data Ingest
● Easy to debug● Trusted
2 Robust Data Pipeline
● Be able to support existing computational frameworks / runtimes
3 Extensible Feature Engineering
● Scales up ● Scales Down
4 Scalable
● If a data processing workflow fails at a step, it does not continue with erroneous data
5Reliable
![Page 15: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/15.jpg)
Approach V1How Then?
![Page 16: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/16.jpg)
16
Solution V1
TDM EDM WellView Homegrown
HD
FS TDM EDM WellView WITSML
HD
FSH
ive
WITSML
Server
CSVCS
VFiles
Spark Zeppelin BI
Staging Reporting Marts ● Heterogeneous ingest implemented through a combination of NiFiprocessors/flows and Spark Jobs
● Avro files loaded as external tables● BI connected via ODBC (Tableau)● Zeppelin Hive interpreter was used
to access the data in Hive
![Page 17: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/17.jpg)
17
Issues with the Solution
● Very Slow BI
● Tough to debug cleansing
● Tough to debug feature extractions
● A lot of overhead for limited benefit
● Painful data loading process
● Incremental refresh was challenging
● Chaining the jobs together in a workflow was very hard
○ Mostly achieved via Jupyter Notebooks
● In order to achieve the functional requirements, all of the computations
were implemented in Spark, even if there was little benefit
![Page 18: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/18.jpg)
18
V1 Achieved RequirementsAchieved DescriptionRequirement
● Very flexible ingest● Flexible simple transformations
1 Heterogeneous Data Ingest
● Hard to Debug● Hard to modify
2 Robust Data Pipeline
● Hard to support other frameworks● Hard to modify current computations
3 Extensible Feature Engineering
● Scales up but not down4 Scalable
● Hard to debug 5 Robust
![Page 19: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/19.jpg)
Approach V1.5An Architectural Midstep
![Page 20: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/20.jpg)
20
A Quick Architectural Midstep (V1.5)
TDM EDM WellView Homegrown
HD
FS TDM
HD
FS
/IG
FS
Hiv
e
WITSML
Server
CSVCS
VFiles
Spark Jupyter BI
Staging Reporting Marts
● Complicated an already complex system
● Did not solve all of the problems● Needed a simpler way to solve all of
the issues● Ignite persistence was released
while we were investigating this
Ign
ite
WITSMLEDMWellView
In-Memory MapReduce
![Page 21: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/21.jpg)
Approach V2How Now?
![Page 22: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/22.jpg)
22
Kubernetes
Approach V2H
DFS
Igni
te
Spark Zeppelin
● Allows for very interactive workflows
● Workflows can be scheduled● Each workflow is made up of
functions (microservices)● Each instance of a workflow
workflow contains its own cache
● Zeppelin via the Ignite interpreter
● Workflows loaded data and also processed data
Service Grid Memory Grid
Docker
Caches
Workflow Cache
Workflow API Scheduler API
Flink
Functions API
Persistent Storage (Configurable)
Functions
Workflow Cache
Function Function
![Page 23: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/23.jpg)
23
Approach V2 - The Workflow
Apache IgniteApache Ignite
Service Service ServiceKey Val
SQL / DF
Key Val
SQL / DF
Function 1 Function 2 Function 3Source
● Source is the location the data is coming from● The workflow is the data that goes from function to function● Data stored as data frames can be queried by an API or another function
![Page 24: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/24.jpg)
24
Approach – The Workflow
• Each function runs as a service using Service Grid
• The function receives input from any source
• Kafka*• JDBC• Ignite Cache
• Once the function is applied, store the result into the Ignite cache store
![Page 25: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/25.jpg)
25
Workflow Capabilities
● Start / Stop / Restart● Execute single functions within a workflow● Pause execution to validate intermediate steps
![Page 26: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/26.jpg)
26
Approach - Spark Based Functions - Persistence
• After each function has completed its computation the Spark DataFrame is stored via distributed storage
• Table name is stored as SQL_PUBLIC_<tableName>
df.write.format(FORMAT_IGNITE).option(OPTION_TABLE, tableName) // table name to store data.option(OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS, “id”).save()
Apache IgniteService Key Val
DF
Spark Function
![Page 27: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/27.jpg)
27
Approach – Intermediate Querying
• Once the data is in the cache, the data can be optionally persisted using the Ignite persistence module
• The data can be queried using the Ignite SQL grid module as well
• Allows for intermediate validation of the data as it proceeds through the workflow
val cache = ignite.getOrCreateCache(cacheConfig)val cursor = cache.query(new SqlFieldsQuery(s”SELECT * FROM $tableName limit 20"))val data = cursor.getAll
Apache IgniteService Key Val
DF
Spark Function
API
![Page 28: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/28.jpg)
28
Approach - Applied to the Use Case
Apache IgniteApache Ignite
Service Service ServiceKey Val
SQL
Key Val
SQL
Java WITSML
Client (Docker)
Channel Mapping /
Unit Conversion
(Docker)
Rig State Detection / Enrichment
/ Pivot (Spark)
WITSML
Server
Workflow API Scheduler API
![Page 29: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/29.jpg)
29
V2 Achieved RequirementsAchieved DescriptionRequirement
● Very flexible ingest● Flexible transformations
1 Heterogeneous Data Ingest
● Easy to debug● Easy to modify
2 Robust Data Pipeline
● Easy to add● Easy to experiment
3 Extensible Feature Engineering
● Scales up● Scales down
4 Scalable
● Easy to debug● Reliable
5 Robust
![Page 30: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/30.jpg)
30
Solution Benchmark Setup
• Dimension Tables already loaded• 8 functions (6 wells of data – 5.7 billion points)
• Ingest / Parse WITSML• Null Value Handling• Interpolation• Depth Adjustments• Drill State Detection• Rig State Detection• Anomaly Detection• Pivot Dataset
• For V1 everything was implemented as a Spark application• For V2 the computations remained close to their original format
![Page 31: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/31.jpg)
31
Solution Comparison
V1 - Execute Time• 9 HoursWithout WITSML Download• 7 Hours
V2 - Execute Time• 2 HoursWithout WITSML Download• 22 minutes
19x Improvement V1 to V2
![Page 32: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/32.jpg)
Lessons LearnedHow Now?
![Page 33: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/33.jpg)
33
Lessons Learned
● Apache Ignite is a great tool to speed up data processing without a wholesale replacement of technology
● Apache Ignite does have a learning curve, it is definitely worth doing an analysis beforehand to understand what it means to operationalize it
● Accelerating Hive via Ignite was not straightforward and, at times made it very difficult to debug the actual issues that we were facing
● Spatial querying, while great, is LGPL, so be aware of that before your specific implementation
● Understanding data locality in Ignite is crucial in larger data sets● Ignite works very well inside of Kubernetes due to its peer-to-peer
clustering mechanism● The thin client JDBC driver does not have affinity awareness, so in multi-
node configurations, the thick client is preferred
![Page 34: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/34.jpg)
What’s NextHow Now?
![Page 35: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/35.jpg)
35
What’s Next
● Implementation of a UI on top of the computational framework● Implementation of a standard set of “functions” that can be leveraged on
top of the memory grid● Implementation of streaming sources via Kafka Ignite Sink
![Page 36: Apache Ignite - Using a Memory Grid for Heterogeneous ... · Why - Oilfield Drilling Data Processing Oilfield Drilling Data Processing - Office Vendors Financial Homegrown TDM EDM](https://reader030.vdocument.in/reader030/viewer/2022040409/5ec53768c781e87490220a24/html5/thumbnails/36.jpg)
Questions
Chris HerreraHashmap
Apache Ignite - Using a Memory Grid for Heterogeneous Computation FrameworksA Use Case Guided Explanation