Transforming data processing at Penton
Raj Nair Director, Data Platform@Penton
About Penton
• Professional information services company
• Provide actionable information to five core markets
Agriculture Transportation Natural Products Infrastructure Industrial Design &
Manufacturing
EquipmentWatch.com - Prices, Specs, Costs, Rental
Govalytics.com - Analytics around Gov’t capital spending down to county level
SourceESB - Vertical Directory, electronic parts
NextTrend.com - Identify new product trends in the natural products industry
Practical Hadoop: Hadoop at Penton
What got us thinking?
• Business units process data in silos
• Heavy ETL – Hours to process, in some cases days
• Not even using all the data we want • Not logging what we needed to • Can’t scale for future requirements
Data Processing Pipeline New features
New Insights
New Products
Biz Value Assembly Line processing
Data Processing Pipeline
Penton Examples
• Daily Inventory data, ingested throughout the day (tens of thousands of parts)
• Auction and survey data gathered daily
• Aviation Fleet data, varying frequency
Ingest, store Clean, validate
Apply Business Rules Map
Analyze Report Distribute
Slow Extract, Transform and Load = Frustration + missed business SLAs Won’t scale for future
Various data formats, mostly unstructured
Two use cases
• Daily model data – upload and map
– Ingest data, build buckets
– Map data (batch and interactive)
– Build Aggregates (dynamic)
• Inventory data for electronics parts
– Hundreds of thousands of parts daily
– Ingest, map, apply biz rules, distribute
Issues: - Mapping Time
Issues: - Biz Rules Processing - Indexing time - Little insight into data quality - Little insight into failures
Up until today… • Ingest raw CSVs as tables in RDBMS • Run stored procedures over batches of data • Build new tables for website queries • Build new tables for loading Solr/Search • Set retention dates to reduce database “clog”
Challenges for Models - Mapping, batch and interactive - On the fly aggregations - Post mapping distribution of data
Challenges for Inventory - Windows based systems - A good number of small files daily - File names contain metadata
What are the options ? And keep in mind …
Where did we land ?
Adopt Hadoop Ecosystem - M/R: Ideal for Batch Processing - Flexible for storage - NoSQL: scale, usability and flexibility
Expand RDBMS options - Expensive - Complex
HBASE Oracle SQL
Server
Drools
Models
Type Number of files
Total number of
records
Projected time to map
1 Auction 5,000 1,400,000 3 days
2 Rental Rate 5,000 3,700,000 8 days
3 Resale 5,000 6,535,000 2 days
4 Serial Number - Manufacturer
5,000 13,700,000 4 days
5 Serial Number - Web
5,000 8,220,000 12 days
Totals 25,000 33,555,000
Mapping Operations: 1. By File : Run map operation by single file 2. By type: Run map operations for all files of a specific type
4-Node Hadoop/HBase cluster
Entire 25,000 files map in 52 minutes
On the fly aggregations were materialized views in Oracle - Required some complicated coprocessor coding in Hbase for performance
Architecture (Models)
REST API
CSV and Rule Management Endpoints
HBASE
HADOOP HDFS
CSV
Files
Master database
of Products/ Parts
Current Oracle
Schema
Push
Updates
Insert
Accepted Data
Existing Business
Applications
Data Upload UI
API
calls MR Jobs
Launch
R
E
S
T
But we are not done …
• Our vision is a data platform
– More on that later
• First, the practical aspects of this journey
Allocate Time for Detailed Research • Know, know, know your source details
- What’s the source of your source?
- What are the different formats?
- What’s the frequency?
- What are their vectors (web, ftp, e-mail, streaming) ?
- What metadata do you need or have currently?
- Where’s your metadata?
- What lookup data do you need? What format are they in?
- What data “sinks” do you distribute to (post processing) ?
For Instance: - In Inventory processing, some metadata was part of the filename – had a big influence - We had lookup data in SQL Server - We had to distribute data out to SQL Server, SOLR, data mart - We had a very large number of small files - We get files via e-mail, web and ftp – for simplicity we converge all vectors to ftp
Allocate Time for Detailed Research
• Understand your processing patterns – In detail
- What portions are batch-processing vs interactive?
- Do you need to deal with joins, merges and updates?
- Do you need to process in “near” real-time?
- Are you going to reuse any existing processing workflows ?
- How much logging do you want to capture?
- How much of the processing do you want analytics on?
- Revisit with business owners – very important
For Inventory: - Logging of inventory rejection by business rule - Operational tracking of processing performance
For Models: - Business team needed to interactively perform mapping functions - Aggregations had to be built real-time after upload
Allocate Time for Detailed Research
• Investigate the different workloads
- What’s the volume of your transactional workloads?
- What’s the nature of the workloads (Read, Write, Read-heavy, Write-heavy..)
- Is there a requirement for Exploratory BI / DW?
- Is there a requirement for high performance BI ?
- What’s the expected data growth rate?
Skills and Expertise
Invest in learning
Get used to: - File-based processing - Key value pairs - Distributed computing
Pay special attention to: - InputFormats - InputSplits - OutputFormats - Small Files Problem - Controlling output - “Append only”
HADOOP
Keep an open mind
Get acquainted with: - Flexible Schemas - Less joins - CAP Theorem - constraints re: RDBMS - sharding, clustering
NoSQL
Make sure you understand: - When you reap benefits - Indexing or the lack of it - performance benchmarks from unbiased studies
And good luck finding skills in the market
Be prepared to do POCs – the dangers of not using your own data set to test are many
Enough already.. No more MapReduce
Watch the Trends !!
Hadoop is becoming the OS of Big Data
Production use cases are still MR for batch
Abstractions will rise to save the day
Some interesting challenges lie ahead
• Infusing more context in the processing pipeline
Data Enrichment
• Content recommendations
Machine Learning
• Pipeline moves from batch to more near real-time
Real-time Ingestion and processing
Real-time Solr indexing
Incubate, Innovate … But keep application integration seamless
Hadoop Ecosystem Data Marts OLTP Store NoSQL
Data Flow
Data Infrastructure
Ingestion ETL Meta Logging
Data as a Platform
We are building a data platform
And so…
We are hiring !!!
Java, Hadoop, ETL, Data Warehousing, NoSQL, Machine Learning, Python, PHP, Spark, Drools
Comp Sci, Comp Eng.