CIGNEX Datamatics Confidential www.cignex.com
Big Data Offerings by CIGNEX Datamatics
December 2012
Presented by
Name: Munwar Shariff
Email: [email protected]
Title: CTO
CIGNEX Datamatics Confidential www.cignex.com
BIG DATA CASE STUDIES
CIGNEX Datamatics
2
CIGNEX Datamatics Confidential www.cignex.com
Patent Search – Situational Analysis using Big Data
• Leading Global Manufacturing Company operating in chemicals, plastics, catalysts
• Challenges Faced – Proprietary Enterprise search engine developed on Oracle Database & IBM Filenet
• Over 80 million patents information (Patent Family, Facet, Proximity, Highlight etc.) stored in various repositories
• Slow, non-scalable and expensive
• Assorted interface to access data
• Solution – Hadoop, Solr and Neo4j based solution
• Synchronization layer – for interface with REST web services
• Persistence layer – Hbase/Hadoop
• Indexing layer – Solr search
• Neo4J to handle Patent Family Calculation
• Migrated millions of records to Apache Hadoop using a CD HBase Hook for Liferay
• Client Benefits – 10x increase in search performance
• Average Search Time reduced to 6.8 ms (from 70 ms)
• Doc Throughput increased to 62/second (from 6/second)
– 20x reduction in TCO • Replaced expensive IBM Filenet and Oracle DB based infrastructure with Open Source tools
• Use of commodity hardware
3
CIGNEX Datamatics Confidential www.cignex.com 4
Client Applications
Repository Server
Persistence Mongodb &
Neo4j
Indexing Engines
External Data &
Applications
External Indexes
Authentication Server
API Server Status Server
Queue Manager
Models Locks
Processors
Controllers
Services Status
Specialized Service Providers
Patent Search – Situational Analysis using Big Data
CIGNEX Datamatics Confidential www.cignex.com
Managing high volume data feeds through MongoDB
• Leading provider of software based CMS managing entire lifecycle of publishing videos
• Challenges Faced;
– 30 Million INSERTS / hour
– 10 Million UPDATES/ hour
– ~150 GB/Day & ~5 TB/month – High Volume data growth
– CONCURRENT High Volume CRUD’s in REAL TIME
– Poor performance of READ queries
– Difficulty in identifying the Shard Keys, Indexes & Cluster configuration
• Solution
– Identify collections that require Sharding with Ideal shard keys with write distribution / query isolation
– Identify collections that require Indexes to speed up READS
– Scientific Cluster sizing & optimal hardware recommendations
– Options around Data archival for better utilization of Cluster configuration
– Performance Tuning tips around collection design
– Performance benchmarking of suggested shard keys, indexes through Load Testing
• Client Benefits
– 5x improved WRITE / UPDATE Performance through SHARDING
– Better utilization of existing infrastructure in Cluster configuration(mongos, arbiter on APP servers)
– Automated Performance Tuning scripts for Testing the recommended approach
5
CIGNEX Datamatics Confidential www.cignex.com
Managing high volume data feeds through MongoDB
6
mongod
Secondary
mongod Primary Mongod
Arbiter
mongod Secondary
mongod Primary
Mongod Arbiter
mongod Secondary
mongod Primary
Mongod Arbiter
mongod Secondary
mongod Primary Mongod
Arbiter
mongod Secondary
mongod Primary Mongod
Arbiter
mo
ngo
s m
on
gos
mo
ngo
s m
on
gos
mo
ngo
s m
on
gos
Ap
p
Serv
er
Ap
p
Serv
er
Ap
p
Serv
er
Ap
p
Serv
er
Ap
p
Serv
er
Ap
p
Serv
er
Dat
a T
ier
mongod mongod
mongod
Config Servers
App Tier
Shard 1
Shard 2
Shard 3
Shard 4
Replica Set
Routed Requests from mongos to shards
Routed for non-sharded collections
Lo
ad
Bal
ance
r Solution Architecture
CIGNEX Datamatics Confidential www.cignex.com
Real-time intelligence for fleet management & worksites
• Leading provider of advanced location-based solutions
• Challenges Faced – Varying formats & sizes of data feeds from different devices on the sites
– ~5 million inserts / day from ~200000 devices
– Improve performance of READS every hour
– Handle disaster recovery from multi-geography data centers
– 24*7 support
• Solution – Overall health check of the system & recommendations
– Efficient indexes based on read patterns
– Robust disaster recovery & failover plan considering different scenarios
– Multi data center deployment planning
– Disaster recovery & Failover testing
– MongoDB Monitoring Service (MMS) setup for cluster administration & maintenance
• Client Benefits – 2x improved performance through RIGHT indexes
– 24*7 support of MongoDB cluster with instant response to issues and 99% uptime
– Real time & instant intelligence on key monitoring metrics
7
CIGNEX Datamatics Confidential www.cignex.com
Real-time intelligence for fleet management & worksites
8
Connected GPS Devices
Load Balancer
App Server App Server
Primary
Secondary
Secondary
DC - 2 DC - 1
Replica Set 1 A
pp
Ser
ver
Solution Architecture
CIGNEX Datamatics Confidential www.cignex.com
Hadoop based Log Processing & Analysis
• Global IT Services Company
• Challenges Faced
– Existing RDBMS solution was incapable of aggregating and managing large unstructured logs generated from different systems
• Lack of control over collection and manipulation of log files due to high volume
• Adding a new log cluster to the existing system was difficult and slogged the system performance
• Huge Maintenance costs due to investments to address the high end storage needs
• Solution
– Log Processing and Analysis
• Apache Flume– distributed system for aggregating streaming data
• HDFS – Primary Hadoop Storage system
• MapReduce – Parallel storage to process large amount of data in parallel
• Sqoop – allows efficient transfer of huge data between Hadoop & structured data stores
• Pentaho – Open Source Data Integration
• Client Benefits
– Seamless aggregation and archival of log files irrespective of environment of log files generated
• IT team received a 360-degree view into employee usage patterns
• Rich user interface with accessibility through mobile devices and tablets
• Cost advantage through non dependence on high end storage networks
9
CIGNEX Datamatics Confidential www.cignex.com 10
Scheduler
Dashboard
1) Fetch Logs
from Server to HDFS using Flume
2) Run Map-Reduce
on Logs collected Daily and generate Summary in HDFS M
3) Export Summary
from HDFS to MySQL Using Sqoop M
4) Generate
Reports on Dashboard using Pentaho on MySQL
Data Sources Collection and Analytics Reporting
Mail Logs
Server Syslogs
Web Logs
Firewall Logs
Voip Call Logs
Hadoop based Log Processing & Analysis
CIGNEX Datamatics Confidential www.cignex.com 11
Hadoop based Log Processing & Analysis (Pentaho Reporting for Mobile Devices)
CIGNEX Datamatics Confidential www.cignex.com
BIG DATA SOLUTIONS
CIGNEX Datamatics
12
CIGNEX Datamatics Confidential www.cignex.com
BigArchive - Enterprise Scale Archival Solution
• Scalable Distributed Repository to archive large number of variety of documents
• Low cost and high performance – uses open standards and open source technologies such as MongoDB, Solr, Apache Tika
• Dynamically captures content and metadata from the documents at load time, stores them in MongoDB and indexes them in Solr
• Provides enterprise search and high performance retrieval of documents
• REST based API interoperable to work with various custom client applications built on Java, PHP, .NET
13
CIGNEX Datamatics Confidential www.cignex.com
BigArchive - Architecture
14
Repository
Controller (Custom Java + Netty)
RESTful Service Layer API (jersey)
User Interface
Persist
Object
Retrieve
Object
Index
Metadata Search
Metadata
Web Service
Request Response
CUD Search &
Retrieve
Content
&
Metadata
CIGNEX Datamatics Confidential www.cignex.com
By 2015, at least 60% of information workers will interact
with their content applications via a mobile device
Employees work on proposals and
presentation on mobile devices while
travelling.
People use digital assets (videos, images) longer on
Tablets and Mobiles compared to desktops.
15
Mobile Media site with Drupal + MongoDB
Mobile Explosion
CIGNEX Datamatics Confidential www.cignex.com
Mobile Media site with Drupal + MongoDB
• Fast performance • Large User base • Concurrent CRUD • Access through various channels
• Millions of Digital assets • Variety of content • Complexity of data
• Rich UI Features • Social Features • Mobile Access • Fast search
• Elastic scaling • Cost effectiveness • Centralized storage • Ease of Maintenance
• HIGH Availability • Automatic failover • User management
Velocity Volume
User experience
Scalability
Security & Availability
16
• Easy Integration • Shorter Dev cycle • Faster Deployment • Ease of schema design
Flexibility & Agility
Mobile Media Site
CIGNEX Datamatics Confidential www.cignex.com
• Big Data Portal with MongoDB and Liferay provide lower TCO and higher ROI to enterprises
• MongoDB enables Portals for scalability (for huge volumes of content) and flexibility (schema-less content)
• Liferay’s rich user interface, content management, security, social and mobile features complement MongoDB’s powerful storage features
17
Big Data Portal with Liferay + MongoDB
CIGNEX Datamatics Confidential www.cignex.com
Big Data Portal with Liferay + MongoDB
CIGNEX Datamatics Confidential www.cignex.com
RDBMS NoSQL (MongoDB)
Incoming Request
Entire Video (30 MB)
loaded into user device
Size: 30 MB
Stored as a Single Collection
Loading…
Size: 30 MB
3 MB
3 MB
3 MB
3 MB Stored in Collections
(Chunks)
Instantaneous Streaming & playback for Videos
Incoming Request
Individual chunks loaded
leading to no playback problems
Buffer and Playback Problems
Big Data Portal with Liferay + MongoDB
CIGNEX Datamatics Confidential www.cignex.com
Integrated Business Ecosystem (IBE) Blueprint Big Data – Integral to CIGNEX Datamatics UXP solution
20
Shaping Languages
Metadata Data
Integration Indexing
Graph Database
EDW
MapReduce
Map/Reduce
Databases including
Business Process
Management
Business Intelligence
Enterprise Resource Planning
Customer Relationship Management
Enterprise Content
Management
Portals E-commerce CMS
Legacy Solutions
Proprietary SW
.NET Systems
CMS Repositories
Inte
gra
tio
n
Mobile Social Cloud Rich Experience Browser friendly Real time Contextual
UXP Components
Platform
Analytics
CIGNEX Datamatics Confidential www.cignex.com
Name: Munwar Shariff
Email: [email protected]
Title: CTO
Thank You. Any Questions ?
Making Open Source Work