(ats3-plat06) handling “big data” with pipeline pilot (mapreduce/nosql)
DESCRIPTION
Pipeline Pilot has wrangled large volumes of scientific data for many years. The emergence of "Big Data" challenges in other fields has brought many new tools and techniques to the table. This session will demonstrate various approaches to handling big data in Pipeline Pilot and show now Pipeline Pilot can integrate with "NoSQL" data stores such as Apache Cassandra and MongoDB. The second half of this session will be focus on audience participation and open discussion around big data tools and techniques to help inform our community and our future product road map.TRANSCRIPT
(ATS3-PLAT06) Handling “Big Data” with Accelrys Enterprise Platform
(MapReduce/NoSQL)Jason BenedictSr. Architect, Platform R&[email protected]
The information on the roadmap and future software development efforts are intended to outline general product direction and should not be relied on in making a purchasing decision.
Agenda
• Intended Audience: Pipeline Pilot Protocol Authors and Software Developers • Domains of Applicability• Infrastructure
– Parallel Processing & Grid Support– MapReduce-like Operations with Parallel Processing– Hadoop?– NoSQL (Reader, Writer, ETL, etc.)
• MongoDB (document-based)• Cassandra (big table)• Paradigm 4 (array-based)• Survey of other systems
• Algorithms
Big Data & Big Compute in the Scientific Domains
• Next Generation Sequencing– Mapping– SNP Calling
• High Content Screening– Image Analysis
• Modeling & Simulations in Life Science– Exploring 3D conformation space with Pharmacophore and
Virtual Docking
HPC & Big Data within Pipeline Pilot
Cluster • Built into Pipeline Pilot
Grid
• Leverages Existing Grid Engine• Sun GridEngine• PBS Pro• LSF• Custom Scripts
• Batches incoming records into processing units and dispatches each batch to one of the servers • Leverage computing power
– Multiprocessors/Multicore– Clusters– Grids– Loosely Coupled Distributed Systems (peer to peer)
• Processing data on different platforms or different bit size– For instance: our main PP server is Win64. You have an application that I want to integrate that only runs on Linux.
Use Parallel Sub-protocol to send this specific processing to a Linux PP server.– Example 2: main PP server is Win64. You have an application that only requires 32 bit. Use Parallel processing to send
to a Win32 PP server. • Improving performance for operations that block on I/O:
– For instance, making remote soap calls that take several seconds to complete.
Parallel Sub-protocols
• When not to use it– Operations that have heavy file I/O. For instance: do NOT read a large image, pass into parallel subprotocol,
process, and pass it back out. Each data record is serialized into and out of the parallel subprotocol. Instead: pass a REFERENCE to the image as a property on the data record.
– Trying to speed up File Reading.– Operations that require the FULL data set (e.g. Merge Data, Sort Data, etc.)
• For Multiprocessors, Grid, and Cluster deployments, use “Localhost” as the server• LOTS of information contained within User Help Docs including pitfalls and suggestions.
Topic is “Parallel Processing Subprotocols” under the “User Help” section.
Parallel Sub-protocols
Public / Private Clusters & IP-based Load Balancing
8
Clients
Primary Pipeline Pilot Server
SecondaryPipelinePilot Servers
Login
Execute
NFS
Users
Users
File System
Server Clustering Detail
XMLDB User Data Job Data
Head Node
Apache, A&A
XMLDB Service
Locator Service
LoggingService
Protocol Web
ServicesRunner Node 1
Apache, A&A
RunnerService File Access
Runner Node N
Apache, A&A
RunnerService File Access
Parent Job 1
Protocol Job 1
Protocol Job 2
Runner Service
Server Grid Integration
10Clients
Primary Pipeline Pilot Server
Job Execution Servers
Login
Execute
Grid Head Node
NFS
Users
Users
Users
File System
Server Grid Integration Detail
User Data Job Data
Server Head Node w/ ApacheAuthentication & Authorization
XMLDB Service
Locator Service
LoggingService
Protocol Web
Services
RunnerService
File AccessService
Node 1
Protocol Job 1
Protocol Job 2
Protocol Job 3
Node 2
Protocol Job 1
Protocol Job 2
Protocol Job 3
Node N
Protocol Job 1
Protocol Job 2
Protocol Job 3
Launch script Queue Manager
Jobs (scisvr command) launched through scripts (qsub).
Each scisvr process messages it’s status back to the head node (stats file).
Results are written to a shared job directory.
XMLDB
• Map Reduce Uses Distributed File Systems– Search across many commodity hardware– Data is redundant – Have a data directory service to find data
nodes with appropriate data sets
• Map fits hits on Each Node, Returns Key / Value
• Reduce filters unique keys to product final Result Set
• Demo• What about Hadoop? Reduced Set
Map Set 3
Map Set 2
Map Set 1
Performing Map-Reduce Like Operations
NoSQL (NRDS) Data Store OptionsCache Type Available Technologies Pros Cons
Wide-column or Key-value NRDS
Cassandra • Fast and Scalable• JDBC Driver• SQL-like query language
• Wide-column format is difficult to use w/ hierarchical data• Must pre-define some schema
In-memory key-value stores
Memcached (native)Ehache (java)Redis
• Simple & Generic• Fast
• No dataset grouping• No filter• Volatile • Limited authorization capability
Document-centric NRDS
MongoDBCouchDB
• Fast• Schema-less Data Model• Stores hierarchical data by nature• Filters and sorts• Built-in Map / Reduce capability
• Must implement own data authorization model
Data Record Caches
Pipeline Pilot Caches • Fast• Flexible• Stores hierarchical data• Filters and sorts
• No admin• No authorization w/out impersonation• Have to retrieve all rows in PP• No locking / concurrency protection
RDBMS JDBCODBC
• Ubiquitous• Standards-based• Well trodden• Usually the right decision
• Generic schemas are complex and possibly slow• Requires more coding• Limited scalability options
MongoDB Structure
MongoDB
Database 1
Collection 1(BSON
Document)
Collection 2(BSON
Document)
Database 2Collection 1
(BSON Document)
Example Schema
Example Schema’s in MongoDB
MongoDBInstance
List (n)
ACL’s
Record Set Meta-data
Record Set
Cached Forms
Big Data and Algorithms
• Clustering• Learning• Statistics• Reporting
• What we learned– Parallel Processing & Grid Support for Big Compute & Big Data– Performing Map-Reduce Operations in AEP– Integration NoSQL Data Sources in AEP
• Recommended Sessions– (ATS2-07) Solving Large Computing Challenges with Pipeline
Pilot
Summary
The information on the roadmap and future software development efforts are intended to outline general product direction and should not be relied on in making a purchasing decision.
For more information on the Accelrys Tech Summits and other IT & Developer information, please visit:https://community.accelrys.com/groups/it-dev