(ats3-plat06) handling “big data” with pipeline pilot (mapreduce/nosql)

(ATS3-PLAT06) Handling “Big Data” with Accelrys Enterprise Platform

(MapReduce/NoSQL)Jason BenedictSr. Architect, Platform R&[email protected]

The information on the roadmap and future software development efforts are intended to outline general product direction and should not be relied on in making a purchasing decision.

Agenda

• Intended Audience: Pipeline Pilot Protocol Authors and Software Developers • Domains of Applicability• Infrastructure

– Parallel Processing & Grid Support– MapReduce-like Operations with Parallel Processing– Hadoop?– NoSQL (Reader, Writer, ETL, etc.)

• MongoDB (document-based)• Cassandra (big table)• Paradigm 4 (array-based)• Survey of other systems

• Algorithms

Big Data & Big Compute in the Scientific Domains

• Next Generation Sequencing– Mapping– SNP Calling

• High Content Screening– Image Analysis

• Modeling & Simulations in Life Science– Exploring 3D conformation space with Pharmacophore and

Virtual Docking

HPC & Big Data within Pipeline Pilot

Cluster • Built into Pipeline Pilot

Grid

• Leverages Existing Grid Engine• Sun GridEngine• PBS Pro• LSF• Custom Scripts

• Batches incoming records into processing units and dispatches each batch to one of the servers • Leverage computing power

– Multiprocessors/Multicore– Clusters– Grids– Loosely Coupled Distributed Systems (peer to peer)

• Processing data on different platforms or different bit size– For instance: our main PP server is Win64. You have an application that I want to integrate that only runs on Linux.

Use Parallel Sub-protocol to send this specific processing to a Linux PP server.– Example 2: main PP server is Win64. You have an application that only requires 32 bit. Use Parallel processing to send

to a Win32 PP server. • Improving performance for operations that block on I/O:

– For instance, making remote soap calls that take several seconds to complete.

Parallel Sub-protocols

• When not to use it– Operations that have heavy file I/O. For instance: do NOT read a large image, pass into parallel subprotocol,

process, and pass it back out. Each data record is serialized into and out of the parallel subprotocol. Instead: pass a REFERENCE to the image as a property on the data record.

– Trying to speed up File Reading.– Operations that require the FULL data set (e.g. Merge Data, Sort Data, etc.)

• For Multiprocessors, Grid, and Cluster deployments, use “Localhost” as the server• LOTS of information contained within User Help Docs including pitfalls and suggestions.

Topic is “Parallel Processing Subprotocols” under the “User Help” section.

Parallel Sub-protocols

Public / Private Clusters & IP-based Load Balancing

8

Clients

Primary Pipeline Pilot Server

SecondaryPipelinePilot Servers

Login

Execute

NFS

Users

Users

File System

Server Clustering Detail

XMLDB User Data Job Data

Head Node

Apache, A&A

XMLDB Service

Locator Service

LoggingService

Protocol Web

ServicesRunner Node 1

Apache, A&A

RunnerService File Access

Runner Node N

Apache, A&A

RunnerService File Access

Parent Job 1

Protocol Job 1

Protocol Job 2

Runner Service

Server Grid Integration

10Clients

Primary Pipeline Pilot Server

Job Execution Servers

Login

Execute

Grid Head Node

NFS

Users

Users

Users

File System

Server Grid Integration Detail

User Data Job Data

Server Head Node w/ ApacheAuthentication & Authorization

XMLDB Service

Locator Service

LoggingService

Protocol Web

Services

RunnerService

File AccessService

Node 1

Protocol Job 1

Protocol Job 2

Protocol Job 3

Node 2

Protocol Job 1

Protocol Job 2

Protocol Job 3

Node N

Protocol Job 1

Protocol Job 2

Protocol Job 3

Launch script Queue Manager

Jobs (scisvr command) launched through scripts (qsub).

Each scisvr process messages it’s status back to the head node (stats file).

Results are written to a shared job directory.

XMLDB

• Map Reduce Uses Distributed File Systems– Search across many commodity hardware– Data is redundant – Have a data directory service to find data

nodes with appropriate data sets

• Map fits hits on Each Node, Returns Key / Value

• Reduce filters unique keys to product final Result Set

• Demo• What about Hadoop? Reduced Set

Map Set 3

Map Set 2

Map Set 1

Performing Map-Reduce Like Operations

NoSQL (NRDS) Data Store OptionsCache Type Available Technologies Pros Cons

Wide-column or Key-value NRDS

Cassandra • Fast and Scalable• JDBC Driver• SQL-like query language

• Wide-column format is difficult to use w/ hierarchical data• Must pre-define some schema

In-memory key-value stores

Memcached (native)Ehache (java)Redis

• Simple & Generic• Fast

• No dataset grouping• No filter• Volatile • Limited authorization capability

Document-centric NRDS

MongoDBCouchDB

• Fast• Schema-less Data Model• Stores hierarchical data by nature• Filters and sorts• Built-in Map / Reduce capability

• Must implement own data authorization model

Data Record Caches

Pipeline Pilot Caches • Fast• Flexible• Stores hierarchical data• Filters and sorts

• No admin• No authorization w/out impersonation• Have to retrieve all rows in PP• No locking / concurrency protection

RDBMS JDBCODBC

• Ubiquitous• Standards-based• Well trodden• Usually the right decision

• Generic schemas are complex and possibly slow• Requires more coding• Limited scalability options

MongoDB Structure

MongoDB

Database 1

Collection 1(BSON

Document)

Collection 2(BSON

Document)

Database 2Collection 1

(BSON Document)

Example Schema

Example Schema’s in MongoDB

MongoDBInstance

List (n)

ACL’s

Record Set Meta-data

Record Set

Cached Forms

Big Data and Algorithms

• Clustering• Learning• Statistics• Reporting

• What we learned– Parallel Processing & Grid Support for Big Compute & Big Data– Performing Map-Reduce Operations in AEP– Integration NoSQL Data Sources in AEP

• Recommended Sessions– (ATS2-07) Solving Large Computing Challenges with Pipeline

Pilot

Summary

The information on the roadmap and future software development efforts are intended to outline general product direction and should not be relied on in making a purchasing decision.

For more information on the Accelrys Tech Summits and other IT & Developer information, please visit:https://community.accelrys.com/groups/it-dev

https://community.accelrys.com/groups/it-dev

(ats3-plat06) handling “big data” with pipeline pilot (mapreduce/nosql)

Technology