(ats3-plat06) handling “big data” with pipeline pilot (mapreduce/nosql)

17
(ATS3-PLAT06) Handling “Big Data” with Accelrys Enterprise Platform (MapReduce/NoSQL) Jason Benedict Sr. Architect, Platform R&D [email protected]

Upload: accelrys

Post on 11-Jan-2015

1.361 views

Category:

Technology


2 download

DESCRIPTION

Pipeline Pilot has wrangled large volumes of scientific data for many years. The emergence of "Big Data" challenges in other fields has brought many new tools and techniques to the table. This session will demonstrate various approaches to handling big data in Pipeline Pilot and show now Pipeline Pilot can integrate with "NoSQL" data stores such as Apache Cassandra and MongoDB. The second half of this session will be focus on audience participation and open discussion around big data tools and techniques to help inform our community and our future product road map.

TRANSCRIPT

Page 1: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

(ATS3-PLAT06) Handling “Big Data” with Accelrys Enterprise Platform

(MapReduce/NoSQL)Jason BenedictSr. Architect, Platform R&[email protected]

Page 2: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

The information on the roadmap and future software development efforts are intended to outline general product direction and should not be relied on in making a purchasing decision.

Page 3: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

Agenda

• Intended Audience: Pipeline Pilot Protocol Authors and Software Developers • Domains of Applicability• Infrastructure

– Parallel Processing & Grid Support– MapReduce-like Operations with Parallel Processing– Hadoop?– NoSQL (Reader, Writer, ETL, etc.)

• MongoDB (document-based)• Cassandra (big table)• Paradigm 4 (array-based)• Survey of other systems

• Algorithms

Page 4: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

Big Data & Big Compute in the Scientific Domains

• Next Generation Sequencing– Mapping– SNP Calling

• High Content Screening– Image Analysis

• Modeling & Simulations in Life Science– Exploring 3D conformation space with Pharmacophore and

Virtual Docking

Page 5: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

HPC & Big Data within Pipeline Pilot

Cluster • Built into Pipeline Pilot

Grid

• Leverages Existing Grid Engine• Sun GridEngine• PBS Pro• LSF• Custom Scripts

Page 6: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

• Batches incoming records into processing units and dispatches each batch to one of the servers • Leverage computing power

– Multiprocessors/Multicore– Clusters– Grids– Loosely Coupled Distributed Systems (peer to peer)

• Processing data on different platforms or different bit size– For instance: our main PP server is Win64. You have an application that I want to integrate that only runs on Linux.

Use Parallel Sub-protocol to send this specific processing to a Linux PP server.– Example 2: main PP server is Win64. You have an application that only requires 32 bit. Use Parallel processing to send

to a Win32 PP server. • Improving performance for operations that block on I/O:

– For instance, making remote soap calls that take several seconds to complete.

Parallel Sub-protocols

Page 7: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

• When not to use it– Operations that have heavy file I/O. For instance: do NOT read a large image, pass into parallel subprotocol,

process, and pass it back out. Each data record is serialized into and out of the parallel subprotocol. Instead: pass a REFERENCE to the image as a property on the data record.

– Trying to speed up File Reading.– Operations that require the FULL data set (e.g. Merge Data, Sort Data, etc.)

• For Multiprocessors, Grid, and Cluster deployments, use “Localhost” as the server• LOTS of information contained within User Help Docs including pitfalls and suggestions.

Topic is “Parallel Processing Subprotocols” under the “User Help” section.

Parallel Sub-protocols

Page 8: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

Public / Private Clusters & IP-based Load Balancing

8

Clients

Primary Pipeline Pilot Server

SecondaryPipelinePilot Servers

Login

Execute

NFS

Users

Users

Page 9: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

File System

Server Clustering Detail

XMLDB User Data Job Data

Head Node

Apache, A&A

XMLDB Service

Locator Service

LoggingService

Protocol Web

ServicesRunner Node 1

Apache, A&A

RunnerService File Access

Runner Node N

Apache, A&A

RunnerService File Access

Parent Job 1

Protocol Job 1

Protocol Job 2

Runner Service

Page 10: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

Server Grid Integration

10Clients

Primary Pipeline Pilot Server

Job Execution Servers

Login

Execute

Grid Head Node

NFS

Users

Users

Users

Page 11: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

File System

Server Grid Integration Detail

User Data Job Data

Server Head Node w/ ApacheAuthentication & Authorization

XMLDB Service

Locator Service

LoggingService

Protocol Web

Services

RunnerService

File AccessService

Node 1

Protocol Job 1

Protocol Job 2

Protocol Job 3

Node 2

Protocol Job 1

Protocol Job 2

Protocol Job 3

Node N

Protocol Job 1

Protocol Job 2

Protocol Job 3

Launch script Queue Manager

Jobs (scisvr command) launched through scripts (qsub).

Each scisvr process messages it’s status back to the head node (stats file).

Results are written to a shared job directory.

XMLDB

Page 12: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

• Map Reduce Uses Distributed File Systems– Search across many commodity hardware– Data is redundant – Have a data directory service to find data

nodes with appropriate data sets

• Map fits hits on Each Node, Returns Key / Value

• Reduce filters unique keys to product final Result Set

• Demo• What about Hadoop? Reduced Set

Map Set 3

Map Set 2

Map Set 1

Performing Map-Reduce Like Operations

Page 13: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

NoSQL (NRDS) Data Store OptionsCache Type Available Technologies Pros Cons

Wide-column or Key-value NRDS

Cassandra • Fast and Scalable• JDBC Driver• SQL-like query language

• Wide-column format is difficult to use w/ hierarchical data• Must pre-define some schema

In-memory key-value stores

Memcached (native)Ehache (java)Redis

• Simple & Generic• Fast

• No dataset grouping• No filter• Volatile • Limited authorization capability

Document-centric NRDS

MongoDBCouchDB

• Fast• Schema-less Data Model• Stores hierarchical data by nature• Filters and sorts• Built-in Map / Reduce capability

• Must implement own data authorization model

Data Record Caches

Pipeline Pilot Caches • Fast• Flexible• Stores hierarchical data• Filters and sorts

• No admin• No authorization w/out impersonation• Have to retrieve all rows in PP• No locking / concurrency protection

RDBMS JDBCODBC

• Ubiquitous• Standards-based• Well trodden• Usually the right decision

• Generic schemas are complex and possibly slow• Requires more coding• Limited scalability options

Page 14: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

MongoDB Structure

MongoDB

Database 1

Collection 1(BSON

Document)

Collection 2(BSON

Document)

Database 2Collection 1

(BSON Document)

Example Schema

Example Schema’s in MongoDB

MongoDBInstance

List (n)

ACL’s

Record Set Meta-data

Record Set

Cached Forms

Page 15: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

Big Data and Algorithms

• Clustering• Learning• Statistics• Reporting

Page 16: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

• What we learned– Parallel Processing & Grid Support for Big Compute & Big Data– Performing Map-Reduce Operations in AEP– Integration NoSQL Data Sources in AEP

• Recommended Sessions– (ATS2-07) Solving Large Computing Challenges with Pipeline

Pilot

Summary

Page 17: (ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)

The information on the roadmap and future software development efforts are intended to outline general product direction and should not be relied on in making a purchasing decision.

For more information on the Accelrys Tech Summits and other IT & Developer information, please visit:https://community.accelrys.com/groups/it-dev