big data applications & analytics looking at individual hpcabds software layers 1/26/2015 cloud...

Cloud Computing Software 1

BIG DATA APPLICATIONS & ANALYTICS

LOOKING AT INDIVIDUAL HPCABDS SOFTWARE LAYERS

1/26/2015

Geoffrey FoxJanuary 26 2014

BigDat 2015: International Winter School on Big DataTarragona, Spain, January 26-30, 2015

[email protected] http://www.infomall.org

School of Informatics and ComputingDigital Science Center

Indiana University Bloomington

mailto:[email protected]

http://www.infomall.org/

http://www.infomall.org/

Cloud Computing Software2

Using the HPC-ABDS Software Stack

CLOUD COMPUTING SOFTWARE

1/26/2015

3

Cloud Computing Software

There are a lot of Big Data and HPC Software systemsChallenge! Manage environment offering these different

components

1/26/2015

Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies January 14 2015 Cross-

Cutting Functions

1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, OpenStack Keystone, LDAP, Sentry, Sqrrl 4) Monitoring: Ambari, Ganglia, Nagios, Inca

17) Workflow-Orchestration: Oozie, ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, NiFi (NSA)

16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, mlpy, scikit-learn, CompLearn, Caffe, R, Bioconductor, ImageJ, pbdR, Scalapack, PetSc, Azure Machine Learning, Google Prediction API, Google Translation API, Torch, Theano, H2O, Google Fusion Tables, Oracle PGX, GraphLab, GraphX, CINET, NWB, Elasticsearch, IBM System G, IBM Watson, GraphBuilder(Intel), TinkerPop 15A) High level Programming: Kite, Hive, HCatalog, Databee, Tajo, Pig, Phoenix, Shark, MRQL, Impala, Presto, Sawzall, Drill, Google BigQuery (Dremel), Google Cloud DataFlow, Summingbird, SAP HANA, IBM META, HadoopDB, PolyBase 15B) Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, AWS Elastic Beanstalk, IBM BlueMix, Ninefold, Aerobatic, Azure, Jelastic, Cloud Foundry, CloudBees, Engine Yard, CloudControl, appfog, dotCloud, Pivotal, OSGi, HUBzero, OODT 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, Stratosphere (Apache Flink), Reef, Hama, Giraph, Pregel, Pegasus 14B) Streams: Storm, S4, Samza, Google MillWheel, Amazon Kinesis, LinkedIn Databus, Facebook Scribe/ODS, Azure Stream Analytics 13) Inter process communication Collectives, point-to-point, publish-subscribe: Harp, MPI, Netty, ZeroMQ, ActiveMQ, RabbitMQ, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Azure Event Hubs, Amazon Lambda Public Cloud: Amazon SNS, Google Pub Sub, Azure Queues 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis (key value), Hazelcast, Ehcache, Infinispan 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, SciDB, Apache Derby, Google Cloud SQL, Azure SQL, Amazon RDS, rasdaman, BlinkDB, N1QL, Galera Cluster, Google F1, Amazon Redshift, IBM dashDB 11B) NoSQL: HBase, Accumulo, Cassandra, Solandra, MongoDB, CouchDB, Lucene, Solr, Berkeley DB, Riak, Voldemort, Neo4J, Yarcdata, Jena, Sesame, AllegroGraph, RYA, Espresso, Sqrrl, Facebook Tao, Google Megastore, Google Spanner, Titan:db Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Google Omega, Facebook Corona 8) File systems: HDFS, Swift, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS, Haystack, f4 Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Whirr, JClouds, OCCI, CDMI, Libcloud, TOSCA, Libvirt 6) DevOps: Docker, Puppet, Chef, Ansible, Boto, Cobbler, Xcat, Razor, CloudMesh, Heat, Juju, Foreman, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic 5) IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, VMware ESXi, vSphere, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, VMware vCloud, Amazon, Azure, Google and other public Clouds, Networking: Google Cloud DNS, Amazon Route 53

21 layers 289 Software Packages


4

USING HPC-ABDS LAYERS I1) Message Protocols

This layer is unlikely to directly visible in many applications as used in “underlying system”. Thrift and Protobuf have similar functionality and are used to build messaging protocols between components (services) of system

2) Distributed CoordinationZookeeper is likely to be used in many applications as it is way that one achieves consistency in distributed systems – especially in overall control logic and metadata. It is for example used in Apache Storm to coordinate distributed streaming data input with multiple servers ingesting data from multiple sensors.JGroups is less commonly used and is very different. It builds secure multi-cast messaging with a variety of transport mechanisms.

3) Security & Privacy ISecurity & Privacy is of course a huge area present implicitly or explicitly in all applications. It covers authentication and authorization of users and the security of running systems. In the Internet there are many authentication systems with sites often allowing you to use Facebook, Microsoft , Google etc. credentials. InCommon, operated by Internet2, federates research and higher education institutions, in the United States with identity management and related services.

1/26/2015


5

USING HPC-ABDS LAYERS II3) Security & Privacy II

LDAP is a simple database (key-value) forming a set of distributed directories recording properties of users and resources according to X.500 standard. It allows secure management of systems. OpenStack Keystone is a role-based authorization and authentication environment to be used in OpenStack private clouds.

4) Monitoring: Here Ambari is aimed at installing and monitoring Hadoop systems. Nagios and Ganglia are similar system monitors with ability to gather metrics and produce alerts. Inca is a higher level system allowing user reporting of performance of any sub system. Essentially all systems use monitoring but most users do not add custom reporting.

5) IaaS Management from HPC to hypervisors:These technologies underlie all applications. The classic technology OpenStack manages virtual machines and associated capabilities such as storage and networking. The commercial clouds have their own solution and it is possible to move machine images between these different environments. As a special case there is “bare-metal” i.e. the null hypervisor. The DevOPs technology Docker is playing an increasing role as a linux container.

1/26/2015


6

USING HPC-ABDS LAYERS III6) DevOps

This describes technologies and approaches that automate the deployment and installation of software systems and underlies “software-defined systems”. At IU, we integrate tools together in Cloudmesh – Libcloud, Cobbler, Chef, Docker, Slurm, Ansible, Puppet. Celery. We saw Docker earlier in 5 on last slide.

7) InteroperabilityThis is both standards and interoperability libraries for services (Whirr), compute (OCCI), virtualization and storage (CDMI)

8) File systemsOne will use files in most applications but the details may not be visible to the user. Maybe you interact with data at level of a data management system or an Object store (OpenStack Swift or Amazon S3). Most science applications are organized around files; commercial systems at a higher level.

9) Cluster Resource ManagementYou will certainly need cluster management in your application although often this is provided by the system and not explicit to the user. Yarn from Hadoop is gaining in popularity while Slurm is a basic HPC system as are Moab, SGE, OpenPBS while Condor also well known for scheduling of Grid applications. Mesos is similar to Yarn and is also becoming popular. Many systems are in fact collections of clusters as in data centers or grids. These require management and scheduling across many clusters; the latter is termed meta-scheduling.1/26/201

5


7

USING HPC-ABDS LAYERS IV10)Data Transport

Globus Online or GridFTP is dominant system in HPC community but this area is often not highlighted as often application only starts after data has made its way to disk of system to be used. Simple HTTP protocols are used for small data transfers while the largest ones use the “Fedex/UPS” solution of transporting disks between sites.

11)A) File management, B) NoSQL, C) SQLThis is a critical area for nearly all applications as it captures areas of file, object, NoSQL and SQL data management. The many entries in area testify to variety of problems (graphs, tables, documents, objects) and importance of efficient solution. Just a little while ago, this area was dominated by SQL databases and file managers.

12)In-memory databases&caches / Object-relational mapping / Extraction ToolsThis is another important area addressing two points. Firstly conversion of data between formats and secondly enabling caching to put as much processing as possible in memory. This is an important optimization with Gartner highlighting this areas in several recent hype charts with In-Memory DBMS and In-Memory Analytics.

1/26/2015


8

USING HPC-ABDS LAYERS V13)Inter process communication Collectives, point-to-point,

publish-subscribe, MPIThis describes the different communication models used by the systems in layers 13, 14) below. Results may be very sensitive to choices here as there are big differences from disk-based versus point to point (no disk) for Hadoop v. Harp (MPI)or the different latencies exhibited by publish-subscribe systems. I always recommend Pub-Sub systems like ActiveMQ or RabbitMQ for messaging.

14)A) Basic Programming model and runtime, SPMD, MapReduce, MPIB) StreamingA very important layer defining the cloud (HPC-ABDS) programming model. Includes Hadoop and related tools Spark, Twister, Stratosphere, Hama (iterative MapReduce); Giraph, Pregel, Pegasus (Graphs); Storm, S4, Samza (Streaming); Tez (workflow) and Yarn integration. Most applications use something here!

15)A) High level ProgrammingComponents at this level are not required but are very interesting and we can expect great progress to come both in improving them and using them. Pig and Sawzall offer data parallel programming models; Hive, HCatalog, Shark, MRQL, Impala, and Drill support SQL interfaces to MapReduce, HDFS and Object stores

1/26/2015


9

USING HPC-ABDS LAYERS VI15)B) Frameworks

This is exemplified by Google App Engine and Azure (when it was called PaaS) but now there are many “integrated environments”.

16)Application and AnalyticsThis is the “business logic” of application and where you find machine learning algorithms like clustering. Mahout , MLlib , MLbase are in Apache for Hadoop and Spark processing; R is a central library from statistics community. There are many other important libraries where we mention those in deep learning (CompLearn Caffe), image processing (ImageJ), bioinformatics (Bioconductor) and HPC (Scalapack and PetSc). You will nearly always need these or other software at this level

17)Workflow-OrchestrationThis layer implements orchestration and integration of the different parts of a job. These can be specified by a directed data-flow graph and often take a simple pipeline form illustrated in “access pattern” 10 discussed later. This field was advanced significantly by the Grid community and the systems are quite similar in functionality although their maturity and ease of use can be quite different. The interface is either visual (link programs as bubbles with data flow) or as an XML or program (Python) script.

1/26/2015

big data applications & analytics looking at individual hpcabds software layers 1/26/2015 cloud...

Documents