2014.07.11 biginsights data2014

21
InfoSphere BigInsights Big Data in practice Wilfried Hoge IT Architect Big Data [email protected] @wilfriedhoge 11.07.2014 Data2014

Upload: wilfried-hoge

Post on 15-Jan-2015

526 views

Category:

Technology


0 download

DESCRIPTION

Overview of IBM's Hadoop distribution InfoSphere BigInsights from a session at Data2014 conference (http://leolo.com/data2014/)

TRANSCRIPT

Page 1: 2014.07.11 biginsights data2014

InfoSphere BigInsights Big Data in practice

Wilfried Hoge IT Architect Big Data [email protected] @wilfriedhoge 11.07.2014 Data2014

Page 2: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 2

Hadoop Observations

Technology Customers Vendors

Rapid innovation

Two sources of innovation - Open source community

-  Integration of existing technologies

Tools and application

vendors selecting partners and integrating

High degree of interest

Many experimental

workstreams

ROI establishment varies by use case

Many customers want to offload data from EDW

Multiple business models

OSS support vendors have

mindshare lead

OSS support vendors business model viability

unclear

SW Portfolio vendors integrating/adding

Page 3: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 3

InfoSphere BigInsights provides Enterprise Grade Hadoop analytics

•  Manages a wide variety and huge volume of data

•  Augments open source Hadoop with enterprise capabilities

– Visualization & Exploration – Development tools – Advanced Engines – Connectors – Workload Optimization – Enterprise integration – Analytic Accelerators – Application and industry accelerators – Administration & Security

Accelerators

Information Integration & Governance

Data Warehouse

Stream Computing

Hadoop System

Discovery Application Development

Systems Management

Data Media Content Machine Social

BIG DATA PLATFORM

© 2013 IBM Corporation

Page 4: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 4

Key Differentiators for BigInsights

Enterprise Performance & Integration Analytics Usability

& Productivity

• Workload / performance optimization

• GPFS

• Security

• Key integrations & Connectors with Enterprise Ecosystem

• Text analytics

• Social Data Analytics Accelerators

• Machine Data Analytics Accelerators

• Execute R in an integrated application

•  Big SQL

•  BigSheets

•  Development Tools

•  Web Console

Page 5: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 5

Integrated Web Console

•  Manage BigInsights –  Inspect /monitor system health –  Add / drop nodes –  Start / stop services –  Run / monitor jobs (applications) –  Explore / modify file system –  Create custom dashboards

•  Launch applications –  Spreadsheet-like analysis tool –  Pre-built applications (IBM supplied or

user developed)

•  Publish applications

•  Monitor cluster, applications, data –  Create / view event alerts.

Page 6: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 6

6

Applications

High level languages (SQL, JAQL, PIG, …)

Map/Reduce API

Hadoop DFS API

GPFS HDFS

Distributed Filesystem

Distributed filesystem GPFS FPO gives additional flexibility, security and high availability •  Optional file system alternative to HDFS •  More than 10 years experience with HPC •  Key features

– No single point of failure – Built-in High Availability – POSIX compliance

•  Standard applications cannot use HDFS but they can use GPFS-FPO

– Enhanced Security – Higher performance

•  Allows concurrent read and write by multiple programs

– Recovery capabilties •  Journaling filesystem

– Support for Storage Pools – SnapShot capability

Page 7: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 7

BigInsights has a simple but effective security system based on a gateway to Hadoop

•  All Hadoop servers are connected over a private network

•  Unrestricted communication between cluster servers on the private network

•  BigInsights Web Console acts as a gateway into the cluster

•  Authentication through PAM or LDAP •  Role based authorization •  Authorization will be enforced at 3 levels:

– UI level – Data level – Map-Reduce level

•  Authorization also respected by services (e.g. SQL) •  Kerberos support

Authentication Authority

Gateway / Web Console

External Sources Users

Services Data Nodes

Infrastr. Nodes

Distributed Filesystem

Page 8: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 8

BigSheets to analyze and visualize

•  Model “big data” collected from various sources in spreadsheet-like structures

•  Filter and enrich content with

built-in functions

•  Combine data in different workbooks

•  Visualize results through

spreadsheets, charts

•  Export data into common formats (if desired)

No programming knowledge needed!

Page 9: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 9

9

A centralized dashboard to visualize analytic results: •  BigSheets collections •  Analytic application results •  Monitoring metrics

•  Ability to view BigSheets data flows between and across data sets to quickly navigate and relate analysis and charts

•  Visualize inner outer joins, enhanced filters for BigSheets columns, column data-type mapping for collections and application of analytics to BigSheets columns, … etc

Centralized dashboard & data flows

Page 10: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 10

10

Editors •  A workflow editor that greatly simplifies the

creation of complex Oozie workflows with a consumable interface

•  A Pig/Jaql Editor with content assist and syntax highlighting that enables users to create and execute new applications using Pig or Jaql in local or cluster mode from the Eclipse IDE

Application development & deployment •  Enablement of BigSheets macro

and BigSheets reader development •  Text Analytics development,

including support for modular rule sets

•  Publish new application: BigSheets Macro, BigSheets Reader, AQL module, Jaql module

Tools for Developers 1. Sample your

Data 2. Develop your application using BigInsights tools

3. Test your application

4. Package and publish your application

5. Deploy your application on the cluster

Page 11: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 11

Running Applications on Big Data

•  Browse available applications •  Deploy published applications

(administrators only) •  Launch (or schedule for launch) a

deployed application •  Monitor job (application) execution

status

•  Predefined applications •  Import & Export Data

•  Database & Files •  Web and Social

•  Analyze and Query •  Predictive Analytics •  Text Analytics •  SQL/Hive, Jaql, Pig, Hbase

•  Accelerators

Page 12: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 12

Application linking and interfaces to build new apps •  Compose new

applications from existing applications and BigSheets

•  Invoke analytics applications from the web console, including integration within BigSheets

•  REST data source App that enables users to load data from any data source supporting REST APIs into BigInsights, including popular social media services

•  Sampling App that enables users to sample data for analysis •  Subsetting App that enables users to subset data for data analysis

12

Page 13: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 13

Collaborative Big Data for many roles •  Business Users can get their hands on big

data and use big data applications and BigSheets to get insights into their data

§  Data scientists can perform deeper analysis and get richer insights

§  Administrators are empowered to be more agile through better controls and views into key performance indicators

§  Developers can leverage unified tooling in a Big Data Application Development Lifecycle and are able to create and deploy new types of applications, with enhancements that simplify even complex workflows

Page 14: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 14

Big SQL – Architected for Performance

•  Leverage IBM's rich SQL heritage, expertise, and technology –  Modern SQL:2011 capabilities –  DB2 compatible SQL PL support

•  SQL bodied functions and stored procedures •  Application logic/security encapsulation

•  Architected from the ground up for performance

–  low latency and high throughput

•  MapReduce replaced with a modern MPP architecture –  Compiler and runtime are native code (not java) –  Big SQL worker daemons live directly on cluster –  Continuously running (no startup latency) –  Processing happens locally at the data

•  Operations occur in memory with the ability

to spill to disk –  Supports aggregations and sorts larger than available RAM

•  Integration with BigSheets (source & target)

InfoSphere BigInsights

Big SQL SQL MPP Runtime

Data Sources

Parquet CSV Seq RC

Avro ORC JSON Custom

SQL-based Application

IBM Data Server Client

Page 15: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 15

Big SQL features

Data shared with Hadoop ecosystem Comprehensive file format support

Superior enablement of IBM software Enhanced by Third Party software

Modern MPP runtime Powerful SQL query rewriter

Cost based optimizer Optimized for concurrent user throughput

Results not constrained by memory

Distributed requests to multiple data sources within a single SQL statement

Main data sources supported: DB2 LUW, DB2/z, Teradata, Oracle, Netezza

Advanced security/auditing Resource and workload management

Self tuning memory management Comprehensive monitoring

Comprehensive SQL Support IBM SQL PL compatibility

Application Portability & Integration

Federation

Performance

Enterprise Features

Rich SQL

Page 16: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 16

IBM BigInsights brings efficient integration of R with Big R

•  R as a big data query language – Outside-in execution

•  R as a statistical language for deep computing –  Inside-out execution – Partitioning of large data (“divide”) – Parallel cluster execution of pushed

down R code (“conquer”) – Almost any R package can run in

this environment

•  R as the gateway to scalable machine learning – A scalable ML engine that provides

canned algorithms, and an ability to author new ones, all via R

R Clients

Scalable ML

Engine

Data Sources

Embedded R Execution

R Packages

R Packages

Pull data (summaries) to

R client

Or, push R functions right

on the data

Page 17: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 17

Text Analytics in BigInsights

Distill structured information from unstructured data

–  Rich annotator library supports multiple languages

–  Declarative Information Extraction (IE) system based on an algebraic framework

–  Richer, cleaner rule semantics –  Better performance through optimization

How it works •  Parses text and detects meaning with annotators •  Understands the context in which the text is

analyzed •  Hundreds of pre-built annotators for names,

addresses, phone numbers, along others

Accuracy •  Highly accurate in deriving meaning from

complex text

Performance •  AQL language optimized for MapReduce

Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win.

Unstructured text (document, email, etc)

Classification and Insight

Page 18: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 18

BigInsights offers value beyond Open Source

Enterprise Capabilities

Administration & Security

Workload Optimization

Connectors

Open source components

Advanced Engines

Visualization & Exploration

Development Tools

IBM-certified Apache Hadoop

Key differentiators •  Built-in analytics •  Enterprise software integration •  Spreadsheet-style analysis •  Integrated installation of supported open

source and other components •  Web Console for admin and application

access •  Platform enrichment: additional security,

performance features, . . . •  World-class support •  Full open source compatibility

Business benefits •  Quicker time-to-value due to IBM

technology and support •  Reduced operational risk •  Enhanced business knowledge with flexible

analytical platform •  Leverages and complements existing

software

Page 19: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 19

InfoSphere BigInsights for Hadoop includes the latest Open Source components, enhanced by enterprise components IBM InfoSphere BigInsights for Hadoop

Runtime

File System

Data Store

Resource M

anagement &

A

dministration

Security Data

Access

Advanced Analytics

Visualization & Ad Hoc Analytics

Applications & Development

Governance

MapReduce

HBase

HDFS

IBM Open Source

Text Analytics R Big R

Kerberos

Audit &

History GPFS FPO

Adaptive MapReduce

Console

Monitoring

LDA

P D

ata Security for H

adoop

Data P

rivacy for Hadoop

Data M

atching D

ata Masking

Stream Computing

Search

Streams

Enterprise S

earch S

olr/ Lucene

Jaql

Pig Hive

ZooKeeper

Oozie

Big SQL

Flexible S

cheduler

ETL

BigSheets

Dashboard Charting

Eclipse Tooling: MapReduce, Hive, Jaql,

Pig, Big SQL, AQL

BigSheets Reader and Macro

Text Analytics Extractors

Flume

Sqoop

HCatalog

YAR

N*

* In Beta

Page 20: 2014.07.11 biginsights data2014

© 2014 International Business Machines Corporation 20

From Getting Starting to Enterprise Deployment: Different BigInsights Editions For Varying Needs

Standard Edition

Breadth of capabilities

Ente

rpris

e cl

ass

Enterprise Edition

- Spreadsheet-style tool - - Web console - - Dashboards - Pre-built applications - - Eclipse tooling - - RDBMS connectivity - - Big SQL - - Monitoring and alerts - - Platform enhancements - - . . .

- Accelerators - - GPFS – FPO - - Adaptive MapReduce - Text analytics - Enterprise Integration - - Big R - - InfoSphere Streams* - - Watson Explorer* - - Cognos BI* - - Data Click* - - . . .

- * Limited use license

Apache Hadoop

Quick Start Free. Non-production Same features as Standard Edition plus text analytics and Big R

Page 21: 2014.07.11 biginsights data2014

IBM big data • IBM big data • IBM big data

IBM big data • IBM big data • IBM big data

IBM

big

dat

a

• IB

M b

ig d

ata

IBM

big data • IBM

big data

THINK