managing big data using new innovations with hpcc systems systems_innovati… · managing big data...

42
Managing Big Data using New Innovations with HPCC Systems Bob Foreman – Senior Software Engineer/ECL Instructor Twitter: #HPCCMeetup

Upload: others

Post on 26-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Managing Big Data using New Innovations with HPCC SystemsBob Foreman – Senior Software Engineer/ECL Instructor

Twitter: #HPCCMeetup

Page 2: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Welcome!

• HPCC Systems has been open source since June 2011• Although the base technology has remained consistent, the last 6 years has

seen many new support technologies unfold.

• These technologies have enhanced and extended the base technology, and HPCC Systems remains ahead of the curve with these new innovations.

• We will look at many of them in this presentation.

2 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 3: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Agenda

• Quick intro on the platform and history before 2011• Open Source in 2011• Machine Learning February 2012 - 2017, many changes• Continuous updates and improvements in speed and compiler power.• Changes in the ECL Watch (Version 5 and 6)• ECL Playground• New services, like WSSQL• Plugin support (Ganglia, Nagios, Kafka, Security Manager, etc.)• EMBED support - new feature for EMBED• KEL lite• STRIKE initiative• Looking ahead to Version 7

3 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 4: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

History of HPCC Systems(High Performance Computing Cluster)

4 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Open sourcing a long established big data strategy

Page 5: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Why Does HPCC Systems Exist?

• It was NOT developed with the idea of selling the technology to anybody else!

• It was all created only to solve some of the data-handling problems that we encountered as we were developing our products.

5 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 6: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

The Result Of All That Development?

HPCC SystemsA single, fully-integrated platform supporting the entire life cycle of Big Data

product development:

• Raw Data Ingest – Thor• Data Transformation to Product – Thor• End-user Query Development – Thor• End-user Query Delivery – ROXIE

6 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 7: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

The Complete Big Data Value Chain

• Collection – collecting structured, unstructured and semi-structured data

• Ingestion – consuming vast amounts of data including extraction, transforming and loading

• Discovery & Cleansing - clean up, formatting and statistical analysis of the data

• Integration – linking, indexing and data fusion

• Analysis – statistics and machine learning

• Delivery – querying, visualization, and redundancy, enterprise-class availability

7 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Collection Ingestion Discovery & Cleansing Integration Analysis Delivery

Page 8: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

HPCC Systems Platform

There are two types of clusters in HPCC Systems:

• Data Refinery (THOR) – Processes every one of billions of records in order to create billions of "improved" records – runs one job at a time.

• Rapid Data Delivery Engine (ROXIE) – Searches quickly for a particular record or set of records – handles thousands of concurrent transactions per second.

• Both are tightly coupled to the infrastructure that supports their operation, and the ECL programming language that defines the work done on them.

8 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 9: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

HPCC Systems Hardware

• Clusters of commercial off-the-shelf components (COTS). Components are ideally homogeneous (all processing/disk storage components same) and the system is tightly coupled.

• Nodes are managed en masse instead of individually, which allows coordinated processing like global sorts (unlike Grid systems).

9 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 10: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Thor Cluster

• Brute force: Thor operates on massive amounts of data where datasets typically contain billions of records

• Open Data Model: The data model is defined by the user, not constrained by the limitations of a strict key-value paradigm

• Scalable: Horizontally linear scalability provides room to accommodate future data and performance growth

• Truly parallel: Datagraph Nodes can be processed in parallel as data seamlessly flows through them, effectively avoiding the well-known “long tail problem”, resulting in higher and predictable performance.

• Powerful optimizer: The HPCC Systems optimizer ensures submitted ECL code executes at the maximum possible speed for the underlying hardware. Advanced techniques such as lazy execution and code reordering are thoroughly utilized to maximize performance

10 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 11: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

ROXIE Cluster

• Low latency: Data queries typically complete sub-second

• Not a key-value store: ROXIE is not limited by the constraints of key-value data stores, allowing for complex queries, multi-key retrieval, fuzzy matching and more

• Highly available: ROXIE operates in critical environments under the most rigorous service level requirements

• Scalable: Horizontally linear scalability provides room to accommodate future data and performance growth

• Highly concurrent: In a typical environment, thousands of concurrent clients can be simultaneously executing transactions on the same ROXIE system

• Redundant: A shared-nothing architecture with no single point of failure provides extreme fault tolerance

11 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 12: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

HPCC Systems Platform

• Batteries included: All components create a consistent and homogeneous platform • Over 15 years of experience: The HPCC Systems platform is the technology underpinning

LexisNexis data offerings – its development began in 1999• Few moving parts: HPCC Systems is an integrated solution extending across the entire

data lifecycle, from data ingest and transformation to data delivery – no third party tools needed

• Multiple data formats: Supported out of the box, including fixed and variable length, delimited records, and XML

• ECL inside: One language to describe both: the data transformations in Thor and data delivery strategies in ROXIE. Solutions to complex data problems are expressed easily and directly in terms of high level ECL primitives.

• Consistent tools: Thor and ROXIE share the same set of tools, which provides consistency across the platform.

12 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 13: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Data on HPCC Systems

• Open Data Model: The data model is defined by the user, as standard files, records, and fields (tables, rows, and columns)

• Simple: Solutions to complex data problems can be expressed easily and directly in terms of high level ECL primitives

• Implicitly parallel: Data is always in distributed datasets whose parts are managed by the DFU, eliminating the need for programmers to manage the complexity of working with distributed datasets

13 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 14: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Data on HPCC Systems

• Data is stored in ISAM Files• Native support for:

• Flat files, with fixed or variable-length records• CSV-type files (any delimiters may be used)• XML datasets• New JSON format support

• Each Record is always whole and complete on a single node• A Record may have as many fields as needed• Indexes are always LZW compressed and may contain “payload” fields in

addition to search terms

14 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 15: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

What is ECL? (Enterprise Control Language)

• Declarative programming language:“Describes what needs to be done, not how to do it”

• Powerful: Unlike Java, high level primitives such as JOIN, TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc. are available. Higher level code means fewer programmers and shorter time to deliver complete projects

• Extensible: As new definitions are created, they become primitives that other programmers can use

• Implicitly parallel: Parallelism is built into the underlying platform. The programmer need not be concerned with it

• Maintainable: A High level programming language, no side effects and definition encapsulation provide for more succinct, reliable and easier to troubleshoot code

• Complete: Unlike Pig and Hive, ECL provides for a complete programming paradigm.

• Homogeneous: One language to express data algorithms across the entire HPCC Systems platform, including data ETL and delivery.

15 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 16: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Machine Learning

16

Page 17: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Machine Learning and HPCC Systems

• The HPCC Machine Learning Library contains an extensible collection of machine learning routines which are easy and efficient to use and are designed to execute in parallel across a cluster.

• In 2012 the first set of modules were released:o Associations (ML.Associate)o Classify (ML.Classify)o Cluster (ML.Cluster)o Correlations (ML.Correlate)o Discretize (ML.Discretize)o Distribution (ML.Distribution)o Field Aggregates (ML.FieldAggregates)o Regression (ML.Regression)o Visualization (ML.VL)

• https://hpccsystems.com/download/free-modules/ecl-ml

17 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 18: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Machine Learning and HPCC Systems

• In 2017, there are several new ML algorithms implemented and under development. • These algorithms now use the ECL bundle technology.

o PBBlas (parallel block basic linear algebra subprograms)o Time Series (TS)o Neural Networkso Deep Learningo Ensembleo NFold Cross Validationo Population Estimateo LDA (Linear Discriminant Analysis) o LSA (Latent Semantic Analysis)o StepwiseLogistico SVM (Support Vector Machine)

• https://github.com/hpcc-systems/ecl-ml

18 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 19: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Internal Updates and Improvements

19

Page 20: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

6.0 and 6.2 Internal Improvements

• Virtual Slave THOR• Parallel Activity Execution• Affinity support in Thor• Optimized merge sort for large number of cores• LZ4 compression for temporary files• Refresh Boolean option on persist• Parallel child query execution in Thor• Memory management improvements• Lookup JOINS in child queries

References:https://hpccsystems.com/resources/blog/lchapman/hpcc-systems-60x-feature-highlights-part-1https://hpccsystems.com/resources/blog/lchapman/hpcc-systems-60x-feature-highlights-part-2https://hpccsystems.com/resources/blog/lchapman/hpcc-systems-62x-here-whats-it-you

20 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 21: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

ECL Watch (Version 5 and 6)

21

Page 22: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

ECL Watch

• Long awaited face lift in Version 5.0• Upgrade was completed in Version 6.0 with even more features• Ability to spray multiple files of same type with one click.• New File Uploader• Hex Previewer• Enhanced filtering throughout• Improved Query Viewer, including Package Maps• New Plug-in interface• Improved Workunit Graphs • Built in Visualization

References:

https://www.youtube.com/watch?v=fupH_to2i84#action=share

https://www.youtube.com/watch?v=wm4xtNsR4bA#action=share

22 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 23: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

ECL Playground

23

Page 24: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

ECL Playground

• New ESP web service

References:

http://cdn.hpccsystems.com/releases/CE-Candidate-6.2.0/docs/ECL_Playground-6.2.0-1.pdf

http://cdn.hpccsystems.com/podcasts/2012_0904_v1_ECL_Playground.mp3

24 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 25: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

WSSQL

25

Page 26: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

WSSQL

• add-on service that provides an SQL interface into HPCC Systems• Submit SQL queries directly to HPCC via SOAP• Access HPCC data files and Published queries• Analyze HPCC data using familiar SQL syntax• Supports SQL SELECT or CALL syntax

• Access HPCC data files as DB Tables• Access published queries as DB Stored Procedures

• Supports SQL Create and Load Syntax• Harnesses the full power of HPCC under the covers

• Submitted SQL request generates ECL code which is submitted, compiled, and executed on your target cluster• Automatic Index fetching capabilities for quicker data fetches

• Creates entry-point for programmatic data access• Leverage HPCC data without need to learn and write ECL!

• Opens the door for non ECL programmers to access HPCC data.

References:http://cdn.hpccsystems.com/releases/CE-Candidate-6.2.0/docs/WsSQL_ESP_Web_Service_Users_Guide-6.2.0-1.pdf

Page 27: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Plugins!

27

Page 28: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Library and Datastore PlugIns

• New Plugin interface with ECL Watch

• Built ins (Debug, File Services)

• Audit and Logging

• dMetaphone (double metaphone)

• Apache Kafka

• Security Manager

• Redis

• Memcached

28 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

References:https://github.com/hpcc-systems/HPCC-Platform/tree/master/plugins

Page 29: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Embedded Language PlugIns• C++

• R Integration

• Couchbase

References:https://hpccsystems.com/resources/blog/lchapman/using-your-favorite-language-or-data-source-hpcc-systemshttps://hpccsystems.com/resources/blog/richardkchapman/projecting-fields-embedsUse and abuse of the EMBED feature: https://hpccsystems.com/bb/viewtopic.php?f=41&t=1509

• Java

• JavaScript

• MySQL

• Python

• SQLite3

• Cassandra

Page 30: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Embedded Language PlugIns

Page 31: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

New Horizons Working with TensorFlow

31

Page 32: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Working with TensorFlow

• Wonderful blog written by Richard Chapman

• TensorFlow is a new open-source program from Google

• Performs linear algebra operations on tensors (matrices) and connects multiple operations together.

• Particularly suited for machine learning applications and large datasets

• Works with HPCC 6.2 and greater versions

• Implemented in ECL using Python EMBED

• Shows how a TensorFlow model could be used inside an ECL workflow!

• This test resulted in enhanced Python plug-in capabilities.

32 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

References:https://hpccsystems.com/resources/blog/richardkchapman/embedding-tensorflow-operations-eclhttps://www.tensorflow.org/

Page 33: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

KEL (Knowledge Engineering Language)

33

Page 34: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Knowledge Engineering Language

• Designed for Data Modeling

• KEL expands the ECL specification of data flows and algorithms.

• Presumes that the user wants control over:

• the logical data model

• the analytic logic

• the mathematics

• ENTITY, MODEL, and ASSOCIATION

• Data Mapping (USE)

• Logic (GLOBAL)

• OUTPUT or QUERY

References:https://hpccsystems.com/download/free-modules/kel-lite

Page 35: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Sample KEL:

Page 36: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

The STRIKE Initiative

36

Page 37: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

STRIKE Is:

37

SALTTHORROXIEINTERLOKKELECL

• Large Scale Data Integration• Probabilistic Linking• Entity Disambiguation and Resolution

• Batch Oriented Big Data Processing and Analytical Engine• Machine Learning and Supervised Model Training• Clustering

• Horizontally Scalable and fault tolerant real-time disk based retrieval• Redundant Data Channels and Meta-key Search• Horizontally Scalable In-Memory Analytics

• Seamless Integration with hundreds of data stores• Real time data ingest• Flexible stream processing

• Graph/Network data models• Complex Queries based on n-degree relationships and attributes• Highly efficient

• Dataflow oriented declarative data programming language• Compiles to C++ for optimal performance• High expressivity and conciseness

Page 38: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Coming Soon…

38

Page 39: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Coming soon in 2017…

• Ease of use• Reliability and Stability• Machine Learning• Security• Interoperability• Dali Replacement for DFS• Opportunistic Improvents• Text Search• Multi-core support• Cloud/Hive 360 Support

And contributions and suggestions from YOU !!!

References:https://track.hpccsystems.com/secure/Dashboard.jspahttps://hpccsystems.com/community/how-to-contribute

Page 40: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Getting Started

Install:

1. Oracle’s VirtualBox:https://www.virtualbox.org/wiki/Downloads

2. ECL IDE and Documentation:https://hpccsystems.com/download/developer-tools/ecl-idehttps://hpccsystems.com/download/documentation

40 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 41: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

Getting Started

Run:1. Launch your VM player.

2. Import the HPCC Virtual Machine .ova file:http://hpccsystems.com/download/hpcc-vm-image

3. Note the IP next to the IP Address: prompt at the top of the VM.

This IP address is the key to allowing the HPCC Systems client tools to access the environment.

41 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems

Page 42: Managing Big Data using New Innovations with HPCC Systems Systems_Innovati… · Managing Big Data using New Innovations with HPCC Systems. Machine Learning and HPCC Systems • In

That’s All Folks!

And there’s so much more to learn!!!Thanks for Attending!

42 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems