teradata loom introductory presentation

21
Teradata Loom Realize the Promise of the Hadoop Data Lake Michael Lang October 2014

Upload: mlang222

Post on 16-Jul-2015

157 views

Category:

Software


0 download

TRANSCRIPT

Teradata LoomRealize the Promise of the Hadoop Data Lake

Michael Lang

October 2014

2

• Single source of raw, historical, operational data

• Cost effectively explore data sets

– Unknown, under-appreciated, or unrecognized value

• Consolidate data environments

– Reduces costs and analytical discrepancies

• Co-location of files enables light, on-the-fly integration

Data Lake Promise

IDW

Web Logs

Sensors

Mobile

Files

A “Data Lake” is a massive repository enabled by low cost technologies that

improves the capture, refinement, and exploration of raw data within an

enterprise.

3

Let’s just build a Data Lake!

4

• Hadoop lacks many core features of existing data management platforms which are still required in a Data Lake

– Does not provide rich metadata for all data, including lineage

– Does not provide data profiling

• Data is difficult to find, understand, and use

– The data remains in “knowledge” silos in the Data Lake – end users need to understand how to reconcile the data across sources

– Data quality is unknown - all data in the Data Lake is treated with equal data quality which can result in inconsistencies or errors

– There is no systematic method of understanding what is in the Data Lake or what knowledge has already been determined

The Hadoop Data Lake Challenge

Hadoop capability gaps prevent organizations from realizing the promise of the

Data Lake

5

“Without descriptive metadata and a

mechanism to maintain it, the data lake

risks turning into a data swamp.”

6

• This model for data mining (CRISP-DM) was introduced in 2000 which has become widely adopted

• Loom helps with

– Data Discovery

– Data Understanding

– Data Preparation

This model for data mining (CRISP-DM) was introduced in 2000 which has become widely adopted

Loom helps with

How Does Loom Help?

Data Discovery

7

Find and Understand Your Data

• Data Cataloging and Profiling with Activescan

– Discovering and introspecting new data in the cluster

– Event triggers

– Job detection and lineage creation

– Data profiling

• Data Exploration and Discovery

– Technical and business metadata

– Data sampling and previews

– Lineage relationships

– Search and browse through Workbench

Prepare Your Data

• Data Wrangling with Weaver

– Self-service, interactive data wranling for Hadoop

– Leverages Loom metadata registry

• SQL-style Transforms with Hive

– Joins, unions, aggregations, UDFs

Loom Capabilities

8

Analysts

– Simple way to find and understand data

– Self-service data preparation

– Good interfaces for accessing data from existing analytics tools

Data Stewards, Architects, and Big Data Specialists

– Centralized source of metadata

– Transparency into data lake

– Custom attributes for easy extensibility

– Activescan event triggers and API for automating tasks

User Benefits

9

Technical Metadata

– Data location, format, structure, schema

– Data profiling statistics

– Data sampling

– Lineage

Business Metadata

– Descriptive attributes

– Custom properties

– Business glossaries

Search and Discovery

– Search over metadata

– Navigate relationships between entities

Open API

– RESTful API developer’s can use to integrate their own applications and use cases and extend metadata management beyond Hadoop to other big data systems

Loom Metadata Management

10

• Source Cataloging

– Scanning HDFS and Hive at scheduled intervals looking for new files/directories

– Scanning Sqoop and TDCH jobs to generate lineage

• Source Profiling

– Extracting metadata from file system, Hive metastora, and Sqoop/TDCH configs

– Introspecting files for embedded metadata and data structure

– Generating samples to support data previews

• Data Profiling

– Generating descriptive statistics by introspecting data

• Event Triggers

– Triggering downstream actions based on source cataloging and profiling

Loom Activescan

11

• Data preparation consumes a large amount of an analyst’s time

– Modify and combine column values to create new columns

– Modify schemas – add/delete/rename columns, convert datatypes

• Self-service, interactive UI for working with large data sets

– Work with a sample of the data set for quick iteration

– Once the sample is in the desired form, Loom will apply all of the steps against the full data set via MapReduce

• Leverages the Loom Metadata Registry

– All data cleaning steps are tracked to provide a complete data lineage picture from the raw source data to the data sets used for analytics

– User benefits from context provided by metadata in Loom Registry

Loom Weaver

12

Hadoop

Loom in the Data Lake

Import Landing Zone

Fit-for-Use Data

External AccessIntegration

Cleansing

Analysis

Wrangling

Catalog and Profile

1

Enrich and Explore

2Prepare for

Analysis3

13 © 2014 Teradata

Loom Architecture

Loom Server

Loom Interface

Loom Workbench

Loom APILoom

Activescan

Hadoop

Environment

HDFS Hive HCat

RegistryPersistence

Loom Services

14

Open source tools that extend Hadoop’s capabilities and are integrated with Loom

– Import/Export

- Sqoop/TDCH (relational)

– Storage & Metadata

- Hive-HCatalog

– Processing

- Cascading, HiveQL

– Data Interfaces

- REST APIs (HCatalog), JDBC/ODBC (Hive)

– Security

- LDAP, Kerberos

Integration with Hadoop Ecosystem

Loom is certified and

supported on all major

distributions of Hadoop

Loom is integrated with the

Teradata Open Distribution

for Hadoop (TDH)

15

Summary of Loom Strengths

Simplifies Hadoop Use and Management

Increase Analyst Productivity

Find and Understand Your Data

• Data Cataloging and Profiling with Activescan

• Data Exploration and Discovery

Prepare Your Data for Analysis

• Data Wrangling with Weaver

• SQL-style Transforms with Hive

16 © 2014 Teradata

Demo

1717 © 2014 Teradata

18

Loom Weaver

19 © 2014 Teradata

Lineage

20

• Teradata Loom Community Edition

– Freely downloadable as an add-on for all Hadoop distributions: teradata.com/tryloom

– Availability:

- Sandbox released on 10/15/2014

- Production ready release available on mid-Nov

• Teradata Loom Edition

– Premium version of Loom subscription licensed on a per node basis

– Fully featured & fully supported

– Availability mid-Nov

• Will support major Hadoop distributions (TDH, HDP, CDH, MapR)

• Globally available English only

– North American locale

Teradata Loom Editions

21

Features Community Loom

Standard Features Open metadata repository & API

Automatic discovery & profiling of new data

Lineage tracking via Loom UI and Loom API

Search

Ambari monitoring (future)

Premium Features Data wrangling steps/operations Up to 20 Unlimited

Security authentication using Kerberos/LDAP

Execution of custom scripts during data discovery

Automated lineage tracking for data movement outside Hadoop

Automated lineage tracking of Hive & MapReduceprocesses

SupportCommunity Teradata

Loom Community Feature Limitations