teradata loom introductory presentation
TRANSCRIPT
2
• Single source of raw, historical, operational data
• Cost effectively explore data sets
– Unknown, under-appreciated, or unrecognized value
• Consolidate data environments
– Reduces costs and analytical discrepancies
• Co-location of files enables light, on-the-fly integration
Data Lake Promise
IDW
Web Logs
Sensors
Mobile
Files
A “Data Lake” is a massive repository enabled by low cost technologies that
improves the capture, refinement, and exploration of raw data within an
enterprise.
4
• Hadoop lacks many core features of existing data management platforms which are still required in a Data Lake
– Does not provide rich metadata for all data, including lineage
– Does not provide data profiling
• Data is difficult to find, understand, and use
– The data remains in “knowledge” silos in the Data Lake – end users need to understand how to reconcile the data across sources
– Data quality is unknown - all data in the Data Lake is treated with equal data quality which can result in inconsistencies or errors
– There is no systematic method of understanding what is in the Data Lake or what knowledge has already been determined
The Hadoop Data Lake Challenge
Hadoop capability gaps prevent organizations from realizing the promise of the
Data Lake
5
“Without descriptive metadata and a
mechanism to maintain it, the data lake
risks turning into a data swamp.”
6
• This model for data mining (CRISP-DM) was introduced in 2000 which has become widely adopted
• Loom helps with
– Data Discovery
– Data Understanding
– Data Preparation
This model for data mining (CRISP-DM) was introduced in 2000 which has become widely adopted
Loom helps with
How Does Loom Help?
Data Discovery
7
Find and Understand Your Data
• Data Cataloging and Profiling with Activescan
– Discovering and introspecting new data in the cluster
– Event triggers
– Job detection and lineage creation
– Data profiling
• Data Exploration and Discovery
– Technical and business metadata
– Data sampling and previews
– Lineage relationships
– Search and browse through Workbench
Prepare Your Data
• Data Wrangling with Weaver
– Self-service, interactive data wranling for Hadoop
– Leverages Loom metadata registry
• SQL-style Transforms with Hive
– Joins, unions, aggregations, UDFs
Loom Capabilities
8
Analysts
– Simple way to find and understand data
– Self-service data preparation
– Good interfaces for accessing data from existing analytics tools
Data Stewards, Architects, and Big Data Specialists
– Centralized source of metadata
– Transparency into data lake
– Custom attributes for easy extensibility
– Activescan event triggers and API for automating tasks
User Benefits
9
Technical Metadata
– Data location, format, structure, schema
– Data profiling statistics
– Data sampling
– Lineage
Business Metadata
– Descriptive attributes
– Custom properties
– Business glossaries
Search and Discovery
– Search over metadata
– Navigate relationships between entities
Open API
– RESTful API developer’s can use to integrate their own applications and use cases and extend metadata management beyond Hadoop to other big data systems
Loom Metadata Management
10
• Source Cataloging
– Scanning HDFS and Hive at scheduled intervals looking for new files/directories
– Scanning Sqoop and TDCH jobs to generate lineage
• Source Profiling
– Extracting metadata from file system, Hive metastora, and Sqoop/TDCH configs
– Introspecting files for embedded metadata and data structure
– Generating samples to support data previews
• Data Profiling
– Generating descriptive statistics by introspecting data
• Event Triggers
– Triggering downstream actions based on source cataloging and profiling
Loom Activescan
11
• Data preparation consumes a large amount of an analyst’s time
– Modify and combine column values to create new columns
– Modify schemas – add/delete/rename columns, convert datatypes
• Self-service, interactive UI for working with large data sets
– Work with a sample of the data set for quick iteration
– Once the sample is in the desired form, Loom will apply all of the steps against the full data set via MapReduce
• Leverages the Loom Metadata Registry
– All data cleaning steps are tracked to provide a complete data lineage picture from the raw source data to the data sets used for analytics
– User benefits from context provided by metadata in Loom Registry
Loom Weaver
12
Hadoop
Loom in the Data Lake
Import Landing Zone
Fit-for-Use Data
External AccessIntegration
Cleansing
Analysis
Wrangling
Catalog and Profile
1
Enrich and Explore
2Prepare for
Analysis3
13 © 2014 Teradata
Loom Architecture
Loom Server
Loom Interface
Loom Workbench
Loom APILoom
Activescan
Hadoop
Environment
HDFS Hive HCat
RegistryPersistence
Loom Services
14
Open source tools that extend Hadoop’s capabilities and are integrated with Loom
– Import/Export
- Sqoop/TDCH (relational)
– Storage & Metadata
- Hive-HCatalog
– Processing
- Cascading, HiveQL
– Data Interfaces
- REST APIs (HCatalog), JDBC/ODBC (Hive)
– Security
- LDAP, Kerberos
Integration with Hadoop Ecosystem
Loom is certified and
supported on all major
distributions of Hadoop
Loom is integrated with the
Teradata Open Distribution
for Hadoop (TDH)
15
Summary of Loom Strengths
Simplifies Hadoop Use and Management
Increase Analyst Productivity
Find and Understand Your Data
• Data Cataloging and Profiling with Activescan
• Data Exploration and Discovery
Prepare Your Data for Analysis
• Data Wrangling with Weaver
• SQL-style Transforms with Hive
20
• Teradata Loom Community Edition
– Freely downloadable as an add-on for all Hadoop distributions: teradata.com/tryloom
– Availability:
- Sandbox released on 10/15/2014
- Production ready release available on mid-Nov
• Teradata Loom Edition
– Premium version of Loom subscription licensed on a per node basis
– Fully featured & fully supported
– Availability mid-Nov
• Will support major Hadoop distributions (TDH, HDP, CDH, MapR)
• Globally available English only
– North American locale
Teradata Loom Editions
21
Features Community Loom
Standard Features Open metadata repository & API
Automatic discovery & profiling of new data
Lineage tracking via Loom UI and Loom API
Search
Ambari monitoring (future)
Premium Features Data wrangling steps/operations Up to 20 Unlimited
Security authentication using Kerberos/LDAP
Execution of custom scripts during data discovery
Automated lineage tracking for data movement outside Hadoop
Automated lineage tracking of Hive & MapReduceprocesses
SupportCommunity Teradata
Loom Community Feature Limitations