50571561-isas-etl-final

Post on 25-Nov-2014

103 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BHAVANI.PSUBHASHINI.V

PUNNIYAA

Introduction

Extract, transform, and load (ETL) is a process indatabase usage and especially in data warehousingthat involves:  Extracting data from outside sources Transforming it to fit operational needs (which can include quality levels) Loading it into the end target (database or data warehouse)

Extract

Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping.

An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure

Transform Selecting only certain columns to load. Translating coded values Encoding free-form values Deriving a new calculated value Filtering Sorting Joining data from multiple sources Aggregation Generating surrogate-key values Transposing or pivoting Splitting a column into multiple columns Dis-aggregation of repeating columns into a separate detail table Lookup and validate the relevant data from tables or referential files for

slowly changing dimensions Applying any form of simple or complex data validation

Load

The load phase loads the data into the end target, usually the data warehouse (DW)

The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs

The load phase interacts with a database, the constraints defined in the database schema - as well as in triggers activated upon data load - apply

ETL Cycle

The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create

aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case

of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up

Challenges ETL processes can involve considerable complexity, and significant

operational problems can occur with improperly designed ETL systems.

The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified.

Data warehouses are typically assembled from a variety of data sources with different formats and purposes.

Design analysts should establish the scalability of an ETL system across the

lifetime of its usage.

The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time.

Performance Direct Path Extract method or bulk unload whenever is possible (instead of querying

the database) to reduce the load on source system while getting high speed extract Most of the transformation processing outside of the database To use bulk load operations whenever possible. Still, even using bulk operations, database access is usually the bottleneck in the ETL

process. Partition tables (and indices). Try to keep partitions similar in size (watch for null

values which can skew the partitioning). Do all validation in the ETL layer before the load. Disable integrity checking in the

target database tables during the load. Disable triggers in the target database tables during the load. Simulate their effect as

a separate step. Generate IDs in the ETL layer. Drop the indexes (on a table or partition) before the load - and recreate them after the

load. Use parallel bulk load when possible. If a requirement exists to do insertions, updates, or deletions, find out which rows

should be processed in which way in the ETL layer, and then process these three operations in the database separately.

Parallel Processing

Sources Central ETL layer Targets

ETL applications implement three main types of parallelism:

Data: By splitting a single sequential file into smaller data files to provide parallel access.

Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2.

Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.

Rerunnability, recoverability

Data warehousing procedures usually subdivide a big ETL process into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs will help to roll back and rerun the failed piece.

Best practice also calls for "checkpoints", which are states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, and so on.

Best practices

Four-layered approach for ETL architecture design

Use file-based ETL processing where possible

Use data-driven methods and minimize custom ETL coding

Qualities of a good ETL architecture design

Tools

Programmers can set up ETL processes using almost any programming language, but building such processes from scratch can become complex.

ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation, and loading of data. Many ETL vendors now have data profiling, data quality, and metadata capabilities.

Open-source ETL frameworks

ApatarCloverETLFlat File CheckerJitterbit 2.0Pentaho Data Integration (now included in OpenOffice Base)RapidMinerScriptellaTalend Open Studio

Proprietary ETL frameworks

IBM InfoSphere DataStageInformatica PowerCenterOracle Data Integrator (ODI)Ab InitioAltova MapForceHiT Software AlloraDigital Fuel Service FlowPhocas ETLMicrosoft SQL Server Integration Services (SSIS)

The Pentaho BI Project is open source application software for enterprise reporting, analysis, dashboard, data mining, workflow and ETL capabilities for business intelligence needs.

Business Model Pentaho uses a subscription model: its commercial open source business model eliminates software license fees, providing support, services, and product enhancements via an annual subscription. A commercial open source company, Pentaho "leads and sponsors" the open source projects that are core to its suite, giving it direct influence over software development.

Pentaho’s Board of Directors & Investors

The Board and Investor's composition is a strong, balanced blend of skills and experience, allowing them to offer guidance in core areas important to Pentaho.

Management and Technical Leads

The core project team at Pentaho has been together for many years and through success after success. It includes highly experienced industry leaders with a strong record of creating successful BI products for top-tier commercial vendors, including:

Business Objects Cognos Hyperion IBM Oracle SAS

COMPONENTS OF PENTAHO BI SUITE ENTERPRISE EDITION

•The Pentaho BI Suite provides a full spectrum of business intelligence (BI) capabilities including query and reporting, interactive analysis, dashboards, data integration/ETL, data mining, and a BI platform that has made it the world's most popular open source BI suite.

•Pentaho Enterprise Edition products provide comprehensive technical support, software maintenance, and enhanced functionality.

•Pentaho's technology was architected from the ground-up as a modern, fully integrated BI platform built on open standards.

•That means it fits easily into any IT infrastructure, out-of-the-box or embedded in a custom application

Pentaho Reporting

• Flexible deployment from standalone desktop reporting to embedded reporting and enterprise business intelligence

•Broad data source support including relational, OLAP, or XML-based data sources•Popular output options including Adobe PDF, HTML, Microsoft Excel, Rich Text Format, or plain text

•Web-based ad hoc query and reporting for business users

•Enterprise Edition provides enhanced software functionality, comprehensive professional technical support, product expertise, certified software and software maintenance.

Embedded reporting

Operational Reporting

Production Reporting

Pentaho Report Designer•Design reports quickly with the streamlined report wizard that takes authors from a blank canvas to a highly polished report in four simple steps.

• Connect to diverse data sources including relational data, Pentaho Analysis, flat files, java objects, or even stream data directly from Pentaho Data Integration transformations to design reports.

•Create and view user prompts, including dynamic cascading prompts.

•Publish directly to the BI server to give business users instant access to the information they need.

•Add rich data visualizations with over 15 customizable chart types, barcodes, sparklines, survey scales, and more.

•Localize reports easily to support multi-lingual deployment with a single report file.

•Embed HTML and JavaScript controls for dynamic and interactive online reports.

•Fine-tune reports using the built-in interactive preview mode.

Pentaho Analysis

Freely explore business information by drilling into and cross-tabulating data Experience speed-of-thought response times to complex analytical queries View information multi-dimensionally, choosing specific metrics and attributes to analyze Deploy stand-alone or integrated with other products in the Pentaho BI Suite

Pentaho Analyzer

Pentaho Analyzer provides intuitive, interactive analytical reporting letting non-technical business users quickly understand business information. As part of the enhanced functionality in Pentaho Analysis Enterprise Edition, Analyzer features:

Web-based, drag-and-drop report creation Advanced sorting and filtering Customized totals and user-defined calculations Chart visualizations And much more

Pentaho Dashboards

Pentaho Dashboards delivers the visibility by providing:

Rich, interactive displays including Adobe Flash-based visualizations so that business users can immediately see which business metrics are on track, and which need attention

Self-service dashboard designer that lets business users easily create personalized dashboards with zero training

Integration with Pentaho Reporting and Pentaho Analysis so that users can drill to underlying reports and analysis to understand what factors are contributing to good or bad performance

Portal integration to make it easy to deliver relevant business metrics to large numbers of users, seamlessly integrated into their application

Integrated alerting to continuously monitor for exceptions and notify users to take action

Pentaho Data Integration

Powers instantaneous, iterative BI application development Enables seamless collaboration between developers and end users Merges complex BI development into a single process Dramatically reduces time and difficulty of building and deploying BI apps

With Pentaho Data Integration, Pentaho is redefining the way that BI applications are built and deployed.  Utilizing Pentaho’s Agile BI approach, Pentaho Data Integration unifies the ETL, modeling and visualization processes into a single, integrated environment that enables developers and end-users to work seamlessly together.  The end result is that BI developers and end users can build BI applications more quickly, easily and at a small fraction of the cost of traditional solutions. Pentaho’s Agile BI:

Pentaho Data Integration is a full-featured ETL solution including:

Rich transformation library with over 100 out-of-the-box mapping objects Broad data source support including packaged applications, over 30 open source

and proprietary database platforms, flat files, Excel documents and more Advanced data warehousing support for Slowly Changing and Junk Dimensions Proven enterprise-class performance and scalability Integration with the Pentaho BI Suite for Enterprise Information Integration (EII), advanced scheduling, and process integration Unified ETL, modeling and visualization development environment for design of

BI applications.

Pentaho Data Integration Transformation Screenshot

Pentaho Data Integration Job Screenshot

Common use cases for Pentaho Data Integration include

Data warehouse population Agile design of BI applications Information enrichment by integrating data from various sources Data migration between applications Imports of data into databases from text-files, Excel spreadsheets,

relational systems and more Data cleansing by applying complex conditions in data

transformations Exploration of data in existing databases (tables, views, etc.)

Pentaho Data Mining

Data Mining is the process of running data through sophisticated algorithms to uncover meaningful patterns and correlations that may otherwise be hidden. These can be used to understand the business better and also exploited to improve future performance through predictive analytics.

Pentaho Data Mining is differentiated by its open, standards-compliant nature, use of Weka data mining technology, and tight integration with core business intelligence capabilities including reporting, analysis and dashboards. Other data mining offerings lack this level of sophistication and integration.

Pentaho Data Mining can be deployed as:

An out-of-the-box solution for immediate deployment to analysts. As far as end-users are concerned, data mining operates entirely in the background – users see results and recommendations through e-mail or other web pages, which can include Pentaho Dashboards.

A set of components that enable Java™ developers to quickly create custom reporting solutions using Java Objects or Java Server Pages (JSPs). These can be tightly integrated with other applications or portals.

Together with other components of the overall Pentaho BI Suite

Features and Benefits

Provides insight into hidden patterns and relationships in your data

Enables you to exploit these correlations to improve organizational performance

Provides indicators of future performance

Enables embedding of recommendations in your applications

Enables you to take full advantage of a range of data mining algorithms

Technology

Powerful Data Mining Engine

Provides a comprehensive set of machine learning algorithms from the Weka project including clustering, segmentation, decision trees, random forests, neural networks, and principal component analysis.

Pentaho has added integration with Pentaho Data Integration and automated the process of transforming data into the format the data mining engine needs.

Algorithms can either be applied directly to a dataset or called from Java code.

Output can be viewed graphically, interacted with programmatically, or used data source for reports, further analysis, and other processes.

Filters are provided for discretization, normalization, re-sampling, attribute selection, and transforming and combining attributes.

Classifiers provide models for predicting nominal or numeric quantities. Learning schemes include decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, and other advanced techniques.

The data mining engine is also well-suited for developing new machine learning schemes, enabling customers to incorporate their own models.

Inputs and outputs can be controlled programmatically, enabling developers to create completely custom solutions using the components provided.

Graphical Design Tools Graphical user interfaces are provided for data pre-processing,

classification,regression, clustering, association rules, andvisualization.

Data Mining - Boundary Visualizer

Data Mining – Classify Panel

Data Mining- Knowledge Flow

Data Mining- Explorer

CUSTOMER SUCCESSES

Pentaho customers address a wide range of BI challenges using services and software from Pentaho. Many Pentaho customers use Pentaho for reporting, data integration, dashboards, and/or analysis. Some use multiple modules or the full Pentaho BI Suite. With subscription services and open source licensing from Pentaho, customers can get best-in-class BI capabilities with the peace of mind of professional support, software maintenance, training, consulting, and more.

The following is a small sample of the many organizations around the globe that depend on Pentaho for commercial open source business intelligence.

"Our only regret was that we didn't have Pentaho for data integration years ago. Immediately we were able to see the increased operational efficiency, reduced internal costs and greater customer value using Pentaho Data Integration.“

Deployment OverviewKey Challenges Cumbersome, manual process for creation and distribution of reports Multiple data points including Google-Analytics needed to integrate and

automate into one report Pentaho Solution Pentaho Data Integration Business and implementation services by Pentaho Systems Integrator Partner,

DEFTeam Solutions Results Increased operational efficiency Reduced internal costs Greater customer valueWhy Pentaho Low cost Flexibility Speed-to-market

"We needed to deliver a business intelligence solution that would show immediate benefit by increasing efficiencies, containing costs, and helping drive revenue. By using Pentaho BI Suite Enterprise Edition, we were able to do so in a fiscally responsible manner, and in today's economic climate that is of utmost importance."

Deployment OverviewKey Challenges Gaining better insight across the organization to help steer strategic decision-

making Conducting deeper analysis on historical data across all facets of its service

offerings Pentaho Solution Pentaho BI Suite Enterprise Edition for data integration, reporting and analysis CentOS, PostgreSQL Database Results Company-wide performance gains through better visibility into customer, cost, and

revenue trends Increased operational efficiency, reduced internal costs and greater customer value Why Pentaho End-to-end BI capabilities Value vs. proprietary BI Enterprise Edition features

"The simplicity of the interface actually allows Lifetime Entertainment Services to give direct access to business analysts, allowing them to understand and manage the business rules governing the integration of information. That wasn't previously possible with complex hand-coded integration jobs."

Deployment Overview

Key Challenges Optimizing advertising processes to drive ad revenue growth Adapting data integration infrastructure to keep up with changing business rulesPentaho Solution Pentaho Data Integration Enterprise Edition Selected over Informatica and BusinessObjects Data Integrator Continued use of Business Objects BI toolsResults Ability for business analysts to manage integration rules and adapt integration

processes to company business rulesWhy Pentaho Ease of use Cost of ownership Enterprise Edition Features

"ActivePivot (tm) uniquely marries the concept of online analytical process with real-time position-keeping; something no other company currently offers. Thanks to Pentaho Spreadsheet Services we can now offer seamless MDX connectivity to Microsoft Excel."

Deployment OverviewKey Challenges Excel-based access to analytic application data Maximizing margins on analytic software solution for financial institutionsPentaho Solution Pentaho Analysis Pentaho Spreadsheet ServicesResults Competitive differentiation based on Excel-based access to centralized

informationWhy Pentaho Low costs delivered by commercial open source business model Standards-based offering allowing Excel-based connectivity to live OLAP

data

"Pentaho's BI suite and top-notch professional support enabled us to deliver a successful, high-value BI solution at a much lower cost than would have been possible with the expensive, proprietary alternatives."

Deployment OverviewKey Challenges Understanding the effectiveness of its online marketing activities Outgrowing Microsoft Excel-based reporting system Maintaining complex, hand-coded ETL scriptsPentaho Solution Pentaho BI Suite Enterprise Edition IBM servers, SUSE Linux, 1.5 terabyte Microsoft SQL Server data warehouse Professional services from Pentaho partner OpenBIResults Automated integration of clickstream data with Google Analytics and catalog

sales activity data Greater visibility into website traffic, keyword performance and revenue

attributionWhy Pentaho Standards-based, cross platform support Quality of support and services

"With professional support and world-class ETL from Pentaho, we've been able to simplify our IT environment and lower our costs. We were also surprised at how much faster Pentaho Data Integration was than our prior solution."

Deployment OverviewKey Challenges Measuring and optimizing agent performance, customer satisfaction, and

marketing ROI Getting an integrated, strategic view across multiple operational systemsPentaho Solution Pentaho Data Integration Enterprise Edition Red Hat Enterprise Linux, MySQL database Continued use of proprietary BI tools (MicroStrategy) Product expertiseResults Three-fold performance increase, 8 hour reduction in batch load times Simplified maintenance and reduced costsWhy Pentaho Functionality and flexibility Professional support

AWARDS AND RECOGNITION

.

Thank You!

top related