![Page 1: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/1.jpg)
TALLERPentaho Data Integration: Extrayendo, Integrando,
Normalizando y Preparando mis datos
Proyectos Programa Big Data y Business Intelligence
Alex Rayó[email protected]
Noviembre, 2015
![Page 2: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/2.jpg)
Before starting….
Who has used a
relational database? Source: http://www.agiledata.org/essays/databaseTesting.html
2
![Page 3: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/3.jpg)
Before starting…. (II)
Who has written scripts or Java code to move data from one
source and load it to another?
Source: http://www.theguardian.com/teacher-network/2012/jan/10/how-to-teach-code
3
![Page 4: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/4.jpg)
Before starting…. (III)
What did you use?
1.Scripts
2.Custom Java Code
3.ETL4
![Page 5: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/5.jpg)
Table of ContentsPentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
5
![Page 6: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/6.jpg)
Table of ContentsPentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
6
![Page 7: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/7.jpg)
Pentaho at a glance
Business Intelligence
7
![Page 8: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/8.jpg)
Pentaho at a glance (II)
8
![Page 9: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/9.jpg)
Pentaho at a glance (III)Business Intelligence & Analytics
Open Core
GPL v2
Apache 2.0
Enterprise and OEM licenses
Java-based
Web front-ends9
![Page 10: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/10.jpg)
Pentaho at a glance (IV)The Pentaho Stack
Data Integration / ETL
Big Data / NoSQL
Data Modeling
Reporting
OLAP / Analysis
Data Visualization
Dashboarding
Data Mining / Predictive Analysis
Scheduling
Source: http://helicaltech.com/blogs/hire-pentaho-consultants-hire-pentaho-developers/
10
![Page 11: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/11.jpg)
Pentaho at a glance (V)Modules
Pentaho Data Integration
Kettle
Pentaho Analysis
Mondrian
Pentaho Reporting
Pentaho Dashboards
Pentaho Data Mining
WEKA
11
![Page 12: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/12.jpg)
Pentaho at a glance (VI)
Figures
+ 10.000 deployments
+ 185 countries
+ 1.200 customers
Since 2012, in Gartner Magic Quadrant for BI Platforms
1 download / 30 seconds
12
![Page 13: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/13.jpg)
Pentaho at a glance (VII)
Open Source Leader
13
![Page 14: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/14.jpg)
Pentaho at a glance (VIII)Single Platform
14
![Page 15: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/15.jpg)
Table of ContentsPentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
15
![Page 16: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/16.jpg)
Academic field
16
![Page 17: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/17.jpg)
Academic field (II)
17
![Page 18: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/18.jpg)
Academic field (III)
18
![Page 19: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/19.jpg)
Academic field (IV)
19
![Page 20: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/20.jpg)
Academic field (V)
20
![Page 21: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/21.jpg)
Academic field (VI)
21
![Page 22: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/22.jpg)
Table of ContentsPentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
22
![Page 23: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/23.jpg)
ETLDefinition and characteristics
An ETL tool is a tool that
Extracts data from various data sources (usually legacy data)
Transforms data
from → being optimized for transaction
to → being optimized for reporting and analysis
synchronizes the data coming from different databases
data cleanses to remove errors
Loads data into a data warehouse23
![Page 24: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/24.jpg)
ETLWhy do I need it?
ETL tools save time and money when developing a data warehouse by removing the need for hand-coding
It is very difficult for database administrators to connect between different brands of databases without using an external tool
In the event that databases are altered or new databases need to be integrated, a lot of hand-coded work needs to be completely redone
24
![Page 25: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/25.jpg)
ETLBusiness Intelligence
ETL is the heart and soul of business intelligence (BI)
ETL processes bring together and combine data from multiple source systems into a data warehouse
Source: http://datawarehouseujap.blogspot.com.es/2010/08/data-warehouse.html
25
![Page 26: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/26.jpg)
ETLBusiness Intelligence (II)
According to most practitioners, ETL
design and development work consumes 60 to
80 percent of an entire BI project
Source: http://www.dwuser.com/news/tag/optimization/
Source: The Data Warehousing Institute. www.dw-institute.com
26
![Page 27: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/27.jpg)
ETLProcessing framework
Source: The Data Warehousing Institute. www.dw-institute.com
27
![Page 28: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/28.jpg)
ETLTools
Source: http://www.slideshare.net/jade_22/kettleetltool-090522005630phpapp01
28
![Page 29: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/29.jpg)
ETLOpen Source tools
CloverETL
KETL
Kettle
Talend
29
![Page 30: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/30.jpg)
ETLCloverETL
Create a basic archive of functions for mapping and transformations, allowing companies to move large amounts of data as quickly and efficiently as possible
Uses building blocks called components to create a transformation graph, which is a visual depiction of the intended data processing
30
![Page 31: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/31.jpg)
ETLCloverETL (II)
The graphic presentation simplifies even complex data transformations, allowing for drag-and-drop functionality
Limited to approximately 40 different components to simplify graph creation
Yet you may configure each component to meet specific needs
It also features extensive debugging capabilities to ensure all transformation graphs work precisely as intended
31
![Page 32: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/32.jpg)
ETLKETL
Contains a scalable, platform-independent engine capable of supporting multiple computers and 64-bit servers
The program also offers performance monitoring, extensive data source support, XML compatibility and a scheduling engine for time-based and event-driven job execution
32
![Page 33: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/33.jpg)
ETLKettle
The Pentaho company produced Kettle as an OS alternative to commercial ETL software
No relation to Kinetic Networks' KETL
Kettle features a drop-and-drag, graphical environment with progress feedback for all data transactions, including automatic documentation of executed jobs
XML Input Stream to handle huge XML files without suffering a loss in performance or a spike in memory usage
Users can also upgrade the free Kettle version for optional pay features and dedicated technical support.
33
![Page 34: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/34.jpg)
ETLTalend
Provides a graphical environment for data integration, migration and synchronization
Drag and drop graphic components to create the java code required to execute the desired task, saving time and effort
Pre-built connectors to enable compatibility with a wide range of business systems and databases
Users gain real-time access to corporate data, allowing for the monitoring and debugging of transactions to ensure smooth data integration
34
![Page 35: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/35.jpg)
ETLComparison
The set of criteria that were used for the ETL tools comparison were divided into seven categories:
TCO
Risk
Ease of use
Support
Deployment
Speed
Data Quality
Monitoring
Connectivity
35
![Page 36: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/36.jpg)
ETLComparison (II)
36
![Page 37: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/37.jpg)
ETLComparison (III)
Total Cost of Ownership
The overall cost for a certain product.
This can mean initial ordering, licensing servicing, support, training, consulting, and any other additional payments that need to be made before the product is in full use
Commercial Open Source products are typically free to use, but the support, training and consulting are what companies need to pay for
37
![Page 38: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/38.jpg)
ETLComparison (IV)
Risk
There are always risks with projects, especially big projects.
The risks for projects failing are:
Going over budget
Going over schedule
Not completing the requirements or expectations of the customers
Open Source products have much lower risk then Commercial ones since they do not restrict the use of their products by pricey licenses
38
![Page 39: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/39.jpg)
ETLComparison (V)
Ease of use
All of the ETL tools, apart from Inaport, have GUI to simplify the development process
Having a good GUI also reduces the time to train and use the tools
Pentaho Kettle has an easy to use GUI out of all the tools
Training can also be found online or within the community
39
![Page 40: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/40.jpg)
ETLComparison (VI)
Support
Nowadays, all software products have support and all of the ETL tool providers offer support
Pentaho Kettle – Offers support from US, UK and has a partner consultant in Hong Kong
Deployment
Pentaho Kettle is a stand-alone java engine that can run on any machine that can run java. Needs an external scheduler to run automatically.
It can be deployed on many different machines and used as “slave servers” to help with transformation processing.
Recommended one 1Ghz CPU and 512mbs RAM
40
![Page 41: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/41.jpg)
ETLComparison (VII)
Speed
The speed of ETL tools depends largely on the data that needs to be transferred over the network and the processing power involved in transforming the data.
Pentaho Kettle is faster than Talend, but the Java-connector slows it down somewhat. Also requires manual tweaking like Talend. Can be clustered by placed on many machines to reduce network traffic
41
![Page 42: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/42.jpg)
ETLComparison (VIII)
Data Quality
Data Quality is fast becoming the most important feature in any data integration tool.
Pentaho – has DQ features in its GUI, allows for customized SQL statements, by using JavaScript and Regular Expressions. It also has some additional modules after subscribing.
Monitoring
Pentaho Kettle – has practical monitoring tools and logging
42
![Page 43: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/43.jpg)
ETLComparison (IX)
Connectivity
In most cases, ETL tools transfer data from legacy systems
Their connectivity is very important to the usefulness of the ETL tools.
Kettle can connect to a very wide variety of databases, flat files, xml files, excel files and web services.
43
![Page 44: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/44.jpg)
Table of ContentsPentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
44
![Page 45: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/45.jpg)
KettleIntroduction
Project Kettle
Powerful Extraction, Transformation and Loading (ETL) capabilities using an
innovative, metadata-driven approach
45
![Page 46: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/46.jpg)
KettleIntroduction (II)
What is Kettle?
Batch data integration and processing tool written in Java
Exists to retrieve, process and load data
PDI is a synonymous term
Source: http://www.dreamstime.com/stock-photo-very-old-kettle-isolated-image16622230
46
![Page 47: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/47.jpg)
KettleIntroduction (III)
It uses an innovative meta-driven approach
It has a very easy-to-use GUI
Strong community of 13,500 registered users
It uses a stand-alone Java engine that process the tasks for moving data between many different databases and files
47
![Page 48: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/48.jpg)
KettleIntroduction (IV)
48
![Page 49: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/49.jpg)
KettleData Integration Platform
Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf
49
![Page 50: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/50.jpg)
KettleArchitecture
Source: Pentaho Corporation
50
![Page 51: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/51.jpg)
KettleMost common uses
Datawarehouse and datamart loads
Data Integration
Data cleansing
Data migration
Data export
etc.51
![Page 52: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/52.jpg)
KettleData Integration
Changing input to desired output
Jobs
Synchronous workflow of job entries (tasks)
Transformations
Stepwise parallel & asynchronous processing of a recordstream
Distributed
52
![Page 53: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/53.jpg)
KettleData Integration challenges
Data is everywhere
Data is inconsistent
Records are different in each system
Performance issues
Running queries to summarize data for stipulated long period takes operating system for task
Brings the OS on max load
Data is never all in Data Warehouse
Excel sheet, acquisition, new application
53
![Page 54: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/54.jpg)
KettleTransformations
String and Date Manipulation
Data Validation / Business Rules
Lookup / Join
Calculation, Statistics
Cryptography
Decisions, Flow control
Scripting
etc.
54
![Page 55: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/55.jpg)
KettleWhat is good for?
Mirroring data from master to slave
Syncing two data sources
Processing data retrieved from multiple sources and pushed to multiple destinations
Loading data to RDBMS
Datamart / Datawarehouse
Dimension lookup/update step
Graphical manipulation of data
55
![Page 56: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/56.jpg)
KettleAlternatives
56
Code
Custom java
Spring batch
Scripts
perl, python, shell, etc
Possibly + db loader tool and cron
Commercial ETL tools
Datastage
Informatica
Oracle Warehouse Builder
SQL Server Integration services
![Page 57: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/57.jpg)
KettleExtraction
57
![Page 58: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/58.jpg)
KettleExtraction (II)
Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf
58
![Page 59: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/59.jpg)
KettleExtraction (III)
RDBMS (SQL Server, DB2, Oracle, MySQL, PostgreSQL, Sybase IQ, etc.)
NoSQL Data: HBase, Cassandra, MongoDB
OLAP (Mondrian, Palo, XML/A)
Web (REST, SOAP, XML, JSON)
Files (CSV, Fixed, Excel, etc.)
ERP (SAP, Salesforce, OpenERP)
Hadoop Data: HDFS, Hive
Web Data: Twitter, Facebook, Log Files, Web Logs
Others: LDAP/Active Directory, Google Analytics, etc.
59
![Page 60: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/60.jpg)
KettleTransportation
60
![Page 61: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/61.jpg)
KettleTransformation
61
![Page 62: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/62.jpg)
KettleLoading
62
![Page 63: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/63.jpg)
KettleEnvironment
63
![Page 64: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/64.jpg)
KettleComparison of Data Integration tools
64
![Page 65: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/65.jpg)
Table of ContentsPentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
65
![Page 66: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/66.jpg)
Big DataBusiness Intelligente
Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)
A brief (BI) history….
66
![Page 67: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/67.jpg)
Big DataWEKA
Project WekaA comprehensive set of tools for Machine
Learning and Data Mining
Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)
67
![Page 68: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/68.jpg)
Big DataAmong Pentaho’s products
Mondrian
OLAP server written in Java
Kettle
ETL tool
Weka
Machine learning and Data Mining tool68
![Page 69: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/69.jpg)
Big DataWEKA platform
WEKA (Waikato Environment for Knowledge Analysis)
Funded by the New Zealand’s Government (for more than 10 years)
Develop an open-source state-of-the-art workbench of data mining tools
Explore fielded applications
Develop new fundamental methods
Became part of Pentaho platform in 2006 (PDM - Pentaho Data Mining)
69
![Page 70: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/70.jpg)
Big DataData Mining with WEKA
(One-of-the-many) Definition: Extraction of implicit, previously unknown, and potentially useful information from data
Goal: improve marketing, sales, and customer support operations, risk assessment etc.
Who is likely to remain a loyal customer?
What products should be marketed to which prospects?
What determines whether a person will respond to a certain offer?
How can I detect potential fraud?
70
![Page 71: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/71.jpg)
Big DataData Mining with WEKA (II)
Central idea: historical data contains information that will be useful in the future (patterns → generalizations)
Data Mining employs a set of algorithms that automatically detect
patterns and regularities in data71
![Page 72: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/72.jpg)
Big DataData Mining with WEKA (III)
A bank’s case as an example
Problem: Prediction (Probability Score) of a Corporate Customer Delinquency (or default) in the next year
Customer historical data used include:
Customer footings behavior (assets & liabilities)
Customer delinquencies (rates and time data)
Business Sector behavioral data
72
![Page 73: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/73.jpg)
Big DataData Mining with WEKA (IV)
Variable selection using the Information Value (IV) criterion
Automatic Binning of continuous data variables was used (Chi-merge). Manual corrections were made to address particularities in the data distribution of some variables (using again IV)
73
![Page 74: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/74.jpg)
Big DataData Mining with WEKA (V)
74
![Page 75: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/75.jpg)
Big DataData Mining with WEKA (VI)
75
![Page 76: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/76.jpg)
Big DataData Mining with WEKA (VII)
Limitations
Traditional algorithms need to have all data in (main) memory
big datasets are an issue
Solution
Incremental schemes
Stream algorithms
MOA (Massive Online Analysis)
http://moa.cs.waikato.ac.nz/76
![Page 77: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/77.jpg)
Big DataBe careful with Data Mining
77
![Page 78: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/78.jpg)
Table of ContentsPentaho at a glance
In the academic field
ETL
Kettle
Big Data
Predictive Analytics
78
![Page 79: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/79.jpg)
Predictive analyticsUnified solution for Big Data Analytics
79
![Page 80: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/80.jpg)
Predictive analyticsUnified solution for Big Data Analytics (II)
Curren release: Pentaho Business Analytics Suite 4.8
Instant and interactive data discovery for iPad● Full analytical power on
the go – unique to Pentaho
● Mobile-optimized user interface
80
![Page 81: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/81.jpg)
Predictive analyticsUnified solution for Big Data Analytics (III)
Curren release: Pentaho Business Analytics Suite 4.8
Instant and interactive data discovery and development for big data● Broadens big data access to
data analysts● Removes the need for
separate big data visualization tools
● Further improves productivity for big data developers
81
![Page 82: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/82.jpg)
Predictive analyticsUnified solution for Big Data Analytics (IV)
Pentaho Instaview● Instaview is simple
○ Created for data analysts○ Dramatically simplifies ways to
access Hadoop and NoSQL data stores
● Instaview is instant & interactive○ Time accelerator – 3 quick steps from
data to analytics○ Interact with big data sources –
group, sort, aggregate & visualize● Instaview is big data analytics
○ Marketing analysis for weblog data in Hadoop
○ Application log analysis for data in MongoDB
82
![Page 83: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/83.jpg)
Predictive analyticsComparison
Source: http://cdn.oreillystatic.com/en/assets/1/event/100/Using%20R%20and%20Hadoop%20for%20Statistical%20Computation%20at%20Scale%20Presentation.htm#/2
83
![Page 84: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/84.jpg)
Referenceshttp://cdn.oreillystatic.com/en/assets/1/event/100/Big%20Data%20Architectural%20Patterns%20Presentation.pdf
http://blog.pentaho.com/tag/strata/
http://www.slideshare.net/mattcasters/pentaho-data-integration-introduction?from_search=2
http://www.slideshare.net/infoaxon/open-source-bi-7640848
http://download.101com.com/tdwi/research_report/2003ETLReport.pdf
http://www.slideshare.net/jade_22/kettleetltool-090522005630phpapp01
http://www.pentaho.com/Blend-of-the-Week?mkt_tok=3RkMMJWWfF9wsRonuKvNce%2FhmjTEU5z17%2BQoXaO2hokz2EFye%2BLIHETpodcMTcdgPbjYDBceEJhqyQJxPr3DJNAN1dt%2BRhDhCA%3D%3D#Analytics
84
![Page 85: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/85.jpg)
Copyright (c) 2015 University of DeustoThis work (but the quoted images, whose rights are reserved to their owners*) is licensed under the Creative
Commons “Attribution-ShareAlike” License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/
Alex RayónNoviembre 2015
![Page 86: Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos](https://reader035.vdocument.in/reader035/viewer/2022062401/5877dd001a28abaa6c8b6963/html5/thumbnails/86.jpg)
TALLERPentaho Data Integration: Extrayendo, Integrando,
Normalizando y Preparando mis datos
Proyectos Programa Big Data y Business Intelligence
Alex Rayó[email protected]
Noviembre, 2015