what is the fundamental difference between “etl” and “elt” in the world of big data?

4
5/26/2015 What is the fundamental difference between “ETL” and “ELT” in the world of big data? | IBM Data Warehousing https://ibmdatawarehousing.wordpress.com/2015/05/13/goetz-etl-bigdata/ 1/4 I BM Data Warehousing The engine for making data smarter, faster. What is the fundamental difference between “ETL” and “ELT” in the world of big data? May 13, 2015 May 26, 2015 IBM Data Warehousing analytics , big data , biginsights , Cognos , Data Warehouse , ELT , ETL , hadoop , IBM Cognos , IBM InfoSphere Information Server , IBM PureData , Theme PureData System for Analytics , Uncategorized By Ralf Goetz Initially, it seems like just a different sequence of the two characters “T” and “L”. But this difference often separates successful big data projects from failed ones. Why is that? And how can you avoid falling into the most common data management traps around mastering big data? Let’s examine this topic in more detail. Why are big data projects different from traditional data warehouse projects? Big data projects are mostly characterized as one or a combination of these 4 (or 5) data requirements: Volume: the volume of (raw) data Variety: the variety (e.g. structured, unstructured, semi-structured) of data Velocity: the speed of data processing, consummation or analytics of data Veracity: the level of trust in the data (Value): the value behind the data For big data, each of the “V”s is bigger in terms of order of magnitudes of its classification. For example, a traditional data warehouse data volume is usually around several hundred gigabytes or a low number of terabytes, while big data projects typically handle data volumes of hundreds or even thousands of terabytes. Another example would be that traditional data warehouse systems only manage and process structured data, whereas typical big data projects need to manage and process both structured and unstructured data. Having this in mind, it is obvious that traditional technologies or methodologies for data warehousing may not be sufficient to handle these big data requirements.

Upload: ibm-data-management-and-data-warehousing

Post on 03-Aug-2015

102 views

Category:

Software


1 download

TRANSCRIPT

Page 1: What is the fundamental difference between “ETL” and “ELT” in the world of big data?

5/26/2015 What is the fundamental difference between “ETL” and “ELT” in the world of big data? | IBM Data Warehousing

https://ibmdatawarehousing.wordpress.com/2015/05/13/goetz-etl-bigdata/ 1/4

IBM Data Warehousing

The engine for making data smarter, faster.

What is the fundamental difference between“ETL” and “ELT” in the world of big data?

May 13, 2015May 26, 2015 IBM Data Warehousing analytics, big data, biginsights,

Cognos, Data Warehouse, ELT, ETL, hadoop, IBM Cognos, IBM InfoSphere Information Server, IBMPureData, Theme PureData System for Analytics, Uncategorized

By Ralf Goetz

Initially, it seems like just a different sequence of the two characters “T” and “L”. But this difference

often separates successful big data projects from failed ones. Why is that? And how can you avoidfalling into the most common data management traps around mastering big data? Let’s examine this

topic in more detail.

Why are big data projects different from traditional data warehouse projects?

Big data projects are mostly characterized as one or a combination of these 4 (or 5) data requirements:

Volume: the volume of (raw) dataVariety: the variety (e.g. structured, unstructured, semi-structured) of data

Velocity: the speed of data processing, consummation or analytics of dataVeracity: the level of trust in the data(Value): the value behind the data

For big data, each of the “V”s is bigger in terms of order of magnitudes of its classification. For

example, a traditional data warehouse data volume is usually around several hundred gigabytes or alow number of terabytes, while big data projects typically handle data volumes of hundreds or eventhousands of terabytes. Another example would be that traditional data warehouse systems only

manage and process structured data, whereas typical big data projects need to manage and processboth structured and unstructured data.

Having this in mind, it is obvious that traditional technologies or methodologies for data warehousingmay not be sufficient to handle these big data requirements.

Page 2: What is the fundamental difference between “ETL” and “ELT” in the world of big data?

5/26/2015 What is the fundamental difference between “ETL” and “ELT” in the world of big data? | IBM Data Warehousing

https://ibmdatawarehousing.wordpress.com/2015/05/13/goetz-etl-bigdata/ 2/4

Mastering the data and information supply chain using traditional ETL

This brings us to a widely adapted methodology for data integration called “Extraction,Transformation and Load” (ETL). ETL is a very common methodology in data warehousing and

business analytics projects and can be performed by custom programming (e.g. scripts, or custom ETLapplications) or with the help of state-of-the-art ETL platforms such as IBM InfoSphere InformationServer (http://www-01.ibm.com/software/data/integration/info_server/).

The fundamental concept behind most ETL implementations is the restriction of the data in thesupply chain. Only data, which is presumably important will be identified, extracted and loaded into

a staging area inside a database, and later, into the data warehouse. “Presumably” is the weakness inthis concept. Who really knows which data is required for which analytic insight and requirement as ofnow and tomorrow? Who knows which legal or regulatory requirements must be followed in themonths and years to come?

Each change in the definition and scope of the information and data supply chain requires aconsiderable amount of effort, time and budget and is a risk for any production system. There must bea resolution for this dilemma – and here it comes.

A new “must follow” paradigm for big data: ELT

Just a little change in the sequence of two letters will mean everything to the success of your big dataproject: ELT (Extraction, Load and Transform). This change seems small, but the difference lies in theoverall concept of data management. Instead of restricting the data sources to only “presumably”important data (and all the steps this entails), what if we take all available data, and put it into aflexible, powerful big data platform such as the Hadoop-based IBM InfoSphere BigInsights(http://www-01.ibm.com/software/data/infosphere/hadoop/enterprise.html) system?

Data storage in Hadoop is flexible, powerful, almost unlimited, and cost efficient – since it can usecommodity hardware and scales across many computing nodes and local storage.

Hadoop is a schema-on-read system. It allows the storage of all kinds of data without knowing itsformat or definition (e.g. JSON, images, movies, text files, spreadsheets, log files and many more).Without the previously discussed limitation in the amount of data which will be extracted in the ETLmethodology, we can be sure that we have all data we need today and may need in the future. This

Page 3: What is the fundamental difference between “ETL” and “ELT” in the world of big data?

5/26/2015 What is the fundamental difference between “ETL” and “ELT” in the world of big data? | IBM Data Warehousing

https://ibmdatawarehousing.wordpress.com/2015/05/13/goetz-etl-bigdata/ 3/4

also reduces the required effort for the identification of “important” data – this step can literally beskipped: we take all we can get and keep it!

Without the previously discussed limitation in the amount of data which will be extractedin the ETL methodology, we can be sure that we have all data we need today and mayneed in the future.

Since Hadoop offers a scalable data storage and processing platform, we can utilize these features as areplacement for the traditional staging area inside a database. From here we can take only the datathat is required today and analyze it either directly with a business intelligence platform such as IBMCognos (http://www-01.ibm.com/software/analytics/cognos/) or IBM SPSS (http://www-03.ibm.com/software/products/de/spss-modeler), or use an intermediate layer with deep and powerfulanalytic capabilities such as IBM PureData System for Analytics (http://www-01.ibm.com/software/data/puredata/analytics/).

Refining raw data and gaining valuable insights

Hadoop is great for storage and processing of raw data, but applying powerful and lightning fastcomplex analytic queries is not its strength, and so another analytics layer makes sense. PureDataSystem for Analytics (http://www-01.ibm.com/software/data/puredata/analytics/) is the perfect placefor the subsequent in-database analytic processing for “valued” data because of it’s massive parallelprocessing (MPP) architecture and it’s rich set of analytics functions. PureData can resolve even themost complex analytic queries in only a fraction of the time compared to traditional relational

databases. And it scales – from a big data starter project with only a couple of terabytes of data to apetabyte-sized PureData cluster.

PureData System for Analytics is the perfect place for the subsequent in-database analytic

processing for “valued” data because of it’s massive parallel processing architecture (MPP)

and it’s rich set of analytic functions.

IBM offers everything you need to master your big data challenges. You can start very small and scalewith your growing requirements. Big data projects can be fun with the right technology and services!

About Ralf Goetz

Page 4: What is the fundamental difference between “ETL” and “ELT” in the world of big data?

5/26/2015 What is the fundamental difference between “ETL” and “ELT” in the world of big data? | IBM Data Warehousing

https://ibmdatawarehousing.wordpress.com/2015/05/13/goetz-etl-bigdata/ 4/4

About Ralf Goetz

(https://ibmdatawarehousing.files.wordpress.com/2015/05/ralf-goetz_400x400.jpeg)Ralf is an Expert Level Certified IT Specialist in the IBM

Software Group. Ralf joined IBM trough the Netezza acquisition in early 2011. Forseveral years, he led the Informatica tech-sales team in DACH region and the

Mahindra Satyam BI competency team in Germany. He then became part of the

technical pre-sales representative for Netezza and later for the PureData System forAnalytics. Ralf is still focusing on PDA but is also supporting the technical sales of all IBM BigData

products. Ralf holds a Master degree in computer science.

Follow @striple66

Blog at WordPress.com (https://wordpress.com/?ref=footer_blog). The Big Brother Theme

(https://wordpress.com/themes/big-brother/).

Occasionally, some of your visitors may see an advertisement here.

Tell me more (http://wordpress.com/about-these-ads/) | Dismiss this message

About these ads (http://wordpress.com/about-these-ads/)