getting the data into the warehouse: extract, transform, load

17
Getting the data into the warehouse: extract, transform, load MIS2502 Data Analytics

Upload: cade-foley

Post on 02-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Getting the data into the warehouse: extract, transform, load. MIS2502 Data Analytics. Getting the information into the data mart. Now let ’ s address this part…. Extract, Transform, Load (ETL). The process of copying data from the transactional database to the analytical database - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Getting the data into the warehouse: extract, transform, load

Getting the data into the warehouse: extract, transform, loadMIS2502

Data Analytics

Page 2: Getting the data into the warehouse: extract, transform, load

Getting the information into the data mart

Now let’s address this part…

Page 3: Getting the data into the warehouse: extract, transform, load

Extract, Transform, Load (ETL)

• The process of copying data from the transactional database to the analytical database

• Going from relational to dimensional

• Basically, it’s a matter of identifying where the data should come from to fill the data mart

Page 4: Getting the data into the warehouse: extract, transform, load

ETL Defined

Page 5: Getting the data into the warehouse: extract, transform, load

The Actual Process

Transactional Database 2

Transactional Database 1

Data Mart

Query

Query

Data conversion

Data conversion

Query

Query

Extract Transform Load

Page 6: Getting the data into the warehouse: extract, transform, load

Main ETL Issues: Conversion Stage

Page 7: Getting the data into the warehouse: extract, transform, load

Data Consistency: The Problem with Legacy Systems

• An IT infrastructure evolves over time• Systems are created and acquired by different people using different specifications

This can happen through:•Changes in management•Mergers & Acquisitions•Externally mandated standards•General poor planning

Page 8: Getting the data into the warehouse: extract, transform, load

This leads to many issues

• Redundant data across the organization• Customer record maintained by accounts

receivable and marketing

• The same data element stored in different formats• Social Security number (123-45-6789 versus

123456789)

• Different naming conventions• “Doritos” versus “Frito-Lay’s Doritos” verus

“Regular Doritos”

• Different unique identifiers used• Account_Number versus Customer_ID

What are the problems with each of these

?

Page 9: Getting the data into the warehouse: extract, transform, load

What’s the big deal?

• This is a fundamental problem for creating data cubes

• We often need to combine information from several transactional databases

• How do we know if we’re talking about the same customer or product?

Page 10: Getting the data into the warehouse: extract, transform, load

Now think about this scenarioHotel Reservation Database Café Database

CustomerCustomer_numberCustomer_nameCustomer_addressCustomer_cityCustomer_zipcode

OrderOrder_numberCustomer_numberHotel_idFood_item_idOrder_dateOrder_timeTable_number

Food itemOrder numberFood_item_idOrder_dateOrder_time

HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode

HotelsHotel_idCountry_codeHotel_nameHotel_addressHotel_cityHotel_zipcode

CountriesCountry_codeCountry_currencyCountry_name

Hotel roomsRoom_numberHotel_idRoom_typeRoom_floor

Room typesRoom_type_codeRoom_standard_rateRoom_descriptionSmoking_YN

Room BookingsBooking_idRoom_type_codeHotel_idCheckin_dateNumber_of_daysRoom_count

Guest BookingsBooking_idGuest_number

GuestsGuest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email

Hotel Amenities LookupCharacteristic_idCharacteristic_description

Hotel AmenitiesCharacteristic_idHotel_id

Page 11: Getting the data into the warehouse: extract, transform, load

Solution: “Single view” of data

• The entire organization understands a unit of data in the same way

• It’s both a business goal and a technology goal

and really more this…

..than this

Page 12: Getting the data into the warehouse: extract, transform, load

Closer look at the Guest/CustomerGuests

Guest_numberGuest_firstnameGuest_lastnameGuest_addressGuest_cityGuest_zipcodeGuest_email

CustomerCustomer_numberCustomer_nameCustomer_addressCustomer_cityCustomer_zipcode

vs.vs.

Page 13: Getting the data into the warehouse: extract, transform, load

Organizational issues

• Why might there be resistance to data standardization?

• Is it an option to just “fix” the transactional databases?

• If two data elements conflict, who’s standard “wins?”

Page 14: Getting the data into the warehouse: extract, transform, load

Data Quality

• The degree to which the data reflects the actual environment

Page 15: Getting the data into the warehouse: extract, transform, load

Finding the right data

Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Page 16: Getting the data into the warehouse: extract, transform, load

Ensuring accuracy

Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Page 17: Getting the data into the warehouse: extract, transform, load

Reliability of the collection process

Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf