dwh concepts interview q&a

8/22/2019 Dwh Concepts Interview Q&A

1/12

What is a Data-warehouse?

A data warehouse is a relational database that is designed for query and analysis rather than for transactionprocessing. It usually contains historical data derived from transaction data, but it can include data from other

sources. It separates analysis workload from transaction workload and enables an organization to consolidate data

from several sources.

What are data marts?

A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such asSales, Finance, or Marketing. Data marts are often built and controlled by a single department within an

organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources

could be internal operational systems, a central data warehouse, or external data.

What is a star schema?

A star schema model can be depicted as a simple star: a central table contains fact data and multiple tables radiateout from it, connected by the primary and foreign keys of the database. In a star schema implementation

Warehouse Builder stores the dimension data in a single table or view for all the dimension levels.

What is Dimensional Modeling?

This is the process of structuring and organizing data. These data structures are then typically implemented in adatabase management system. In addition to defining and organizing the data, data modeling may also impose

constraints or limitations on the data placed within the structure.

What is a snow Flake Schema?

The snowflake schema represents a dimensional model which is also composed of a central fact table and a set ofconstituent dimension tables which are further normalized into sub-dimension tables. In a snowflake schema

implementation, Warehouse Builder uses more than one table or view to store the dimension data. Separate

database tables or views store data pertaining to each level in the dimension.

What are the different methods of loading dimension tables?

The data in the dimension tables may change over a period of time. Depending upon how you want to treat thehistorical data in dimension tables, there are three different ways of loading the (slowly) varying dimensions:

Type one dimension: do not keep history. Hence update the record if found, else insert the data Type two dimension: do not update the existing record. Create a new record(with version number of change date

as part of key) of the dimension, while retaining the old one Type three dimension: keeps more than one column for each changing attribute. The new value of the attribute is

recorded in the existing record, but in an empty column

Or

Conventional Load - Before loading the data, all the Table constraints will be checked against the data. Direct load Faster Loading- All the Constraints will be disabled. Data will be loaded directly.Later the data

willbe checked against the table constraints and the bad data wont be indexed. Conventional and Direct loadmethod are applicable for only oracle.

What are aggregate tables?

Aggregate tables, also known as summary tables, are fact tables which contain data that has been summarized upto a different level of detail.


2/12

What is the difference between OLAP and OLTP?

Online Transaction Processing (OLTP) Online Analytical Processing (OLAP)

Application Oriented Used to analyze and forecast business needs

Up to date and consistent at all times Data is consistent only up to the last update

Detailed data Summarized data

Isolated data Integrated data

Queries touch small amount of data Queries touch large amounts of data

Fast response time Slow response time

Updates are frequent Updates are less frequent

Concurrency is the biggest performance concern Each report or query requires lot of resources

Clerical Users Managerial/Business Users

OLTP targets specific process like

ordering from an online store

OLAP integrates data from different processes like

(Ordering, processing, inventory, sales etc.,)

Performance sensitive Performance relaxed

Few accessed records per time Large volumes accessed at a time

Read/Update access Mostly read and occasional update

No redundancy Redundancy cannot be avoided

Databases size is usually around 100 MB to 100 GB Databases size is usually around 100 GB to a few TB

OR

Online transactional processing (OLTP) is designed to efficiently process high volumes of transactions, instantlyrecording business events (such as a sales invoice payment) and reflecting changes as they occur.

Online analytical processing (OLAP) is designed for analysis and decision support, allowing exploration of oftenhidden relationships in large amounts of data by providing unlimited views of multiple relationships at any cross-section of defined business dimensions.


3/12

What is ETL?

Extract, transform, and load (ETL) is a process in database usage and especially in data warehousing thatinvolves: * Extracting data from outside sources * Transforming it to fit operational needs (which can includequality levels) * Loading it into the end target (database or data warehouse)

What are the various ETL tools in the market?

Oracle Warehouse Builder (OWB) 11gR1 Oracle Data Integrator & Services XI 3.0 Business Objects, SAP IBM Information Server (Datastage) 8.1 IBM SAS Data Integration Studio 4.2 SAS Institute PowerCenter 8.5.1 Informatica Elixir Repertoire 7.2.2 Elixir Data Migrator 7.6 Information Builders SQL Server Integration Services 10 Microsoft Talend Open Studio 3.1 Talend DataFlow Manager 6.5 Pitney Bowes Business Insight

What are various reporting tools in the market?

SSRS(Microsoft),Businessobjects,Pentahoreporting,BIRTS,Cognos,Microstrategy,Actuate,Qlikview,Proclarity,Excel,Crystal reports,Data Integrator 8.12 Pervasive ,Transformation Server 5.4 IBM DataMirror ,TransformationManager 5.2.2 ETL Solutions Ltd. ,Data Manager/Decision Stream 8.2 IBM Cognos ,Clover ETL 2.5.2 Javlin ,ETL4ALL

4.2 IKAN,,DB2 Warehouse Edition 9.1 IBM ,Pentaho Data Integration 3.0 Pentaho ,Adeptia Integration Server 4.9Adeptia

What is a Fact table?

A fact table is a table, typically in a data warehouse, that contains the measures and facts (the primary data). A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and

those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have

been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually

contains facts with the same level of aggregation.What is a Dimension table?

Dimension tables, also known as lookup or reference tables, contain the relatively static data in the warehouseDimension tables store the information you normally use to contain queries. Dimension tables are usually textual

and descriptive and you can use them as the row headers of the result set. Examples are customers or products.

What is a look up table?

Look up table is a referential table in which we will pass a key column from source table and we will get therequired data once the key column matches.

What are the modeling tools available in the market? Name some of them?

Erwin Computer Associates Embarcadero Embarcadero Technologies Rational Rose IBM Corporation Power Designer Sybase Corporation Oracle Designer Oracle Corporation
http://www.orafaq.com/wiki/Tablehttp://www.orafaq.com/wiki/Data_warehousehttp://www.orafaq.com/wiki/Dimension_tablehttp://www.orafaq.com/wiki/Dimension_tablehttp://www.orafaq.com/wiki/Data_warehousehttp://www.orafaq.com/wiki/Table


4/12

What is normalization? First normal form, second normal form, Third normal form?

Normalization is a series of steps followed to obtain a database design that allows for efficient access andstorage of data. These steps reduce data redundancy and the chances of data becoming inconsistent.

First Normal Form

First Normal Form eliminates repeating groups by putting each into a separate table and connecting them with a one-to-

many relationship.

Two rules follow this definition:

Each table has a primary key made of one or several fields and uniquely identifying each record Each field is atomic, it does not contain more than one valueSecond Normal Form

Second Normal Form eliminates functional dependencies on a partial key by putting the fields in a separate table from

those that are dependent on the whole key.

In our example, "wagon_type", "empty_weight", "capacity"... only depends on "wagon_id" but not on "timestamp"

field of the primary key, so this table is not in 2NF. In order to reach 2NF, we have to split the table in two in the way

that each field of each table depends on all the fields of each primary key:

Third Normal Form

Third Normal Form eliminates functional dependencies on non-key fields by putting them in a separate table. At this

stage, all non-key fields are dependent on the key, the whole key and nothing but the key.

In our example, in the first table it is most likely that "empty_weight", "capacity", "designer" and "design_date" depend

on "wagon_type", so we have to split this table in two

What is ODS?

An operational data store (or "ODS") is a database designed to integrate data from multiple sources foradditional operations on the data. The data is then passed back to operational systemsfor further operations

and to the data warehouse for reporting.

What type of indexing mechanism do we need to use for a typical data warehouse?

On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types ofclustered/non-clustered, unique/non-unique indexes.
http://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Functional_dependencyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Functional_dependencyhttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Data_integrationhttp://en.wikipedia.org/wiki/Operational_systemhttp://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Operational_systemhttp://en.wikipedia.org/wiki/Data_integrationhttp://en.wikipedia.org/wiki/Databasehttp://www.orafaq.com/wiki/Functional_dependencyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Functional_dependencyhttp://www.orafaq.com/wiki/Primary_key


5/12

Which columns go to the fact table and which columns go the dimension table? All elements before broken=fact

measures?

changing numeric fields..fact table Texual-dimension table

Or Before broken into coloumns is going to the fact After broken going to dimensions

What is a level of granularity of a fact table? What does this signify?

Granularity means nothing but it is a level of representation of measures and metrics. The lowest level is called detailed data and highest level is called summary data It depends of project we extract fact table significance

*How are the dimension tables designed? De-normalized, wide, short, use surrogate keys, contain additional

date fields and flags?

What are slowly changing dimensions?

Slowly Changing Dimensions: Slowly changing dimensions are the dimensions in which the data changesslowly, rather than changing regularly on a time basis.

What are non-additive facts? (Inventory, Account balances in bank)

Facts are generally additive. But in some business fact may be non-additive such as Inventory, Bank Balances.What are conformed dimensions?

A conformed dimension is a set of data attributes that have been physically implemented in multiple databasetables using the same structure, attributes, domain values, definitions and concepts in each implementation.

What are SCD1, SCD2, and SCD3?

There are three types of SCDs and you can use Warehouse Builder to define, deploy, and load all three types ofSCDs.

Type 1 SCDs - Overwriting

In a Type 1 SCD the new data overwrites the existing data. Thus the existing data is lost as it is not storedanywhere else. This is the default type of dimension you create. You do not need to specify any additional

information to create a Type 1 SCD.


6/12

Type 2 SCDs - Creating another dimension record

A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the currentrecord is closed. A new record is created with the changed data values and this new record becomes the current

record. Each record contains the effective time and expiration time to identify the time period between whichthe record was active.

Type 3 SCDs - Creating a current value field

A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores theprevious value and the current value of the selected attribute. When the value of any of the selected attributes

changes, the current value is stored as the old value and the new value becomes the current value.

Discuss the advantages and disadvantages of star and snowflake schema?

Star schema Snowflake schema

Star Adv: Reduced Joins, Faster Query Operation

Snow Adv: Distributed data, Easier to obtain fact-less

data

e.g. Orders Shipped across one Quarter

Star DisAdv: Bigger table sizes, Too many rows in Fact

Table

Snow DisAdv: More number of joins, Slower Query

operation

In a star schema every dimension will have a primary

key

Whereas in a snow flake schema, a dimension table will

have one or more parenttables

In a star schema, a dimension table will not have

any parent table.

In snowflake schema dimension table will have parent

table

Hierarchies for the dimensions are stored in the

dimensional table itself in star schema.

Whereas hierachies are broken into separate tables in

snow flake schema. These hierachies helps to drill down

the data from topmost hierachies to the lowermost

hierarchies.

What is a junk dimension?

An abstract dimension with the decodes for a group of low - cardinality flags and indicators , there byremoving the flags from the fact is known as junk dimension


7/12

What are the differences between view and materialized view?

Views Materialized views

store the SQL statement in the database and let you use it

as a table. Everytime you access the view, the SQL

statement executes.

stores the results of the SQL in table form in the

database. SQL statement only executes once and after

that everytime you run the query, the stored result set is

used. Pros include quick query results

query result is not stored in the disk or database Materialized view allow to store query result in disk or

table

when we create view using any table, rowid of view is

same as original table

in case of Materialized view rowid is different

In case of View we always get latest data from the

database

in case of Materialized view we need to refresh the view

for getting latest data.

Performance of View is less than Materialized view More performance than view

In case of view its only the logical view of table no

separate copy of table will be available

but in case of Materialized view we get physically

separate copy of table

this is not required for views in database. In case of Materialized view we need extra trigger or

some automatic method so that we can keep MV

refreshed


8/12

Compare data warehousing top down and bottom-up approach?

Top-down approach Bottom-up approach

In the top-down design approach the, data warehouse

is built first. The data marts are then created from thedata warehouse

In the bottom-up design approach, the data marts arecreated first to provide reporting capability. Thesedata marts are then integrated to build a completedata warehouse

Provides consistent dimensional views of data acrossdata marts, as all data marts are loaded from the datawarehouse

This model contains consistent data marts and thesedata marts can be delivered quickly

This approach is robust against business changes.Creating a new data mart from the data warehouse is

very easy

As the data marts are created first, reports can be

generated quickly.

This methodology is inflexible to changingdepartmental needs during implementation phase

The data warehouse can be extended easily to

accommodate new business units. It is just creating

new data marts and then integrating with other data

marts.

It represents a very large project and the cost of

implementing the project is significant.

The positions of the data warehouse and the data

marts are reversed in the bottom-up approach design.

What is factless fact schema?

Factlessfactis a fact table without measures. It can view the number of occurring eventsExample :

A number of accidents occurred in a month.

Which kind of index is preferred in DWH?

Index types depend very much on cardinality of the distinct values.High cardinality would require a regular B-tree index, whereas very low cardinality would require Bitmaps.

Small tables may not require indices at all, since a full-table scan on such a table can be much faster that

reading an index.

It Actually depends on the nature of the Column on which you are goint to create an index. Bit Map: If the column is of type Flag means contains 1 or 0. Binary: If the column cintains numerical values. Partitions also can be created if column values contains only some list of values


9/12

what is the architecture of any data warehousing project? What is the flow?

1)The basic step of datawarehousing starts with datamodelling. i.e creation dimensions and facts. 2)datawarehouse starts with collection of data from source systems such as OLTP,CRM,ERPs etc 3)Cleansing and transformation process is done with ETL(Extraction Transformation Loading) tool. 4)by the end of ETL process target databases(dimensions,facts) are ready with data which accomplishes the

business rules.

5)Now finally with the use of Reporting tools(OLAP) we can get the information which is used for decisionsupport.

Explain Additive, semi-additive, non-additive facts?

Fact table can store different types of measures such as additive, non additive, semi additive. AdditiveAs it name implied, additive measures are measures which can be added across all dimensions. Non-additivedifferent from additive measures, non-additive measures are measures that cannot be added

across all dimensions.

Semi additivesemi additive measures are measure that can be added across only some dimensions and notacross other.

Difference between DWH and ODS

ODS DWH

Transactions similar to those of an Online Transaction

Processing System

Transactions similar to those of anOnline Analytical

System

Contains current and near current data Queries process larger volumes of data

Typically detailed data only, often

resulting in very large data volume

Contains historical data

Real-time and near real-time data loads Typically batch data loads

Generally modeled to support rapid dataupdate Generally dimensionally modeled and tunes to

optimise query performanceUpdated at the data field level Data is appended, not updated

Used for detailed decision making and

operational reporting

Used for long-term decision making and management

reporting

Knowledge workers (customer service representatives,line managers

Strategic audience (executives, business unitmanagement)

Data is volatile Da ta is non-volatile


10/12

what are the steps to build the data warehouse?

Identifying Sources Identifying Facts Defining Dimensions Define Attribues Redefine Dimensions & Attributes Organise Attribute Hierarchy & Define Relationship Assign Unique Identifiers Additional convetions:Cardinality/Adding ratios 1 business modeling 2 data modeling 3 data from the source databases 4 Extration Transformation Loading 5

DataWare house (Data Marts)

Or

Extracting the transactional data from the data sources into a staging area Transforming the transactional data Loading the transformed data into a dimensional database Building pre-calculated summary values to speed up report generation Building (or purchasing) a front-end reporting tool

How do you connect two fact tables? Is it possible?

This is possible through conform dimension methodology. If a dimension table is connected to more then oneFact table is called confirm dimension

what is the main difference between Inmon and Kimball philosophies of data warehousing?

Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprisehas one data warehouse, and data marts source their information from the data warehouse. In the datawarehouse, information is stored in 3rd normal form.

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise.Information is always stored in the dimensional model

What is meant by metadata in context of a data warehouse and how it is important?

Metadata or Meta Data Metadata is data about data. Examples of metadata include data elementdescriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and

process/method descriptions. The repository environment encompasses all corporate metadata resources:

database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length,

valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It

insulates the data warehouse from changes in the schema of operational systems. Metadata Synchronization

The process of consolidating, relating and synchronizing data elements with the same or similar meaning from

different systems. Metadata synchronization joins these differing elements together in the data warehouse to

allow for easier access
http://www.geekinterview.com/question_details/16910http://www.geekinterview.com/question_details/16910


11/12

what is the role of surrogate keys in data warehouse and how will u generarte them?

A surrogate key is a simple Primary key which maps one to one with a Natural compound Primary key. Thereason for using them is to alleviate the need for the query writer to know the full compound key and also to

speed query processing by removing the need for the RDBMS to process the full compound key when

considering a join.

The Surrogate key role is it links the Dimension and Fact table. It avoids smart keys and Production keys*45. how data in data warehouse stored after data has been extracted and transformed from heterogeneous

sources?

why fact table is in normal form?

Foreign keys of facts tables are primary keys of Dimension tables. It is clear that fact table contains columnswhich are primary key to other table that itself make normal form table.

Or

Basically the fact table consists of the Index keys of the dimension/ook up tables and the measures. so whenever we have the keys in a table .that itself implies that the table is in the normal form.

what is the difference between E-R Modelling and Dimensional modelling? Basic difference is E-R modeling will have logical and physical model. Dimensional model will have only

physical model. E-R modeling is used for normalizing the OLTP database design.Dimensional modeling is

used for de-normalizing the ROLAP/MOLAP design.

Can a dimension table contain numeric values?

Yes dimension can have numeric values, that is surrogate Key which holds numeric value for unique identification of

records in the dimension But those datatype will be char (only the values can numeric/char)

what are the methodologies of data warehousing?

Regarding the methodologies in the Datawarehousing . They are mainly 2 methods. Ralph Kimbell Model- Kimbell model always structed as Denormalised structure. 2. Inmon Model.- Inmon model structed as Normalised structure Depends on the requirements of the company anyone can follow the company's DWH will choose the one of

the above models.Or

Every company has methodology of their own. But to name a few SDLC Methodology, AIM methodology arestardadly used. Other methodologies are AMM, World class methodology and many more.

what is a surrogate key? Where we use it explain with examples?

A surrogate key is a unique identifier in database either for an entity in the modeled word or an object in thedatabase. Application data is not used to derive surrogate key. Surrogate key is an internally generated key by

the current system and is invisible to the user. As several objects are available in the database corresponding to

surrogate, surrogate key can not be utilized as primary key.

For example, a sequential number can be a surrogate key.*39. tell me what would be the size of your warehouse project?


12/12

What is semi additive and fully additive measures?

Semiadditive A semiadditive measure can be aggregated along some, but not all, dimensions that are included in the measure

group that contains the measure. For example, a measure that represents the quantity available for inventorycan be aggregated along a geography dimension to produce a total quantity available for all warehouses, but the

measure cannot be aggregated along a time dimension because the measure represents a periodic snapshot of

quantities available. Aggregating such a measure along a time dimension would produce incorrect results.

Nonadditive A nonadditive measure cannot be aggregated along any dimension in the measure group that contains the

measure. Instead, the measure must be individually calculated for each cell in the cube that represents the

measure. For example, a calculated measure that returns a percentage, such as profit margin, cannot be

aggregated from the percentage values of child members in any dimension.

What are the differences between star schema and snow-flake schema?

Star schema Snow-flake schemaStar schema is highly denormalized It is normalized

Data access latency is less Data access latency is more when compared to star

Size of DWH is larger than snow-flake as it isdenormalized

Size of DWH is less than star schema

It is good as per performance It is better when memory utilization is a major concern

Reduces the no. of joins between tables Minimum storage space, min. data redundancy

Requires more amount of storage space Requires more joins to get \information from look uptables hence slow performance

Where we use star schema & where snow flake?

if PERFORMANCE is the priority than go for star schema, since here dimension tables are DENORMALIZED

if MEMORY SPACE is the priority than go for snowflake schema, since here dimension tables areNORMALIZED

What is ODS? What data loaded from it? What is DW architecture?

ODSOperational Data Source, Normally in 3NF form. Data is stored with least redundancy. General architecture of DWH OLTP System ODSDWH( Denormalized Star or Snowflake, vary case to case)

dwh concepts interview q&a

Documents