dwh concepts interview q&a
TRANSCRIPT
-
8/22/2019 Dwh Concepts Interview Q&A
1/12
What is a Data-warehouse?
A data warehouse is a relational database that is designed for query and analysis rather than for transactionprocessing. It usually contains historical data derived from transaction data, but it can include data from other
sources. It separates analysis workload from transaction workload and enables an organization to consolidate data
from several sources.
What are data marts?
A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such asSales, Finance, or Marketing. Data marts are often built and controlled by a single department within an
organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources
could be internal operational systems, a central data warehouse, or external data.
What is a star schema?
A star schema model can be depicted as a simple star: a central table contains fact data and multiple tables radiateout from it, connected by the primary and foreign keys of the database. In a star schema implementation
Warehouse Builder stores the dimension data in a single table or view for all the dimension levels.
What is Dimensional Modeling?
This is the process of structuring and organizing data. These data structures are then typically implemented in adatabase management system. In addition to defining and organizing the data, data modeling may also impose
constraints or limitations on the data placed within the structure.
What is a snow Flake Schema?
The snowflake schema represents a dimensional model which is also composed of a central fact table and a set ofconstituent dimension tables which are further normalized into sub-dimension tables. In a snowflake schema
implementation, Warehouse Builder uses more than one table or view to store the dimension data. Separate
database tables or views store data pertaining to each level in the dimension.
What are the different methods of loading dimension tables?
The data in the dimension tables may change over a period of time. Depending upon how you want to treat thehistorical data in dimension tables, there are three different ways of loading the (slowly) varying dimensions:
Type one dimension: do not keep history. Hence update the record if found, else insert the data Type two dimension: do not update the existing record. Create a new record(with version number of change date
as part of key) of the dimension, while retaining the old one Type three dimension: keeps more than one column for each changing attribute. The new value of the attribute is
recorded in the existing record, but in an empty column
Or
Conventional Load - Before loading the data, all the Table constraints will be checked against the data. Direct load Faster Loading- All the Constraints will be disabled. Data will be loaded directly.Later the data
willbe checked against the table constraints and the bad data wont be indexed. Conventional and Direct loadmethod are applicable for only oracle.
What are aggregate tables?
Aggregate tables, also known as summary tables, are fact tables which contain data that has been summarized upto a different level of detail.
-
8/22/2019 Dwh Concepts Interview Q&A
2/12
What is the difference between OLAP and OLTP?
Online Transaction Processing (OLTP) Online Analytical Processing (OLAP)
Application Oriented Used to analyze and forecast business needs
Up to date and consistent at all times Data is consistent only up to the last update
Detailed data Summarized data
Isolated data Integrated data
Queries touch small amount of data Queries touch large amounts of data
Fast response time Slow response time
Updates are frequent Updates are less frequent
Concurrency is the biggest performance concern Each report or query requires lot of resources
Clerical Users Managerial/Business Users
OLTP targets specific process like
ordering from an online store
OLAP integrates data from different processes like
(Ordering, processing, inventory, sales etc.,)
Performance sensitive Performance relaxed
Few accessed records per time Large volumes accessed at a time
Read/Update access Mostly read and occasional update
No redundancy Redundancy cannot be avoided
Databases size is usually around 100 MB to 100 GB Databases size is usually around 100 GB to a few TB
OR
Online transactional processing (OLTP) is designed to efficiently process high volumes of transactions, instantlyrecording business events (such as a sales invoice payment) and reflecting changes as they occur.
Online analytical processing (OLAP) is designed for analysis and decision support, allowing exploration of oftenhidden relationships in large amounts of data by providing unlimited views of multiple relationships at any cross-section of defined business dimensions.
-
8/22/2019 Dwh Concepts Interview Q&A
3/12
What is ETL?
Extract, transform, and load (ETL) is a process in database usage and especially in data warehousing thatinvolves: * Extracting data from outside sources * Transforming it to fit operational needs (which can includequality levels) * Loading it into the end target (database or data warehouse)
What are the various ETL tools in the market?
Oracle Warehouse Builder (OWB) 11gR1 Oracle Data Integrator & Services XI 3.0 Business Objects, SAP IBM Information Server (Datastage) 8.1 IBM SAS Data Integration Studio 4.2 SAS Institute PowerCenter 8.5.1 Informatica Elixir Repertoire 7.2.2 Elixir Data Migrator 7.6 Information Builders SQL Server Integration Services 10 Microsoft Talend Open Studio 3.1 Talend DataFlow Manager 6.5 Pitney Bowes Business Insight
What are various reporting tools in the market?
SSRS(Microsoft),Businessobjects,Pentahoreporting,BIRTS,Cognos,Microstrategy,Actuate,Qlikview,Proclarity,Excel,Crystal reports,Data Integrator 8.12 Pervasive ,Transformation Server 5.4 IBM DataMirror ,TransformationManager 5.2.2 ETL Solutions Ltd. ,Data Manager/Decision Stream 8.2 IBM Cognos ,Clover ETL 2.5.2 Javlin ,ETL4ALL
4.2 IKAN,,DB2 Warehouse Edition 9.1 IBM ,Pentaho Data Integration 3.0 Pentaho ,Adeptia Integration Server 4.9Adeptia
What is a Fact table?
A fact table is a table, typically in a data warehouse, that contains the measures and facts (the primary data). A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and
those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have
been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually
contains facts with the same level of aggregation.What is a Dimension table?
Dimension tables, also known as lookup or reference tables, contain the relatively static data in the warehouseDimension tables store the information you normally use to contain queries. Dimension tables are usually textual
and descriptive and you can use them as the row headers of the result set. Examples are customers or products.
What is a look up table?
Look up table is a referential table in which we will pass a key column from source table and we will get therequired data once the key column matches.
What are the modeling tools available in the market? Name some of them?
Erwin Computer Associates Embarcadero Embarcadero Technologies Rational Rose IBM Corporation Power Designer Sybase Corporation Oracle Designer Oracle Corporation
http://www.orafaq.com/wiki/Tablehttp://www.orafaq.com/wiki/Data_warehousehttp://www.orafaq.com/wiki/Dimension_tablehttp://www.orafaq.com/wiki/Dimension_tablehttp://www.orafaq.com/wiki/Data_warehousehttp://www.orafaq.com/wiki/Table -
8/22/2019 Dwh Concepts Interview Q&A
4/12
What is normalization? First normal form, second normal form, Third normal form?
Normalization is a series of steps followed to obtain a database design that allows for efficient access andstorage of data. These steps reduce data redundancy and the chances of data becoming inconsistent.
First Normal Form
First Normal Form eliminates repeating groups by putting each into a separate table and connecting them with a one-to-
many relationship.
Two rules follow this definition:
Each table has a primary key made of one or several fields and uniquely identifying each record Each field is atomic, it does not contain more than one valueSecond Normal Form
Second Normal Form eliminates functional dependencies on a partial key by putting the fields in a separate table from
those that are dependent on the whole key.
In our example, "wagon_type", "empty_weight", "capacity"... only depends on "wagon_id" but not on "timestamp"
field of the primary key, so this table is not in 2NF. In order to reach 2NF, we have to split the table in two in the way
that each field of each table depends on all the fields of each primary key:
Third Normal Form
Third Normal Form eliminates functional dependencies on non-key fields by putting them in a separate table. At this
stage, all non-key fields are dependent on the key, the whole key and nothing but the key.
In our example, in the first table it is most likely that "empty_weight", "capacity", "designer" and "design_date" depend
on "wagon_type", so we have to split this table in two
What is ODS?
An operational data store (or "ODS") is a database designed to integrate data from multiple sources foradditional operations on the data. The data is then passed back to operational systemsfor further operations
and to the data warehouse for reporting.
What type of indexing mechanism do we need to use for a typical data warehouse?
On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types ofclustered/non-clustered, unique/non-unique indexes.
http://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Functional_dependencyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Functional_dependencyhttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Data_integrationhttp://en.wikipedia.org/wiki/Operational_systemhttp://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Operational_systemhttp://en.wikipedia.org/wiki/Data_integrationhttp://en.wikipedia.org/wiki/Databasehttp://www.orafaq.com/wiki/Functional_dependencyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Functional_dependencyhttp://www.orafaq.com/wiki/Primary_key -
8/22/2019 Dwh Concepts Interview Q&A
5/12
Which columns go to the fact table and which columns go the dimension table? All elements before broken=fact
measures?
changing numeric fields..fact table Texual-dimension table
Or Before broken into coloumns is going to the fact After broken going to dimensions
What is a level of granularity of a fact table? What does this signify?
Granularity means nothing but it is a level of representation of measures and metrics. The lowest level is called detailed data and highest level is called summary data It depends of project we extract fact table significance
*How are the dimension tables designed? De-normalized, wide, short, use surrogate keys, contain additional
date fields and flags?
What are slowly changing dimensions?
Slowly Changing Dimensions: Slowly changing dimensions are the dimensions in which the data changesslowly, rather than changing regularly on a time basis.
What are non-additive facts? (Inventory, Account balances in bank)
Facts are generally additive. But in some business fact may be non-additive such as Inventory, Bank Balances.What are conformed dimensions?
A conformed dimension is a set of data attributes that have been physically implemented in multiple databasetables using the same structure, attributes, domain values, definitions and concepts in each implementation.
What are SCD1, SCD2, and SCD3?
There are three types of SCDs and you can use Warehouse Builder to define, deploy, and load all three types ofSCDs.
Type 1 SCDs - Overwriting
In a Type 1 SCD the new data overwrites the existing data. Thus the existing data is lost as it is not storedanywhere else. This is the default type of dimension you create. You do not need to specify any additional
information to create a Type 1 SCD.
-
8/22/2019 Dwh Concepts Interview Q&A
6/12
Type 2 SCDs - Creating another dimension record
A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the currentrecord is closed. A new record is created with the changed data values and this new record becomes the current
record. Each record contains the effective time and expiration time to identify the time period between whichthe record was active.
Type 3 SCDs - Creating a current value field
A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores theprevious value and the current value of the selected attribute. When the value of any of the selected attributes
changes, the current value is stored as the old value and the new value becomes the current value.
Discuss the advantages and disadvantages of star and snowflake schema?
Star schema Snowflake schema
Star Adv: Reduced Joins, Faster Query Operation
Snow Adv: Distributed data, Easier to obtain fact-less
data
e.g. Orders Shipped across one Quarter
Star DisAdv: Bigger table sizes, Too many rows in Fact
Table
Snow DisAdv: More number of joins, Slower Query
operation
In a star schema every dimension will have a primary
key
Whereas in a snow flake schema, a dimension table will
have one or more parenttables
In a star schema, a dimension table will not have
any parent table.
In snowflake schema dimension table will have parent
table
Hierarchies for the dimensions are stored in the
dimensional table itself in star schema.
Whereas hierachies are broken into separate tables in
snow flake schema. These hierachies helps to drill down
the data from topmost hierachies to the lowermost
hierarchies.
What is a junk dimension?
An abstract dimension with the decodes for a group of low - cardinality flags and indicators , there byremoving the flags from the fact is known as junk dimension
-
8/22/2019 Dwh Concepts Interview Q&A
7/12
What are the differences between view and materialized view?
Views Materialized views
store the SQL statement in the database and let you use it
as a table. Everytime you access the view, the SQL
statement executes.
stores the results of the SQL in table form in the
database. SQL statement only executes once and after
that everytime you run the query, the stored result set is
used. Pros include quick query results
query result is not stored in the disk or database Materialized view allow to store query result in disk or
table
when we create view using any table, rowid of view is
same as original table
in case of Materialized view rowid is different
In case of View we always get latest data from the
database
in case of Materialized view we need to refresh the view
for getting latest data.
Performance of View is less than Materialized view More performance than view
In case of view its only the logical view of table no
separate copy of table will be available
but in case of Materialized view we get physically
separate copy of table
this is not required for views in database. In case of Materialized view we need extra trigger or
some automatic method so that we can keep MV
refreshed
-
8/22/2019 Dwh Concepts Interview Q&A
8/12
Compare data warehousing top down and bottom-up approach?
Top-down approach Bottom-up approach
In the top-down design approach the, data warehouse
is built first. The data marts are then created from thedata warehouse
In the bottom-up design approach, the data marts arecreated first to provide reporting capability. Thesedata marts are then integrated to build a completedata warehouse
Provides consistent dimensional views of data acrossdata marts, as all data marts are loaded from the datawarehouse
This model contains consistent data marts and thesedata marts can be delivered quickly
This approach is robust against business changes.Creating a new data mart from the data warehouse is
very easy
As the data marts are created first, reports can be
generated quickly.
This methodology is inflexible to changingdepartmental needs during implementation phase
The data warehouse can be extended easily to
accommodate new business units. It is just creating
new data marts and then integrating with other data
marts.
It represents a very large project and the cost of
implementing the project is significant.
The positions of the data warehouse and the data
marts are reversed in the bottom-up approach design.
What is factless fact schema?
Factlessfactis a fact table without measures. It can view the number of occurring eventsExample :
A number of accidents occurred in a month.
Which kind of index is preferred in DWH?
Index types depend very much on cardinality of the distinct values.High cardinality would require a regular B-tree index, whereas very low cardinality would require Bitmaps.
Small tables may not require indices at all, since a full-table scan on such a table can be much faster that
reading an index.
It Actually depends on the nature of the Column on which you are goint to create an index. Bit Map: If the column is of type Flag means contains 1 or 0. Binary: If the column cintains numerical values. Partitions also can be created if column values contains only some list of values
-
8/22/2019 Dwh Concepts Interview Q&A
9/12
what is the architecture of any data warehousing project? What is the flow?
1)The basic step of datawarehousing starts with datamodelling. i.e creation dimensions and facts. 2)datawarehouse starts with collection of data from source systems such as OLTP,CRM,ERPs etc 3)Cleansing and transformation process is done with ETL(Extraction Transformation Loading) tool. 4)by the end of ETL process target databases(dimensions,facts) are ready with data which accomplishes the
business rules.
5)Now finally with the use of Reporting tools(OLAP) we can get the information which is used for decisionsupport.
Explain Additive, semi-additive, non-additive facts?
Fact table can store different types of measures such as additive, non additive, semi additive. AdditiveAs it name implied, additive measures are measures which can be added across all dimensions. Non-additivedifferent from additive measures, non-additive measures are measures that cannot be added
across all dimensions.
Semi additivesemi additive measures are measure that can be added across only some dimensions and notacross other.
Difference between DWH and ODS
ODS DWH
Transactions similar to those of an Online Transaction
Processing System
Transactions similar to those of anOnline Analytical
System
Contains current and near current data Queries process larger volumes of data
Typically detailed data only, often
resulting in very large data volume
Contains historical data
Real-time and near real-time data loads Typically batch data loads
Generally modeled to support rapid dataupdate Generally dimensionally modeled and tunes to
optimise query performanceUpdated at the data field level Data is appended, not updated
Used for detailed decision making and
operational reporting
Used for long-term decision making and management
reporting
Knowledge workers (customer service representatives,line managers
Strategic audience (executives, business unitmanagement)
Data is volatile Da ta is non-volatile
-
8/22/2019 Dwh Concepts Interview Q&A
10/12
what are the steps to build the data warehouse?
Identifying Sources Identifying Facts Defining Dimensions Define Attribues Redefine Dimensions & Attributes Organise Attribute Hierarchy & Define Relationship Assign Unique Identifiers Additional convetions:Cardinality/Adding ratios 1 business modeling 2 data modeling 3 data from the source databases 4 Extration Transformation Loading 5
DataWare house (Data Marts)
Or
Extracting the transactional data from the data sources into a staging area Transforming the transactional data Loading the transformed data into a dimensional database Building pre-calculated summary values to speed up report generation Building (or purchasing) a front-end reporting tool
How do you connect two fact tables? Is it possible?
This is possible through conform dimension methodology. If a dimension table is connected to more then oneFact table is called confirm dimension
what is the main difference between Inmon and Kimball philosophies of data warehousing?
Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprisehas one data warehouse, and data marts source their information from the data warehouse. In the datawarehouse, information is stored in 3rd normal form.
Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise.Information is always stored in the dimensional model
What is meant by metadata in context of a data warehouse and how it is important?
Metadata or Meta Data Metadata is data about data. Examples of metadata include data elementdescriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and
process/method descriptions. The repository environment encompasses all corporate metadata resources:
database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length,
valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It
insulates the data warehouse from changes in the schema of operational systems. Metadata Synchronization
The process of consolidating, relating and synchronizing data elements with the same or similar meaning from
different systems. Metadata synchronization joins these differing elements together in the data warehouse to
allow for easier access
http://www.geekinterview.com/question_details/16910http://www.geekinterview.com/question_details/16910 -
8/22/2019 Dwh Concepts Interview Q&A
11/12
what is the role of surrogate keys in data warehouse and how will u generarte them?
A surrogate key is a simple Primary key which maps one to one with a Natural compound Primary key. Thereason for using them is to alleviate the need for the query writer to know the full compound key and also to
speed query processing by removing the need for the RDBMS to process the full compound key when
considering a join.
The Surrogate key role is it links the Dimension and Fact table. It avoids smart keys and Production keys*45. how data in data warehouse stored after data has been extracted and transformed from heterogeneous
sources?
why fact table is in normal form?
Foreign keys of facts tables are primary keys of Dimension tables. It is clear that fact table contains columnswhich are primary key to other table that itself make normal form table.
Or
Basically the fact table consists of the Index keys of the dimension/ook up tables and the measures. so whenever we have the keys in a table .that itself implies that the table is in the normal form.
what is the difference between E-R Modelling and Dimensional modelling? Basic difference is E-R modeling will have logical and physical model. Dimensional model will have only
physical model. E-R modeling is used for normalizing the OLTP database design.Dimensional modeling is
used for de-normalizing the ROLAP/MOLAP design.
Can a dimension table contain numeric values?
Yes dimension can have numeric values, that is surrogate Key which holds numeric value for unique identification of
records in the dimension But those datatype will be char (only the values can numeric/char)
what are the methodologies of data warehousing?
Regarding the methodologies in the Datawarehousing . They are mainly 2 methods. Ralph Kimbell Model- Kimbell model always structed as Denormalised structure. 2. Inmon Model.- Inmon model structed as Normalised structure Depends on the requirements of the company anyone can follow the company's DWH will choose the one of
the above models.Or
Every company has methodology of their own. But to name a few SDLC Methodology, AIM methodology arestardadly used. Other methodologies are AMM, World class methodology and many more.
what is a surrogate key? Where we use it explain with examples?
A surrogate key is a unique identifier in database either for an entity in the modeled word or an object in thedatabase. Application data is not used to derive surrogate key. Surrogate key is an internally generated key by
the current system and is invisible to the user. As several objects are available in the database corresponding to
surrogate, surrogate key can not be utilized as primary key.
For example, a sequential number can be a surrogate key.*39. tell me what would be the size of your warehouse project?
-
8/22/2019 Dwh Concepts Interview Q&A
12/12
What is semi additive and fully additive measures?
Semiadditive A semiadditive measure can be aggregated along some, but not all, dimensions that are included in the measure
group that contains the measure. For example, a measure that represents the quantity available for inventorycan be aggregated along a geography dimension to produce a total quantity available for all warehouses, but the
measure cannot be aggregated along a time dimension because the measure represents a periodic snapshot of
quantities available. Aggregating such a measure along a time dimension would produce incorrect results.
Nonadditive A nonadditive measure cannot be aggregated along any dimension in the measure group that contains the
measure. Instead, the measure must be individually calculated for each cell in the cube that represents the
measure. For example, a calculated measure that returns a percentage, such as profit margin, cannot be
aggregated from the percentage values of child members in any dimension.
What are the differences between star schema and snow-flake schema?
Star schema Snow-flake schemaStar schema is highly denormalized It is normalized
Data access latency is less Data access latency is more when compared to star
Size of DWH is larger than snow-flake as it isdenormalized
Size of DWH is less than star schema
It is good as per performance It is better when memory utilization is a major concern
Reduces the no. of joins between tables Minimum storage space, min. data redundancy
Requires more amount of storage space Requires more joins to get \information from look uptables hence slow performance
Where we use star schema & where snow flake?
if PERFORMANCE is the priority than go for star schema, since here dimension tables are DENORMALIZED
if MEMORY SPACE is the priority than go for snowflake schema, since here dimension tables areNORMALIZED
What is ODS? What data loaded from it? What is DW architecture?
ODSOperational Data Source, Normally in 3NF form. Data is stored with least redundancy. General architecture of DWH OLTP System ODSDWH( Denormalized Star or Snowflake, vary case to case)