dwh concepts interview q&a

Upload: nadikattu-ravikishore

Post on 08-Aug-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/22/2019 Dwh Concepts Interview Q&A

    1/12

    What is a Data-warehouse?

    A data warehouse is a relational database that is designed for query and analysis rather than for transactionprocessing. It usually contains historical data derived from transaction data, but it can include data from other

    sources. It separates analysis workload from transaction workload and enables an organization to consolidate data

    from several sources.

    What are data marts?

    A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such asSales, Finance, or Marketing. Data marts are often built and controlled by a single department within an

    organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources

    could be internal operational systems, a central data warehouse, or external data.

    What is a star schema?

    A star schema model can be depicted as a simple star: a central table contains fact data and multiple tables radiateout from it, connected by the primary and foreign keys of the database. In a star schema implementation

    Warehouse Builder stores the dimension data in a single table or view for all the dimension levels.

    What is Dimensional Modeling?

    This is the process of structuring and organizing data. These data structures are then typically implemented in adatabase management system. In addition to defining and organizing the data, data modeling may also impose

    constraints or limitations on the data placed within the structure.

    What is a snow Flake Schema?

    The snowflake schema represents a dimensional model which is also composed of a central fact table and a set ofconstituent dimension tables which are further normalized into sub-dimension tables. In a snowflake schema

    implementation, Warehouse Builder uses more than one table or view to store the dimension data. Separate

    database tables or views store data pertaining to each level in the dimension.

    What are the different methods of loading dimension tables?

    The data in the dimension tables may change over a period of time. Depending upon how you want to treat thehistorical data in dimension tables, there are three different ways of loading the (slowly) varying dimensions:

    Type one dimension: do not keep history. Hence update the record if found, else insert the data Type two dimension: do not update the existing record. Create a new record(with version number of change date

    as part of key) of the dimension, while retaining the old one Type three dimension: keeps more than one column for each changing attribute. The new value of the attribute is

    recorded in the existing record, but in an empty column

    Or

    Conventional Load - Before loading the data, all the Table constraints will be checked against the data. Direct load Faster Loading- All the Constraints will be disabled. Data will be loaded directly.Later the data

    willbe checked against the table constraints and the bad data wont be indexed. Conventional and Direct loadmethod are applicable for only oracle.

    What are aggregate tables?

    Aggregate tables, also known as summary tables, are fact tables which contain data that has been summarized upto a different level of detail.

  • 8/22/2019 Dwh Concepts Interview Q&A

    2/12

    What is the difference between OLAP and OLTP?

    Online Transaction Processing (OLTP) Online Analytical Processing (OLAP)

    Application Oriented Used to analyze and forecast business needs

    Up to date and consistent at all times Data is consistent only up to the last update

    Detailed data Summarized data

    Isolated data Integrated data

    Queries touch small amount of data Queries touch large amounts of data

    Fast response time Slow response time

    Updates are frequent Updates are less frequent

    Concurrency is the biggest performance concern Each report or query requires lot of resources

    Clerical Users Managerial/Business Users

    OLTP targets specific process like

    ordering from an online store

    OLAP integrates data from different processes like

    (Ordering, processing, inventory, sales etc.,)

    Performance sensitive Performance relaxed

    Few accessed records per time Large volumes accessed at a time

    Read/Update access Mostly read and occasional update

    No redundancy Redundancy cannot be avoided

    Databases size is usually around 100 MB to 100 GB Databases size is usually around 100 GB to a few TB

    OR

    Online transactional processing (OLTP) is designed to efficiently process high volumes of transactions, instantlyrecording business events (such as a sales invoice payment) and reflecting changes as they occur.

    Online analytical processing (OLAP) is designed for analysis and decision support, allowing exploration of oftenhidden relationships in large amounts of data by providing unlimited views of multiple relationships at any cross-section of defined business dimensions.

  • 8/22/2019 Dwh Concepts Interview Q&A

    3/12

    What is ETL?

    Extract, transform, and load (ETL) is a process in database usage and especially in data warehousing thatinvolves: * Extracting data from outside sources * Transforming it to fit operational needs (which can includequality levels) * Loading it into the end target (database or data warehouse)

    What are the various ETL tools in the market?

    Oracle Warehouse Builder (OWB) 11gR1 Oracle Data Integrator & Services XI 3.0 Business Objects, SAP IBM Information Server (Datastage) 8.1 IBM SAS Data Integration Studio 4.2 SAS Institute PowerCenter 8.5.1 Informatica Elixir Repertoire 7.2.2 Elixir Data Migrator 7.6 Information Builders SQL Server Integration Services 10 Microsoft Talend Open Studio 3.1 Talend DataFlow Manager 6.5 Pitney Bowes Business Insight

    What are various reporting tools in the market?

    SSRS(Microsoft),Businessobjects,Pentahoreporting,BIRTS,Cognos,Microstrategy,Actuate,Qlikview,Proclarity,Excel,Crystal reports,Data Integrator 8.12 Pervasive ,Transformation Server 5.4 IBM DataMirror ,TransformationManager 5.2.2 ETL Solutions Ltd. ,Data Manager/Decision Stream 8.2 IBM Cognos ,Clover ETL 2.5.2 Javlin ,ETL4ALL

    4.2 IKAN,,DB2 Warehouse Edition 9.1 IBM ,Pentaho Data Integration 3.0 Pentaho ,Adeptia Integration Server 4.9Adeptia

    What is a Fact table?

    A fact table is a table, typically in a data warehouse, that contains the measures and facts (the primary data). A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and

    those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have

    been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually

    contains facts with the same level of aggregation.What is a Dimension table?

    Dimension tables, also known as lookup or reference tables, contain the relatively static data in the warehouseDimension tables store the information you normally use to contain queries. Dimension tables are usually textual

    and descriptive and you can use them as the row headers of the result set. Examples are customers or products.

    What is a look up table?

    Look up table is a referential table in which we will pass a key column from source table and we will get therequired data once the key column matches.

    What are the modeling tools available in the market? Name some of them?

    Erwin Computer Associates Embarcadero Embarcadero Technologies Rational Rose IBM Corporation Power Designer Sybase Corporation Oracle Designer Oracle Corporation

    http://www.orafaq.com/wiki/Tablehttp://www.orafaq.com/wiki/Data_warehousehttp://www.orafaq.com/wiki/Dimension_tablehttp://www.orafaq.com/wiki/Dimension_tablehttp://www.orafaq.com/wiki/Data_warehousehttp://www.orafaq.com/wiki/Table
  • 8/22/2019 Dwh Concepts Interview Q&A

    4/12

    What is normalization? First normal form, second normal form, Third normal form?

    Normalization is a series of steps followed to obtain a database design that allows for efficient access andstorage of data. These steps reduce data redundancy and the chances of data becoming inconsistent.

    First Normal Form

    First Normal Form eliminates repeating groups by putting each into a separate table and connecting them with a one-to-

    many relationship.

    Two rules follow this definition:

    Each table has a primary key made of one or several fields and uniquely identifying each record Each field is atomic, it does not contain more than one valueSecond Normal Form

    Second Normal Form eliminates functional dependencies on a partial key by putting the fields in a separate table from

    those that are dependent on the whole key.

    In our example, "wagon_type", "empty_weight", "capacity"... only depends on "wagon_id" but not on "timestamp"

    field of the primary key, so this table is not in 2NF. In order to reach 2NF, we have to split the table in two in the way

    that each field of each table depends on all the fields of each primary key:

    Third Normal Form

    Third Normal Form eliminates functional dependencies on non-key fields by putting them in a separate table. At this

    stage, all non-key fields are dependent on the key, the whole key and nothing but the key.

    In our example, in the first table it is most likely that "empty_weight", "capacity", "designer" and "design_date" depend

    on "wagon_type", so we have to split this table in two

    What is ODS?

    An operational data store (or "ODS") is a database designed to integrate data from multiple sources foradditional operations on the data. The data is then passed back to operational systemsfor further operations

    and to the data warehouse for reporting.

    What type of indexing mechanism do we need to use for a typical data warehouse?

    On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types ofclustered/non-clustered, unique/non-unique indexes.

    http://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Functional_dependencyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Functional_dependencyhttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Data_integrationhttp://en.wikipedia.org/wiki/Operational_systemhttp://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Data_warehousehttp://en.wikipedia.org/wiki/Operational_systemhttp://en.wikipedia.org/wiki/Data_integrationhttp://en.wikipedia.org/wiki/Databasehttp://www.orafaq.com/wiki/Functional_dependencyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Primary_keyhttp://www.orafaq.com/wiki/Functional_dependencyhttp://www.orafaq.com/wiki/Primary_key
  • 8/22/2019 Dwh Concepts Interview Q&A

    5/12

    Which columns go to the fact table and which columns go the dimension table? All elements before broken=fact

    measures?

    changing numeric fields..fact table Texual-dimension table

    Or Before broken into coloumns is going to the fact After broken going to dimensions

    What is a level of granularity of a fact table? What does this signify?

    Granularity means nothing but it is a level of representation of measures and metrics. The lowest level is called detailed data and highest level is called summary data It depends of project we extract fact table significance

    *How are the dimension tables designed? De-normalized, wide, short, use surrogate keys, contain additional

    date fields and flags?

    What are slowly changing dimensions?

    Slowly Changing Dimensions: Slowly changing dimensions are the dimensions in which the data changesslowly, rather than changing regularly on a time basis.

    What are non-additive facts? (Inventory, Account balances in bank)

    Facts are generally additive. But in some business fact may be non-additive such as Inventory, Bank Balances.What are conformed dimensions?

    A conformed dimension is a set of data attributes that have been physically implemented in multiple databasetables using the same structure, attributes, domain values, definitions and concepts in each implementation.

    What are SCD1, SCD2, and SCD3?

    There are three types of SCDs and you can use Warehouse Builder to define, deploy, and load all three types ofSCDs.

    Type 1 SCDs - Overwriting

    In a Type 1 SCD the new data overwrites the existing data. Thus the existing data is lost as it is not storedanywhere else. This is the default type of dimension you create. You do not need to specify any additional

    information to create a Type 1 SCD.

  • 8/22/2019 Dwh Concepts Interview Q&A

    6/12

    Type 2 SCDs - Creating another dimension record

    A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the currentrecord is closed. A new record is created with the changed data values and this new record becomes the current

    record. Each record contains the effective time and expiration time to identify the time period between whichthe record was active.

    Type 3 SCDs - Creating a current value field

    A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores theprevious value and the current value of the selected attribute. When the value of any of the selected attributes

    changes, the current value is stored as the old value and the new value becomes the current value.

    Discuss the advantages and disadvantages of star and snowflake schema?

    Star schema Snowflake schema

    Star Adv: Reduced Joins, Faster Query Operation

    Snow Adv: Distributed data, Easier to obtain fact-less

    data

    e.g. Orders Shipped across one Quarter

    Star DisAdv: Bigger table sizes, Too many rows in Fact

    Table

    Snow DisAdv: More number of joins, Slower Query

    operation

    In a star schema every dimension will have a primary

    key

    Whereas in a snow flake schema, a dimension table will

    have one or more parenttables

    In a star schema, a dimension table will not have

    any parent table.

    In snowflake schema dimension table will have parent

    table

    Hierarchies for the dimensions are stored in the

    dimensional table itself in star schema.

    Whereas hierachies are broken into separate tables in

    snow flake schema. These hierachies helps to drill down

    the data from topmost hierachies to the lowermost

    hierarchies.

    What is a junk dimension?

    An abstract dimension with the decodes for a group of low - cardinality flags and indicators , there byremoving the flags from the fact is known as junk dimension

  • 8/22/2019 Dwh Concepts Interview Q&A

    7/12

    What are the differences between view and materialized view?

    Views Materialized views

    store the SQL statement in the database and let you use it

    as a table. Everytime you access the view, the SQL

    statement executes.

    stores the results of the SQL in table form in the

    database. SQL statement only executes once and after

    that everytime you run the query, the stored result set is

    used. Pros include quick query results

    query result is not stored in the disk or database Materialized view allow to store query result in disk or

    table

    when we create view using any table, rowid of view is

    same as original table

    in case of Materialized view rowid is different

    In case of View we always get latest data from the

    database

    in case of Materialized view we need to refresh the view

    for getting latest data.

    Performance of View is less than Materialized view More performance than view

    In case of view its only the logical view of table no

    separate copy of table will be available

    but in case of Materialized view we get physically

    separate copy of table

    this is not required for views in database. In case of Materialized view we need extra trigger or

    some automatic method so that we can keep MV

    refreshed

  • 8/22/2019 Dwh Concepts Interview Q&A

    8/12

    Compare data warehousing top down and bottom-up approach?

    Top-down approach Bottom-up approach

    In the top-down design approach the, data warehouse

    is built first. The data marts are then created from thedata warehouse

    In the bottom-up design approach, the data marts arecreated first to provide reporting capability. Thesedata marts are then integrated to build a completedata warehouse

    Provides consistent dimensional views of data acrossdata marts, as all data marts are loaded from the datawarehouse

    This model contains consistent data marts and thesedata marts can be delivered quickly

    This approach is robust against business changes.Creating a new data mart from the data warehouse is

    very easy

    As the data marts are created first, reports can be

    generated quickly.

    This methodology is inflexible to changingdepartmental needs during implementation phase

    The data warehouse can be extended easily to

    accommodate new business units. It is just creating

    new data marts and then integrating with other data

    marts.

    It represents a very large project and the cost of

    implementing the project is significant.

    The positions of the data warehouse and the data

    marts are reversed in the bottom-up approach design.

    What is factless fact schema?

    Factlessfactis a fact table without measures. It can view the number of occurring eventsExample :

    A number of accidents occurred in a month.

    Which kind of index is preferred in DWH?

    Index types depend very much on cardinality of the distinct values.High cardinality would require a regular B-tree index, whereas very low cardinality would require Bitmaps.

    Small tables may not require indices at all, since a full-table scan on such a table can be much faster that

    reading an index.

    It Actually depends on the nature of the Column on which you are goint to create an index. Bit Map: If the column is of type Flag means contains 1 or 0. Binary: If the column cintains numerical values. Partitions also can be created if column values contains only some list of values

  • 8/22/2019 Dwh Concepts Interview Q&A

    9/12

    what is the architecture of any data warehousing project? What is the flow?

    1)The basic step of datawarehousing starts with datamodelling. i.e creation dimensions and facts. 2)datawarehouse starts with collection of data from source systems such as OLTP,CRM,ERPs etc 3)Cleansing and transformation process is done with ETL(Extraction Transformation Loading) tool. 4)by the end of ETL process target databases(dimensions,facts) are ready with data which accomplishes the

    business rules.

    5)Now finally with the use of Reporting tools(OLAP) we can get the information which is used for decisionsupport.

    Explain Additive, semi-additive, non-additive facts?

    Fact table can store different types of measures such as additive, non additive, semi additive. AdditiveAs it name implied, additive measures are measures which can be added across all dimensions. Non-additivedifferent from additive measures, non-additive measures are measures that cannot be added

    across all dimensions.

    Semi additivesemi additive measures are measure that can be added across only some dimensions and notacross other.

    Difference between DWH and ODS

    ODS DWH

    Transactions similar to those of an Online Transaction

    Processing System

    Transactions similar to those of anOnline Analytical

    System

    Contains current and near current data Queries process larger volumes of data

    Typically detailed data only, often

    resulting in very large data volume

    Contains historical data

    Real-time and near real-time data loads Typically batch data loads

    Generally modeled to support rapid dataupdate Generally dimensionally modeled and tunes to

    optimise query performanceUpdated at the data field level Data is appended, not updated

    Used for detailed decision making and

    operational reporting

    Used for long-term decision making and management

    reporting

    Knowledge workers (customer service representatives,line managers

    Strategic audience (executives, business unitmanagement)

    Data is volatile Da ta is non-volatile

  • 8/22/2019 Dwh Concepts Interview Q&A

    10/12

    what are the steps to build the data warehouse?

    Identifying Sources Identifying Facts Defining Dimensions Define Attribues Redefine Dimensions & Attributes Organise Attribute Hierarchy & Define Relationship Assign Unique Identifiers Additional convetions:Cardinality/Adding ratios 1 business modeling 2 data modeling 3 data from the source databases 4 Extration Transformation Loading 5

    DataWare house (Data Marts)

    Or

    Extracting the transactional data from the data sources into a staging area Transforming the transactional data Loading the transformed data into a dimensional database Building pre-calculated summary values to speed up report generation Building (or purchasing) a front-end reporting tool

    How do you connect two fact tables? Is it possible?

    This is possible through conform dimension methodology. If a dimension table is connected to more then oneFact table is called confirm dimension

    what is the main difference between Inmon and Kimball philosophies of data warehousing?

    Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprisehas one data warehouse, and data marts source their information from the data warehouse. In the datawarehouse, information is stored in 3rd normal form.

    Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise.Information is always stored in the dimensional model

    What is meant by metadata in context of a data warehouse and how it is important?

    Metadata or Meta Data Metadata is data about data. Examples of metadata include data elementdescriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and

    process/method descriptions. The repository environment encompasses all corporate metadata resources:

    database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length,

    valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It

    insulates the data warehouse from changes in the schema of operational systems. Metadata Synchronization

    The process of consolidating, relating and synchronizing data elements with the same or similar meaning from

    different systems. Metadata synchronization joins these differing elements together in the data warehouse to

    allow for easier access

    http://www.geekinterview.com/question_details/16910http://www.geekinterview.com/question_details/16910
  • 8/22/2019 Dwh Concepts Interview Q&A

    11/12

    what is the role of surrogate keys in data warehouse and how will u generarte them?

    A surrogate key is a simple Primary key which maps one to one with a Natural compound Primary key. Thereason for using them is to alleviate the need for the query writer to know the full compound key and also to

    speed query processing by removing the need for the RDBMS to process the full compound key when

    considering a join.

    The Surrogate key role is it links the Dimension and Fact table. It avoids smart keys and Production keys*45. how data in data warehouse stored after data has been extracted and transformed from heterogeneous

    sources?

    why fact table is in normal form?

    Foreign keys of facts tables are primary keys of Dimension tables. It is clear that fact table contains columnswhich are primary key to other table that itself make normal form table.

    Or

    Basically the fact table consists of the Index keys of the dimension/ook up tables and the measures. so whenever we have the keys in a table .that itself implies that the table is in the normal form.

    what is the difference between E-R Modelling and Dimensional modelling? Basic difference is E-R modeling will have logical and physical model. Dimensional model will have only

    physical model. E-R modeling is used for normalizing the OLTP database design.Dimensional modeling is

    used for de-normalizing the ROLAP/MOLAP design.

    Can a dimension table contain numeric values?

    Yes dimension can have numeric values, that is surrogate Key which holds numeric value for unique identification of

    records in the dimension But those datatype will be char (only the values can numeric/char)

    what are the methodologies of data warehousing?

    Regarding the methodologies in the Datawarehousing . They are mainly 2 methods. Ralph Kimbell Model- Kimbell model always structed as Denormalised structure. 2. Inmon Model.- Inmon model structed as Normalised structure Depends on the requirements of the company anyone can follow the company's DWH will choose the one of

    the above models.Or

    Every company has methodology of their own. But to name a few SDLC Methodology, AIM methodology arestardadly used. Other methodologies are AMM, World class methodology and many more.

    what is a surrogate key? Where we use it explain with examples?

    A surrogate key is a unique identifier in database either for an entity in the modeled word or an object in thedatabase. Application data is not used to derive surrogate key. Surrogate key is an internally generated key by

    the current system and is invisible to the user. As several objects are available in the database corresponding to

    surrogate, surrogate key can not be utilized as primary key.

    For example, a sequential number can be a surrogate key.*39. tell me what would be the size of your warehouse project?

  • 8/22/2019 Dwh Concepts Interview Q&A

    12/12

    What is semi additive and fully additive measures?

    Semiadditive A semiadditive measure can be aggregated along some, but not all, dimensions that are included in the measure

    group that contains the measure. For example, a measure that represents the quantity available for inventorycan be aggregated along a geography dimension to produce a total quantity available for all warehouses, but the

    measure cannot be aggregated along a time dimension because the measure represents a periodic snapshot of

    quantities available. Aggregating such a measure along a time dimension would produce incorrect results.

    Nonadditive A nonadditive measure cannot be aggregated along any dimension in the measure group that contains the

    measure. Instead, the measure must be individually calculated for each cell in the cube that represents the

    measure. For example, a calculated measure that returns a percentage, such as profit margin, cannot be

    aggregated from the percentage values of child members in any dimension.

    What are the differences between star schema and snow-flake schema?

    Star schema Snow-flake schemaStar schema is highly denormalized It is normalized

    Data access latency is less Data access latency is more when compared to star

    Size of DWH is larger than snow-flake as it isdenormalized

    Size of DWH is less than star schema

    It is good as per performance It is better when memory utilization is a major concern

    Reduces the no. of joins between tables Minimum storage space, min. data redundancy

    Requires more amount of storage space Requires more joins to get \information from look uptables hence slow performance

    Where we use star schema & where snow flake?

    if PERFORMANCE is the priority than go for star schema, since here dimension tables are DENORMALIZED

    if MEMORY SPACE is the priority than go for snowflake schema, since here dimension tables areNORMALIZED

    What is ODS? What data loaded from it? What is DW architecture?

    ODSOperational Data Source, Normally in 3NF form. Data is stored with least redundancy. General architecture of DWH OLTP System ODSDWH( Denormalized Star or Snowflake, vary case to case)