data analytics …...data analytics twomarksquestionsandanswers unit i 1.define data analytics. data...

DATA ANALYTICS

TWOMARKSQUESTIONSANDANSWERS

Unit I

1.Define Data analytics. Data analytics (DA) is the science of examining raw data with the purpose of drawing

conclusions about that information. Data analytics is used in many industries to allow companies and

organization to make better business decisions and in the sciences to verify or disprove existing

models or theories.

2.Give some alternative terms for data mining.

• Knowledge mining • Knowledge extraction

• Data/pattern analysis.

• Data Archaeology

• Data dredging

3.What is KDD.

KDD-KnowledgeDiscovery in Databases.

4.What are the stepsinvolved in KDDprocess.

• Data cleaning • Data Mining

• Pattern Evaluation

• Knowledge Presentation

• Data Integration

• Data Selection

• Data Transformation

5.What is the use of theknowledge base?

Knowledge base is domainknowledge thatis used to guide searchorevaluate theinterestingness

of resulting pattern. Such knowledge can include concepthierarchiesusedto organize attribute

/attribute values in to differentlevels ofabstractionofData Mining.

6.Arcitecture of a typical data mining system.

Knowledge base

7.Mention some of thedata miningtechniques.

• Statistics

• Machine learning

• Decision Tree

• Hiddenmarkovmodels

• Artificial Intelligence

• Genetic Algorithm

• Meta learning

8.Givefew statisticaltechniques.

• Point Estimation

• Data Summarization

• Bayesian Techniques

• Testing Hypothesis

• Correlation

• Regression

9.What ismeta learning.

Concept of combining the predictionsmadefrom multiplemodels of datamining and analyzing

thosepredictions to formulate a new andpreviously unknownprediction.

GUI

Pattern Evaluation

Database orData warehouseserver

DBDW

10.Define Genetic algorithm.

• Search algorithm. • Enables us to locateoptimal binary string by processing an initialrandompopulation ofbinary

strings by performingoperations such asartificialmutation , crossover and selection.

11.What isthe purpose of Data mining Technique?

It provides a way to usevarious data mining tasks.

12.Define Predictive model.

It is used topredict the values of data by making use of known results from adifferent setof sample

data.

13.Data mining tasks that are belongs to predictivemodel

• Classification

• Regression

• Time series analysis

14.Define descriptive model

• It is used to determine the patterns and relationships in a sample data.Datamining tasks that

belongs to descriptivemodel:

• Clustering

• Summarization

• Association rules

• Sequence discovery

15. Define the term summarization The summarization ofa large chunk ofdata contained ina web page or adocument.

Summarization = caharcterization=generalization

16. List outthe advanced database systems.

• Extended-relational databases

• Object-oriented databases

• Deductive databases

• Spatial databases

• Temporaldatabases

• Multimedia databases

• Active databases

• Scientificdatabases

• Knowledge databases

17. Define clusteranalysis

Clusteranalyses data objectswithoutconsultinga known class label. Theclasslabels are

notpresent inthe training datasimply because theyare not known to beginwith.

18.Classifications of Data mining systems.

• Based on the kinds of databases mined:

o Accordingto model

_ Relational mining system

_ Transactionalmining system

_ Object-orientedmining system

_ Object-Relationalmining system

_ Data warehousemining systemo Types of Data

_ Spatial data mining system

_ Time series data mining system

_ Text data mining system

_ Multimedia data mining system

• Based on kinds of Knowledgemined

o Accordingto functionalities

_ Characterization

_ Discrimination

_ Association

_ Classification

_ Clustering

_ Outlieranalysis

_ Evolution analysis

o Accordingto levels of abstraction ofthe knowledge mined _ Generalized knowledge (High level of abstraction)

_ Primitive-level knowledge (Raw data level)

o Accordingto mine data regularities versus mine data irregularities

• Based on kinds of techniquesutilized

o Accordingto user interaction

_ Autonomoussystems

_ Interactiveexploratorysystem

_ Query-driven systems

o Accordingto methods of data analysis

_ Database-oriented

_ Data warehouse-oriented

_ Machinelearning

_ Statistics

_ Visualization

_ Patternrecognition

_ Neural networks

• Based on applications adopted

o Finance

o Telecommunicationo DNA

o Stock markets

o E-mail and so on

19.Describe challengesto data miningregarding data miningmethodology and

userinteractionissues.

• Mining different kindsofknowledge in databases

• Interactive mining ofknowledge atmultiple levels of abstraction

• Incorporation ofbackground knowledge

• Datamining query languages and ad hoc data mining

• Presentation and visualization of data mining results

• Handling noisy or incomplete data

• Pattern evaluation

20.Describechallengesto data miningregarding performance issues. • Efficiency and scalability of data miningalgorithms

• Parallel, distributed,andincrementalminingalgorithms

21.Describeissues relating to the diversity ofdatabasetypes. • Handling ofrelational and complex types of data

• Mining information from heterogeneousdatabases and globalinformationSystems

22.What ismeant by pattern? Patternrepresentsknowledge if it iseasily understood by humans; valid on testdata withsome

degree of certainty;andpotentially useful, novel,or validates a hunchabout whichthe used was curious.

Measures of patterninterestingness, either objective orsubjective, can be used to guide thediscovery

process.

23.Howis a data warehousedifferent from a database?

Data warehouse is a repository of multipleheterogeneous data sources, organizedunder a

unifiedschema at a singlesite in order tofacilitatemanagement decision-making.Database consists of a

collection of interrelateddata.

UNIT II

1. Define Association RuleMining. Associationrule mining searches for interesting relationships among items ina given data set.

2. When we can say theassociationrules are interesting?

Associationrules are consideredinterestingif they satisfy both a minimumsupport threshold and

a minimumconfidence threshold. Users or domain expertscan set such thresholds.

3. Explain Associationrule in mathematicalnotations.

Let I-{i1,i2,…..,im} bea set of items

Let D, the taskrelevantdata be asetofdatabase transaction Tis a set ofitems

An association ruleis an implication ofthe formA=>B where A C I, B C I,and An B=f. The rule

A=>Bcontains in the transaction set Dwith support s,

where s is thepercentage of transactions in D thatcontainAUB. The Rule A=> Bhas confidence c inthe

transaction set D ifc is the percentage of transactions in Dcontaining A that also contain B.

4. Define support and confidence in Association rule mining.

Support S isthe percentage of transactions in Dthat contain AUB.

Confidence c is thepercentage of transactions in Dcontaining A that alsocontainB.

Support ( A=>B)=P(AUB)Confidence(A=>B)=P(B/A)

5. Howareassociation rules mined from largedatabases?

• I step: Find all frequent itemsets:

• II step: Generatestrong association rules from frequentitemsets

6. Describethe differentclassifications of Association rulemining.

• Based on types of values handledinthe Rule

i. Boolean associationrule

ii. Quantitativeassociation rule

• Based on the dimensions of data involved

i. Single dimensionalassociationrule

ii. Multidimensionalassociationrule

• Based on the levels of abstraction involved

i. Multilevel associationrule

ii. Single level association rule

• Based on various extensions

i. Correlationanalysis

ii. Mining max patterns

7. What is the purpose of Apriori Algorithm? Apriori algorithmis an influential algorithmformining frequent itemsets forBoolean

associationrules. The nameofthe algorithmis based on the fact that thealgorithmuses priorknowledge

of frequentitem set properties.

8. Define anti-monotone property.

If a set cannot pass a test, all of itssupersets willfail the sametestas well.

9. Howto generate association rules from frequent itemsets?

Associationrules can begenerated as follows

For each frequentitemset1, generate all non empty subsets of1.For every nonempty subsets s of 1,

outputtherule “S=>(1-s)”ifSupport count(1)

=min_conf,Support_count(s)

Wheremin_conf is the minimumconfidencethreshold.

10. Givefewtechniques to improve the efficiency of Apriorialgorithm.

• Hash based technique • Transaction Reduction

• Portioning

• Sampling

• Dynamic itemcounting

11. What are the thingssuffering the performance of Apriori candidategenerationtechnique.

• Need to generate a hugenumber of candidate sets • Need to repeatedlyscan the scanthe database and check a large set ofcandidatesby pattern

matching

12. Describe the method of generating frequent item sets without candidategeneration. Frequent-patterngrowth(or FP Growth) adopts divide-and-conquerstrategy.

Steps:

Compress the database representingfrequent items into a frequent pattern treeor FP tree

Divide the compresseddatabase intoa set of conditionaldatabaseMine each

conditionaldatabaseseparately

13. Define Icebergquery. It computes an aggregatefunctionover an attribute or set of attributes inorder to findaggregate

valuesabove some specified threshold.

Given relation Rwith attributes a1,a2,…..,an and b, and an aggregate function,

agg_f, an iceberg query is theformSelect R.a1,R.a2,…..R.an,agg_f(R,b)Fromrelation R

Group by R.a1,R.a2,….,R.an

Having agg_f(R.b)>=threshold

14. Mentionfewapproaches to miningMultilevelAssociationRules

• Uniform minimumsupport for all levels(oruniformsupport) • Using reduced minimum support atlower levels(orreduced support)

• Level-by-level independent

• Level-crossfiltering bysingleitem

• Level-crossfiltering byk-itemset

15. What are multidimensionalassociationrules?

Associationrules that involve two or moredimensions or predicates • Interdimensionassociationrule:Multidimensionalassociation rule withnorepeated predicate or

dimension

• Hybrid-dimension association rule:Multidimensional associationrule withmultiple occurrences of

some predicates or dimensions.

16. Define constraint-BasedAssociationMining. Mining is performed under the guidance of various kinds of constraintsprovided bythe user.

The constraintsincludethe following

• Knowledge type constraints

• Data constraints

• Dimension/levelconstraints

• Interestingness constraints

• Rule constraints.

17. Define the conceptof classification.

Twostep process • A model is built describing a predefined set of data classesor concepts.The model is constructed by

analyzing databasetuples described byattributes.

• The modelis usedfor classification.

18. What isDecision tree?

A decisiontree isa flow chartliketree structures, where each internal

node denotes a test on an attribute, each branch represents anoutcome of the test,and leaf

nodesrepresent classes or classdistributions. The top most in a tree istheroot node.

19. What isAttribute Selection Measure? The information Gain measure is used to select the test attribute at each nodein the decision

tree. Such a measure is referred toas an attribute selectionmeasureor a measure of the goodness of

split.

20. Describe Tree pruning methods.

When a decision tree is built,many of the branches will reflectanomalies inthe training data

due tonoise or outlier.Treepruningmethods addressthis

problemof over fitting the data.Approaches:

• Pre pruning

• Post pruning

21. DefinePre Pruning A tree is pruned by halting its construction early.Upon halting, the nodebecomes a leaf. The leaf may

hold themost frequent class among the subsetsamples.

22. DefinePost Pruning.

Post pruning removes branches froma “Fully grown” tree. Atree node is pruned by removing its branches.Eg: Cost Complexity Algorithm

23. What ismeant by Pattern?

Patternrepresents the knowledge.

24. Define the conceptof prediction.

Predictioncan be viewed as the constructionand use of a model to assess theclass of an

unlabeledsample or to assessthe value or value ranges of an attributethat a given sample is likely to

have.

Unit III

1.Define Clustering? Clustering is a process ofgrouping the physical or conceptualdata object intoclusters.

2. What do you mean by Cluster Analysis?

A cluster analysis is theprocess of analyzingthe various clusters to organizethedifferentobjects into

meaningful anddescriptive objects.

3. What arethe fields inwhich clusteringtechniques are used?

• Clusteringis used in biology to develop new plants and animal

taxonomies.

• Clusteringis used in business to enablemarketers to developnew

distinctgroups oftheir customers and characterize thecustomer group on basisof purchasing.

• Clusteringis used intheidentification ofgroups of automobiles

Insurancepolicycustomer.

• Clusteringis used intheidentification ofgroups of house in a city on

the basis ofhouse type, theircost and geographical location.

• Clusteringis used toclassify the document on the web forinformation

discovery.

4.What are the requirements of cluster analysis?

The basic requirements of cluster analysis are

• Dealing with differenttypes of attributes.

• Dealing with noisy data.

• Constraints on clustering.

• Dealing with arbitrary shapes.

• High dimensionality

• Ordering of input data

• Interpretability and usability

• Determining input parameter and

• Scalability

5.What are the different types of data used forcluster analysis? The different types of data used for cluster analysis are interval scaled, binary,nominal,ordinal

and ratio scaled data.

6. What are interval scaled variables?

Intervalscaled variablesare continuousmeasurements of linear scale. For example, height andweight, weather temperature or coordinatesforany cluster.These

measurements can be calculated using Euclideandistance or Minkowskidistance.

7. Define Binary variables? And what are the two types of binary variables? Binary variables are understood bytwo states 0 and 1, when state is 0, variable isabsent

andwhen stateis 1, variable is present. There are twotypes of binaryvariables,symmetric and

asymmetricbinaryvariables.Symmetric variables are those variables thathave same state values

andweights. Asymmetricvariablesare those variables thathavenot same state valuesand weights.

8. Define nominal, ordinal and ratio scaled variables?

A nominalvariableis a generalization ofthe binary variable. Nominal variablehas more than

two states,For example, a nominalvariable,colorconsists of four states,red, green, yellow, or black. In

Nominal variablesthe totalnumber ofstates is N and it isdenoted by letters, symbols or integers.

An ordinal variablealso has more thantwo statesbut all thesestates are orderedin a meaningful

sequence.

A ratio scaled variablemakespositivemeasurements on a non-linearscale, suchas exponential scale,

using the formula

AeBt or Ae-Bt

Where A and B are constants.

9. What do u mean by partitioningmethod?

In partitioningmethod a partitioning algorithmarrangesall theobjectsintovarious partitions,

where the totalnumber ofpartitionsis less than thetotal number ofobjects. Here each partitionrepresents

a cluster.The two types ofpartitioningmethod arek-means and k-medoids.

10. DefineCLARA and CLARANS?

Clustering inLARgeApplications is called as CLARA. The efficiency ofCLARA depends upon

the size of the representative data set.CLARA does not workproperlyif any representative dataset

fromthe selected representative datasets does notfind best k-medoids.

To recover thisdrawback a new algorithm, Clustering Large Applicationsbasedupon RANdomized

search (CLARANS) is introduced. The CLARANS works like

CLARA, the only difference betweenCLARA and CLARANS is the clusteringprocessthatis doneafter

selecting the representativedata sets.

11. What is Hierarchical method? Hierarchical methodgroups allthe objects intoatree of clusters that are arrangedin a

hierarchicalorder.Thismethod works on bottom-up or top-down approaches.

12. DifferentiateAgglomerativeand Divisive Hierarchical Clustering?

AgglomerativeHierarchicalclusteringmethod works on thebottom-up approach. In Agglomerative hierarchical method, each object createsitsown clusters. The singleClustersare

merged tomake larger clusters and the process ofmerging continues until all the singular clusters are

merged into one big clusterthatconsists of all theobjects.Divisive Hierarchicalclusteringmethod works

on the top-down approach.In thismethod all theobjects are arrangedwithin a big singular cluster

andthe largecluster iscontinuously dividedinto smallerclustersuntil each cluster has a single object.

13. What is CURE?

ClusteringUsing Representativesis called as CURE. The clustering algorithmsgenerallywork

on sphericaland similarsizeclusters.CUREovercomes the problemofspherical and similar size cluster

andis more robust with respect to outliers.

14. DefineChameleon method?

Chameleon is another hierarchicalclusteringmethod that usesdynamicmodeling.Chameleon is

introducedto recover the drawbacks of CUREmethod. In this method twoclusters are merged, if the

interconnectivitybetween two clusters is greater than theinterconnectivitybetween the objects within a

cluster.

15. DefineDensity based method? Density basedmethod deals with arbitrary shaped clusters. In density-basedmethod, clusters

areformed on the basis of the region where the density of the objectsishigh.

16. What isa DBSCAN?

Density BasedSpatialClustering ofApplicationNoise is called as DBSCAN.DBSCAN isa

density based clusteringmethod that converts the high-density objectsregionsintoclusters

witharbitraryshapes and sizes.DBSCAN defines the cluster asamaximal set of density

connectedpoints.

17. What do you mean by Grid Based Method?

In this method objects arerepresented by the multi resolutiongrid datastructure. All the objects are quantizedinto a finitenumber of cells andthe collection of cells buildthe grid

structure of objects. The clusteringoperations are performed on that gridstructure.Thismethod is

widely usedbecause its processing time is very fast andthatisindependent of numberofobjects.

18. What isa STING?

Statistical Information Grid is calledas STING; it is a grid basedmulti resolutionclustering method. In

STINGmethod, allthe objects are containedinto rectangular cells,these cells are keptintovarious levels

of resolutions and these levels arearranged inahierarchical structure.

19. DefineWave Cluster? It is a gridbasedmulti resolution clustering method. In this method all theobjectsare represented by a

multidimensional grid structure and a wavelet transformation is

applied for finding the dense region.Each grid cellcontainstheinformation of the group of objects that

map into a cell. A wavelet transformation is a process of signaling thatproduces the signal of various

frequency sub bands.

20. What isModel based method?

For optimizing a fit between a givendata set anda mathematicalmodel basedmethods are used. This

method uses an assumption thatthe data are distributed byprobability distributions.There are two basic

approaches in this method that are

1. Statistical Approach

2. Neural Network Approach.

21. What isthe use of Regression?

Regression can be used to solvethe classificationproblems but itcan also be used

for applications such as forecasting.Regression can be performed using many differenttypes of

techniques; in actuallyregressiontakes aset of data and fits the data toaformula.

22. What are the reasons for not usingthe linear regressionmodel to estimate theoutput data? There are many reasonsfor that, Oneis that the data do not fita linear model,It ispossible howeverthat

the data generally do actuallyrepresent a linearmodel, but thelinearmodelgenerated is poor because

noise or outliers exist in the data.

Noise is erroneous data and outliers are data values thatare exceptions to the usual andexpected data.

23. What are the two approaches used by regression to perform classification?

Regression can be used to performclassification using the following approaches 1. Division: The data aredividedintoregions based on class.

2. Prediction:Formulas are generated to predict the output class value.

24. What do u mean by logistic regression? Instead of fitting a data into a straightlinelogistic regressionuses a logisticcurve.The formulafor the

univariate logistic curveis

P= e (C0+C1X1)1+e (C0+C1X1)

The logistic curve gives a value between 0 and 1so it can be interpretedas theprobability ofclass

membership.

25. What isTime SeriesAnalysis?

A timeseries is a set of attribute values over a period of time.Time Series Analysis may be viewedas findingpatterns inthe data and predicting futurevalues.

26. What are the various detected patterns?

Detected patternsmay include: ¨Trends : It may be viewed as systematic non-repetitivechanges to the values overtime.

¨Cycles :The observedbehavioris cyclic.

¨Seasonal :The detected patterns may be based on time of year or month or day.

¨Outliers : To assist inpattern detection , techniquesmay be needed toremove orreduce the impact of

outliers.

27. What isSmoothing?

Smoothingis an approach thatis used to remove the nonsystematic behaviors foundin time series. Itusually takes the formof finding moving averagesof attributevalues. It isused to

filter out noise and outliers.

28. Givethe formulafor Pearson’s r

One standard formula tomeasurecorrelationis

thecorrelationcoefficientr,sometimescalledPearson‟sr.Giventwotimeseries,XandYwithmeansX‟andY‟

,each with n elements, the formulaforr is

S(xi–X‟)(yi–Y‟)

(S(xi–X‟)2S(yi–Y‟)2)1/2

29. What is Autoregression?

Autoregression is a method ofpredicting a futuretime series value by looking atprevious values. Given

a time series X= (x1,x2,….xn) a future value, xn+1, can befound

using

x n+1 = x + j nx n + j n-1x n-1 +……+ e n+1

Here e n+1 represents arandomerror, at time n+1.In addition,each element in thetimeseries can

beviewed as a combination ofa random error anda linear combination ofprevious values.

UNIT-IV

1.Define data warehouse? A data warehouse is a repository ofmultiple heterogeneous data sourcesorganized under a

unified schema at a single siteto facilitate managementdecisionmaking .

(or)

A data warehouse is a subject-oriented,time-variant and nonvolatile

collectionofdatainsupportofmanagement‟sdecision-makingprocess.

2.What are operationaldatabases?

Organizationsmaintain large databasethat are updated by daily transactions arecalled

operationaldatabases.

3.Define OLTP?

If an on-line operational database systems is used for efficient retrieval, efficientstorageand

managementof large amounts of data, then the systemis said to be on-linetransaction processing.

4.Define OLAP? Data warehousesystems serves users(or) knowledge workers in the role ofdataanalysis

anddecision-making. Suchsystems can organize and present data in variousformats. These systems are

known as on-lineanalyticalprocessingsystems.

5.Howa database design is represented in OLTPsystems?

Entity-relationmodel

6. Howa database design is represented in OLAPsystems?

Star schemaSnowflakeschema

Fact constellationschema

7.Write short notes onmultidimensional datamodel?

Data warehouses and OLTPtools arebased on a multidimensionaldatamodel.

This model is used for the design of corporate data warehouses and department datamarts. This model

contains a Starschema, Snowflake schema and Fact constellationschemas. The core of the

multidimensionalmodel is the datacube.

8.Define data cube?

It consists ofa large set offacts (or) measures and a numberof dimensions.

9.What are facts?

Facts are numericalmeasures. Factscan also beconsidered asquantities by whichwe can analyze

the relationshipbetweendimensions.

10.What are dimensions? Dimensions are theentities(or)perspectives withrespect to anorganization forkeeping records

and arehierarchical in nature.

11.Define dimensiontable?

A dimension tableis used fordescribing the dimension. (e.g.) A dimension table for item may containtheattributesitem_name,brand and type.

12.Define fact table?

Fact table contains the name of facts (or) measures as well as keys to eachof therelated

dimensional tables.

13.What are lattice ofcuboids?

In data warehousingresearchliterature, a cube can also be called as cuboids. Fordifferent (or)

set of dimensions, we canconstructa lattice ofcuboids, eachshowing thedata at different level. The

lattice ofcuboids is also referredtoas data cube.

14.What isapex cuboid? The 0-D cuboid which holds the highestlevel ofsummarization is calledthe apexcuboid.

Theapex cuboid is typicallydenoted by all.

15.List outthe components of starschema?

A large centraltable(fact table) containing the bulk of datawith noredundancy.

_ A set of smallerattendanttables (dimension tables), one for eachdimension.

16.What is snowflake schema? The snowflakeschema is a variant of the star schemamodel, where somedimension tables are

normalized thereby furthersplitting the tables in to additionaltables.

17.List outthe components of fact constellation schema? This requiresmultiple facttables to share dimensiontables.This kind of schemacan be viewed as

a collection of stars and henceit is knownas galaxy schema (or) factconstellationschema.

18.Point out the major differencebetween the star schemaand the snowflakeschema? The dimension table of the snowflake schemamodel may be kept in normalizedformto reduce

redundancies. Such a tableis easy to maintain and saves storage space.

19.Which is popular in the data warehouse design, star schema model (or)snowflake schema

model? Star schema model, because the snowflake structure canreduce the effectivenessand more joins

will be needed to execute a query.

20.Define concepthierarchy?

A concept hierarchydefines a sequence of mappings froma set of low-levelconcepts tohigher-

level concepts.

21.Define total order?

If the attributes of a dimension whichforms a concept hierarchy such as

“street<city<province_or_state<country”, thenit is said to be total order.

Country Province orstate City

Street

Fig: Partialorder for location

22.Define partialorder? If the attributes of a dimension whichforms a latticesuch as“day<{month<quarter; week}<year, then it

is said to be partial order.23.Define schemahierarchy?

A concept hierarchythat is a total (or) partialorderamong attributes in adatabaseschema is called a

schema hierarchy.

24.List outthe OLAP operations in multidimensional data model?

_ Roll-up

_ Drill-down _ Slice and dice

_ Pivot (or) rotate

25.What isroll-up operation?

The roll-upoperation isalso called drill-upoperation which performs aggregationon a data cube either

byclimbing up a concept hierarchyfor a dimension (or) bydimension reduction.

26.What is drill-down operation? Drill-downis the reverse of roll-up operation.It navigates fromless detailed datato more detailed

data.Drill-downoperation can be taken place by stepping down aconcept hierarchy for a dimension.

27.What is slice operation? The slice operationperforms a selection on one dimension ofthe cube resulting ina sub cube.

28.What isdice operation?

The dice operationdefines a sub cube by performing a selection on two(or) moredimensions.

29.What ispivot operation? This is a visualizationoperation thatrotates thedata axes inan alternativepresentationof the data.

30.List outthe views in the designof a data warehouse?

_ Top-down view

_ Data source view

_ Data warehouse view

_ Business query view

31.What are the methods for developing large software systems?

_ Waterfall method

_ Spiral method

32.Howtheoperation is performed in waterfall method?

The waterfallmethodperforms a structuredand systematicanalysis at each step

before proceeding to thenext, whichis like a waterfall fallingfromone step to the next.

33.Howtheoperation is performed in spiralmethod?

The spiral method involves the rapid generationofincreasinglyfunctional

systems, with short intervalsbetween successivereleases. This is considered as a goodchoice for the

data warehousedevelopmentespeciallyfor data marts,because theturnaround timeis short,

modificationscan be done quickly and new designs andtechnologies can be adapted in a timelymanner.

34.List outthe steps of the data warehouse design process?

_ Choose a business process to model. _ Choose the grain of the business process

_ Choose the dimensions that willapply to eachfact table record.

_ Choose the measures that will populate each fact table record.

35.Define ROLAP? The ROLAPmodel is anextended relational DBMS that mapsoperations onmultidimensionaldata

tostandardrelational operations.

36.DefineMOLAP?

The MOLAP model is a

specialpurposeserverthatdirectlyimplementsmultidimensionaldataandoperations.

37.Define HOLAP?

The hybridOLAP approach combinesROLAP and MOLAP technology,benefiting fromthe greater

scalability of ROLAP and the fastercomputation of

MOLAP,(i.e.) a HOLAP server may allow large volumes of detail data tobe storedinarelational

database,while aggregations are kept in a separate MOLAP store.

38.What is enterprise warehouse?

Anenterprisewarehousecollectsalltheinformation‟saboutsubjectsspanningthe entire organization.It providescorporate-widedataintegration, usually fromone (or)more operational

systems (or) externalinformationproviders. It contains detailed dataaswell as summarized data and can

range in size fromafewgiga bytesto hundreds of gigabytes, tera bytes (or)beyond.

39.What isdata mart?

Data mart isa database that contains a subset of data presentina data warehouse. Data marts are createdto structure the datain a datawarehouseaccordingto issues suchas hardware

platforms and access controlstrategies.We can divide a data warehouse intodata martsafter the data

warehouse has been created. Data marts are usuallyimplementedon low-cost departmental servers that

are UNIX (or) windows/NT based.

40.What are dependent and independent data marts?

Dependent datamarts are sourced directly fromenterprise data warehouses.Independent data marts are

data captured fromone (or) more operational systems (or)external information providers(or) data

generated locally with in particulardepartment(or) geographic area.

41.What isvirtual warehouse?

A virtual warehouse is a set of viewsover operational databases. For efficient

query processing, only some of the possible summary views may bematerialized. Avirtualwarehouse

is easy to build but requires excesscapability on operationaldatabaseservers.

42.Define indexing?

Indexing isa technique,which is used forefficient dataretrieval (or) accessing data inafastermanner.When a table grows in volume, the indexes also increase insizerequiring more

storage.

43.What are the typesof indexing?

_ B-Treeindexing

_ Bit map indexing

_ Join indexing

44.Define metadata?

Metadata isused in data warehouse is usedfor describing data about data.

(i.e.) metadata are the datathatdefine warehouse objects. Metadata are createdforthedata names and

definitions of the given warehouse.

45.Define VLDB?

Very LargeData Base. If a databasewhose size is greaterthan 100GB, thenthe database is saidto be

very largedatabase.

UNIT – V

1.What are the classifications of tools for datamining?

• Commercial Tools

• Public domain Tools

• Research prototypes

2.What are commercialtools?

Commercial tools can bedefined as thefollowingproducts and usuallyareassociatedwiththeconsulting

activityby the samecompany:

1. „IntelligentMiner‟fromIBM

2. „SAS‟SystemfromSASInstitute

3. „Thought‟fromRightInformationSystems.etc

3. What are Publicdomain Tools?

Public domain Tools arelargely freeware with justregistration fees:

‟Brute‟fromUniversityofWashington.„MC++‟fromStanforduniversity,Stanford,

California.

4. What are Research prototypes?

Some of the researchproductsmay find their way into commercial

market:„DBMiner‟fromSimonFraserUniversity,BritishColumbia,„MiningKernelSystem‟fromUniversi

tyofUlster,NorthIreland.

5.What is thedifferencebetween generic single-task toolsand generic multi-tasktools? Generic single-tasktools generally useneuralnetworks or decisiontrees.They coveronly the datamining

part and require extensive pre-processingandpostprocessing

steps.

Generic multi-task tools offermodules for pre-processingandpostprocessingsteps and alsooffer a broad

selectionof several populardata miningalgorithms as clustering.

6. What are the areas inwhich data warehouses are usedin present and in future?

The potential subjectareas in whichdata ware housesmay bedeveloped atpresentandalso in future are

1.Census data:

The registrar general and census commissioner ofIndia decennially

compilesinformation of all individuals,villages, population groups, etc. Thisinformationis wide

ranging such as theindividual slip. A compilation ofinformation of individualhouseholds, ofwhich a

database of 5%sample is maintained for analysis. Adatawarehouse can be built fromthis database

uponwhich OLAP techniquescan be applied,Data mining also can be performed for analysis and

knowledge discovery

2.Prices ofEssentialCommodities

The ministry of food and civil supplies, Government of India complies

daily data for about 300observationcenters in the entire country on the prices ofessential

commoditiessuch as rice,edible oiletc, Adata warehouse canbe builtfor this dataand OLAP techniques

can be appliedfor its analysis

7. What are the other areas for Datawarehousing and data mining?

• Agriculture

• Rural development

• Health

• Planning

• Education

• Commerce and Trade

8. Specify some of thesectors in which data warehousing and data mining are used?

• Tourism • ProgramImplementation

• Revenue

• EconomicAffairs

• Audit and Accounts

9. Describethe use of DBMiner.

Used to performdataminingfunctions,includingcharacterization,association,classification,

predictionand clustering.

10. Applications of DBMiner. The DBMiner systemcan be used asa general-purpose online analyticalmining system for both OLAP

and data miningin relationaldatabase anddatawarehouses.

Used in mediumto largerelational databaseswithfast response time.

11. Give some data mining tools. DBMinerGeoMiner Multimedia minerWeblogMiner

12. Mentionsome ofthe application areas of data mining

DNA analysisFinancial dataanalysisRetailIndustry

Telecommunication industryMarketanalysis

Bankingindustry and Health care analysis.

13. Differentiatedata query and knowledge query A data query finds concrete datastored in a database and corresponds toabasic retrievalstatement in a

databasesystem.

A knowledge query finds rules, patterns andother kinds of knowledge in adatabase andcorresponds to

querying databaseknowledge includingdeductionrules, integrityconstraints, generalized rules, frequent

patterns andother regularities.

14.Differentiatedirectquery answering and intelligent query answering.Direct queryanswering

means that a query answers by returningexactlywhatis beingasked.

Intelligentquery answering consists ofanalyzingthe intent ofquery andproviding generalized,

neighborhood,or associatedinformationrelevant tothequery.

15. Define visual datamining

Discovers implicit anduseful knowledge fromlarge datasets using dataand/or knowledge visualization

techniques.

Integration ofdata visualization anddata mining.

16. What does audio data miningmean? Uses audio signalsto indicate patterns ofdata or the features ofdata miningresults.

Patterns aretransformed into sound andmusic.

To identify interestingorunusualpatterns by listening pitches,rhythms, tuneand melody.

Steps involved in DNA analysis

Semanticintegration of heterogeneous,distributedgenome databasesSimilarity search and

comparisonamong DNA sequencesAssociationanalysis: Identification ofco-occuring gene sequences

Path analysis: Linking genes to differentstagesof disease developmentVisualization tools andgenetic

data analysis

17.What are the factors involvedwhile choosing data mining system? Data types Systemissues Data sources

Data Mining functions and methodologies

Coupling data mining with database and/or datawarehouse systemsScalability

Visualization tools

Data mining query language and graphicaluser interface.

18. DefineDMQL

Data Mining Query Language

It specifies clauses and syntaxes for performingdifferenttypes of data miningtasks for example data

classification, data clustering and miningassociationrules. Alsoit uses SQl-like syntaxesto mine

databases.

19. Define text mining

Extractionofmeaningfulinformation fromlarge amounts free format textualdata.

Useful in Artificial intelligence and patternmatching

Also known as textmining, knowledge discoveryfromtext, or contentanalysis.

20. What does web mining mean Technique to process informationavailable on web and search for useful data.To discoverweb pages,

text documents , multimediafiles,images, andothertypes of resources fromweb.

Used in several fields such as E-commerce,information filtering, frauddetection and educationand

research.

21.Define spatial datamining.

Extractingundiscoveredand implied spatial information.Spatial data:Data that is associatedwith a

location

Used in several fields such as geography, geology, medicalimaging etc.

22. Explain multimediadata mining.

Mines largedata bases. Does not retrieveany specificinformation from multimedia databasesDerive newrelationships , trends,

and patterns fromstored multimedia datamining.

Used in medicaldiagnosis, stock markets,Animation industry,Airlineindustry,

Trafficmanagementsystems,Surveillance systems etc.

16 MARKSQUESTIONS AND ANSWERS

UNIT-I

1. Explain the evolutionof Database technology?

_ Data collection and Database creation

_ Database managementsystems

_ Advanced database systems

_ Data warehousing andData Mining

_ Web-basedDatabasesystems

_ New generation of Integratedinformationsystems

2.Explain the steps of knowledgediscovery in databases? _ Data cleaning

_ Data integration

_ Data selection

_ Data transformation

_ Data mining

_ Patternevaluation

_ Knowledgepresentation

3. Explain the architecture of datamining system?

_ Database, datawarehouse, or otherinformationrepository

_ Databaseor data warehouseserver

_ Knowledge base

_ Data mining engine

_ Patternevaluationmodule

_ Graphicaluser interface

4.Explain various tasks in data mining?

(Or)

Explain the taxonomyofdata miningtasks?

_ Predictive modeling • Classification

• Regression

• Time series analysis

_ Descriptivemodeling

• Clustering

• Summarization

• Association rules

• Sequence discovery

5.Explain various techniques in data mining?

_ Statistics (or) Statisticalperspectives _ Point estimation

• Data summarization

• Bayesian techniques

• Hypothesis testing

• Correlation

_ Regression

_ Machinelearning

_ Decisiontrees

_ Hidden markov models

_ Artificial neural networks

_ Genetic algorithms

_ Meta learning

UNIT-II

6.Explain the issues regardingclassification andprediction?

_ Preparingthe data forclassification and predictiono Data cleaning

o Relevance analysiso Data transformation

_ Comparingclassificationmethodso Predictive accuracy

o Speed

o Robustnesso Scalability

o Interpretability

7.Explainclassification by Decision treeinduction?

_ Decisiontree induction

_ Attributeselectionmeasure.

_ Tree pruning _ Extracting classification rules from decision trees

8.Write short notes onpatterns?

_ Pattern definition _ Objective measures

_ Subjective measures

_ Can a data mining systemgenerate all of theinterestingpatterns?

_ Can a data mining systemgenerate only interestingpatterns?

9.Explain mining single–dimensionalBoolean associatedrules from transactionaldatabases? _ The apriorialgorithm:Finding frequentitemsets usingcandidategeneration

_ Mining frequent itemsets without candidate generation

10.Explain apriori algorithm?

_ Apriori property _ Join steps

_ Prune step

_ Example

_ Algorithm

11.Explain howthe efficiency of apriori is improved?

_ Hash-based technique (hashing item set counts) _ Transaction reduction (reducingthe number of transactionsscanned in future iteration)

_ Partitioning(Partitioning the data to find candidate itemsets)

_ Sampling(mining on a subset of the given data)

_ Dynamic itemset counting (addingcandidateitemsets atdifferentpoints duringascan)

12.Explain frequent item set withoutcandidatewithout candidate generation?

_ Frequent patternsgrowth (or) FP-growth

_ Frequent patterntree(or) FP-tree

_ Algorithm

13. Explain mining Multi-dimensional Boolean association rules from transactiondatabases?

_ Multi-dimensional(or) Multilevel associationrules _ Approaches to mining Multilevel association rules

• Using uniform minimum support for all levels

• Using reduced minimum support atlower levels

o Level-by-levelindependent

o Level-cross filtering bysingle

o Level- crossfilteringby k-itemset

_ Checkingfor redundant Multilevel associationrules

14.Explainconstraint-based association mining?

_ Knowledge type constraints

_ Data constraints

_ Dimension/levelconstraints

_ Interestingness constraints

_ Rule constraints

_ Metarule-Guidedmining ofassociation ofassociationrules

_ Mining guided by additionalrule constraints

Unit–III

15.Explainregressionin predictive modeling?

_ Regression definition _ Linear regression

_ Multiple regression

_ Non-linear regression

_ Other regression models

16.Explainstatisticalperspective indata mining?

_ Point estimation _ Data summarization

_ Bayesian techniques

_ Hypothesis testing

_ Regression

_ Correlation

17. ExplainBayesian classification.

_ Bayesian theorem _ Naïve Bayesian classification

_ Bayesianbelief networks

_ Bayesian learning

18. Discuss the requirements of clustering indata mining.

_ Scalability _ Abilitytodeal with different typesof attributes

_ Discoveryofclusters witharbitrary shape

_ Minimalrequirements fordomain knowledge to determineinput parameters

_ Abilitytodeal with noisy data

_ Insensitivity to theorder of input records

_ High dimensionality

_ Interpretability and usability

_ Intervalscaledvariables

_ Binary variables

o Symmetric binary variableso Asymmetricbinary variables

_ Nominalvariables

_ Ordinal variables

_ Ratio-scaled variables

20. Explain the partitioningmethod of clustering. K-means clusteringK-medoids clustering

21. ExplainVisualization in data mining.

Various forms of visualizingthe discovered patterns _ Rules

_ Table

_ Crosstab

_ Pie chart

_ Bar chart

_ Decisiontree

_ Data cube

_ Histogram

_ Quantile plots

_ q-q plots

_ Scatterplots

_ Loess curves

UNIT IV

22. Discuss the components of datawarehouse.

_ Subject-oriented

_ Integrated

_ Time-Variant

_ Non-volatile

23. List outthe differencesbetween OLTPand OLAP.

_ Users andsystemorientation

_ Data contents

_ Databasedesign

_ View

_ Access patterns

24.Discuss the various schematicrepresentations in multidimensional model.

_ Star schema

_ Snow flakeschema

_ Fact constellationschema

25. Explain the OLAP operations I multidimensional model.

_ Roll-up

_ Drill-down

_ Slice and dice

_ Pivot or rotate

26. Explain the design and construction of a datawarehouse.

_ Design of a data warehouse • Top-down view

• Data source view

• Data warehouse view

• Business query view

_ Process of data warehouse design

27.Expalin the three-tier data warehousearchitecture.

_ Warehouse database server(Bottomtier) _ OLAP server(middle tier)

_ Client(toptier)

28. Explain indexing.

_ Definition _ B-Treeindexing

_ Bit-map indexing

_ Join indexing

29.Write notes on metadata repository.

_ Definition _ Structure ofthe datawarehouse

_ Operationalmetadata

_ Algorithms used for summarization

_ Mapping fromoperationalenvironment to data warehouse

_ Data related to systemperformance

_ Business metadata

30. Write short notes onVLDB.

_ Definition

_ Challenge related to database technologies

_ Issuesin VLDB

UNIT V

31.Explain data miningapplications for Biomedical andDNA data analysis.

_ Semanticintegrationof heterogeneous, distributed genome databases _ Similaritysearchand comparisonamong DNA sequences

_ Associationanalysis.

_ Path analysis

_ Visualization tools andgenetic data analysis.

32. Explain data miningapplications fro financial data analysis.

_ Loan payment prediction and customer creditpolicy analysis. _ Classification and clustering of customers fro targeted marketing.

_ Detectionof money laundering and other financial crimes.

33. Explain data miningapplications for retail industry.

_ Multidimensional analysis of sales,customers, products, time and region. _ Analysis of the effectiveness of sales campaigns.

_ Customerretention-analysis of customer loyalty.

_ Purchase recommendation and cross-reference of items.

34. Explain data miningapplications for Telecommunication industry.

_ Multidimensional analysis of telecommunication data.

_ Fraudulent pattern analysis and theidentification of unusual patterns.

_ Multidimensional association and sequentialpatternanalysis

_ Use of visualization tools in telecommunication data analysis.

35. ExplainDBMiner tool in datamining.

_ Systemarchitecture

_ Input andOutput

_ Data mining tasks supported by the system

_ Support of task and methodselection

_ Support of the KDD process

_ Main applications

_ Currentstatus

36. Explain howdata mining is usedin healthcare analysis.

_ Health care data mining and its aims _ Health care data miningtechnique

_ Segmentingpatientsinto groups

_ Identifyingpatientsinto groups

_ Identifyingpatients with recurring healthproblems

_ Relation between disease and symptoms

_ Curbing thetreatment costs

_ Predictingmedical diagnosis

_ Medical research

_ Hospital administration

_ Applications of data mining in health care

_ Conclusion

37. Explain howdata mining is usedin banking industry.

_ Data collected by data mining in banking

_ Banking data miningtools

_ Mining customer dataofbank

_ Mining forprediction and forecasting

_ Mining for frauddetection

_ Mining forcross selling bank services

_ Mining for identifyingcustomer preferences

_ Applications of data mining in banking

_ Conclusion

38. Explain the types of datamining.

_ Audio data mining _ Video data mining

_ Image data mining

_ Scientific and statistical data mining

data analytics …...data analytics twomarksquestionsandanswers unit i 1.define data analytics. data...

Documents