[fall 2011] cs-402 data mining - final exam-sub_v03

G I F T U N I V E R S I T YG U J R A N W A L A

(Chartered by the Govt. of the Punjab, Recognized by HEC)

Final Term Examination Fall 2011

Data Mining(CS-402)

Lecturer: Mr. Nadeem Qaisar Mehmood

Total Marks: 100 Duration: 3 hours

Candidate Name: Candidate Roll No:

Instructions to Candidates:

Candidates are required to sit on the seats assigned to them by the invigilators. Do not open this question paper until you have been told to do so by the Invigilator. Please fill in exam specific details in space provided (both Question Paper and Answer Sheet). This is a Closed Book Exam. “Closed book examinations’” refer to examinations where the candidate may not take

into the examination room any study materials (including textbooks, study guides, lecture notes, printed notes fromweb pages, hand written notes and any audio/visual aid).

There are 4 section from A to D. The total marks available are 100.o Section A

There are three questions, attempt only 2 Each question carries 20 marks

o Section B: Only one question having 25 markso Section C: Only one question having 20 markso Section D: Only one question having 15 markso Each question is subdivided into sub partso Formulae are given at the end of the question paper

Do not write anything on question paper except Name and Roll Number. Question paper consists of total 06 pages including Title page. Calculators are allowed, except programmable ones. Each student has to attempt the paper on the answer sheet and have to return the answer sheet back to the

invigilator. Students can keep the question paper with them.

EXA-114

Department of Examinations GIFT University, Pakistan

of 6

Section AAttempt only two questions from section A

Question 01: (Total Marks 20)Question 01(a): (09 Marks)What are different types of data sets? Please provide some relevant examples.

Question 01(b): (06 Marks)Find the following proximity measures for the following two binary vectors p and q.p = (0 1 1 0 0 0 0 0 0 1), q = (0 1 0 0 0 0 1 0 0 1) Jaccard, Cosin, Hamming

Question 01(c): (05 Marks)Which proximity measure would you prefer to use, if in case you are working on finding distances between the classinstances but the record data matrix to mine clusters is very sparse?

Question 02: (Total 20 Marks)Question 02(a): (15 Marks)

Following data, in table 1, contains information about the eye patients who got subscription to use lenses based ontheir disease age, spectacle and stigma reports. The disease age varies between young, pre-presbyopic,

and presbyopic. However spectacle prescription would be myope and hypermetrope. Either a patient can haveastigma or not. Use this data and apply decision trees so that the data gets classified to decide what kind of lenses tochose either hard or soft. The formulae and measurements given at the end of this question paper would help you toperform classification.

Table 1: Data set for Question 02(a)AGE ASTIGMA SPECTACLE CONTACT LENSES

1 Young No Myope Soft2 Young Yes Myope Hard3 Young No Hypermetrope Soft4 Young Yes Hypermetrope Hard5 pre-presbyopic No Myope Soft6 pre-presbyopic Yes Myope Hard7 pre-presbyopic No Hypermetrope Soft8 Presbyopic Yes Myope Hard9 Presbyopic no hypermetrope Soft

Some Entropy Measurements:

Question 02(b): (05 Marks)Discuss the hypothesis space of ID3 algorithm based Decision Tree learning.

Question 03: (Total 20 Marks)Question 03(a): (14 Marks)Consider the training examples shown in Table 02 for a binary classification problem.

a) Compute the Gini index for the overall collection of training examples.b) Compute the Gini index for the Customer ID attribute.c) Compute the Gini index for the Gender attribute.d) Compute the Gini index for the Car Type attribute using multiway split.e) Compute the Gini index for the Shirt Size attribute using multiway split.f) Which attribute is better, Gender, Car Type, or Shirt Size?

[2+,1-]=0.9018 [1+,2-]=0.9018 [2+,2-]=1 [2+,0]=0 [2+,3-]=0.971[3+,2-]=0.971 [3+,1-]=0.8115 [3+,2-]=0.971 [5+,1-]=0.650


of 6

Table 2: Data set for Question 03(a)Customer ID Gender Car Type Shirt Size Class

1 M Family Small C0

2 M Sports Medium C0

3 M Sports Medium C0

4 M Sports Large C0

5 M Sports Extra Large C0

6 M Sports Extra Large C0

7 F Sports Small C0

8 F Sports Small C0

9 F Sports Medium C0

10 F Luxury Large C0

11 M Family Large C1

12 M Family Extra Large C1

13 M Family Medium C1

14 M Luxury Extra Large C1

15 F Luxury Small C1

16 F Luxury Small C1

17 F Luxury Medium C1



20 F Luxury Large C1

g) Explain why Customer ID should not be used as the attribute test condition even though it has the lowestGini.

Question 03(b): (06 Marks)Describe the following terms:

a) Model Overfittingb) Occam’s Razor Principle

Section B: “Alternate Classification”Question 04: (Total 25 Marks)Question 04(a): (08 Marks)Consider a training set that contains 100 positive examples and 400 negative examples. For each of the followingcandidate rules:

R1: A → + (covers 4 positive and 1 negative examples),R2: B → + (covers 30 positive and 10 negative examples),R3: C → + (covers 100 positive and 90 negative examples),

Determine which is the best and worst candidate rule is according to:

a) Rule Accuracyb) FOIL information gainc) The likelihood ratio statisticd) The Laplace measure


of 6

Question 04(b): (04 Marks)Answer the following questions.

a) Suppose you are using direct method of generating classification rules and the generated rules are notexclusive, what appropriate measures you will take then.

b) Suppose you are using direct method of generating classification rules and the generated rules are notexhaustive, what appropriate measures you will take then.

Question 04(c): (08 Marks)Consider the One-Dimensional data set shown in Table 03.

Table 3: Data set for Question 04(c)X 0.5 3.0 4.5 4.6 4.9 5.2 5.3 5.6 7.0 8.0 9.2 9.5y no no yes yes yes no no yes no no no no

Classify the data point 5.5, according to its 1-, 3-, 5- and 9-nearest neighbors (using the majority votes)

Question 04(d): (05 Marks)What are the types of problems that can be solved with Artificial Neural Networks? How you will train your ANN model?

Section C:”Association Rules”Question 05: (Total 20 Marks)Question 05(a): (10 Marks)Consider the market basket transactions shown in table 04.

Table 4: Data set for Question 05(a)Transaction ID Items Bought

1 {Milk, Beer, Diapers}

2 {Bread, Butter, Milk}

3 {Milk, Diapers, Cookies}

4 {Bread, Butter, Cookies}

5 {Beer, Cookies, Diapers}

6 {Milk, Diapers, Bread, Butter}

7 {Bread, Butter, Diapers}

8 {Beer, Diapers}

9 {Milk, Diapers, Bread, Butter}

10 {Beer, Cookies ,Biscotti}

a) What is the maximum number of association rules that can be extracted from this data (including rules thathave zero support)?

b) What is the maximum size of frequent itemsets that can be extracted (assuming minsup > 0)?c) Write an expression for the maximum number of size-3 itemsets that can be derived from this data set.d) Suppose a and b are two pair of items and the rules {a} → {b} and {b} → {a} have the same confidence.

Prove this for the item set {Bread, Beer}

Question 05(b): (10 Marks)

The Apriori algorithm uses a hash tree data structure to efficiently count the support of candidate itemsets. Considerthe hash tree for candidate 3-itemsets shown in Figure 01.


of 6

Figure 1

a) Given a transaction that contains items {1, 3, 4, 5, 9}, which of the hash tree leaf nodes will bevisited when finding the candidates of the transaction?

b) Use the visited leaf nodes to determine the candidate itemsets that are contained in thetransaction {1, 3, 4, 5, 9}.

Section D:”Clustering”Question 05: (Total 15 Marks)Question 05(a): (10 Marks)Use the similarity matrix in Table 05 to perform single and complete link hierarchical clustering for the next level clustergeneration. The coordinates of the points have been shown in table 06 and are depicted graphical in Euclidean space inFigure 02. The first two clusters are formed using the both techniques and have been shown in the figure 02. Your taskis to find the next cluster for both the types.

Show your results by drawing a Dendrogram.

Clusters so far are: {3, 6}, {2, 5}, {1}, {4}


of 6

Question 05(b): (05 Marks)

A data set was analyzed to make clusters. In order to do so, K-means clustering was applied on the data set for the

first time and the result was three different clusters as shown in Figure 03(A). However when K-means was applied for

the second time, the three set of clusters were quite different than the first one as shown in figure 03(B).

Figure 3

Please discuss in detail based on these observations:

a. Centroids Selection (i.e. K)

b. Sum of square error (SSE)

Formulae

-End of Question Paper-

)log(log'

1

)/log(2

)()(

)]|(max[1)(

])|([1)(

)()()()()(

00

02

11

121

1

1

1

0

2

22

np

p

np

pponGainsInformatiFOIL

kn

kpfestimatem

kn

fLaplace

effRatioLikelihood

iEntropyn

nSEntropyGAIN

tipstionerrorClassifica

tipsGini

PLogPPLogPSEEntropy

k

iiii

k

i

i

i

c

i

+−

+×=

++

=−

++

=

=

−=

−=

−=

−−==

++

+

=

=

−

=

−−++

∑

∑

∑

[fall 2011] cs-402 data mining - final exam-sub_v03

Documents