[fall 2012] cs-402 dm - midterm exam_v01
TRANSCRIPT
-
7/30/2019 [Fall 2012] CS-402 DM - Midterm Exam_v01
1/1
BSCS Honors Program CS-402 GIFT University Gujranwala
Page 1 of 1
Course: Data Mining 14-December-2012 (Fall 2012)
Resource Person: Nadeem Qaisar Mehmood MIDTERM EXAMINATION
Max. Marks: 50 Time Allowed: 70 Minutes
Question 01: Short Questions (Marks 5x5)
1. In context to Rule-based classification discuss direct or indirect methods of building rules. Which
method provides you the robust and exhaustive rules and why?
2. Discuss the hypothesis space of ID3 algorithm based Decision Tree learning.
3. Which proximity measure would you prefer to use, if in case you are working on finding distances
between the class instances but the record data matrix to mine clusters is very sparse?
4. Find the following proximity measures for the following two binary vectors p and q.p = (0 1 1 0 0 0 0 0 0 1), q = (0 1 0 0 0 0 1 0 0 1) Jaccard, Cosin, Hamming
5. In Rule-based Classification what are the three different methods to evaluate a rule.
Question 02: (Marks 15)
AutoRash Traders has a broad business of automobiles and has millions of customers. In order to market their
new product in Texas City they are planning to target only a limited number of possible customers, to advertisethem directly. In order to achieve the goal they collected the previous advertising data While looking on thefollowing dataset, what data mining technique you will suggest to AutoRash Traders , so that they could able
to predict about a new customer that he will purchase a car or not. AutoRash Traders have gathered thefollowing dataset about different customers, please suggest a predicting modeling technique that could predict
about a new customer that whether he will purchase a car or not.
Serial
No
Marital
Status
Home
Owner
Annual
Income
Bought
Car
1 Single Y Normal Yes
2 Married Y Normal Yes
3 Divorced N Low No
4 Divorced Y High Yes5 Single N Low No
6 Single N Normal No
Some Entropy Measurements:
Question 03: (Marks 15)
Consider a training set that contains 100 positive examples and 400 negative examples. For each of thefollowing candidate rules:
R1: A + (covers 4 positive and 1 negative examples),
R2: B + (covers 30 positive and 10 negative examples),
R3: C + (covers 100 positive and 90 negative examples),
Rule accuracy measures are 80% (for R1), 75% (for R2), and 52.6% (forR3), respectively. Therefore R1 is thebest candidate and R3 is the worst candidate according to rule accuracy. However this is not always theobviuous case why? Explain the situation with the help ofFOILs Information gain.
END OF EXAM
)log(log....'
)()(
00
0
2
11
1
21
1
np
p
np
ppGainnInformatiosFOIL
iEntropyn
nSEntropyGAIN
k
i
i
+
+=
=
=
2+, 2-] =1 [2+, 0] =0 [2+, 3-] =0.971 [3+, 2-] =0.971 [3+, 1-] = 0.8115 [3+, 2-] =0.971 [5+, 1-] = 0.650
=
++
=
==1
0
2
22
])|([1)(
)()()()()(
c
i
tipsGini
PLogPPLogPSEEntropy