homework 2 - infolab

6
Overview of Data Informatics for Big Data Summer 2017 HOMEWORK 2 Due Date: July 12, 2017. 1:00PM 1. A company is investigating the relationship between its advertising expenditures and the sales of their products. The following data represents a sample of 10 products. Note that AD = Advertising dollars and S = Sales in thousands $. 1) Find the equation of the regression line, using Advertising dollars as the independent variable and Sales as the response variable. 2) Plot the scatter diagram and the regression line. 3) Explain how to interpret the slope of the line in this problem. 4) Find r 2 and interpret it in the words of the problem. 5) Use the line to predict the Sales if Advertising dollars = $50 K. ANSWER: Let Sales be represented by y i and Advertising Money be represented by x i then linear regression and coefficient of determination are AD -> x S -> y x - avg(x) y - avg(y) (x-avg(x))^2 Sum((x - avg(x))*(y-avg(y))) 1 22 64 -22.2 -56.2 492.84 1247.64 2 25 74 -19.2 -46.2 368.64 887.04 3 29 82 -15.2 -38.2 231.04 580.64 4 35 90 -9.2 -30.2 84.64 277.84 5 38 100 -6.2 -20.2 38.44 125.24 6 42 120 -2.2 -0.2 4.84 0.44 7 46 120 1.8 -0.2 3.24 -0.36 8 52 142 7.8 21.8 60.84 170.04 9 65 180 20.8 59.8 432.64 1243.84 10 88 230 43.8 109.8 1918.44 4809.24 Sum 442 1202 3635.6 9341.6 Average 44.2 120.2 Part (1) : Look the chart below for answer to part 1. AD S 22 64 25 74 29 82 35 90 38 100 42 120 46 120 52 142 65 180 88 230

Upload: others

Post on 26-May-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HOMEWORK 2 - InfoLab

Overview of Data Informatics for Big Data

Summer 2017

HOMEWORK 2 Due Date: July 12, 2017. 1:00PM

1. A company is investigating the relationship between its advertising expenditures and the sales of their products. The following data represents a sample of 10 products. Note that AD = Advertising dollars and S = Sales in thousands $. 1) Find the equation of the regression line, using Advertising dollars as

the independent variable and Sales as the response variable. 2) Plot the scatter diagram and the regression line. 3) Explain how to interpret the slope of the line in this problem. 4) Find r2 and interpret it in the words of the problem. 5) Use the line to predict the Sales if Advertising dollars = $50 K.

ANSWER:LetSalesberepresentedbyyiandAdvertisingMoneyberepresentedbyxithenlinearregressionandcoefficientofdeterminationare

AD->x S->y x-avg(x) y-avg(y) (x-avg(x))^2 Sum((x-avg(x))*(y-avg(y)))1 22 64 -22.2 -56.2 492.84 1247.642 25 74 -19.2 -46.2 368.64 887.043 29 82 -15.2 -38.2 231.04 580.644 35 90 -9.2 -30.2 84.64 277.845 38 100 -6.2 -20.2 38.44 125.246 42 120 -2.2 -0.2 4.84 0.447 46 120 1.8 -0.2 3.24 -0.368 52 142 7.8 21.8 60.84 170.049 65 180 20.8 59.8 432.64 1243.8410 88 230 43.8 109.8 1918.44 4809.24Sum 442 1202 3635.6 9341.6

Average 44.2 120.2

Part(1):Lookthechartbelowforanswertopart1.

AD S 22 64 25 74 29 82 35 90 38 100 42 120 46 120 52 142 65 180 88 230

Page 2: HOMEWORK 2 - InfoLab

Part(3):The slope of the line is positive. This indicates that sales increase with increase in expenditure on advertising. I.e., $1000 ad money more, $2570 sale increases

Part(4):R2 = 0.9927 : This indicates that the liner regression equation fits our

data really well. This means that if we use our equation to predict the amount in sales based on amount spent on advertising, our prediction will be close to actual sales amount.

Part (5) : Using the equation from part (1) : if amount spent on Advertising is

$50k, then sales will be $135.104k 2. Hierarchical Clustering:

Assume we are trying to cluster the following points using hierarchical clustering. If we are using Euclidian distance, draw a sketch of the hierarchical clustering tree (dendrogram) we would obtain for each of the linkage methods (single and complete, respectively)

y=2.5695x+6.629R²=0.9927

0

50

100

150

200

250

0 20 40 60 80 100

Sales(inth

ousand

$)

AdvertisingMoney(inthousand$)

FittingDataonaline

Page 3: HOMEWORK 2 - InfoLab

Answer: Distance Table

1 2 3 4 5 6 71 0 0.722 0.664 0.962 0.750 1.033 0.2502 0 0.122 0.711 0.824 0.491 0.6703 0 0.605 0.702 0.440 0.5774 0 0.372 0.386 0.7345 0 0.700 0.5006 0 0.8667 0

1) Single Linkage (13 pts)

1 (2,3) 4 5 6 71 0 0.664 0.962 0.750 1.033 0.250

(2,3) 0 0.605 0.702 0.440 0.5774 0 0.372 0.386 0.7345 0 0.700 0.5006 0 0.8667 0

(1,7) (2,3) 4 5 6

(1,7) 0 0.577 0.734 0.500 0.866(2,3) 0 0.605 0.702 0.4404 0 0.372 0.3865 0 0.7006 0

(1,7) (2,3) (4,5) 6

(1,7) 0 0.577 0.734 0.866(2,3) 0 0.605 0.440(4,5) 0 0.3866 0

(1,7) (2,3) (4,5,6)

(1,7) 0 0.577 0.500(2,3) 0 0.440(4,5,6) 0

(1,7) (2,3,4,5,6)

(1,7) 0 0.500(2,3,4,5,6) 0

Page 4: HOMEWORK 2 - InfoLab

2) Complete Linkage (12 pts) 1 (2,3) 4 5 6 71 0 0.722 0.962 0.750 1.033 0.250

(2,3) 0 0.711 0.824 0.491 0.6704 0 0.372 0.386 0.7345 0 0.700 0.5006 0 0.8667 0

(1,7) (2,3) 4 5 6

(1,7) 0 0.722 0.962 0.750 1.033(2,3) 0 0.711 0.824 0.4914 0 0.372 0.3865 0 0.7006 0

(1,7) (2,3) (4,5) 6

(1,7) 0 0.722 0.962 1.033(2,3) 0 0.824 0.491(4,5) 0 0.7006 0

(1,7) (2,3,4,5,6)

(1,7) 0 1.033(2,3,4,5,6) 0

3. (25 pts) Consider a two dimensional database D with the records : R1 (2, 2), R2 (2, 4), R3(4,

2), R4(4, 4), R5(3, 6), R6(7, 6), R7(9, 6), R8(5, 10), R9(8, 10), R10(10, 10). The distance function is the L1 distance (Manhattan distance). Show the results of the k-means algorithm at each step, assuming that you start with two clusters (k = 2) with centers C1 = (6,6) and C2 = (9,7).

L1 distance: d((x1,x2),(x1’,x2’))= |x1-x1’| + |x2-x2’|

(1,7) (2,3,6) (4,5)(1,7) 0 1.033 0.962(2,3,6) 0 0.824(4,5) 0

Page 5: HOMEWORK 2 - InfoLab

Answer:Thefirststepassignspoints1,2,3,4,5,6,and8toC1andtheotherpointstoC2.Thenewcentersare(3.85,4.85)and(9,8.66).(15pts)Inthenextstep,point8movesfromC1toC2.Thenewcentersare(3.33,4)and(8,9).(5pts)Inthenextstep,point6movesfromC1toC2.Afterthatmovethealgorithmstops.Thefinalclustersarepoints(1,2,3,4,5)and(6,7,8,9,10).(5pts)4. (25 pts) k-Means Clustering: For the following six points,

X Y A1 1.00 2.00 A2 1.00 4.00 A3 3.00 1.00 A4 3.00 5.00 A5 5.00 2.00 A6 5.00 4.00

1) (10 pts) Use the k-means algorithm to show the final clustering result assuming initially we assign A1, A6 as the center of each cluster, respectively.

2) (10 pts) Use the k-means algorithm to show the final clustering result assuming initially we assign A3, A4 as the center of each cluster, respectively.

3) (5 pts) Compute the quality of the K-Means clustering using the Sum of Squared Error (SSE) which shows cohesion measures how near the data points in a cluster are to the cluster centroid. Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the intra-cluster sum of squares.

SSE = (x −µ))+,∈.)

/

)01

where µi is the mean of points in Si.

Based on SSE of 1) and 2), which clustering would be better?

Answer: 1) (10 pts)

Seed1 X Y A1(1,2) A6(5,4) NewX NewY A1 1.00 2.00 0.000 4.472 C1={A1,A2,A3} 1.667 2.333 A2 1.00 4.00 2.000 4.000 A3 3.00 1.00 2.236 3.606 C2={A4,A5,A6} 4.333 3.667 A4 3.00 5.00 3.606 2.236

Page 6: HOMEWORK 2 - InfoLab

A5 5.00 2.00 4.000 2.000 A6 5.00 4.00 4.472 0.000

X YC1(1.667,2.333)

C2(4.333,3.667) NewX NewY

A1 1.00 2.00 0.746 3.727 C1={A1,A2,A3} 1.667 2.333A2 1.00 4.00 1.795 3.350 A3 3.00 1.00 1.885 2.982 C2={A4,A5,A6} 4.333 3.667A4 3.00 5.00 2.982 1.885 A5 5.00 2.00 3.350 1.795 A6 5.00 4.00 3.727 0.746

SSE1=14.666668

2) (10 pts)

Seed2 X Y A3(3,1) A4(3,5) NewX NewY A1 1.00 2.00 2.236 3.606 C1={A1,A3,A5} 3.000 1.667 A2 1.00 4.00 3.606 2.236 A3 3.00 1.00 0.000 4.000 C2={A2,A4,A6} 3.000 4.333 A4 3.00 5.00 4.000 0.000 A5 5.00 2.00 2.236 3.606 A6 5.00 4.00 3.606 2.236

X YC1(3,1.6

67)C2(3,4.3

33) NewX NewY A1 1.00 2.00 2.028 3.073 C1={A1,A3,A5}3.000 1.667 A2 1.00 4.00 3.073 2.028 C2={A2,A4,A6}3.000 4.333 A3 3.00 1.00 0.667 3.333 A4 3.00 5.00 3.333 0.667 A5 5.00 2.00 2.028 3.073 A6 5.00 4.00 3.073 2.028

SSE2=17.333334

3) (5 pts) SSE 1 (14.67) < SSE 2 (17.33). So the first one is better.