database management systems: data mining
Post on 01-Feb-2016
19 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Jerry PostCopyright © 2003
Database Management Database Management Systems:Systems:Data MiningData Mining
Attribute Evaluation
2
DDAATTAA MMiinniinngg
Multiple Regression
Y = b0 + b1X1 + b2X2 + … + bkXk
Regression estimates the b coefficients.
If a b value is zero, the corresponding X attribute does not influence the Y variable.
The b value coefficient also indicates the strength of the relationship: dY/dXi = bi. A one unit increase in Xi results in a bi change in Y.
3
DDAATTAA MMiinniinngg
Regression Example: RTQuery: Sales by Year by City Population:
SELECT Format([orderdate],"yyyy") AS SaleYear, City.Population1990, Sum(Bicycle.SalePrice) AS SumOfSalePrice
FROM City RIGHT JOIN (Customer INNER JOIN Bicycle ON Customer.CustomerID = Bicycle.CustomerID) ON City.CityID = Customer.CityID
GROUP BY Format([orderdate],"yyyy"), City.Population1990
HAVING (((City.Population1990)>0));
Paste data into Exel.Tools/Data Analysis/Regression
4
DDAATTAA MMiinniinngg
Regression Results
75% variation explained
Less than 0.05, so significantly different from zero
Regression StatisticsMultiple R 0.8647R Square 0.7476Adjusted R Square 0.7476Standard Error 7464.1009Observations 12081
ANOVAdf SS MS F
Regression 2 1.9936E+12 9.96799E+11 17891.74218Residual 12078 6.72899E+11 55712802.45Total 12080 2.6665E+12
Coefficients Standard Error t Stat P-valueIntercept -708867.855 46760.007 -15.160 0.000SaleYear 355.889 23.384 15.219 0.000Population1990 0.033 0.000 188.872 0.000
Each year, sales increase $356
For 1000 people, sales increase $33
5
DDAATTAA MMiinniinngg
Information Gain: Partitioning
)(log2 ii ppI
In 1948, Shannon defined information (I) as:
-pi log2(pi)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.01
0.09
0.17
0.25
0.33
0.41
0.49
0.57
0.65
0.73
0.81
0.89
0.97
If pi is zero or one, there is no information—since you always know what will happen.
6
DDAATTAA MMiinniinngg
Information Example
),...,(...
)( 11
1mjj
v
i
mjj ssIs
ssAE
Types of shoppers (m=2): status is high roller or tourist
S is a set of data (rows)
The dataset contains attributes (A), such as: Income, Age_range, Region, and Gender.
Each attribute has many (v) possible values. For example, Income categories are: low, medium, high, and wealthy.
The subset Sij contains the rows of customers in category i who possess attribute level j. The count of the number of rows is sij.
The entropy of attribute A defined from this partitioning is
The information gain from the partitioning is
Find the attribute with the highest gain.
)(),...,,()( 21 AEsssIAGain m
7
DDAATTAA MMiinniinngg
Data for Information ExampleClass:1 High rollerGender Income Age_range Region CountM High Middle Northeast 12M Wealthy Old West 8M Medium Young West 21F High Middle South 32M Low Young Northeast 17M High Old Midwest 14
104Class:2 TouristGender Income Age_range Region CountM Low Young West 25F Low Young West 10M Medium Middle Midwest 32M High Young Northeast 5F Medium Young West 8M Low Old Northeast 27
107
9999.0211
107log
211
107
211
104log
211
104),( 2221 ssI
s1=104s2=107s=211
Expected information for income categories:Value High roller Tourist sum I(s1j,s2j) WeightedLow 17 62 79 0.2262 0.0847Medium 21 40 61 0.2796 0.0808High 58 5 63 0.1204 0.0359Wealthy 8 0 8 0.0000 0.0000
104 107 211 0.2015
E(income)=0.2015Gain(income) = 0.9999-0.2015
= 0.7984
=79/211*I(…)
8
DDAATTAA MMiinniinngg
Results for Information
Attribute Gain
Income 0.7984
Gender 0.7048
Age_range 0.7025
Region 0.7549
All values are relatively high, so all attributes are important.
9
DDAATTAA MMiinniinngg
Dimensionality
Notice the issue of dimensionality in the example.We had to setup groups within the attributes. If there are too many groupings/values:
The system will take a long time to run.Many subgroups will have no observations.
How do you establish the groupings/values?Natural hierarchies (e.g., dates)Cluster analysisPrior knowledgeLevel of detail required for analysis
10
DDAATTAA MMiinniinngg
Non-Linear Estimation
Regression: Polynomial: Y = b0 + b1X + b2X2 + b3X3 + b4X4…+ u Exponential: Y = b0Xb1eu ln(Y) = ln(b0) + b1 ln(X) + u Log-Linear: ln(Y) = b0 + b1 ln(X) + u Other: log log and more
Other Methods: Neural networks Search
11
DDAATTAA MMiinniinngg
Example: PolyAnalyst: Find Law for MPGmpg = (2.59183e+009 *power*age+176465 *power*age*weight+2.41554e+009 *power*age*age-3.54349e+009 *power+7.27281e+007 *age*weight-2.55635e+010)/(power*age*weight+52028.3 *power*age*age*weight)
Best exact rule found: mpg = (4.71047e+008 *power*age*weight-38783.5 *power*age*weight*weight+2.5987e+009 *power*age*age*weight-7.65205e+009 *power*weight+1.5658e+008 *age*weight*weight+1.15859e+011 *power*power-3.0532e+013 *age*age)/(power*age*weight*weight+52028.3 *power*age*age*weight*weight)
12
DDAATTAA MMiinniinngg
MPG Versus Weight
13
DDAATTAA MMiinniinngg
Problems with Non-Linear Models They can be harder to estimate. They are substantially more difficult to optimize. They are often unstable—particularly at the ends.
Y
-60000
-40000
-20000
0
20000
40000
60000
80000
100000
120000
140000
-25
-21
-17
-13 -9 -5 -1 3 7 11 15 19 23
Y = 15000 – 850 X – 435 X2 + 2 X3 + X4
Note: (x + 7)(x – 5)(x + 20)(x – 20)
top related