technique presentation
TRANSCRIPT
My goal is to learn
Contraceptive Method Choice
◈ This proved to be fairly difficult, as I initially struggled to correctly classify more than ~55% of the instances
◈ After an effort to better prepare the data, I found that the best performing classifier is actually a combination of classifiers - called an ensemble method
ensemble method◈ learning algorithm that splits a set of data into a number of
subsets, classifies instances in each subset independently, then takes a vote (average) of those predictions to classify new instances
◈ because it can be trained and then used to make new predictions, an ensemble is a single hypothesis - but this hypothesis is not necessarily contained in the hypothesis space of the models from which it is built, which lends the algorithm great flexibility
“Discrete Aggregation”◈ Dagging is a meta-
algorithm based on bagging (“bootstrap aggregation”).
◈ Like bagging, it uses majority vote as a means to combine the classification of outputs from a number of subsets, but forms those subsets in differently than bagging.
DaggingMeta-Algorithms◈ Higher-level procedures for
selecting lower-level procedures. In other words, an algorithm to manipulate other algorithms.
◈ Can provide an adequate solution to an optimization problem (selecting the best solution from all possible solutions).
Let’s take a closer look at bagging...
“When the president of the US needs to make policy decisions, he relies on the expertise of his cabinet
members to make the correct decision with respect to the policy. The expertise of the cabinet members
compliment each other as opposed to being redundant and duplicative. Using this conceptual example of
bagging, we can apply it to machine learning concepts familiar to us.*
Given a set of training data, we can create n subsets of this data by sampling with replacement.
We then train each subset independently, resulting in n classifiers - these are analogous to the cabinet members!
Finally, we classify each testing instance using a majority vote from the n classifiers.
What does this mean in the context of dagging?
No ResamplingDagging creates disjoint, stratified folds in a dataset, rather than allow instances to belong to more than one subset like bagging.
Dagging takes bagging a step further...
Stratified FoldsStratified sampling is a method in statistics to sample from a population. It is well-suited for data with an appreciable level of variation in its subpopulations.
Homogeneous SubsetsInstances are grouped into relatively homogeneous subsets so that they can be trained without being skewed by the characteristics instances in other subsets.
by improving the representativeness of each subset - creating better “cabinet members”.
Heuristic (?)
◈ Training:○ Choose a number of folds, and divide the total
number of instances by it to determine the distance metric
○ Find the k-nearest neighbors for a randomly chosen training instance, using the selected distance metric to form a fold (subset)
○ Classify instances in the fold using the base-classifier◈ Testing:
○ Find the fold that contains instances most similar to a new instance
○ Classify that instance by taking a majority vote of the classifications of the instances in that fold
The base classifier is used to train each fold independently.Each fold will be classified rather than each instance.
Weak classifiers have better predictive power than random guessing, but still performs relatively poorly.A classic example is a decision tree.
Dagging (and ensemble methods in general) work best in conjunction with a weak base classifier.We seek to improve upon its predictive power.
simple logistic regression
MisnomerIt’s actually a technique for classification, rather than for regression.
MethodologyFit a linear model to the data, then use the logistic (squashing) function to make a prediction.
◈ Fitting the data to a linear model gives a real number result for each instance.
◈ We turn this number into a classification by putting it into the logistic (sigmoid) function.
◈ This normalizes the value to lie between 0 and 1.
◈ We interpret this normalized number as a probability, which can be used to classify the instance against a defined threshold.
1 / [1 + e^(-x)]
RobustnessCombining the
predictions of several estimators can reduce variance and improve
the robustness (generalizability) of a learning algorithm, lending it greater
capacity.
Benefits
Reduced Over-FittingTaking a majority vote or
average can reduce over-fitting, and thus
later misclassification, caused by training on
instances with considerable variation
by abstracting irrelevant instances.
Sources
◈ http://dl.acm.org/citation.cfm?id=743935◈ http://scikit-learn.org/stable/modules/ensemble.html◈ *http://cse-wiki.unl.edu/wiki/index.
php/Bagging_and_Boosting◈ http://www3.cs.stonybrook.edu/~cse352/L11testing.pdf◈ http://rapid-i.com/wiki/index.php?title=Weka:W-Dagging◈ http://advantech.gr/med07/papers/T05-001-719.pdf◈ http://courses.washington.edu/css490/2012.
Winter/lecture_slides/05b_logistic_regression.pdf◈ http://jvns.ca/blog/2014/11/17/fun-with-machine-
learning-logistic-regression/