machine learning and data mining 2006 中国科学院自动化研究所 统计机器学习 陶 卿...

Click here to load reader

Upload: lucinda-williamson

Post on 31-Dec-2015

425 views

Category:

Documents


3 download

TRANSCRIPT


Outline
Statistical learning algorithms for one-class problems.
Machine Learning and Data Mining 2006
Challenging problems
Machine Learning and Data Mining 2006
Data Mining
Challenges in area of data storage, organization and searching have led to the new field of data mining.
Vast amounts of data are being generated in many fields, and our job is: to exact important patterns and trends, and understand what the data says. This is called learning from data.
Machine Learning and Data Mining 2006
Machine Learning
The learning problems can be roughly categorized as supervised and unsupervised.
Supervised: classification, regression and ranking;
Unsupervised: one-class, clustering and PCA.
Machine Learning and Data Mining 2006
Application in PR
Pattern recognition system:
Machine Learning and Data Mining 2006
Difference
Statistical machine learning: in terms of finite samples, inductive inference.
Machine Learning and Data Mining 2006
Biometrics
Biometrics refers to the automatic identification of a person based on his/her physiological or behavioral characteristics.
Machine Learning and Data Mining 2006
Bioinformatics
In the last few decades, advances in molecular biology and the equipment available for research in this field have allowed the increasingly rapid sequencing of large portions of the genomes. Popular sequence databases have been growing at exponential rates.
This deluge of information has necessitated the careful storage, organization and indexing of sequence information. Information science has been applied to biology to produce the field called Bioinformatics.
Machine Learning and Data Mining 2006
ISI
Intelligence and Security Informatics is an merging field of study aimed at developing advanced information technologies, systems, algorithms, and databases for national- and homeland-security-related applications.
Machine Learning and Data Mining 2006
Confusion
Many researchers claim that they are studying statistical machine learning methods.
Maybe, their interest is only to apply the statistical machine learning algorithms.
Machine Learning and Data Mining 2006

Machine learning community
To analyze different algorithms theoretically;
To develop new theory and learning algorithms for new problems.
Machine Learning and Data Mining 2006


Performance
More
But also in terms of implementation, speed, understandability etc.
Machine Learning and Data Mining 2006
Theoretical Analysis
Model Assessment: estimating the performance of different models in order to choose the best one.
Model Selection: having chosen the model, estimating the prediction error on new data.
Machine Learning and Data Mining 2006
Ian Hacking
The quiet statisticians have changed our world:
not by discovering new facts or technical developments, but by changing the ways that we reason, experiments and form our opinions……
Machine Learning and Data Mining 2006
Statistical learning
Not limited to statistical learning by Vapnik.
Machine Learning and Data Mining 2006
Andreas Buja
There is no true interpretation of anything:
Interpretation is vehicle in the service of human comprehension. The value of interpretation is in enabling others to fruitfully think about the idea.
Machine Learning and Data Mining 2006
Interpretation of Algorithms
Almost all the learning algorithms can be illustrated theoretically and intuitively;
The probability and geometric explanations not only help us to understand the algorithms theoretically and intuitively, but also motivate us to develop elegant and practical new algorithms.
Machine Learning and Data Mining 2006
Main references
N. Cristianini and J. Schawe-Taylor. An Introduction to Support Vector Machines. Cambridge: Cambridge Univ Press. 2000.
T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer. 2001.
Machine Learning and Data Mining 2006
Main kinds of theory
Definition of Classifications
Definition of regression
Several well-known algorithms
Machine Learning and Data Mining 2006
Framework of algorithms
Linear to nonlinear (neural networks, kernel); single to ensemble (boosting); pointwise to continuous (one-class problems); local to global ( KNN and LMS).
Machine Learning and Data Mining 2006
Designation of algorithms
Usually, the algorithm under more complex hypothesis space should be a specific one under simple hypothesis space.
The algorithm under simple hypothesis space serves as a start point of the complete framework.
Machine Learning and Data Mining 2006
Bayesianclassification
Machine Learning and Data Mining 2006
Bayesian: regression
Machine Learning and Data Mining 2006
Estimating densities
The knowledge of density functions would allow us to solve whatever problems that can be solved on the basis of available data;
Vapnik's principle: never to solve a problem that is more general than you actually need to solve.
Machine Learning and Data Mining 2006
KNN
InterpretationKNN
KNN:
Assuming that the classifier is well approximated by a locally constant function, and conditioning at a point is relaxed to conditioning on some region to the target point.
Machine Learning and Data Mining 2006
LMS
LMS
Interpretation: LMS
LMS
Assuming that the classifier is well approximated by a globally linear function, and the expectation is approximated by averages over the training data.
Machine Learning and Data Mining 2006
Fisher Discriminant Analysis
To seek a direction for which the projected samples are well separated.
w(y)
w
y1
y2
x2
ω1
ω2
x1
Interpretation: FDA
Generally speaking, it is not optimal.
FDA is the Bayes optimal solution if the two classes are distributed according to a normal distribution with equal covariance.
Machine Learning and Data Mining 2006
FDA and LMS
squares approaches for classification.
The solution to the least squares problem is in the same direction as the solution of Fisher’s discriminant.
Machine Learning and Data Mining 2006
FDA: a novel interpretation
T. Centeno, N. Lawrence. Optimizing kernel parameters and regularization coefficients for non-linear discriminant analysis. JMLR, 7 (2006).
A novel Bayesian interpretation of FDA relating Rayleigh’s coefficient to a noise model that minimizes a cost based on the most probable class centers and that abandons the ‘regression to the labels’ assumption used by other algorithms.
Machine Learning and Data Mining 2006
FDA: parameters
Going further, with the use of a Gaussian process prior, they show the equivalence of their model to a regularized kernel FDA.
A key advantage of our approach is the facility to determine kernel parameters and the regularization coefficient through the optimization of the marginal log-likelihood of the data.
Machine Learning and Data Mining 2006
FDA: framework of algorithms
Qing Tao, et al. The Theoretical Analysis of FDA and Applications. Pattern Recognition. 39(6):1199-1204.
Similar in spirit to maximal margin algorithm, FDA with zero within-class variance is proved to serve as a start point of the complete FDA framework.
Machine Learning and Data Mining 2006
Disadvantage
Motivation;
Bias and variance analysis
The bias-variance decomposition is a very powerful and widely-used tool for understanding machine-learning algorithms;
It was originally developed for squared loss.
Machine Learning and Data Mining 2006
Bias-Variance Decomposition
XY,f-hat(x0)
Bias-Variance Tradeoff
Often, the variance can be significantly reduced by deliberately introducing a small amount of bias.
blue area: error sigma (noise) , centered by the truth
large yellow circle: variance of least square fit, centered by the closest fit in whole population.
smaller circle: shrunken fit, smaller variance, higher bias
model space: the set of all possible predictions from the model
question: what is the definition of model?
could we say “linear model” but what is the role of least square, or is it also a model itself?
Machine Learning and Data Mining 2006
Interpretation: KNN
Ridge regression
LMS: ill-posed problem;
Compared with LMS under certain assumptions, it introduces a small amount of bias.
Machine Learning and Data Mining 2006
Interpretation: ridge regression
Analytic solution
The technique as a way to simultaneously reduce the risk and increase the numerical stability of LMS.
Machine Learning and Data Mining 2006
Interpretation: parameter
Effective degrees of freedom: experiment analysis;
The key result is dramatic reduction of parameter variance.
Machine Learning and Data Mining 2006
A note
A new class of generalized Bayes minimax ridge regression estimators. The Annals of Statistics. 2005, 33(4).
The risk reduction aspect of ridge regression was often observed in simulations but was not theoretically justified. Almost all theoretical results on ridge regression in the literature depend on normality.
Machine Learning and Data Mining 2006
Other loss functions
P. Domingos. A Unified Bias-Variance Decomposition and its Applications. ICML, 2000.
The resulting decomposition specializes to the standard one for the squared-loss case, and to a close relative of Kong and Dietterich’s 1995 one for the zero-one case.
Machine Learning and Data Mining 2006
Interpretation: boosting
Both Bagging and Boosting reduce error by reducing the variance term (Breiman1996);
Bauer and Kohavi 1999 demonstrated that Boosting does indeed seem to reduce bias for certain real world problems.
Machine Learning and Data Mining 2006
Interpretation: margin
Domingos 2000:
Schapire’s (1997) notion of “margin” can be expressed as a function of the zero-one bias and variance, making it possible to formally relate a classifier ensemble’s generalization error to the base learner’s bias and variance on training examples.
Machine Learning and Data Mining 2006
Interpretation: SVM
G. Valentini, T. Dietterich. Bias-variance analysis of SVM for the development of SVM-Based ensemble methods. JMLR, 2004. 5
Presenting an extended experimental analysis of bias-variance decomposition of the error in SVMs, considering Gaussian, polynomial and dot product kernels.
Machine Learning and Data Mining 2006
SVM: experimental analysis
A characterization of the error decomposition is provided, by means of the analysis of the relationships between bias, variance, kernel type and its parameters.
The results show that the expected trade-off between bias and variance is sometimes observed, but more complex relationship can be detected, especially in Gaussian and polynomial kernels.
Machine Learning and Data Mining 2006
Interpretation: base learners
The effectiveness of ensemble methods depends on the specific characteristics of the base learners;
The bias-variance decomposition offers a rationale to develop ensemble methods using SVMs as base learners.
Machine Learning and Data Mining 2006
Disadvantage
To be able to estimate the bias, variance, we need to know the actual function being learned. This is unavailable for real-world problems;
Estimating bias and varianceusing bootstrap to replicate the data.
Machine Learning and Data Mining 2006
Generalization bound
PAC Framework;
VC Theory;
PAC Frame
Probably Approximately CorrectFPACPfF0<11hP(h(x)f(x)) 1/1/
Machine Learning and Data Mining 2006
VC Theory and PAC Bounds
Landmark paper by Blumer1989;
Greatly influence the field of machine learning;
VC theory and PAC bounds have been used to analyze the performance of learning systems as diverse as decision trees, neural networks, and others.
Machine Learning and Data Mining 2006
PAC Bounds for Classification
errD(hS) (l, H, ) holds with probability 1- .
(l, H, ): a bound.
VC Dimension
The largest size of shattered set.
Machine Learning and Data Mining 2006
A consistency problems
In spite of
Machine Learning and Data Mining 2006
Remarks on PAC+VC Bounds
The size of training set required to ensure good generalization scales linearly with this quantity.
Only be able to take advantage of benign distributions.
Machine Learning and Data Mining 2006
SVM: Linearly separable
Linearly separable SVM
Maximizing the margin
SVM: soft Margin
C-SVM
The impasse of the NP-hardness of minimizing the training error has been avoided.
Machine Learning and Data Mining 2006
SVM: algorithms
Hypothesis: linearly separable, linearly inseparable; kernel.
The maximal margin algorithm serves as a start point of the complete SVM framework Shawe-Taylor1998.
Machine Learning and Data Mining 2006

Bound: VC Dimension
provided
Bound: VC dimension+errors
Structural risk minimization principle;
Occam’s razor: a simple function that explains most of the data is preferable to a complex one.
Machine Learning and Data Mining 2006
Disadvantages of SRM
According to the SRM principle, the structure has to be defined a priori before the training data appear Shawe-Taylor1998.
The maximum margin algorithm violates this principle in that the hierarchy defined depends on the data.
Machine Learning and Data Mining 2006
Disadvantage: PAC+VC bound
Not data-dependent: bounds should rely on an effective complexity measure rather than the a priori VC dimension;
It can not be applied in high dimensional feature spaces.
Machine Learning and Data Mining 2006
Several concepts
fat-shattering dimension;
Covering number.
Generalization Bound: margin
The first paper about margin results: J.Shawe-Taylor and P. L. Bartlett, 1998.
provided
Machine Learning and Data Mining 2006
Importance of Margin
Over the last decade, both theories and practices have pointed out that the concept of margin is central to the success of SVM and boosting.
Generally, a large margin implies good generalization performance.
Machine Learning and Data Mining 2006
Vapnik’s three periods
1970 – 1990Development of Basics of Statistical Learning Theory (the VC theory)
1992 – 2004Development of Large Margin Technology (SVMs).
2005 – .... Development of Non-Inductive Methods of Inferences
Machine Learning and Data Mining 2006
Neural networks
MSE criterion and deepest decent method.
Machine Learning and Data Mining 2006
Interpretation: neural networks
An important feature of the fat-shattering dimension for these classes is that it does not depend on the number of parameters (e.g., weights in a neural network), but rather on their sizes.
These measures, therefore, motivate a form of weight decay. Indeed, one consequence of the above result is a justification of the standard error function used in back-propagation optimization incorporating weight decay.
Machine Learning and Data Mining 2006
BP Algorithms
In the case of neural networks, the question naturally arises as to whether there might exist a polynomial-time algorithm for optimizing the soft margin bound.
The analysis has also placed the optimization of the quadratic loss used in the back-propagation algorithm on a firm footing, though, in this case, no polynomial time algorithm is known.
Machine Learning and Data Mining 2006
Disadvantage
A nice and direct proof of this bound is systematically described in N. Cristianini and J. Schawe-Taylor 2001.
So many bounds now.
One-class references
D. Tax and R. Duin. Support vector data description. Machine Learning, 2004.
B. Schölkopf, et al.. New support vector algorithms. Neural Computation, 2000.
I. Steinwart et al.. A Classification Framework for Anomaly Detection. JMLR, 2005.
(
)
(
)
(
)
(
)