machine learning and data mining 2006 中国科学院自动化研究所统计机器学习陶卿...

Outline
Statistical learning algorithms for one-class problems.
Machine Learning and Data Mining 2006
Challenging problems
Data Mining
Challenges in area of data storage, organization and searching have led to the new field of data mining.
Vast amounts of data are being generated in many fields, and our job is: to exact important patterns and trends, and understand what the data says. This is called learning from data.
Machine Learning
The learning problems can be roughly categorized as supervised and unsupervised.
Supervised: classification, regression and ranking;
Unsupervised: one-class, clustering and PCA.
Application in PR
Pattern recognition system:
Difference
Statistical machine learning: in terms of finite samples, inductive inference.
Biometrics
Biometrics refers to the automatic identification of a person based on his/her physiological or behavioral characteristics.
Bioinformatics
In the last few decades, advances in molecular biology and the equipment available for research in this field have allowed the increasingly rapid sequencing of large portions of the genomes. Popular sequence databases have been growing at exponential rates.
This deluge of information has necessitated the careful storage, organization and indexing of sequence information. Information science has been applied to biology to produce the field called Bioinformatics.
ISI
Intelligence and Security Informatics is an merging field of study aimed at developing advanced information technologies, systems, algorithms, and databases for national- and homeland-security-related applications.
Confusion
Many researchers claim that they are studying statistical machine learning methods.
Maybe, their interest is only to apply the statistical machine learning algorithms.

Machine learning community
To analyze different algorithms theoretically;
To develop new theory and learning algorithms for new problems.

Performance
More
But also in terms of implementation, speed, understandability etc.
Theoretical Analysis
Model Assessment: estimating the performance of different models in order to choose the best one.
Model Selection: having chosen the model, estimating the prediction error on new data.
Ian Hacking
The quiet statisticians have changed our world:
not by discovering new facts or technical developments, but by changing the ways that we reason, experiments and form our opinions……
Statistical learning
Not limited to statistical learning by Vapnik.
Andreas Buja
There is no true interpretation of anything:
Interpretation is vehicle in the service of human comprehension. The value of interpretation is in enabling others to fruitfully think about the idea.
Interpretation of Algorithms
Almost all the learning algorithms can be illustrated theoretically and intuitively;
The probability and geometric explanations not only help us to understand the algorithms theoretically and intuitively, but also motivate us to develop elegant and practical new algorithms.
Main references
N. Cristianini and J. Schawe-Taylor. An Introduction to Support Vector Machines. Cambridge: Cambridge Univ Press. 2000.
T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer. 2001.
Main kinds of theory
Definition of Classifications
Definition of regression
Several well-known algorithms
Framework of algorithms
Linear to nonlinear (neural networks, kernel); single to ensemble (boosting); pointwise to continuous (one-class problems); local to global ( KNN and LMS).
Designation of algorithms
Usually, the algorithm under more complex hypothesis space should be a specific one under simple hypothesis space.
The algorithm under simple hypothesis space serves as a start point of the complete framework.
Bayesianclassification
Bayesian: regression
Estimating densities
The knowledge of density functions would allow us to solve whatever problems that can be solved on the basis of available data;
Vapnik's principle: never to solve a problem that is more general than you actually need to solve.
KNN
InterpretationKNN
KNN:
Assuming that the classifier is well approximated by a locally constant function, and conditioning at a point is relaxed to conditioning on some region to the target point.
LMS
LMS
Interpretation: LMS
LMS
Assuming that the classifier is well approximated by a globally linear function, and the expectation is approximated by averages over the training data.
Fisher Discriminant Analysis
To seek a direction for which the projected samples are well separated.
w(y)
w
y1
y2
x2
ω1
ω2
x1
Interpretation: FDA
Generally speaking, it is not optimal.
FDA is the Bayes optimal solution if the two classes are distributed according to a normal distribution with equal covariance.
FDA and LMS
squares approaches for classification.
The solution to the least squares problem is in the same direction as the solution of Fisher’s discriminant.
FDA: a novel interpretation
T. Centeno, N. Lawrence. Optimizing kernel parameters and regularization coefficients for non-linear discriminant analysis. JMLR, 7 (2006).
A novel Bayesian interpretation of FDA relating Rayleigh’s coefficient to a noise model that minimizes a cost based on the most probable class centers and that abandons the ‘regression to the labels’ assumption used by other algorithms.
FDA: parameters
Going further, with the use of a Gaussian process prior, they show the equivalence of their model to a regularized kernel FDA.
A key advantage of our approach is the facility to determine kernel parameters and the regularization coefficient through the optimization of the marginal log-likelihood of the data.
FDA: framework of algorithms
Qing Tao, et al. The Theoretical Analysis of FDA and Applications. Pattern Recognition. 39(6):1199-1204.
Similar in spirit to maximal margin algorithm, FDA with zero within-class variance is proved to serve as a start point of the complete FDA framework.
Disadvantage
Motivation;
Bias and variance analysis
The bias-variance decomposition is a very powerful and widely-used tool for understanding machine-learning algorithms;
It was originally developed for squared loss.
Bias-Variance Decomposition
XY,f-hat(x0)
Bias-Variance Tradeoff
Often, the variance can be significantly reduced by deliberately introducing a small amount of bias.
blue area: error sigma (noise) , centered by the truth
large yellow circle: variance of least square fit, centered by the closest fit in whole population.
smaller circle: shrunken fit, smaller variance, higher bias
model space: the set of all possible predictions from the model
question: what is the definition of model?
could we say “linear model” but what is the role of least square, or is it also a model itself?
Interpretation: KNN
Ridge regression
LMS: ill-posed problem;
Compared with LMS under certain assumptions, it introduces a small amount of bias.
Interpretation: ridge regression
Analytic solution
The technique as a way to simultaneously reduce the risk and increase the numerical stability of LMS.
Interpretation: parameter
Effective degrees of freedom: experiment analysis;
The key result is dramatic reduction of parameter variance.
A note
A new class of generalized Bayes minimax ridge regression estimators. The Annals of Statistics. 2005, 33(4).
The risk reduction aspect of ridge regression was often observed in simulations but was not theoretically justified. Almost all theoretical results on ridge regression in the literature depend on normality.
Other loss functions
P. Domingos. A Unified Bias-Variance Decomposition and its Applications. ICML, 2000.
The resulting decomposition specializes to the standard one for the squared-loss case, and to a close relative of Kong and Dietterich’s 1995 one for the zero-one case.
Interpretation: boosting
Both Bagging and Boosting reduce error by reducing the variance term (Breiman1996);
Bauer and Kohavi 1999 demonstrated that Boosting does indeed seem to reduce bias for certain real world problems.
Interpretation: margin
Domingos 2000:
Schapire’s (1997) notion of “margin” can be expressed as a function of the zero-one bias and variance, making it possible to formally relate a classifier ensemble’s generalization error to the base learner’s bias and variance on training examples.
Interpretation: SVM
G. Valentini, T. Dietterich. Bias-variance analysis of SVM for the development of SVM-Based ensemble methods. JMLR, 2004. 5
Presenting an extended experimental analysis of bias-variance decomposition of the error in SVMs, considering Gaussian, polynomial and dot product kernels.
SVM: experimental analysis
A characterization of the error decomposition is provided, by means of the analysis of the relationships between bias, variance, kernel type and its parameters.
The results show that the expected trade-off between bias and variance is sometimes observed, but more complex relationship can be detected, especially in Gaussian and polynomial kernels.
Interpretation: base learners
The effectiveness of ensemble methods depends on the specific characteristics of the base learners;
The bias-variance decomposition offers a rationale to develop ensemble methods using SVMs as base learners.
Disadvantage
To be able to estimate the bias, variance, we need to know the actual function being learned. This is unavailable for real-world problems;
Estimating bias and varianceusing bootstrap to replicate the data.
Generalization bound
PAC Framework;
VC Theory;
PAC Frame
Probably Approximately CorrectFPACPfF0<11hP(h(x)f(x)) 1/1/
VC Theory and PAC Bounds
Landmark paper by Blumer1989;
Greatly influence the field of machine learning;
VC theory and PAC bounds have been used to analyze the performance of learning systems as diverse as decision trees, neural networks, and others.
PAC Bounds for Classification
errD(hS) (l, H, ) holds with probability 1- .
(l, H, ): a bound.
VC Dimension
The largest size of shattered set.
A consistency problems
In spite of
Remarks on PAC+VC Bounds
The size of training set required to ensure good generalization scales linearly with this quantity.
Only be able to take advantage of benign distributions.
SVM: Linearly separable
Linearly separable SVM
Maximizing the margin
SVM: soft Margin
C-SVM
The impasse of the NP-hardness of minimizing the training error has been avoided.
SVM: algorithms
Hypothesis: linearly separable, linearly inseparable; kernel.
The maximal margin algorithm serves as a start point of the complete SVM framework Shawe-Taylor1998.

Bound: VC Dimension
provided
Bound: VC dimension+errors
Structural risk minimization principle;
Occam’s razor: a simple function that explains most of the data is preferable to a complex one.
Disadvantages of SRM
According to the SRM principle, the structure has to be defined a priori before the training data appear Shawe-Taylor1998.
The maximum margin algorithm violates this principle in that the hierarchy defined depends on the data.
Disadvantage: PAC+VC bound
Not data-dependent: bounds should rely on an effective complexity measure rather than the a priori VC dimension;
It can not be applied in high dimensional feature spaces.
Several concepts
fat-shattering dimension;
Covering number.
Generalization Bound: margin
The first paper about margin results: J.Shawe-Taylor and P. L. Bartlett, 1998.
provided
Importance of Margin
Over the last decade, both theories and practices have pointed out that the concept of margin is central to the success of SVM and boosting.
Generally, a large margin implies good generalization performance.
Vapnik’s three periods
1970 – 1990Development of Basics of Statistical Learning Theory (the VC theory)
1992 – 2004Development of Large Margin Technology (SVMs).
2005 – .... Development of Non-Inductive Methods of Inferences
Neural networks
MSE criterion and deepest decent method.
Interpretation: neural networks
An important feature of the fat-shattering dimension for these classes is that it does not depend on the number of parameters (e.g., weights in a neural network), but rather on their sizes.
These measures, therefore, motivate a form of weight decay. Indeed, one consequence of the above result is a justification of the standard error function used in back-propagation optimization incorporating weight decay.
BP Algorithms
In the case of neural networks, the question naturally arises as to whether there might exist a polynomial-time algorithm for optimizing the soft margin bound.
The analysis has also placed the optimization of the quadratic loss used in the back-propagation algorithm on a firm footing, though, in this case, no polynomial time algorithm is known.
Disadvantage
A nice and direct proof of this bound is systematically described in N. Cristianini and J. Schawe-Taylor 2001.
So many bounds now.
One-class references
D. Tax and R. Duin. Support vector data description. Machine Learning, 2004.
B. Schölkopf, et al.. New support vector algorithms. Neural Computation, 2000.
I. Steinwart et al.. A Classification Framework for Anomaly Detection. JMLR, 2005.
(
)
(
)
(
)
(
)

machine learning and data mining 2006 中国科学院自动化研究所 统计机器学习 陶 卿...

Documents

machine learning and data mining 2006 中国科学院自动化研究所统计机器学习陶卿...