Daniel Roggen
2011
Wearable ComputingPart IV
Ensemble classifiersInsight into ongoing research
© Daniel Roggen www.danielroggen.net [email protected]
F ContextActivityS2 P2
S1 P1
S0 P0
S3 P3
S4 P4
S0
S1
S2
S3
S4
F1
F2
F3
F0 C0
C1
C2
PreprocessingSensor sampling Segmentation
Feature extractionClassification
Decision fusion
R
Null classrejection
Reasoning
Subsymbolic processing Symbolic processing
Low-level activity models
(primitives)
Runtime: Recognition phase
Design-time: Training phase
Training
Activity-aware application
Sensor data
AnnotationsHigh-level activity
models
Training
A1, p1, t1
A2, p2, t2
A3, p3, t3
A4, p4, t4
t
© Daniel Roggen www.danielroggen.net [email protected]
Many classifiers: Ensemble classifiers
• What is it?
• How to generate ensembles?
• What are they useful for in wearable computing?
© Daniel Roggen www.danielroggen.net [email protected]
What are ensemble classifiers?
{(X1,y1),(X2,y2)…(Xn,yn)}
Decision fusion
© Daniel Roggen www.danielroggen.net [email protected]
Why?
• Intuitively: increasing the confidence in the decision taken
– Seek additional opinion before making a decision
– Read multiple product reviews
– Request reference before hiring someone
© Daniel Roggen www.danielroggen.net [email protected]
Background
• 1786 Condorcet’s Jury Theorem
– Probability of a group of individuals arriving at a correct decision
– Individual vote correctly (p) or incorrectly (1-p)
– With p>0.5, the more voters the higher the probability that the majority decision is correct
– « Theoretical basis for democracy »
http://en.wikipedia.org/wiki/Condorcet_jury_theorem
© Daniel Roggen www.danielroggen.net [email protected]
Also known as…
• Combination of multiple classifiers
• Classifier fusion
• Classifier ensembles
• Mixture of experts
• Consensus aggregation
• Composite classifier systems
• Dynamic classifier selection
• …
Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems magazine, 2006
© Daniel Roggen www.danielroggen.net [email protected]
Why are classifier ensembles interesting?
• Ruta: Another approach [to progress in decision support systems] suggests that as the limits of the existing individual method are approached and it is hard to develop a better one, the solution of the problem might be just to combine existing well performing methods, hoping that better results will be achieved.
• Diettrich: The main discovery is that ensembles are often much more accurate than the individual classifiers that make them up.
• Polikar: If we had access to a classifier with perfect generalization performance, there would be no need to resort to ensemble techniques. The realities of noise, outliers and overlapping data distributions, however, make such a classifier an impossible proposition. At best, we can hope for classifiers that correctly classify the field data most of the time. The strategy in ensemble systems is therefore to create many classifiers, and combine their outputs such that the combination improves upon the performance of a single classifier.
Ruta et al., An overview of classifier fusion methods, Computing and Information Systems, 2000
Dietterich, Ensemble methods in machine learning, Proc. Multiple Classifier Systems, 2000
Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems magazine, 2006
© Daniel Roggen www.danielroggen.net [email protected]
Motivation
Dietterich, Ensemble methods in machine learning, Proc. Multiple Classifier Systems, 2000
• The ‘true f’ cannot be
represented by any of
the classifiers in H
• A combination of
multiple classifiers
expands the
representable functions
Dietterich: “These three fundamental issues are the three most important ways in which existing learning algorithms fail. Hence, ensemble methods have the promise of reducing (and perhaps even eliminating) these three key shortcomings of standard learning algorithms.”
• Enough training data but
computationally difficult
to find the best classifier
• Local optima
• Ensemble constructed
from different start
points better
approximates f
• Insufficient data
• Many classifiers give the
same accuracy on the
training data
• An ensemble of
‘accurate’ classifiers
reduces the risk of
choosing the wrong
classifier
© Daniel Roggen www.danielroggen.net [email protected]
Motivation• Statistical reasons:
– Good performance on training set does not guarantee generalization– Combining classifiers reduce the risk of selecting a poorly one
• Large volume of data– Training classifiers with large amounts of data can be impractical– Partition data in smaller subsets and train/combine specific classifiers
• Too little data– Resampling techniques and training of different classifiers on (random) subsets
• Data fusion– Multiple/multimodal sensors– For each modality a specific classifier is trained, and then combined
• Divide and conquer– Too complex decision boundary for a single classifier– Approximate the complex decision boundary by multiple classifiers
Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems magazine, 2006
© Daniel Roggen www.danielroggen.net [email protected], Ensemble based systems in decision making, IEEE Circuits and Systems magazine, 2006
Divide and conquer
© Daniel Roggen www.danielroggen.net [email protected]
Classifier selection / Classifier fusion
• Classifier selection: Use an expert in a local area of the feature space
• Classifier fusion: merge individual (weaker) learners to obtain a single (stronger) learner
© Daniel Roggen www.danielroggen.net [email protected]
The diversity problem
• Classifiers must (in a fused sense) agree on the right decision
• When classifiers disagree, they must disagree differently
5 classifiers, majority voting
Classifier Decision
h0: 0
h1: 1
h2: 0
h3: 2
h4: 3
• Classifiers are diverse if they make different errors on data points
• A strategy for ensemble generation must find diverse classifiers
© Daniel Roggen www.danielroggen.net [email protected]
Measuring diversity
• An good diversity measure should relate to the ensemble accuracy
• No strict definition of ‘diversity’ – active area of research
• For two classifiers: statistical litterature
• For three+ classifiers: no consensus
Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems magazine, 2006
Kuncheva, Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy, Machine Learning, 2003
© Daniel Roggen www.danielroggen.net [email protected]
Measuring diversity: pair-wise measures
• Average of all pair-wise diversity measures
• Q-Statistics
• Correlation
• Disagreement, double fault
© Daniel Roggen www.danielroggen.net [email protected]
Measuring diversity: summary
• No diversity measure consistently correlates with higher accuracy
• “although a rough tendency was confirmed. . . no prominent links appeared between the diversity of the ensemble and its accuracy. Diversity alone is a poor predictor of the ensemble accuracy” [1]
• Although there are proven connections between diversity and accuracy in some special cases, our results raise some doubts about the usefulness of diversity measures in building classifier ensembles in real-life pattern recognition problems. [2]
[1] Kuncheva, That Elusive Diversity in Classifier Ensembles, IbPRIA, 2003
[2] Kuncheva, Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy, Machine Learning, 2003
© Daniel Roggen www.danielroggen.net [email protected]
Measuring diversity: summary
• In the absence of additional information Q may be recommended– Simple implementation
– Limits: [-1;1]
– Independence value: 0
Kuncheva, Is Independence Good for Combining Classifiers?, Proc. Int. Conf. Pattern Recognition, 2000
© Daniel Roggen www.danielroggen.net [email protected]
How to obtain diversity
Strategies for ensemble generation
1. Enumerating the hypotheses
2. Manipulating the training examples
3. Manipulating the input features
4. Manipulating the output targets
5. Injecting randomness
Dietterich, Ensemble methods in machine learning, Proc. Multiple Classifier Systems, 2000
Brown, Yao, Diversity creation methods: a survey and categorisation, Information Fusion, 2005
© Daniel Roggen www.danielroggen.net [email protected]
Strategy for ensemble generation (1)
Manipulating the training examples
• Learning algorithm run multiple times on different training subsets
• Suited for unstable classifiers– decision tree, neural networks, …– (Stable: linear regression, nearest neighbor, linear threshold)
• Methods:– Bagging: randomly draw samples from training set– Cross-validation: leave out disjoints subsets from training– Boosting: draw samples with more likelihood for difficult samples
© Daniel Roggen www.danielroggen.net [email protected], Ensemble based systems in decision making, IEEE Circuits and Systems magazine, 2006
Strategy for ensemble generation (1)
© Daniel Roggen www.danielroggen.net [email protected]
Strategy for ensemble generation (2)
Manipulating the input features
• Change the set of input features available to the learning algorithm
• E.g. select/group features according to identical sensors
• Input features need to be redundant
• Input decimated ensembles [1]
[1] Tumer,Oza, Input decimated ensembles, Pattern Anal Applic, 2003
Ho, The Random Subspace Method for Constructing Decision Forests, IEEE PAMI, 1998
© Daniel Roggen www.danielroggen.net [email protected]
Strategy for ensemble generation (3)
Manipulating the output targets
• Classification: {(X1,y1),(X2,y2)…(Xn,yn)}
• Change the classification problem by changing y
• Error correcting codes– Change form 1 classifier with K classes -> log2(K) 2-class classifiers
© Daniel Roggen www.danielroggen.net [email protected]
Strategy for ensemble generation (4)
Injecting randomness
• Randomness in the learning algorithm
• E.g.– initial weights of a neural network
– initial parameters of HMM
– C4.5: random selection among N best decision tree splits
© Daniel Roggen www.danielroggen.net [email protected]
How to combine the classifiers?
Ruta et al., An overview of classifier fusion methods, Computing and Information Systems, 2000
© Daniel Roggen www.danielroggen.net [email protected]
• (weighted) Majority voting– Class label output– Select the class most voted for
• Mean rule– Continuous output
– Support for class wj is average of classifier output
• Product rule– Continuous output– Product of classifier output
How to combine the classifiers?
© Daniel Roggen www.danielroggen.net [email protected]
Which method is better?
• No free lunch - problem dependent
• Ensemble generation– Boosting vs Bagging: Boosting usually achieves better generalization but is more
sensitive to noise and outliers
• Ensemble combination– General case: mean rule - consistent performance on a broad range of problems
– Reliable estimate of classifier accuracy: weighted average, weighted majority
– Classifier output posterior probabilities: product rule
Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems magazine, 2006
© Daniel Roggen www.danielroggen.net [email protected]
Which method is better?
• Ensemble combination
– No information classifier errors distribution: median• always leads to Pe → 0 even with heavy-tailed distributions.
– Error distribution less heavy tailed: mean
– For technical reasons (e.g. communication in WSN) majority vote may be the only one that can be implemented
• Performance of the majority vote strategy coincides with the performance of the median strategy
Cabrera, On the impact of fusion strategies on classification errors for large ensembles of classifiers , Pattern recognition, 2006
© Daniel Roggen www.danielroggen.net [email protected]
In wearable computing
Classifier fusion• Multimodal sensors & NULL class rejection
• Sound
• Acceleration
• Null class when sound&acceleration classification disagree
Ward, Gesture Spotting Using Wrist Worn Microphone and 3-Axis Accelerometer, Proc. Joint Conf on Smart objects and ambient intelligence, 2005
© Daniel Roggen www.danielroggen.net [email protected]
Zappi, Roggen et al. Activity recognition from on-body sensors: accuracy-power trade-off by dynamic sensor selection. EWSN, 2008.
Stiefmeier et al., Wearable activity tracking in car manufacturing, Pervasive Computing Magazine, 2008
In wearable computing
© Daniel Roggen www.danielroggen.net [email protected]
In wearable computing
Classifier fusion
Sensor Scalability [2]
• Application defined performance
• Clustering
Robustness to faults [1]
• Graceful degradation
• Implicit fault-tolerance
[1] Zappi, Stiefmeier, Farella, Roggen, Benini, Tröster, Activity Recognition from On-Body Sensors by Classifier Fusion: Sensor Scalability and Robustness. ISSNIP 07
[2] Zappi, Lombriser, Stiefmeier, Farella, Roggen, Benini, Tröster, Activity recognition from on-body sensors: accuracy-power trade-off by dynamic sensor selection, EWSN 08
© Daniel Roggen www.danielroggen.net [email protected]
In wearable computing
Classifier fusion
Power-performance management[1]
[1] Zappi, Roggen et al., Network-level power-performance trade-off in wearable activity recognition: a dynamic sensor selection approach, submitted to ACM Trans. Embedded Computing Systems
© Daniel Roggen www.danielroggen.net [email protected]
In wearable computing
Classifier selection
Stiefmeier, Combining Motion Sensors and Ultrasonic Hands Tracking for Continuous Activity Recognition in a Maintenance Scenario,
Location Class 1(μ1,σ1)
Location Class 2(μ2,σ2)
Select 'expert' classifier for location
class 1
Select 'expert' classifier for location
class 2
© Daniel Roggen www.danielroggen.net [email protected]
Further applications
• Classification despite missing features– "A bootstrap-based method can provide an alternative approach to the missing
data problem by generating an ensemble of classifiers, each trained with a random subset of the features." [1]
– "Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two [imputation] methods, sometimes by a large margin." [2]
• E.g.:– Long term multimodal activity recognition
– Physiological signal assessment
– Opportunistic activity recognition
[1] Polikar, Bootstrap-inspired techniques in computational intelligence, IEEE Signal Processing Magazine, 2007
[2] Provost, Handling Missing Values when Applying Classification Models, Machine Learning Research, 2007
© Daniel Roggen www.danielroggen.net [email protected]
Further applications
• Enhanced robustness in activity recognition– Typically small datasets: are we using the optimal decision boundary for field
deployment?– Ensembles of classsifiers trained with resampling– Ensembles have different field generalization performance
• Confidence estimation/QoC– Continuous valued output of ensemble classifiers can estimate posterior
probability [1]
• WSN– "classifiers using data from different sensors are usually uncorrelated to a far
greater degree than classifiers which use data from the same sensor" [2]– Distributed activity recognition (Tiny Task Network): only classification result is
required, lower bandwidth
[1] Muhlbaier, Polikar, Ensemble confidence estimates posterior probability, Int. Workshop on Multiple Classifier Systems, 2005
[2] Fumera, Roli, A theoretical and experimental analysis of linear combiners for multiple classifier systems , IEEE Trans. Pattern Anal. Mach. Intell., 2005
© Daniel Roggen www.danielroggen.net [email protected]
Reasons not to use ensembles
• Classifier with (perfect|good) generalization performance available
• Decreased comprehensibility
• Limited storage and computational resources
• Correlated errors or uncorrelated errors at rate higher than chance
© Daniel Roggen www.danielroggen.net [email protected]
Summary
• Large body of research showing benefits of ensembles
• Some ensembles classifiers already in use in Wearable Computing
• Potentials: missing features, confidence/QoC, improved robustness, WSN
• Active field of research
© Daniel Roggen www.danielroggen.net [email protected]
Further readingReviews, books• Ruta et al., An overview of classifier fusion methods, Computing and Information Systems, 2000• Dietterich, Ensemble methods in machine learning, Proc. Multiple Classifier Systems, 2000• Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems magazine, 2006• Polikar, Bootstrap-inspired techniques in computational intelligence, IEEE Signal Processing Magazine, 2007• Kuncheva, Combining Pattern Classifiers, Methods and Algorithms, Wiley, 2005
Diversity• Kuncheva, Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy ,
Machine Learning, 2003• Brown, Yao, Diversity creation methods: a survey and categorisation, Information Fusion, 2005
Decimation• Tumer,Oza, Input decimated ensembles, Pattern Anal Applic, 2003• Ho, The Random Subspace Method for Constructing Decision Forests, IEEE PAMI, 1998
Confidence• Muhlbaier, Polikar, Ensemble confidence estimates posterior probability, Int. Workshop on Multiple Classifier
Systems, 2005• Tourassi, Reliability Assessment of Ensemble Classifiers-Application in Mammography
Missing features• Provost, Handling Missing Values when Applying Classification Models, Machine Learning Research, 2007
Conferences• Proc. Workshop Multiple Classifier Systems (Springer)
Various• Cabrera, On the impact of fusion strategies on classification errors for large ensembles of classifiers , Pattern
recognition, 2006• Fumera, A theoretical and experimental analysis of linear combiners for multiple classifier systems, IEEE Trans.
Pattern Anal. Mach. Intell., 2005
© Daniel Roggen www.danielroggen.net [email protected]
Multiplication of sensors in real-world use
© Daniel Roggen www.danielroggen.net [email protected]
http://www.opportunity-project.eu
© Daniel Roggen www.danielroggen.net [email protected]
Activity recognition with sensors that just happen to be available
Opportunistic activity recognition
Designing a pattern recognition system without knowing the input space !
© Daniel Roggen www.danielroggen.net [email protected]
The OPPORTUNITY activity recognition chain
© Daniel Roggen www.danielroggen.net [email protected]
WP4 Ad-hoc cooperative sensing
OPPORTUNITY Architecture, Recognition goal, Self-* principles
• Specify what should be recognized but not how– E.g.: « Detect grasping manipulative activities with wearable sensors »
• Self-organization in a coordinated sensing mission– E.g.: « Recognition of manipulative activities » calls for sensors capable of providing
movement information, and placed on body to network
• Sensor self-description (statically known characteristics)
© Daniel Roggen www.danielroggen.net [email protected]
WP1 Sensor and features
Filter variations
• Conditioning: re-define features to make them less sensitive to variations– E.g. use magnitude of acceleration signal, rather than X,Y,Z vector
• Abstraction: different modalities map to the same feature space– E.g. hand coordinates from inertial sensors or localization system
• Self-characterization: run-time characteristics– E.g. location, orientation
© Daniel Roggen www.danielroggen.net [email protected]
WP2: Opportunistic classifiers
Robust classification & allow for adaptation
• Dynamic « Ensemble classifier » architecture
• Dynamic selection of most informative information channel
• Allow for multimodal data, changing sensor numbers
• Allow for adaptation
sensor0
sensor1
sensorn
classifier0
classifier1
classifiern
c0
c1
cn
Fusion class userGesture
© Daniel Roggen www.danielroggen.net [email protected]
WP3 Dynamic adaptation and autonomous evolution
Run-time monitoring and adapation of the system
• Adaptation to slow changes, long-term, concept drift– Sensor degradation, change in user action-motor strategies
• Use new sensors– Sensing infrastructure changes with upgrades
• Opportunistic user feedback– Explicit: e.g. feedback through keyboard
– Implicit: e.g. from EEG signals
© Daniel Roggen www.danielroggen.net [email protected]
Dynamic adaptation: power-performance management
• Dynamic ensemble classifiers• Passively: ensemble classifiers allow for changes in the environment• Actively: benefit of dynamic adaptation
Zappi et al. Network-level power-performance trade-off in wearable activity recognition: a dynamic sensor selection approach, To appear ACM TECS
© Daniel Roggen www.danielroggen.net [email protected]
Adaptation: Classifier self-calibration to sensor displacement
Förster, Roggen, Tröster, Unsupervised classifier self-calibration through repeated context occurences: is there robustness against sensor displacement to gain?, Proc. Int. Symposium Wearable Computers, 2009
Calibration dynamics: class centers follow cluster
displacement in feature space
Self-calibration to displaced sensors increases accuracy:
• by 33.3% in HCI dataset
• by 13.4% in fitness dataset
Principle: upon activity detection, classifiers are re-trained to better model the last classified activity
© Daniel Roggen www.danielroggen.net [email protected]
Adaptation: minimally user-supervised adaptation
Acceleration data Recognized gesture
Error button
Förster et al., Incremental kNN classifier exploiting correct - error teacher for activity recognition, Submitted to ICMLA 2010
© Daniel Roggen www.danielroggen.net [email protected]
Adaptation: minimally user-supervised adaptation
• Adaptation leads to:• Higher accuracy in the adaptive case v.s. control• Higher input rate• More "personalized" gestures
Förster et al., Online user adaptation in gesture and activity recognition - what’s the benefit? Tech Rep.
Förster et al., Incremental kNN classifier exploiting correct - error teacher for activity recognition, Submitted to ICMLA 2010
© Daniel Roggen www.danielroggen.net [email protected]örster et al., On the use of brain decoded signals for online user adaptive gesture recognition systems , Pervasive 2010
Adaptation: with brain-signal feedback
• ~9% accuracy increase with perfect brain signal recognition• ~3% accuracy increase with effective brain signal recognition accuracy•Adaptation guided by the user’s own perception of the system• User in the loop
© Daniel Roggen www.danielroggen.net [email protected]
• New sensors may be discovered – Infrastructure upgrades– Entering a new environment
• Problem: How to use the sensor without self-*?– Typical in open-ended environments– Hard to predict what future sensors will be deployed
• Unsupervised approaches to use new sensors!
Using new sensors without supervision…
© Daniel Roggen www.danielroggen.net [email protected]
Using new sensors without supervision… … using behavioral assumptions
• Can a reed switch recognize different gestures and modes of locomotion?
• Extract maximum information content from simple sensors– Use behavioral assumptions
© Daniel Roggen www.danielroggen.net [email protected]
Open
Using new sensors without supervision… … using behavioral assumptions
© Daniel Roggen www.danielroggen.net [email protected]
Application to Opportunity Dataset
• Functionality of wearable sensor is learned incrementally• Autonomous training of wearable systems• Only needed: sporadic interactions with the environment• Applicable in WSN/AmI as demonstrated by hardware implementation
Calatroni et al. Context Cells: Towards Lifelong Learning in Activity Recognition Systems, EuroSSC 2009
© Daniel Roggen www.danielroggen.net [email protected]
Transfer of recognition capabilities
• System designed for domain 1 should work in domain 2• Changes of sensors between setup 1 and 2
Roggen et al., Wearable Computing: Designing and Sharing Activity-Recognition Systems Across Platforms, IEEE Robotics&Automation Magazine, 2011
© Daniel Roggen www.danielroggen.net [email protected]
Summary
• Improving wearability & user-acceptance
• Addressing real-world deployment issues
• Enabling large-scale Ambient Intelligence environments
www.opportunity-project.euEC grant n° 225938
© Daniel Roggen www.danielroggen.net [email protected]