machine(learning(and(common( sense( · machine learning is of immense importance in bioinformatics...
TRANSCRIPT
Lothar Richter
Machine Learning and Common Sense
Dr. Lothar Richter
Lothar Richter
Structure of the Talk
● general ideas of machine learning ● some vocabulary and basics
● machine learning and bioinforma>cs
● hidden assump>ons
Lothar Richter
Ideas
● a machine learning device can generalize from real world observa>ons into a “formal” model
● each model reflects only a few aspect of reality ● no model can completely represent the reality, i.e. a photograph of a dog remains a photograph and not a real dog
● the model should reflect a concept or commonali>es and not individual characteris>cs
Lothar Richter
Induc>ve Bias
● every learning scheme discards some aspects of reality to construct a model
● this may differ between different learning schemes
● this might also already happens on the level of feature extrac>on, i.e. choosing the types and values to represent an observa>on
● this is not to be mistaken with the predic>ve bias
Lothar Richter
Some vocabulary
● learning scheme: a specific learning algorithm producing a model like decision trees, rule based systems, SVMs, Bayesian networks, etc.
● aMribute/feature: a variable describing a specific aspect of real world observa>ons, like body weight, color, certain property found yes/not
● instance: a single observa>on describing an observed event by assigning values for each feature used to represent this observa>on
Lothar Richter
Some vocabulary II
● training: phase of analyzing real world observa>ons in a formalized representa>on to derive parameters and/or internal structure
● test phase: phase of model applica>on to determine the reliability of statements (predic>ons) on instances not used for training
● label: an aMribute selected to be predicted
Lothar Richter
Types of Learning
● depending on the presence of a label we dis>nguish between supervised and unsupervised learning
● unsupervised learning: concept learning, frequent item sets, clustering
● supervised learning: everything with labeled data which allows to make a predic>on
Lothar Richter
Synonyms ● Data Mining and Machine Learning are widely used as synonyms
● if there is a difference at all, both emphasize different aspects
● also commonly used: sta>s>cal learning, KDD (knowledge discovery in databases)
● some>mes you also dis>nguish descrip>ve and induc>ve data mining or descrip>ve and predic>ve modelling
Lothar Richter
Categorial Types
● nominal: set of dis>nct values an aMribute can be assigned like red, blue, man, woman, a.s.f.
● ordinal: like a nominal type plus an order rela>onship for the values, like newborn, infant, pupil, adult, old man
● the boolean type is a two values nominal type with true and false
● arithme>c opera>ons are not defined
Lothar Richter
Con>nuous Types
● integer: very similar to the ordinal type but arithme>c makes sense, like number of children
● interval-‐scaled: contains numerical values measured in equal intervals from an arbitrary set origin, e.g. temperature in °C; it is ordered, arithme>c makes sense
● ra>o-‐scaled: similar to interval-‐scaled with the extension that a value of 0 indicate the absence of this property
Lothar Richter
Data Preprocessing
Before you can start to build a model in most cases the data are subjected to various preprocessing steps:
● Feature Extrac>on
● Discre>za>on
● Feature Selec>on
Lothar Richter
Feature Extrac>on/Construc>on
● Conversion of observa>on records into a formalized, computer-‐readable representa>on
● defini>on of an aMribute type ● assignment of appropriate aMribute values
● this implies a strong involvement of the analyst
● important: common sense, background knowledge from expert domains
Lothar Richter
Discre>za>on
● conversion of an numeric aMribute into an categorial one
● the values range is split into a set of discrete intervals and the interval label is used as value
● this can help avoid overfi[ng
● unsupervised: domain dependent, equal width, equal frequency
● supervised: construct bins in respect to class label
Lothar Richter
Feature Selec>on ● remove values from instances, i.e. discard some features of a data set because these are: - irrelevant - redundant - noisy/faulty
● possible benefits: - improve efficiency and accuracy - prevent overfi[ng - save space
Lothar Richter
Feature Selec>on Strategies ● unsupervised: based on domain knowledge, random sampling
● supervised: - measures consider the class (filtering): Gini-‐index, informa>on gain, relief, ...
- use a learning scheme’s performance (wrapping): ● select the set of aMribute which leads to best performance ● forward selec>on: increase the set of aMribute 1 by 1 ● backward elemina>on: decrease the set of aMributes 1 by 1
Lothar Richter
Machine Learning and Bioinforma>cs ● today biology has to span between two extremes: - statements on the nucleo>de level (one level below genes)
- statements the individual/popula>on level on the other hand
● the gain in speed to generate sequence data (nucleo>de sequences) has clearly outpaced the speed of analysis and knowledge discovery
● current lab technology even cannot fill the gap between sequence and structure
Lothar Richter
Primary Data Growth Nucleo>des
taken from hMp://www.nlm.nih.gov/about/image/2014CJ_fig_5.png
Lothar Richter
Primary Data Growth Nucleo>des
taken from hMp://www.ebi.ac.uk/ena/about/sta>s>cs
Lothar Richter
Automa>cally Derived Proteins
taken from hMp://www.ebi.ac.uk/uniprot/TrEMBLstats
Lothar Richter
Proteins with Some Evidence
taken from hMp://web.expasy.org/docs/relnotes/relstat.html
Lothar Richter
All Known Structures
0,00#
20000,00#
40000,00#
60000,00#
80000,00#
100000,00#
120000,00#
1972#
1973#
1974#
1975#
1976#
1977#
1978#
1979#
1980#
1981#
1982#
1983#
1984#
1985#
1986#
1987#
1988#
1989#
1990#
1991#
1992#
1993#
1994#
1995#
1996#
1997#
1998#
1999#
2000#
2001#
2002#
2003#
2004#
2005#
2006#
2007#
2008#
2009#
2010#
2011#
2012#
2013#
2014#
Yearly'Growth'of'Total'Structures'in'PDB'
Yearly#
Total#
Lothar Richter
Role of DM / ML Data Mining helps to: ● structure the data and compress the data
● filter out mistakes and outliers because of experimental errors and other noise
● reduce redundancy
● replace wet lab analyses with predic>ons
● detect interes>ng rela>onship and models and directs man power towards points where it is needed
Lothar Richter
Overview of the Steps in KDD
ly understandable patterns in data (Fayyad,Piatetsky-Shapiro, and Smyth 1996).
Here, data are a set of facts (for example,cases in a database), and pattern is an expres-sion in some language describing a subset ofthe data or a model applicable to the subset.Hence, in our usage here, extracting a patternalso designates fitting a model to data; find-ing structure from data; or, in general, mak-ing any high-level description of a set of data.The term process implies that KDD comprisesmany steps, which involve data preparation,search for patterns, knowledge evaluation,and refinement, all repeated in multiple itera-tions. By nontrivial, we mean that somesearch or inference is involved; that is, it isnot a straightforward computation ofpredefined quantities like computing the av-erage value of a set of numbers.
The discovered patterns should be valid onnew data with some degree of certainty. Wealso want patterns to be novel (at least to thesystem and preferably to the user) and poten-tially useful, that is, lead to some benefit tothe user or task. Finally, the patterns shouldbe understandable, if not immediately thenafter some postprocessing.
The previous discussion implies that we candefine quantitative measures for evaluatingextracted patterns. In many cases, it is possi-ble to define measures of certainty (for exam-ple, estimated prediction accuracy on new
data) or utility (for example, gain, perhaps indollars saved because of better predictions orspeedup in response time of a system). No-tions such as novelty and understandabilityare much more subjective. In certain contexts,understandability can be estimated by sim-plicity (for example, the number of bits to de-scribe a pattern). An important notion, calledinterestingness (for example, see Silberschatzand Tuzhilin [1995] and Piatetsky-Shapiro andMatheus [1994]), is usually taken as an overallmeasure of pattern value, combining validity,novelty, usefulness, and simplicity. Interest-ingness functions can be defined explicitly orcan be manifested implicitly through an or-dering placed by the KDD system on the dis-covered patterns or models.
Given these notions, we can consider apattern to be knowledge if it exceeds some in-terestingness threshold, which is by nomeans an attempt to define knowledge in thephilosophical or even the popular view. As amatter of fact, knowledge in this definition ispurely user oriented and domain specific andis determined by whatever functions andthresholds the user chooses.
Data mining is a step in the KDD processthat consists of applying data analysis anddiscovery algorithms that, under acceptablecomputational efficiency limitations, pro-duce a particular enumeration of patterns (ormodels) over the data. Note that the space of
Articles
FALL 1996 41
Data
Transformed�Data
Patterns
Preprocessing
Data Mining
Interpretation / �Evaluation
Transformation
Selection
--- --- ------ --- ------ --- ---
Knowledge
Preprocessed Data
Target Date
Figure 1. An Overview of the Steps That Compose the KDD Process.taken from U. Fayyad, G. Piatetsky-Shapiro, P. Smyth “From Data Mining to Knowledge Discovery in Databases” (1996) AI Magazine, 17, 37-54
Lothar Richter
ML Tools employed in Bioinforma>cs taken from “The rise and fall of supervised machine learning techniques”
[14:56 18/11/2011 Bioinformatics-btr585.tex] Page: 3331 3331–3332
BIOINFORMATICS EDITORIAL Vol. 27 no. 24 2011, pages 3331–3332doi:10.1093/bioinformatics/btr585
Editorial
The rise and fall of supervised machine learning techniquesLars Juhl Jensen1,∗ and Alex Bateman2
1Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty ofHealth Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark and 2Wellcome Trust SangerInstitute, Wellcome Trust Genome Campus, Hinxton CB10 1SA UK
Machine learning is of immense importance in bioinformatics andbiomedical science more generally (Larrañaga et al., 2006; Tarcaet al., 2007). In particular, supervised machine learning has beenused to great effect in numerous bioinformatics prediction methods.Through many years of editing and reviewing manuscripts, wenoticed that some supervised machine learning techniques seem tobe gaining in popularity while others seemed, at least to our eyes,to be looking ‘unfashionable’.
We were motivated to create a league table of machine learningtechniques to learn what is hot and what is not in the machinelearning field. In this editorial, we only include those that weconsidered major league and leave analysis of the minor leaguemethods as an exercise for the interested reader. To create our leaguetable, we created a list of supervised machine learning techniquescommonly used in bioinformatics and their common synonyms,plural forms and abbreviations. We then searched this list againstthe PubMed titles and abstracts to identify the number of paperspublished per year for each machine learning technique. To match asmany papers as possible, searches were case insensitive and allowedfor variation in hyphenation.
Fig. 1. The growth of supervised machine learning methods in PubMed.
∗To whom correspondence should be addressed
To our surprise, the artificial neural network (ANN) is not onlythe dominant league leader in 2011 but has been in this positionsince at least the 1970s (see Fig. 1). However, in recent years theusage of support vector machines (SVMs) grew tremendously, andwe predict that SVMs will challengeANNs for the dominant positionin the coming decade. Since 2007 the number of publications usingANNs has decreased by 21%, which we hypothesize may be directlyattributed to researchers increasingly using SVMs in place of ANNs.SVMs caught up with and overtook Markov models in 2004 to gainsecond spot in our machine learning league.
As for the question of ‘what is hot?’, one can see that Randomforests are a rapidly growing method with not a single mention ofthem before 2003 and now a total of 407 papers published to date.
We were hoping to find techniques that were not so hot andperhaps going out of fashion. The results show that none of themajor league methods has gone out of fashion, but we do seemoderate decreases in the use of both ANNs and Markov models inthe literature.
We were also curious to find out if certain machine learningtechniques were used in combination with each other. To investigatethis, we looked at what machine learning methods are co-mentionedin articles (See Fig. 2). For all pairs of methods from the Supervised
Fig. 2. Heatmap showing the co-occurrence of machine learning techniqueswithin articles.
© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
at Universitatsbibliothek der Technischen U
niversitaet Muenchen Zw
eigbibliothe on June 10, 2014http://bioinform
atics.oxfordjournals.org/D
ownloaded from
[14:56 18/11/2011 Bioinformatics-btr585.tex] Page: 3331 3331–3332
BIOINFORMATICS EDITORIAL Vol. 27 no. 24 2011, pages 3331–3332doi:10.1093/bioinformatics/btr585
Editorial
The rise and fall of supervised machine learning techniquesLars Juhl Jensen1,∗ and Alex Bateman2
1Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty ofHealth Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark and 2Wellcome Trust SangerInstitute, Wellcome Trust Genome Campus, Hinxton CB10 1SA UK
Machine learning is of immense importance in bioinformatics andbiomedical science more generally (Larrañaga et al., 2006; Tarcaet al., 2007). In particular, supervised machine learning has beenused to great effect in numerous bioinformatics prediction methods.Through many years of editing and reviewing manuscripts, wenoticed that some supervised machine learning techniques seem tobe gaining in popularity while others seemed, at least to our eyes,to be looking ‘unfashionable’.
We were motivated to create a league table of machine learningtechniques to learn what is hot and what is not in the machinelearning field. In this editorial, we only include those that weconsidered major league and leave analysis of the minor leaguemethods as an exercise for the interested reader. To create our leaguetable, we created a list of supervised machine learning techniquescommonly used in bioinformatics and their common synonyms,plural forms and abbreviations. We then searched this list againstthe PubMed titles and abstracts to identify the number of paperspublished per year for each machine learning technique. To match asmany papers as possible, searches were case insensitive and allowedfor variation in hyphenation.
Fig. 1. The growth of supervised machine learning methods in PubMed.
∗To whom correspondence should be addressed
To our surprise, the artificial neural network (ANN) is not onlythe dominant league leader in 2011 but has been in this positionsince at least the 1970s (see Fig. 1). However, in recent years theusage of support vector machines (SVMs) grew tremendously, andwe predict that SVMs will challengeANNs for the dominant positionin the coming decade. Since 2007 the number of publications usingANNs has decreased by 21%, which we hypothesize may be directlyattributed to researchers increasingly using SVMs in place of ANNs.SVMs caught up with and overtook Markov models in 2004 to gainsecond spot in our machine learning league.
As for the question of ‘what is hot?’, one can see that Randomforests are a rapidly growing method with not a single mention ofthem before 2003 and now a total of 407 papers published to date.
We were hoping to find techniques that were not so hot andperhaps going out of fashion. The results show that none of themajor league methods has gone out of fashion, but we do seemoderate decreases in the use of both ANNs and Markov models inthe literature.
We were also curious to find out if certain machine learningtechniques were used in combination with each other. To investigatethis, we looked at what machine learning methods are co-mentionedin articles (See Fig. 2). For all pairs of methods from the Supervised
Fig. 2. Heatmap showing the co-occurrence of machine learning techniqueswithin articles.
© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
at Universitatsbibliothek der Technischen U
niversitaet Muenchen Zw
eigbibliothe on June 10, 2014http://bioinform
atics.oxfordjournals.org/D
ownloaded from
L. J. Jensen & A. Bateman, 2011. Bioinformatics, 27(24), 3331-2
Lothar Richter
Possible Explana>ons the Prevalence of ANNs and SVMs
● they are capable to handle a huge number of aMributes
● they are quite robust against uninforma>ve features
● they implicitly adjust feature weights during the training phase
● they work sufficiently well
Lothar Richter
How is Machine Learning Influenced by Underlying Assump>ons?
● there are a number of assump>ons in the various processing steps
● the performance depends on that these assump>ons hold
● very oken we cannot really check or proof if this is true
Lothar Richter
Background distribu>on
● we assume that the background distribu>on is uniform
● i.e. the underlying source emits instances with constant probabili>es
● possible solu>ons: - use many features to represent complex scenarios - use stream mining algorithms which update parameters
Lothar Richter
x
x x
x x
x
x x
x
Lothar Richter
Unfair Sampling
● due to “experimental” reason the sample represents only a special subset of the en>>es
● especially difficult for lazy learning methods like k-‐nearest-‐neighbors
● possible solu>ons: - remove redundancy - use stra>fica>on - check variance and iden>fy difficult instances
Lothar Richter
x
x
x
x x x x
x x
x x
x x
x x
x x
x x
x
Lothar Richter
Suitable Representa>on
● we assume that the selected feature set can represent the concept to learn
● since we oken do not know causal rela>onships we might mistake employed features with real causing ones
● e.g. number of storks and births in Germany
● e.g. opening umbrellas and rain
● remedy: use background knowledge, careful interpreta>on
Lothar Richter
Representa>on of the Concept
Weather Temperature Wind playTennis rainy high no yes dry medium yes yes rainy medium yes no dry medium no no
• with this attributes it is really hard to learn favorable conditions for playing tennis
• we unconsciously assume that these attribute are sufficient to describe the scenario
Lothar Richter
Representa>on of the Concept
Weather Temperature Wind Buddy playTennis rainy high no yes yes dry medium yes yes yes rainy medium yes no no dry medium no no no
• actually there no proof to show whether important attributes are missing or not
• for this aspect background knowledge and experience are most important
Lothar Richter
Performance Es>ma>on ● training a model means discarding and keeping some informa>on from the observed instances
● depending on the learning scheme instance specific informa>on is stored too
● instance specific informa>on leads to overfi[ng
● overfi[ng: a predic>on model is biased towards the training examples,i.e.: - beMer performance on training examples - worse performance on new instances
Lothar Richter
More Realis>c Es>ma>ons ● most op>mis>c es>ma>on: Resubs>tu>on error (determine performance on the set completely used for training)
● if you have a lot of data: determine the error on an independent test set
Lothar Richter
More Realis>c Es>ma>ons II ● LOOCV: Leave on out cross valida>on - always one example is hold out for tes>ng, the remaining for training
- n itera>ons with n instances, final result is the average
- s>ll quite biased - use to check the influence of individual instances - if you have a small number of instances
Lothar Richter
More Realis>c Es>ma>ons IIi ● n-‐fold cross valida>on, typically n=10 - par>>on the data in n par>>on - use n-‐1 par>>ons for training - use 1 par>>on for performance assessment - repeat with a different hold-‐out par>>on - average performance
● every step where class informa>on is considered has to be included in the loop! (done aker par>>oning)
Lothar Richter
Text Books
● Machine Learning, Tom M. Mitchell, 1997, ISBN 0-‐07-‐115467-‐1, The McGraw-‐Hill Companies Inc.
● Introduc>on to Data Mining for the Life Sciences, Rob Sullivan, 2012, ISBN 978-‐1-‐58829-‐942-‐0, Springer New York
● Data Mining 3rd edi>on, Ian WiMen, Eibe Frank, Mark Hall, 2011, ISBN 978-‐0-‐12-‐374856-‐0, Morgan Kaufmann