machine(learning(and(common( sense( · machine learning is of immense importance in bioinformatics...

38
Lothar Richter Machine Learning and Common Sense Dr. Lothar Richter

Upload: others

Post on 23-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Machine  Learning  and  Common  Sense  

   

Dr.  Lothar  Richter  

Page 2: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Structure  of  the  Talk  

●  general  ideas  of  machine  learning  ●  some  vocabulary  and  basics  

●  machine  learning  and  bioinforma>cs  

●  hidden  assump>ons  

Page 3: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Ideas  

●  a  machine  learning  device  can  generalize  from  real  world  observa>ons  into  a  “formal”  model  

●  each  model  reflects  only  a  few  aspect  of  reality  ●  no  model  can  completely  represent  the  reality,  i.e.  a  photograph  of  a  dog  remains  a  photograph  and  not  a  real  dog  

●  the  model  should  reflect  a  concept  or  commonali>es  and  not  individual  characteris>cs  

Page 4: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Induc>ve  Bias  

●  every  learning  scheme  discards  some  aspects  of  reality  to  construct  a  model  

●  this  may  differ  between  different  learning  schemes  

●  this  might  also  already  happens  on  the  level  of  feature  extrac>on,  i.e.  choosing  the  types  and  values  to  represent  an  observa>on  

●  this  is  not  to  be  mistaken  with  the  predic>ve  bias  

Page 5: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Some  vocabulary  

●  learning  scheme:  a  specific  learning  algorithm  producing  a  model  like  decision  trees,  rule  based  systems,  SVMs,  Bayesian  networks,  etc.  

●  aMribute/feature:  a  variable  describing  a  specific  aspect  of  real  world  observa>ons,  like  body  weight,  color,  certain  property  found  yes/not  

●  instance:  a  single  observa>on  describing  an  observed  event  by  assigning  values  for  each  feature  used  to  represent  this  observa>on  

Page 6: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Some  vocabulary  II  

●  training:  phase  of  analyzing  real  world  observa>ons  in  a  formalized  representa>on  to  derive  parameters  and/or  internal  structure  

●  test  phase:  phase  of  model  applica>on  to  determine  the  reliability  of  statements  (predic>ons)  on  instances  not  used  for  training  

●  label:  an  aMribute  selected  to  be  predicted  

Page 7: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Types  of  Learning  

●  depending  on  the  presence  of  a  label  we  dis>nguish  between  supervised  and  unsupervised  learning  

●  unsupervised  learning:  concept  learning,  frequent  item  sets,  clustering  

●  supervised  learning:  everything  with  labeled  data  which  allows  to  make  a  predic>on  

Page 8: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Synonyms  ●  Data  Mining  and  Machine  Learning  are  widely  used  as  synonyms  

●  if  there  is  a  difference  at  all,  both  emphasize  different  aspects  

●  also  commonly  used:  sta>s>cal  learning,  KDD  (knowledge  discovery  in  databases)  

●  some>mes  you  also  dis>nguish  descrip>ve  and  induc>ve  data  mining  or  descrip>ve  and  predic>ve  modelling  

Page 9: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Categorial  Types  

●  nominal:  set  of  dis>nct  values  an  aMribute  can  be  assigned  like  red,  blue,  man,  woman,  a.s.f.  

●  ordinal:  like  a  nominal  type  plus  an  order  rela>onship  for  the  values,  like  newborn,  infant,  pupil,  adult,  old  man  

●  the  boolean  type  is  a  two  values  nominal  type  with  true  and  false  

●  arithme>c  opera>ons  are  not  defined  

Page 10: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Con>nuous  Types  

●  integer:  very  similar  to  the  ordinal  type  but  arithme>c  makes  sense,  like  number  of  children  

●  interval-­‐scaled:  contains  numerical  values  measured  in  equal  intervals  from  an  arbitrary  set  origin,  e.g.  temperature  in  °C;  it  is  ordered,  arithme>c  makes  sense  

●  ra>o-­‐scaled:  similar  to  interval-­‐scaled  with  the  extension  that  a  value  of  0  indicate  the  absence  of  this  property  

Page 11: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Data  Preprocessing  

Before  you  can  start  to  build  a  model  in  most  cases  the  data  are  subjected  to  various  preprocessing  steps:  

●  Feature  Extrac>on  

●  Discre>za>on  

●  Feature  Selec>on  

Page 12: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Feature  Extrac>on/Construc>on  

●  Conversion  of  observa>on  records  into  a  formalized,  computer-­‐readable  representa>on  

●  defini>on  of  an  aMribute  type  ●  assignment  of  appropriate  aMribute  values  

●  this  implies  a  strong  involvement  of  the  analyst  

●  important:  common  sense,  background  knowledge  from  expert  domains  

Page 13: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Discre>za>on  

●  conversion  of  an  numeric  aMribute  into  an  categorial  one  

●  the  values  range  is  split  into  a  set  of  discrete  intervals  and  the  interval  label  is  used  as  value  

●  this  can  help  avoid  overfi[ng  

●  unsupervised:  domain  dependent,  equal  width,  equal  frequency  

●  supervised:  construct  bins  in  respect  to  class  label  

Page 14: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Feature  Selec>on  ●  remove  values  from  instances,  i.e.  discard  some  features  of  a  data  set  because  these  are:  -  irrelevant  -  redundant  -  noisy/faulty  

●  possible  benefits:  -  improve  efficiency  and  accuracy  -  prevent  overfi[ng  -  save  space  

Page 15: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Feature  Selec>on  Strategies  ●  unsupervised:  based  on  domain  knowledge,  random  sampling  

●  supervised:  -  measures  consider  the  class  (filtering):  Gini-­‐index,  informa>on  gain,  relief,  ...  

-  use  a  learning  scheme’s  performance  (wrapping):  ●  select  the  set  of  aMribute  which  leads  to  best  performance  ●  forward  selec>on:  increase  the  set  of  aMribute  1  by  1  ●  backward  elemina>on:  decrease  the  set  of  aMributes  1  by  1  

Page 16: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Machine  Learning  and  Bioinforma>cs  ●  today  biology  has  to  span  between  two  extremes:  -  statements  on  the  nucleo>de  level  (one  level  below  genes)  

-  statements  the  individual/popula>on  level  on  the  other  hand  

●  the  gain  in  speed  to  generate  sequence  data  (nucleo>de  sequences)  has  clearly  outpaced  the  speed  of  analysis  and  knowledge  discovery  

●  current  lab  technology  even  cannot  fill  the  gap  between  sequence  and  structure  

Page 17: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Primary  Data  Growth  Nucleo>des  

taken  from  hMp://www.nlm.nih.gov/about/image/2014CJ_fig_5.png  

Page 18: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Primary  Data  Growth  Nucleo>des  

taken  from  hMp://www.ebi.ac.uk/ena/about/sta>s>cs  

Page 19: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Automa>cally  Derived  Proteins  

taken  from  hMp://www.ebi.ac.uk/uniprot/TrEMBLstats  

Page 20: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Proteins  with  Some  Evidence  

taken  from  hMp://web.expasy.org/docs/relnotes/relstat.html  

Page 21: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

All  Known  Structures  

0,00#

20000,00#

40000,00#

60000,00#

80000,00#

100000,00#

120000,00#

1972#

1973#

1974#

1975#

1976#

1977#

1978#

1979#

1980#

1981#

1982#

1983#

1984#

1985#

1986#

1987#

1988#

1989#

1990#

1991#

1992#

1993#

1994#

1995#

1996#

1997#

1998#

1999#

2000#

2001#

2002#

2003#

2004#

2005#

2006#

2007#

2008#

2009#

2010#

2011#

2012#

2013#

2014#

Yearly'Growth'of'Total'Structures'in'PDB'

Yearly#

Total#

Page 22: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Role  of  DM  /  ML  Data  Mining  helps  to:  ●  structure  the  data  and  compress  the  data  

●  filter  out  mistakes  and  outliers  because  of  experimental  errors  and  other  noise  

●  reduce  redundancy  

●  replace  wet  lab  analyses  with  predic>ons  

●  detect  interes>ng  rela>onship  and  models  and  directs  man  power  towards  points  where  it  is  needed  

Page 23: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Overview  of  the  Steps  in  KDD  

ly understandable patterns in data (Fayyad,Piatetsky-Shapiro, and Smyth 1996).

Here, data are a set of facts (for example,cases in a database), and pattern is an expres-sion in some language describing a subset ofthe data or a model applicable to the subset.Hence, in our usage here, extracting a patternalso designates fitting a model to data; find-ing structure from data; or, in general, mak-ing any high-level description of a set of data.The term process implies that KDD comprisesmany steps, which involve data preparation,search for patterns, knowledge evaluation,and refinement, all repeated in multiple itera-tions. By nontrivial, we mean that somesearch or inference is involved; that is, it isnot a straightforward computation ofpredefined quantities like computing the av-erage value of a set of numbers.

The discovered patterns should be valid onnew data with some degree of certainty. Wealso want patterns to be novel (at least to thesystem and preferably to the user) and poten-tially useful, that is, lead to some benefit tothe user or task. Finally, the patterns shouldbe understandable, if not immediately thenafter some postprocessing.

The previous discussion implies that we candefine quantitative measures for evaluatingextracted patterns. In many cases, it is possi-ble to define measures of certainty (for exam-ple, estimated prediction accuracy on new

data) or utility (for example, gain, perhaps indollars saved because of better predictions orspeedup in response time of a system). No-tions such as novelty and understandabilityare much more subjective. In certain contexts,understandability can be estimated by sim-plicity (for example, the number of bits to de-scribe a pattern). An important notion, calledinterestingness (for example, see Silberschatzand Tuzhilin [1995] and Piatetsky-Shapiro andMatheus [1994]), is usually taken as an overallmeasure of pattern value, combining validity,novelty, usefulness, and simplicity. Interest-ingness functions can be defined explicitly orcan be manifested implicitly through an or-dering placed by the KDD system on the dis-covered patterns or models.

Given these notions, we can consider apattern to be knowledge if it exceeds some in-terestingness threshold, which is by nomeans an attempt to define knowledge in thephilosophical or even the popular view. As amatter of fact, knowledge in this definition ispurely user oriented and domain specific andis determined by whatever functions andthresholds the user chooses.

Data mining is a step in the KDD processthat consists of applying data analysis anddiscovery algorithms that, under acceptablecomputational efficiency limitations, pro-duce a particular enumeration of patterns (ormodels) over the data. Note that the space of

Articles

FALL 1996 41

Data

Transformed�Data

Patterns

Preprocessing

Data Mining

Interpretation / �Evaluation

Transformation

Selection

--- --- ------ --- ------ --- ---

Knowledge

Preprocessed Data

Target Date

Figure 1. An Overview of the Steps That Compose the KDD Process.taken from U. Fayyad, G. Piatetsky-Shapiro, P. Smyth “From Data Mining to Knowledge Discovery in Databases” (1996) AI Magazine, 17, 37-54

Page 24: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

ML  Tools  employed  in  Bioinforma>cs  taken  from  “The  rise  and  fall  of  supervised  machine  learning  techniques”  

[14:56 18/11/2011 Bioinformatics-btr585.tex] Page: 3331 3331–3332

BIOINFORMATICS EDITORIAL Vol. 27 no. 24 2011, pages 3331–3332doi:10.1093/bioinformatics/btr585

Editorial

The rise and fall of supervised machine learning techniquesLars Juhl Jensen1,∗ and Alex Bateman2

1Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty ofHealth Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark and 2Wellcome Trust SangerInstitute, Wellcome Trust Genome Campus, Hinxton CB10 1SA UK

Machine learning is of immense importance in bioinformatics andbiomedical science more generally (Larrañaga et al., 2006; Tarcaet al., 2007). In particular, supervised machine learning has beenused to great effect in numerous bioinformatics prediction methods.Through many years of editing and reviewing manuscripts, wenoticed that some supervised machine learning techniques seem tobe gaining in popularity while others seemed, at least to our eyes,to be looking ‘unfashionable’.

We were motivated to create a league table of machine learningtechniques to learn what is hot and what is not in the machinelearning field. In this editorial, we only include those that weconsidered major league and leave analysis of the minor leaguemethods as an exercise for the interested reader. To create our leaguetable, we created a list of supervised machine learning techniquescommonly used in bioinformatics and their common synonyms,plural forms and abbreviations. We then searched this list againstthe PubMed titles and abstracts to identify the number of paperspublished per year for each machine learning technique. To match asmany papers as possible, searches were case insensitive and allowedfor variation in hyphenation.

Fig. 1. The growth of supervised machine learning methods in PubMed.

∗To whom correspondence should be addressed

To our surprise, the artificial neural network (ANN) is not onlythe dominant league leader in 2011 but has been in this positionsince at least the 1970s (see Fig. 1). However, in recent years theusage of support vector machines (SVMs) grew tremendously, andwe predict that SVMs will challengeANNs for the dominant positionin the coming decade. Since 2007 the number of publications usingANNs has decreased by 21%, which we hypothesize may be directlyattributed to researchers increasingly using SVMs in place of ANNs.SVMs caught up with and overtook Markov models in 2004 to gainsecond spot in our machine learning league.

As for the question of ‘what is hot?’, one can see that Randomforests are a rapidly growing method with not a single mention ofthem before 2003 and now a total of 407 papers published to date.

We were hoping to find techniques that were not so hot andperhaps going out of fashion. The results show that none of themajor league methods has gone out of fashion, but we do seemoderate decreases in the use of both ANNs and Markov models inthe literature.

We were also curious to find out if certain machine learningtechniques were used in combination with each other. To investigatethis, we looked at what machine learning methods are co-mentionedin articles (See Fig. 2). For all pairs of methods from the Supervised

Fig. 2. Heatmap showing the co-occurrence of machine learning techniqueswithin articles.

© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

at Universitatsbibliothek der Technischen U

niversitaet Muenchen Zw

eigbibliothe on June 10, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

[14:56 18/11/2011 Bioinformatics-btr585.tex] Page: 3331 3331–3332

BIOINFORMATICS EDITORIAL Vol. 27 no. 24 2011, pages 3331–3332doi:10.1093/bioinformatics/btr585

Editorial

The rise and fall of supervised machine learning techniquesLars Juhl Jensen1,∗ and Alex Bateman2

1Department of Disease Systems Biology, The Novo Nordisk Foundation Center for Protein Research, Faculty ofHealth Sciences, University of Copenhagen, DK-2200 Copenhagen N, Denmark and 2Wellcome Trust SangerInstitute, Wellcome Trust Genome Campus, Hinxton CB10 1SA UK

Machine learning is of immense importance in bioinformatics andbiomedical science more generally (Larrañaga et al., 2006; Tarcaet al., 2007). In particular, supervised machine learning has beenused to great effect in numerous bioinformatics prediction methods.Through many years of editing and reviewing manuscripts, wenoticed that some supervised machine learning techniques seem tobe gaining in popularity while others seemed, at least to our eyes,to be looking ‘unfashionable’.

We were motivated to create a league table of machine learningtechniques to learn what is hot and what is not in the machinelearning field. In this editorial, we only include those that weconsidered major league and leave analysis of the minor leaguemethods as an exercise for the interested reader. To create our leaguetable, we created a list of supervised machine learning techniquescommonly used in bioinformatics and their common synonyms,plural forms and abbreviations. We then searched this list againstthe PubMed titles and abstracts to identify the number of paperspublished per year for each machine learning technique. To match asmany papers as possible, searches were case insensitive and allowedfor variation in hyphenation.

Fig. 1. The growth of supervised machine learning methods in PubMed.

∗To whom correspondence should be addressed

To our surprise, the artificial neural network (ANN) is not onlythe dominant league leader in 2011 but has been in this positionsince at least the 1970s (see Fig. 1). However, in recent years theusage of support vector machines (SVMs) grew tremendously, andwe predict that SVMs will challengeANNs for the dominant positionin the coming decade. Since 2007 the number of publications usingANNs has decreased by 21%, which we hypothesize may be directlyattributed to researchers increasingly using SVMs in place of ANNs.SVMs caught up with and overtook Markov models in 2004 to gainsecond spot in our machine learning league.

As for the question of ‘what is hot?’, one can see that Randomforests are a rapidly growing method with not a single mention ofthem before 2003 and now a total of 407 papers published to date.

We were hoping to find techniques that were not so hot andperhaps going out of fashion. The results show that none of themajor league methods has gone out of fashion, but we do seemoderate decreases in the use of both ANNs and Markov models inthe literature.

We were also curious to find out if certain machine learningtechniques were used in combination with each other. To investigatethis, we looked at what machine learning methods are co-mentionedin articles (See Fig. 2). For all pairs of methods from the Supervised

Fig. 2. Heatmap showing the co-occurrence of machine learning techniqueswithin articles.

© The Author(s) 2011. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

at Universitatsbibliothek der Technischen U

niversitaet Muenchen Zw

eigbibliothe on June 10, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

L. J. Jensen & A. Bateman, 2011. Bioinformatics, 27(24), 3331-2

Page 25: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Possible  Explana>ons  the  Prevalence  of  ANNs  and  SVMs  

●  they  are  capable  to  handle  a  huge  number  of  aMributes  

●  they  are  quite  robust  against  uninforma>ve  features  

●  they  implicitly  adjust  feature  weights  during  the  training  phase  

●  they  work  sufficiently  well  

Page 26: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

How  is  Machine  Learning  Influenced  by  Underlying  Assump>ons?  

●  there  are  a  number  of  assump>ons  in  the  various  processing  steps  

●  the  performance  depends  on  that  these  assump>ons  hold  

●  very  oken  we  cannot  really  check  or  proof  if  this  is  true  

Page 27: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Background  distribu>on  

●  we  assume  that  the  background  distribu>on  is  uniform  

●  i.e.  the  underlying  source  emits  instances  with  constant  probabili>es  

●  possible  solu>ons:  -  use  many  features  to  represent  complex  scenarios  -  use  stream  mining  algorithms  which  update  parameters  

Page 28: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

x

x x

x x

x

x x

x

Page 29: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Unfair  Sampling  

●  due  to  “experimental”  reason  the  sample  represents  only  a  special  subset  of  the  en>>es  

●  especially  difficult  for  lazy  learning  methods  like  k-­‐nearest-­‐neighbors  

●  possible  solu>ons:  -  remove  redundancy  -  use  stra>fica>on  -  check  variance  and  iden>fy  difficult  instances  

Page 30: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

x

x

x

x x x x

x x

x x

x x

x x

x x

x x

x

Page 31: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Suitable  Representa>on  

●  we  assume  that  the  selected  feature  set  can  represent  the  concept  to  learn  

●  since  we  oken  do  not  know  causal  rela>onships  we  might  mistake  employed  features  with  real  causing  ones  

●  e.g.  number  of  storks  and  births  in  Germany  

●  e.g.  opening  umbrellas  and  rain  

●  remedy:  use  background  knowledge,  careful  interpreta>on  

Page 32: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Representa>on  of  the  Concept  

Weather Temperature Wind playTennis rainy high no yes dry medium yes yes rainy medium yes no dry medium no no

•  with this attributes it is really hard to learn favorable conditions for playing tennis

•  we unconsciously assume that these attribute are sufficient to describe the scenario

Page 33: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Representa>on  of  the  Concept  

Weather Temperature Wind Buddy playTennis rainy high no yes yes dry medium yes yes yes rainy medium yes no no dry medium no no no

•  actually there no proof to show whether important attributes are missing or not

•  for this aspect background knowledge and experience are most important

Page 34: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Performance  Es>ma>on  ●  training  a  model  means  discarding  and  keeping  some  informa>on  from  the  observed  instances  

●  depending  on  the  learning  scheme  instance  specific  informa>on  is  stored  too  

●  instance  specific  informa>on  leads  to  overfi[ng  

●  overfi[ng:  a  predic>on  model  is  biased  towards  the  training  examples,i.e.:    -  beMer  performance  on  training  examples  -  worse  performance  on  new  instances  

Page 35: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

More  Realis>c  Es>ma>ons  ●  most  op>mis>c  es>ma>on:  Resubs>tu>on  error  (determine  performance  on  the  set  completely  used  for  training)  

●  if  you  have  a  lot  of  data:  determine  the  error  on  an  independent  test  set  

Page 36: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

More  Realis>c  Es>ma>ons  II  ●  LOOCV:  Leave  on  out  cross  valida>on  -  always  one  example  is  hold  out  for  tes>ng,  the  remaining  for  training  

-  n  itera>ons  with  n  instances,  final  result  is  the  average  

-  s>ll  quite  biased  -  use  to  check  the  influence  of  individual  instances  -  if  you  have  a  small  number  of  instances  

Page 37: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

More  Realis>c  Es>ma>ons  IIi  ●  n-­‐fold  cross  valida>on,  typically  n=10  -  par>>on  the  data  in  n  par>>on  -  use  n-­‐1  par>>ons  for  training  -  use  1  par>>on  for  performance  assessment  -  repeat  with  a  different  hold-­‐out  par>>on  -  average  performance  

●  every  step  where  class  informa>on  is  considered  has  to  be  included  in  the  loop!  (done  aker  par>>oning)  

Page 38: Machine(Learning(and(Common( Sense( · Machine learning is of immense importance in bioinformatics and biomedical science more generally (Larrañaga et al., 2006; Tarca et al., 2007)

Lothar Richter

Text  Books  

●  Machine  Learning,  Tom  M.  Mitchell,  1997,  ISBN  0-­‐07-­‐115467-­‐1,  The  McGraw-­‐Hill  Companies  Inc.  

●  Introduc>on  to  Data  Mining  for  the  Life  Sciences,  Rob  Sullivan,  2012,  ISBN  978-­‐1-­‐58829-­‐942-­‐0,  Springer  New  York  

●  Data  Mining  3rd  edi>on,  Ian  WiMen,  Eibe  Frank,  Mark  Hall,  2011,  ISBN  978-­‐0-­‐12-­‐374856-­‐0,    Morgan  Kaufmann