pattern recognition in clinical data
TRANSCRIPT
![Page 1: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/1.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
Pattern Recognition In Clinical Data
Saket ChoudharyDual Degree Project
Guide: Prof. Santosh Noronha
C G C A T C G A G C T
C G C G T C G A G C T
October 30, 2013
![Page 2: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/2.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
INTRODUCTION
INTRODUCTION
Objective
SIGNIFICANT MUTATIONS
Next Generation SequencingMotivationComputational Methods for Driver Detection
VIRAL GENOME DETECION
Workflow
REPRODUCIBILITY
Reproducibility
CONCLUSIONS
Wrapping up
![Page 3: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/3.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 4: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/4.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 5: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/5.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 6: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/6.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 7: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/7.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 8: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/8.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 9: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/9.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 10: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/10.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 11: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/11.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 12: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/12.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 13: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/13.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 14: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/14.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 15: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/15.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
OBJECTIVE
Next GenerationSequencing
&Cancer Research
Toolsfor
BiologistsReproducible
analysis
Driver& Pas-senger
MutationDetection
GalaxyTools
ViralGenome
Inte-gration
GalaxyWork-flow
Better/NewAlgorithms
Bench-marking
Alignmenttools
BWAv/s
BWA-PSSM
Driver& Pas-senger
MutationDetection
EnsemblMethod
Lit.Survey
![Page 16: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/16.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NEXT GENERATION SEQUENCING
C G C G T C G A G C T A G C A
![Page 17: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/17.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NEXT GENERATION SEQUENCING
C G C G T C G A G C T A G C A
![Page 18: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/18.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NEXT GENERATION SEQUENCING
C G C G T C G A G C T A G C A
G C G T∗ C G A G∗ C T A G∗
![Page 19: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/19.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NEXT GENERATION SEQUENCING
C G C G T C G A G C T A G C A
G C G T∗ C G A G∗ C T A G∗
A G C G C G T C G A G C T A G C A C A
![Page 20: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/20.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NGS: WHY?
I Molecular Approach: Study of variations at the ’base’level
I Low Cost: 1000$ genomeI Faster: Quicker than traditional sequencing techniques
![Page 21: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/21.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NGS: WHY?
I Molecular Approach: Study of variations at the ’base’level
I Low Cost: 1000$ genomeI Faster: Quicker than traditional sequencing techniques
![Page 22: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/22.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NGS: WHY?
I Molecular Approach: Study of variations at the ’base’level
I Low Cost: 1000$ genomeI Faster: Quicker than traditional sequencing techniques
![Page 23: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/23.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NGS: WHERE?
I Study variations, genotype-phenotype associationI Look for ’markers of diseases’I Prognosis
![Page 24: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/24.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NGS: WHERE?
I Study variations, genotype-phenotype associationI Look for ’markers of diseases’I Prognosis
![Page 25: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/25.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NGS: WHERE?
I Study variations, genotype-phenotype associationI Look for ’markers of diseases’I Prognosis
![Page 26: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/26.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
NGS: MUTATIONS
I 3x109 base pairsI We are all 99.9% similar, at DNA levelI More than 2 million SNPsI No particular pattern of SNPsI If a certain mutation causes a change in an amino acid, it is
referred to as non synonymous(nsSNV)
![Page 27: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/27.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
DRIVERS AND PASSENGERS I
Cancer is known to arise due to mutationsNot all mutations are equally important!
Somatic MutationsSet of mutations acquired after zygote formation, over andabove the germline mutations
Driver MutationsMutations that confer growth advantages to the cell, beingselected positively in the tumor tissue
![Page 28: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/28.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
DRIVERS AND PASSENGERS
Drivers are NOT simply loss of function mutations, but morethan that:
I Loss of function: Inactivate tumor suppressor proteinsI Gain of function: Activates normal genes transforming
them to oncogenesI Drug Resistance Mutations: Mutations that have evolved
to overcome the inhibitory effect of drugs
![Page 29: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/29.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
DRIVERS AND PASSENGERS
Drivers are NOT simply loss of function mutations, but morethan that:
I Loss of function: Inactivate tumor suppressor proteinsI Gain of function: Activates normal genes transforming
them to oncogenesI Drug Resistance Mutations: Mutations that have evolved
to overcome the inhibitory effect of drugs
![Page 30: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/30.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
DRIVERS AND PASSENGERS
Drivers are NOT simply loss of function mutations, but morethan that:
I Loss of function: Inactivate tumor suppressor proteinsI Gain of function: Activates normal genes transforming
them to oncogenesI Drug Resistance Mutations: Mutations that have evolved
to overcome the inhibitory effect of drugs
![Page 31: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/31.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
DRIVER MUTATIONS: WHY?
Identify driver mutations −→ better therapeutic targetsBut how does one zero down upon the exact set? −→experiments are too costly, probably infeasible for 2 million+SNPs −→ Leverage computational analysis
I Low cost of NGS comes with a heavier roadblock of dataanalysis
I Searching among 2 million+ SNPs is a non-trivial, and acomputationally intensive problem
I Softwares have a low consensus ratio amongst them selves←→ Defining a driver, computationally is non-trivial
I However there is no tool that allows one to visualise theresults on an input across the cohort of tools
![Page 32: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/32.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
DRIVER MUTATIONS: WHY?
Identify driver mutations −→ better therapeutic targetsBut how does one zero down upon the exact set? −→experiments are too costly, probably infeasible for 2 million+SNPs −→ Leverage computational analysis
I Low cost of NGS comes with a heavier roadblock of dataanalysis
I Searching among 2 million+ SNPs is a non-trivial, and acomputationally intensive problem
I Softwares have a low consensus ratio amongst them selves←→ Defining a driver, computationally is non-trivial
I However there is no tool that allows one to visualise theresults on an input across the cohort of tools
![Page 33: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/33.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
DRIVER MUTATIONS: WHY?
Identify driver mutations −→ better therapeutic targetsBut how does one zero down upon the exact set? −→experiments are too costly, probably infeasible for 2 million+SNPs −→ Leverage computational analysis
I Low cost of NGS comes with a heavier roadblock of dataanalysis
I Searching among 2 million+ SNPs is a non-trivial, and acomputationally intensive problem
I Softwares have a low consensus ratio amongst them selves←→ Defining a driver, computationally is non-trivial
I However there is no tool that allows one to visualise theresults on an input across the cohort of tools
![Page 34: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/34.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
DRIVER MUTATIONS: WHY?
Identify driver mutations −→ better therapeutic targetsBut how does one zero down upon the exact set? −→experiments are too costly, probably infeasible for 2 million+SNPs −→ Leverage computational analysis
I Low cost of NGS comes with a heavier roadblock of dataanalysis
I Searching among 2 million+ SNPs is a non-trivial, and acomputationally intensive problem
I Softwares have a low consensus ratio amongst them selves←→ Defining a driver, computationally is non-trivial
I However there is no tool that allows one to visualise theresults on an input across the cohort of tools
![Page 35: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/35.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
MACHINE LEARNING ITwo datasets:
I Training: Labeled dataset, containing a table of featureswith mutations labelled as ”drivers/passengers”
I Test: ’Learning’ from training dataset, test the predictionmodel
Table: Training Dataset
Chromosome Position Ref Alt Type1 27822 A G Driver1 27832 T G Driver2 47842 G C Passenger. . . . .. . . . .. . . . .
![Page 36: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/36.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
MACHINE LEARNING II
Table: Test Dataset
Chromosome Position Ref Alt Type1 27824 A G ?1 47832 T G ?
![Page 37: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/37.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
MACHINE LEARNING: FEATURE SELECTION I
Machine Learning relies on a set of features for trainingRedundant features should be avoidedCHASM [1] makes use ofp(Xi) represents the probability of occurrence of an event XiConsidering a series of events X1, X2, X3...,Xn analogous ’seriesof packets’ in communication theory , the information receivedat each step can be quantified on a log scale by:
1log2(Xi)
= −log2(p(Xi)) (1)
The expected value of information from a series of events iscalled shannon entropy: H(X):
H(X) = −∑
i
p(Xi) log2 p(Xi) (2)
![Page 38: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/38.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
MACHINE LEARNING: FEATURE SELECTION IIMutual Information between two random variables X,Y isdefined as the amount of information gained about randomvariable X due to additional information gained from thesecond, Y:
I(X,Y) = H(X)−H(X|Y) (3)
Here:X: Class Label[Driver/Passenger]Y: Predictive Featureand hence I(X,Y) represents how much information wasgained about the class label Y from knowledge of a feature XSimplifying :
I(X,Y) =∑
p(x, y)log2p(x, y)
p(x)p(y)(4)
![Page 39: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/39.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
FUNCTIONAL IMPACT I
I If a certain mutation confers an advantage to the cell interms of replication rate, it is probably going to be selectedwhile all those mutations that reduce its fitness have ahigher chance of being eliminated from the population.
I Certain residues in a MSA of homologous sequences aremore conserved than others. A highly conserved ifmutated is possibly going to cost a lot since what had’evolved’ is disturbed!
I Scores can be assigned based on this ”conservation”parameter.
![Page 40: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/40.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
FUNCTIONAL IMPACT IIFigure: SIFT [?] algorithm
![Page 41: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/41.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
Some of the common tools/algorithms used for drivermutation prediction:
I SIFTI PolyphenI Mutation AssesorI TransFICI Condel
![Page 42: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/42.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
FRAMEWORK FOR COMPARING VARIOUS TOOLS I
I Different tools use different formats, give different outputsfor similar input
I Running analysis on multiple tools −→ keep shifting dataformats
I Concordance?
Polyphen2 Input
chr1:888659 T/Cchr1:1120431 G/Achr1:1387764 G/Achr1:1421991 G/Achr1:1599812 C/Tchr1:1888193 C/Achr1:1900186 T/C
![Page 43: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/43.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
FRAMEWORK FOR COMPARING VARIOUS TOOLS II
SIFT Input
1,888659,T,C1,1120431,G,A1,1387764,G,A1,1421991,G,A1,1599812,C,T1,1888193,C,A1,1900186,T,C
![Page 44: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/44.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
DRIVER MUTATIONS: TOOLS DON’T AGREE
X Axis: Condel Score Y Axis: MA Score
![Page 45: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/45.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
Solution?:Galaxy[?], an open source web-based platform forbioinformatics, makes it possible to represent the entire dataanalysis pipeline in an intuitive graphical interface
Figure: Galaxy Workflow polyphen2 algorithm
![Page 46: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/46.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
Run all tools in one go:
Figure: Run all tools
![Page 47: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/47.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
Compare all tools:
Figure: Compare all tools
![Page 48: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/48.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
VIRAL GENOME DETECION
Cervical cancers have been proven to be associated withHuman Papillomavirus(HPV)Cervical cancer datasets from Indian women was put throughan analysis to detect :
1. Any possible HPV integration2. Sites of HPV integration
Who Cares?I Replacing whole genome sequencing, by targeted
sequencing at the sites where these virus have beendetected in a cohort of samples, thus speeding up thewhole process.
![Page 49: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/49.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
RawData
Align datato humangenome
reference
Extractunmapped
regions
Alignunmapped
regionsto Virusgenome
ExtractmappedregionsBLAST
![Page 50: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/50.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
RawData
Align datato humangenome
reference
Extractunmapped
regions
Alignunmapped
regionsto Virusgenome
ExtractmappedregionsBLAST
![Page 51: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/51.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
RawData
Align datato humangenome
reference
Extractunmapped
regions
Alignunmapped
regionsto Virusgenome
ExtractmappedregionsBLAST
![Page 52: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/52.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
RawData
Align datato humangenome
reference
Extractunmapped
regions
Alignunmapped
regionsto Virusgenome
ExtractmappedregionsBLAST
![Page 53: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/53.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
RawData
Align datato humangenome
reference
Extractunmapped
regions
Alignunmapped
regionsto Virusgenome
ExtractmappedregionsBLAST
![Page 54: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/54.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
RawData
Align datato humangenome
reference
Extractunmapped
regions
Alignunmapped
regionsto Virusgenome
ExtractmappedregionsBLAST
![Page 55: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/55.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
RawData
Align datato humangenome
reference
Extractunmapped
regions
Alignunmapped
regionsto Virusgenome
ExtractmappedregionsBLAST
![Page 56: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/56.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
GALAXY WORKFLOW
![Page 57: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/57.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
Figure: Aligned HPV genomes
![Page 58: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/58.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
REPRODUCIBILITY
I In pursuit of novel ’discovery’, standardizing the dataanalysis pipeline is often ignored, leading to dubiousconclusions
I Analysis should be reproducible and above all, correctI Parameter’s values can change the results by a big factor,
they need to be documented/loggedI Garbage in, Garbage out
![Page 59: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/59.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
CONCLUSIONS
With the Galaxy tool box for identification of significantmutations and the study of the science behind the methods, thenext steps would be to:
I Open source the toolbox to the community: A tool makeslittle sense if it is not in a usable form, communityfeedback will be used to add more tools and improve theexisting ones
I A new method for driver mutation prediction: all themethods have low level of concordance. A new methodthat takes into account the available data at all levels :mutations, transcriptome and micro array data is possible.With the Galaxy toolbox in place, it would be possible tointegrate information at various levels
![Page 60: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/60.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
FUTURE WORK
I Develop an algorithm that integrates machine learningapproach with functional approach by zeroing down upononly those attributes that are known to have an impact
I The algorithm would also account for information at otherlevels: RNA expressions, Clinical data.
I Integrating information at all levels would provide adeeper insight
I The developed Galaxy toolbox will be used a the basicframework for integrating information
![Page 61: Pattern Recognition in Clinical Data](https://reader033.vdocument.in/reader033/viewer/2022060110/555a673fd8b42a972b8b48c5/html5/thumbnails/61.jpg)
INTRODUCTION SIGNIFICANT MUTATIONS VIRAL GENOME DETECION REPRODUCIBILITY CONCLUSIONS
REFERENCES I
Hannah Carter, Sining Chen, Leyla Isik, SvitlanaTyekucheva, Victor E Velculescu, Kenneth W Kinzler, BertVogelstein, and Rachel Karchin.Cancer-specific high-throughput annotation of somaticmutations: computational prediction of driver missensemutations.Cancer research, 69(16):6660–6667, 2009.