building a clinical genome interpretation services company
Post on 11-May-2015
488 Views
Preview:
DESCRIPTION
TRANSCRIPT
1/28Reece Hart — Locus Development
Reece Hart, Ph.D.reece@locusdev.net
Locus Development Inc.http://locusdevelopmentinc.com/
Building a clinical genome Building a clinical genome interpretation services companyinterpretation services company
2/28Reece Hart — Locus Development
OpportunityOpportunity
3/28Reece Hart — Locus Development
Clinical Genome InterpretationClinical Genome Interpretation
photos:Baylor College of Medicine, Univ. Utah, learningradiology.com, sciencephotos.com
Patient presents with symptoms
If genomic interpretation might influence diagnosis or treatment, doctor refers patient to genetic counselor
GC takes history; sample is sent to internal or one of hundreds of labs that provide specific genomic tests
Sequencing and other lab data are processed into preliminary iterpretation
Report is returned to GC and/or physician who
verify interpretation and consult with patient
4/28Reece Hart — Locus Development
100s of laboratory diagnostic testing labs100s of laboratory diagnostic testing labs
5/28Reece Hart — Locus Development
Common variants are hard to interpretCommon variants are hard to interpret
6/28Reece Hart — Locus Development
Some variants are informativeSome variants are informative
7/28
The Significance ofThe Significance of“Variants of Uncertain Significance”“Variants of Uncertain Significance”
“VUS – Variant of uncertain significance. A variation in a genetic sequence whose association with disease risk is unknown. Also called variant of uncertain significance, variant of unknown significance, and unclassified variant.”http://www.cancer.gov/cancertopics/genetics-terms-alphalist
8/28
The long tail of rare diseases.The long tail of rare diseases.
“A rare disease typically affects a patient population estimated at fewer than 200,000 in the U.S. There are more than 6,000 rare diseases known today and they affect an estimated 25 million persons in the U.S.”
NIH Office of Rare Diseases Researchhttp://rarediseases.info.nih.gov/
9/28Reece Hart — Locus Development
The Problems to SolveThe Problems to Solve
➢ Develop a reliable database of genotypes and phenotypes.
➢ Develop methods to interpret all types of variants, not just common SNVs.
➢ Provide meaningful, reliable interpretations based on genomic data.
➢ Do it better than everyone else.
10/28Reece Hart — Locus Development
PlanPlan
11/28Reece Hart — Locus Development
Company OverviewCompany Overview
Genomic Sequenceand Variants
ClinicalInterpretation
Locus
12/28Reece Hart — Locus Development
Curating Genotypes, Phenotypes, and RiskCurating Genotypes, Phenotypes, and Risk
Genotypes/Variants
Phenotypes/Conditions
Genotype-Phenotype Database
RiskModels
dbSNPLSDBs
PharmGKB…
GOOMIMICD-9/10…
workflow and tracking
Locus OverviewLocus Overview
hospitals/clinics, physicians, insurers
sequencessequences variants/attributesvariants/attributes
conditionpredictionscondition
predictionsinter-
pretationinter-
pretation
14/28Reece Hart — Locus Development
ImplementationImplementation
15/28Reece Hart — Locus Development
Curation ContentCuration Content
➢ Many sources● automated and manual tools● databases and literature
➢ Most kinds of variants● SNV, del, ins, delins, repeat, conv, CNV,
haplotypes➢ Many kinds of conditions
● inherited, spontaneous, dominant, recessive, x-linked, preventative, cancer, metabolic, pharmacogenomic, cardio
➢ Examples:● Cystic Fibrosis (w/modifiers)● CMT (~21 subclasses)● Long and Short QT● TPMT, warfarin, CYP2D6
execution frameworkexecution framework
The pipelineThe pipeline
inter-pretation
inter-pretation
variantcallingvariantcalling
reads(fastq)reads(fastq)
calls(vcf)calls(vcf)
report(xml)report(xml)selectionselection
attributes(xml)
attributes(xml)
req'n andsample inforeq'n and
sample infocond'n var.cond'n var. risk modelsrisk models
curationcuration
@G88NFDU01AI6Z3 rank=0000170 x=101.0 y=1953.5 length=56AGTGTAGTAGTGAGAAAAACTTTGTGGGGATATGGATACAATTATTTACCCAAATC+IIIIIIIIIIIGC>////-....826666<EIIIIIIIIIIIHI6644/..222==@G88NFDU01AKOQI rank=0000178 x=118.0 y=1960.0 length=59agtgtagtagtaaggaagattgagtgcctgaccttCCGGGTGGCGGTAGCGTTGGCCCC+BHBEEIIIIIEEEEEBGBECCCDEIIIIIIIIIIIEEICC===988ED>?>>>88...-@G88NFDU01AL6H7 rank=0000323 x=135.0 y=2013.5 length=95agtgtagtagtgtgagctggtgaagaaggtctccGATGTCATATGGAACAGCCTCAGCCGCTCCTACTTCAAGGATCGGGCCCACATCCAGTCCC+=>BBBBB==;;B>454@EA@>>===>>BBIIE@ACIGIEIFFDD66665@@:::>AA777A<;;>A>?>>4433;>>;660000.9=85533,,,
##fileformat=VCFv4.1…##FILTER=<ID=LowQual,Description="Low quality">##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##VariantFiltration="analysis_type=VariantFiltration input_file=[] sample_metadata=[] read_buffer_size=null phone_home=NO_ET read_filter=[] intervals=null excl##contig=<ID=GL000240.1,length=41933,assembly=b37>##reference=file:///locus/data/references/genomes/human_g1k_v37/sequences/human_g1k_v37.fasta##source=SelectVariants#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LS991 145414740 . A N . PASS AC=0;AF=0.00;AN=2;DP=137;MQ=213.16;MQ0=0 GT:DP:GQ:PL:AD 0/0:137:99:0,412,5401:137,01 145414741 . G N . PASS AC=0;AF=0.00;AN=2;DP=138;MQ=213.16;MQ0=0 GT:DP:GQ:PL:AD 0/0:138:99:0,415,5406:138,01 145414742 . C N . PASS AC=0;AF=0.00;AN=2;DP=139;MQ=212.39;MQ0=0 GT:DP:GQ:PL:AD 0/0:139:99:0,418,5289:139,0
<?xml version="1.0" encoding="UTF-8"?> <sample-attributes> <sample-info id="LS125" gender="Unknown" … > <reference> <organism>homo sapiens</organism> <build-id>human_g1k_v37</build-id> </reference> <loci> <locus chr="3" start="10183685" end="10183685" sequence="G" read-coverage="0"> <alt read-coverage="" quality-score="" sequence="" locus-cvid-code="CVID1003741" locus-cvid="A|G" locus-cvid-start="10183685"/> </locus>
<locus-report format="1.0"> <requisition> <!-- I'm not giving this section too much thought. Good enough for now Can update later, when we commercialize --> <client id="uuid"></client> <patient name="LS99" ethnicity="" gender="Male" dob="" id="LS99"></patient>… <conditions> <condition code="VonHL"> <associated-conditions></associated-conditions> </condition> </conditions> </requisition> <coverage> <sequence minimum-depth="100" sensitivity="98.9" specificity="99.0"> <region genome-build="GRCh37" chrom="3" end="440" start="400"></region> <region genome-build="GRCh37" chrom="3" end="700" start="600"></region>
variants_and_refagree.vcf filtered_on_callable.vcf: lake.mk reads.fastq (set -e; \ source /locus/opt/lake/bin/lakeSetupEnv; \ $(MAKE) -f $< $@; \ ) 2>$@.err
calls.vcf: variants_and_refagree.vcf #filtered_on_callable.vcf ln -s $< $@
attr.xml: calls.vcf req.xml sample.xml ${ATTR_FILE} generate_attributes_file.py …
report.xml: attr.xml req.xml sampleconditionreport $^ -o $@
report.html: report.xml reportrenderer $< -o $@
LIMSLIMS
<?xml version="1.0"?><requisition> <conditions> <condition>VonHL</condition> </conditions></requisition>
<?xml version="1.0" encoding="UTF-8"?><samples><sample-info id="LS99" gender="Male" birth-date="/Date(1320120000000-0700)/" type="GenomicDNA" status="New" ordering-clinician="JMajor" nanodrop-concentration="300" original-barcode="NA06994" use-type="RD" origin="Coriell" code="NA06994" concentration="300" description="" accession-date="" accession-user=""/></samples>
The pipeline in actionThe pipeline in action$ ls reads.fastq.gz req.xml sample.* Makefile Makefile reads.fastq.gz req.xml sample.info sample.xml
$ time make report.html report.pdfgzip -cdq <reads.fastq.gz >reads.fastqlake --recipe reads_to_variants >lake.mkln -s variants_and_refagree.vcf calls.vcf
generate_attributes_file.py ...
sampleconditionreport attr.xml req.xml -o report.xml
reportrenderer report.xml -o report.html
wkhtmltopdf report.html report.pdf
real 7m14.804suser 7m16.490ssys 2m0.150s
18/28Reece Hart — Locus Development
Locus InterpretationLocus Interpretation
19/28Reece Hart — Locus Development
The big lesson…The big lesson…
Transcripts are muchTranscripts are muchmessier than expected.messier than expected.
20/28Reece Hart — Locus Development
Problem statementProblem statement
There is no single source of transcripts that is all of: stable (archived), mapped, agree with the reference genome, have RefSeq accessions.
➢ Issues:● Poor access / programmability● No archived mappings● RefSeq != reference genome due to origin,
ambiguity, error● Patches are difficult to use
21/28Reece Hart — Locus Development
When RefSeq != Genome ReferenceWhen RefSeq != Genome Reference
NM_0123.4:c.45C>T
NC_000006.11:g.31030103C>T
variant publishedrelative to RefSeq
NM_0123.4:c.832T>G
NC_000006.11:g.31038124T>G
discovered variantreported relative to RefSeq
Amismatch ins/del
-
downstream coordinatesshifted
22/28Reece Hart — Locus Development
17.8% of RefSeq transcripts differ from 17.8% of RefSeq transcripts differ from GRCh37GRCh37
Garla, V., Kong, Y., Szpakowski, S., & Krauthammer, M. (2011).MU2A--reconciling the genome and transcriptome to determine the effects of base substitutions.Bioinformatics (Oxford, England), 27(3), 416-8. doi:10.1093/bioinformatics/btq658
5.4% have coordinate-changing differences
23/28Reece Hart — Locus Development
Sources of transcript informationSources of transcript information
➢ NCBI:● map current transcripts to current genome only● maps with splign● doesn't agree with ref genome ~18%● no local database option
➢ UCSC:● current transcripts only● maps using blat
➢ Ensembl:● aligns using in-house gene building process● cross-linked to refseqs● incorporates NCBI transcripts ad hoc● well-maintained; good API; broad data; VEP
24/28Reece Hart — Locus Development
PTEN: insertion/deletion in 5' UTRPTEN: insertion/deletion in 5' UTR
25/28Reece Hart — Locus Development
NEFL: genome insertion leads to NEFL: genome insertion leads to frameshift/stopframeshift/stop
26/28Reece Hart — Locus Development
RefSeq HandlingRefSeq Handling
---------- Forwarded message ----------Date: Wed, Jan 25, 2012 at 1:59 PMSubject: [Genome] How does UCSC hg19 gene model add exons to RefSeqs?To: genome@soe.ucsc.edu
Hi, when using the human reference hg19 gene model… where the hg19 model has an exon that does not exon exist in the RefSeq accession (or any historical version of the RefSeq accession).
How/why does the alignment introduce an intron in this case? Does it ensure there are plausible flanking splice junctions before inserting an intron to a RefSeq sequence that lacks it but it maps to?
27/28Reece Hart — Locus Development
338 genes so far
➢ We should encourage LRG and adopt it when ready(and we'll still have to deal with legacy transcripts)
28/28Reece Hart — Locus Development
Not pictured: Jon Sorenson
top related