building a clinical genome interpretation services company

Post on 11-May-2015

488 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk given to the Berkeley sequencing supergroup in January 2012.

TRANSCRIPT

1/28Reece Hart — Locus Development

Reece Hart, Ph.D.reece@locusdev.net

Locus Development Inc.http://locusdevelopmentinc.com/

Building a clinical genome Building a clinical genome interpretation services companyinterpretation services company

2/28Reece Hart — Locus Development

OpportunityOpportunity

3/28Reece Hart — Locus Development

Clinical Genome InterpretationClinical Genome Interpretation

photos:Baylor College of Medicine, Univ. Utah, learningradiology.com, sciencephotos.com

Patient presents with symptoms

If genomic interpretation might influence diagnosis or treatment, doctor refers patient to genetic counselor

GC takes history; sample is sent to internal or one of hundreds of labs that provide specific genomic tests

Sequencing and other lab data are processed into preliminary iterpretation

Report is returned to GC and/or physician who

verify interpretation and consult with patient

4/28Reece Hart — Locus Development

100s of laboratory diagnostic testing labs100s of laboratory diagnostic testing labs

5/28Reece Hart — Locus Development

Common variants are hard to interpretCommon variants are hard to interpret

6/28Reece Hart — Locus Development

Some variants are informativeSome variants are informative

7/28

The Significance ofThe Significance of“Variants of Uncertain Significance”“Variants of Uncertain Significance”

“VUS – Variant of uncertain significance. A variation in a genetic sequence whose association with disease risk is unknown. Also called variant of uncertain significance, variant of unknown significance, and unclassified variant.”http://www.cancer.gov/cancertopics/genetics-terms-alphalist

8/28

The long tail of rare diseases.The long tail of rare diseases.

“A rare disease typically affects a patient population estimated at fewer than 200,000 in the U.S. There are more than 6,000 rare diseases known today and they affect an estimated 25 million persons in the U.S.”

NIH Office of Rare Diseases Researchhttp://rarediseases.info.nih.gov/

9/28Reece Hart — Locus Development

The Problems to SolveThe Problems to Solve

➢ Develop a reliable database of genotypes and phenotypes.

➢ Develop methods to interpret all types of variants, not just common SNVs.

➢ Provide meaningful, reliable interpretations based on genomic data.

➢ Do it better than everyone else.

10/28Reece Hart — Locus Development

PlanPlan

11/28Reece Hart — Locus Development

Company OverviewCompany Overview

Genomic Sequenceand Variants

ClinicalInterpretation

Locus

12/28Reece Hart — Locus Development

Curating Genotypes, Phenotypes, and RiskCurating Genotypes, Phenotypes, and Risk

Genotypes/Variants

Phenotypes/Conditions

Genotype-Phenotype Database

RiskModels

dbSNPLSDBs

PharmGKB…

GOOMIMICD-9/10…

workflow and tracking

Locus OverviewLocus Overview

hospitals/clinics, physicians, insurers

sequencessequences variants/attributesvariants/attributes

conditionpredictionscondition

predictionsinter-

pretationinter-

pretation

14/28Reece Hart — Locus Development

ImplementationImplementation

15/28Reece Hart — Locus Development

Curation ContentCuration Content

➢ Many sources● automated and manual tools● databases and literature

➢ Most kinds of variants● SNV, del, ins, delins, repeat, conv, CNV,

haplotypes➢ Many kinds of conditions

● inherited, spontaneous, dominant, recessive, x-linked, preventative, cancer, metabolic, pharmacogenomic, cardio

➢ Examples:● Cystic Fibrosis (w/modifiers)● CMT (~21 subclasses)● Long and Short QT● TPMT, warfarin, CYP2D6

execution frameworkexecution framework

The pipelineThe pipeline

inter-pretation

inter-pretation

variantcallingvariantcalling

reads(fastq)reads(fastq)

calls(vcf)calls(vcf)

report(xml)report(xml)selectionselection

attributes(xml)

attributes(xml)

req'n andsample inforeq'n and

sample infocond'n var.cond'n var. risk modelsrisk models

curationcuration

@G88NFDU01AI6Z3 rank=0000170 x=101.0 y=1953.5 length=56AGTGTAGTAGTGAGAAAAACTTTGTGGGGATATGGATACAATTATTTACCCAAATC+IIIIIIIIIIIGC>////-....826666<EIIIIIIIIIIIHI6644/..222==@G88NFDU01AKOQI rank=0000178 x=118.0 y=1960.0 length=59agtgtagtagtaaggaagattgagtgcctgaccttCCGGGTGGCGGTAGCGTTGGCCCC+BHBEEIIIIIEEEEEBGBECCCDEIIIIIIIIIIIEEICC===988ED>?>>>88...-@G88NFDU01AL6H7 rank=0000323 x=135.0 y=2013.5 length=95agtgtagtagtgtgagctggtgaagaaggtctccGATGTCATATGGAACAGCCTCAGCCGCTCCTACTTCAAGGATCGGGCCCACATCCAGTCCC+=>BBBBB==;;B>454@EA@>>===>>BBIIE@ACIGIEIFFDD66665@@:::>AA777A<;;>A>?>>4433;>>;660000.9=85533,,,

##fileformat=VCFv4.1…##FILTER=<ID=LowQual,Description="Low quality">##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##VariantFiltration="analysis_type=VariantFiltration input_file=[] sample_metadata=[] read_buffer_size=null phone_home=NO_ET read_filter=[] intervals=null excl##contig=<ID=GL000240.1,length=41933,assembly=b37>##reference=file:///locus/data/references/genomes/human_g1k_v37/sequences/human_g1k_v37.fasta##source=SelectVariants#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LS991 145414740 . A N . PASS AC=0;AF=0.00;AN=2;DP=137;MQ=213.16;MQ0=0 GT:DP:GQ:PL:AD 0/0:137:99:0,412,5401:137,01 145414741 . G N . PASS AC=0;AF=0.00;AN=2;DP=138;MQ=213.16;MQ0=0 GT:DP:GQ:PL:AD 0/0:138:99:0,415,5406:138,01 145414742 . C N . PASS AC=0;AF=0.00;AN=2;DP=139;MQ=212.39;MQ0=0 GT:DP:GQ:PL:AD 0/0:139:99:0,418,5289:139,0

<?xml version="1.0" encoding="UTF-8"?> <sample-attributes> <sample-info id="LS125" gender="Unknown" … > <reference> <organism>homo sapiens</organism> <build-id>human_g1k_v37</build-id> </reference> <loci> <locus chr="3" start="10183685" end="10183685" sequence="G" read-coverage="0"> <alt read-coverage="" quality-score="" sequence="" locus-cvid-code="CVID1003741" locus-cvid="A|G" locus-cvid-start="10183685"/> </locus>

<locus-report format="1.0"> <requisition> <!-- I'm not giving this section too much thought. Good enough for now Can update later, when we commercialize --> <client id="uuid"></client> <patient name="LS99" ethnicity="" gender="Male" dob="" id="LS99"></patient>… <conditions> <condition code="VonHL"> <associated-conditions></associated-conditions> </condition> </conditions> </requisition> <coverage> <sequence minimum-depth="100" sensitivity="98.9" specificity="99.0"> <region genome-build="GRCh37" chrom="3" end="440" start="400"></region> <region genome-build="GRCh37" chrom="3" end="700" start="600"></region>

variants_and_refagree.vcf filtered_on_callable.vcf: lake.mk reads.fastq (set -e; \ source /locus/opt/lake/bin/lakeSetupEnv; \ $(MAKE) -f $< $@; \ ) 2>$@.err

calls.vcf: variants_and_refagree.vcf #filtered_on_callable.vcf ln -s $< $@

attr.xml: calls.vcf req.xml sample.xml ${ATTR_FILE} generate_attributes_file.py …

report.xml: attr.xml req.xml sampleconditionreport $^ -o $@

report.html: report.xml reportrenderer $< -o $@

LIMSLIMS

<?xml version="1.0"?><requisition> <conditions> <condition>VonHL</condition> </conditions></requisition>

<?xml version="1.0" encoding="UTF-8"?><samples><sample-info id="LS99" gender="Male" birth-date="/Date(1320120000000-0700)/" type="GenomicDNA" status="New" ordering-clinician="JMajor" nanodrop-concentration="300" original-barcode="NA06994" use-type="RD" origin="Coriell" code="NA06994" concentration="300" description="" accession-date="" accession-user=""/></samples>

The pipeline in actionThe pipeline in action$ ls reads.fastq.gz req.xml sample.* Makefile Makefile reads.fastq.gz req.xml sample.info sample.xml

$ time make report.html report.pdfgzip -cdq <reads.fastq.gz >reads.fastqlake --recipe reads_to_variants >lake.mkln -s variants_and_refagree.vcf calls.vcf

generate_attributes_file.py ...

sampleconditionreport attr.xml req.xml -o report.xml

reportrenderer report.xml -o report.html

wkhtmltopdf report.html report.pdf

real 7m14.804suser 7m16.490ssys 2m0.150s

18/28Reece Hart — Locus Development

Locus InterpretationLocus Interpretation

19/28Reece Hart — Locus Development

The big lesson…The big lesson…

Transcripts are muchTranscripts are muchmessier than expected.messier than expected.

20/28Reece Hart — Locus Development

Problem statementProblem statement

There is no single source of transcripts that is all of: stable (archived), mapped, agree with the reference genome, have RefSeq accessions.

➢ Issues:● Poor access / programmability● No archived mappings● RefSeq != reference genome due to origin,

ambiguity, error● Patches are difficult to use

21/28Reece Hart — Locus Development

When RefSeq != Genome ReferenceWhen RefSeq != Genome Reference

NM_0123.4:c.45C>T

NC_000006.11:g.31030103C>T

variant publishedrelative to RefSeq

NM_0123.4:c.832T>G

NC_000006.11:g.31038124T>G

discovered variantreported relative to RefSeq

Amismatch ins/del

-

downstream coordinatesshifted

22/28Reece Hart — Locus Development

17.8% of RefSeq transcripts differ from 17.8% of RefSeq transcripts differ from GRCh37GRCh37

Garla, V., Kong, Y., Szpakowski, S., & Krauthammer, M. (2011).MU2A--reconciling the genome and transcriptome to determine the effects of base substitutions.Bioinformatics (Oxford, England), 27(3), 416-8. doi:10.1093/bioinformatics/btq658

5.4% have coordinate-changing differences

23/28Reece Hart — Locus Development

Sources of transcript informationSources of transcript information

➢ NCBI:● map current transcripts to current genome only● maps with splign● doesn't agree with ref genome ~18%● no local database option

➢ UCSC:● current transcripts only● maps using blat

➢ Ensembl:● aligns using in-house gene building process● cross-linked to refseqs● incorporates NCBI transcripts ad hoc● well-maintained; good API; broad data; VEP

24/28Reece Hart — Locus Development

PTEN: insertion/deletion in 5' UTRPTEN: insertion/deletion in 5' UTR

25/28Reece Hart — Locus Development

NEFL: genome insertion leads to NEFL: genome insertion leads to frameshift/stopframeshift/stop

26/28Reece Hart — Locus Development

RefSeq HandlingRefSeq Handling

---------- Forwarded message ----------Date: Wed, Jan 25, 2012 at 1:59 PMSubject: [Genome] How does UCSC hg19 gene model add exons to RefSeqs?To: genome@soe.ucsc.edu

Hi, when using the human reference hg19 gene model… where the hg19 model has an exon that does not exon exist in the RefSeq accession (or any historical version of the RefSeq accession).

How/why does the alignment introduce an intron in this case? Does it ensure there are plausible flanking splice junctions before inserting an intron to a RefSeq sequence that lacks it but it maps to?

27/28Reece Hart — Locus Development

338 genes so far

➢ We should encourage LRG and adopt it when ready(and we'll still have to deal with legacy transcripts)

28/28Reece Hart — Locus Development

Not pictured: Jon Sorenson

top related