building a clinical genome interpretation services company

1/28Reece Hart — Locus Development

Reece Hart, Ph.D.reece@locusdev.net

Locus Development Inc.http://locusdevelopmentinc.com/

Building a clinical genome Building a clinical genome interpretation services companyinterpretation services company

OpportunityOpportunity

Clinical Genome InterpretationClinical Genome Interpretation

photos:Baylor College of Medicine, Univ. Utah, learningradiology.com, sciencephotos.com

Patient presents with symptoms

If genomic interpretation might influence diagnosis or treatment, doctor refers patient to genetic counselor

GC takes history; sample is sent to internal or one of hundreds of labs that provide specific genomic tests

Sequencing and other lab data are processed into preliminary iterpretation

Report is returned to GC and/or physician who

verify interpretation and consult with patient

100s of laboratory diagnostic testing labs100s of laboratory diagnostic testing labs

Common variants are hard to interpretCommon variants are hard to interpret

Some variants are informativeSome variants are informative

The Significance ofThe Significance of“Variants of Uncertain Significance”“Variants of Uncertain Significance”

“VUS – Variant of uncertain significance. A variation in a genetic sequence whose association with disease risk is unknown. Also called variant of uncertain significance, variant of unknown significance, and unclassified variant.”http://www.cancer.gov/cancertopics/genetics-terms-alphalist

The long tail of rare diseases.The long tail of rare diseases.

“A rare disease typically affects a patient population estimated at fewer than 200,000 in the U.S. There are more than 6,000 rare diseases known today and they affect an estimated 25 million persons in the U.S.”

NIH Office of Rare Diseases Researchhttp://rarediseases.info.nih.gov/

The Problems to SolveThe Problems to Solve

➢ Develop a reliable database of genotypes and phenotypes.

➢ Develop methods to interpret all types of variants, not just common SNVs.

➢ Provide meaningful, reliable interpretations based on genomic data.

➢ Do it better than everyone else.

PlanPlan

Company OverviewCompany Overview

Genomic Sequenceand Variants

ClinicalInterpretation

Curating Genotypes, Phenotypes, and RiskCurating Genotypes, Phenotypes, and Risk

Genotypes/Variants

Phenotypes/Conditions

Genotype-Phenotype Database

RiskModels

dbSNPLSDBs

PharmGKB…

GOOMIMICD-9/10…

workflow and tracking

Locus OverviewLocus Overview

hospitals/clinics, physicians, insurers

sequencessequences variants/attributesvariants/attributes

conditionpredictionscondition

predictionsinter-

pretationinter-

pretation

ImplementationImplementation

Curation ContentCuration Content

➢ Many sources● automated and manual tools● databases and literature

➢ Most kinds of variants● SNV, del, ins, delins, repeat, conv, CNV,

haplotypes➢ Many kinds of conditions

● inherited, spontaneous, dominant, recessive, x-linked, preventative, cancer, metabolic, pharmacogenomic, cardio

➢ Examples:● Cystic Fibrosis (w/modifiers)● CMT (~21 subclasses)● Long and Short QT● TPMT, warfarin, CYP2D6

execution frameworkexecution framework

The pipelineThe pipeline

inter-pretation

variantcallingvariantcalling

reads(fastq)reads(fastq)

calls(vcf)calls(vcf)

report(xml)report(xml)selectionselection

attributes(xml)

req'n andsample inforeq'n and

sample infocond'n var.cond'n var. risk modelsrisk models

curationcuration

@G88NFDU01AI6Z3 rank=0000170 x=101.0 y=1953.5 length=56AGTGTAGTAGTGAGAAAAACTTTGTGGGGATATGGATACAATTATTTACCCAAATC+IIIIIIIIIIIGC>////-....826666<EIIIIIIIIIIIHI6644/..222==@G88NFDU01AKOQI rank=0000178 x=118.0 y=1960.0 length=59agtgtagtagtaaggaagattgagtgcctgaccttCCGGGTGGCGGTAGCGTTGGCCCC+BHBEEIIIIIEEEEEBGBECCCDEIIIIIIIIIIIEEICC===988ED>?>>>88...-@G88NFDU01AL6H7 rank=0000323 x=135.0 y=2013.5 length=95agtgtagtagtgtgagctggtgaagaaggtctccGATGTCATATGGAACAGCCTCAGCCGCTCCTACTTCAAGGATCGGGCCCACATCCAGTCCC+=>BBBBB==;;B>454@EA@>>===>>BBIIE@ACIGIEIFFDD66665@@:::>AA777A<;;>A>?>>4433;>>;660000.9=85533,,,

##fileformat=VCFv4.1…##FILTER=<ID=LowQual,Description="Low quality">##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##VariantFiltration="analysis_type=VariantFiltration input_file=[] sample_metadata=[] read_buffer_size=null phone_home=NO_ET read_filter=[] intervals=null excl##contig=<ID=GL000240.1,length=41933,assembly=b37>##reference=file:///locus/data/references/genomes/human_g1k_v37/sequences/human_g1k_v37.fasta##source=SelectVariants#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT LS991 145414740 . A N . PASS AC=0;AF=0.00;AN=2;DP=137;MQ=213.16;MQ0=0 GT:DP:GQ:PL:AD 0/0:137:99:0,412,5401:137,01 145414741 . G N . PASS AC=0;AF=0.00;AN=2;DP=138;MQ=213.16;MQ0=0 GT:DP:GQ:PL:AD 0/0:138:99:0,415,5406:138,01 145414742 . C N . PASS AC=0;AF=0.00;AN=2;DP=139;MQ=212.39;MQ0=0 GT:DP:GQ:PL:AD 0/0:139:99:0,418,5289:139,0

<?xml version="1.0" encoding="UTF-8"?> <sample-attributes> <sample-info id="LS125" gender="Unknown" … > <reference> <organism>homo sapiens</organism> <build-id>human_g1k_v37</build-id> </reference> <loci> <locus chr="3" start="10183685" end="10183685" sequence="G" read-coverage="0"> <alt read-coverage="" quality-score="" sequence="" locus-cvid-code="CVID1003741" locus-cvid="A|G" locus-cvid-start="10183685"/> </locus>

<locus-report format="1.0"> <requisition>  <client id="uuid"></client> <patient name="LS99" ethnicity="" gender="Male" dob="" id="LS99"></patient>… <conditions> <condition code="VonHL"> <associated-conditions></associated-conditions> </condition> </conditions> </requisition> <coverage> <sequence minimum-depth="100" sensitivity="98.9" specificity="99.0"> <region genome-build="GRCh37" chrom="3" end="440" start="400"></region> <region genome-build="GRCh37" chrom="3" end="700" start="600"></region>

variants_and_refagree.vcf filtered_on_callable.vcf: lake.mk reads.fastq (set -e; \ source /locus/opt/lake/bin/lakeSetupEnv; \ $(MAKE) -f $< $@; \ ) 2>$@.err

calls.vcf: variants_and_refagree.vcf #filtered_on_callable.vcf ln -s $< $@

attr.xml: calls.vcf req.xml sample.xml ${ATTR_FILE} generate_attributes_file.py …

report.xml: attr.xml req.xml sampleconditionreport $^ -o $@

report.html: report.xml reportrenderer $< -o $@

LIMSLIMS

<?xml version="1.0"?><requisition> <conditions> <condition>VonHL</condition> </conditions></requisition>

<?xml version="1.0" encoding="UTF-8"?><samples><sample-info id="LS99" gender="Male" birth-date="/Date(1320120000000-0700)/" type="GenomicDNA" status="New" ordering-clinician="JMajor" nanodrop-concentration="300" original-barcode="NA06994" use-type="RD" origin="Coriell" code="NA06994" concentration="300" description="" accession-date="" accession-user=""/></samples>

The pipeline in actionThe pipeline in action$ ls reads.fastq.gz req.xml sample.* Makefile Makefile reads.fastq.gz req.xml sample.info sample.xml

$ time make report.html report.pdfgzip -cdq <reads.fastq.gz >reads.fastqlake --recipe reads_to_variants >lake.mkln -s variants_and_refagree.vcf calls.vcf

generate_attributes_file.py ...

sampleconditionreport attr.xml req.xml -o report.xml

reportrenderer report.xml -o report.html

wkhtmltopdf report.html report.pdf

real 7m14.804suser 7m16.490ssys 2m0.150s

Locus InterpretationLocus Interpretation

The big lesson…The big lesson…

Transcripts are muchTranscripts are muchmessier than expected.messier than expected.

Problem statementProblem statement

There is no single source of transcripts that is all of: stable (archived), mapped, agree with the reference genome, have RefSeq accessions.

➢ Issues:● Poor access / programmability● No archived mappings● RefSeq != reference genome due to origin,

ambiguity, error● Patches are difficult to use

When RefSeq != Genome ReferenceWhen RefSeq != Genome Reference

NM_0123.4:c.45C>T

NC_000006.11:g.31030103C>T

variant publishedrelative to RefSeq

NM_0123.4:c.832T>G

NC_000006.11:g.31038124T>G

discovered variantreported relative to RefSeq

Amismatch ins/del

downstream coordinatesshifted

17.8% of RefSeq transcripts differ from 17.8% of RefSeq transcripts differ from GRCh37GRCh37

Garla, V., Kong, Y., Szpakowski, S., & Krauthammer, M. (2011).MU2A--reconciling the genome and transcriptome to determine the effects of base substitutions.Bioinformatics (Oxford, England), 27(3), 416-8. doi:10.1093/bioinformatics/btq658

5.4% have coordinate-changing differences

Sources of transcript informationSources of transcript information

➢ NCBI:● map current transcripts to current genome only● maps with splign● doesn't agree with ref genome ~18%● no local database option

➢ UCSC:● current transcripts only● maps using blat

➢ Ensembl:● aligns using in-house gene building process● cross-linked to refseqs● incorporates NCBI transcripts ad hoc● well-maintained; good API; broad data; VEP

PTEN: insertion/deletion in 5' UTRPTEN: insertion/deletion in 5' UTR

NEFL: genome insertion leads to NEFL: genome insertion leads to frameshift/stopframeshift/stop

RefSeq HandlingRefSeq Handling

---------- Forwarded message ----------Date: Wed, Jan 25, 2012 at 1:59 PMSubject: [Genome] How does UCSC hg19 gene model add exons to RefSeqs?To: genome@soe.ucsc.edu

Hi, when using the human reference hg19 gene model… where the hg19 model has an exon that does not exon exist in the RefSeq accession (or any historical version of the RefSeq accession).

How/why does the alignment introduce an intron in this case? Does it ensure there are plausible flanking splice junctions before inserting an intron to a RefSeq sequence that lacks it but it maps to?

338 genes so far

➢ We should encourage LRG and adopt it when ready(and we'll still have to deal with legacy transcripts)

Not pictured: Jon Sorenson

building a clinical genome interpretation services company

accession locus

common variants

variants lake

types of variants

kinds of variants snv

history sample

unclassified variant

long tail of rare diseases

Documents

factors influencing success of clinical genome sequencing

hvp critical assessment of genome interpretation

bioinformatics for personal genome interpretation

bits: genome browsers and interpretation of gene lists

ecg interpretation and clinical significance

clinical interpretation standards and quality assurance

exome and genome sequencing in clinical practice

clinical assessment in incorporating a personal genome

interpretation of semen analysis: a clinical problem

is the whole genome sequencing service and interpretation

interpretation for general and clinical examination

genome biology: research and clinical perspectives

accurate and rapid genome interpretation – in clinical...

genome analysis and knowledge-driven variant interpretation...

clinical teaching ecg interpretation

reporting and interpretation in genome-wide association...

clinical genome sequencing

applied interpretation of clinical studies

congenica genome analysis and clinical interpretation: an...

whole genome sequencing, clinical interpretation, and deep...