a distributed annotation pipeline for mssng

Post on 15-Apr-2017

388 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Dstrbuted Annotaton Ppelne for

MSSNGBuilding infrastructure to annotate 10,000 Autism Spectrum Disorder Genomes using

Google’s Cloud.

Simon Twigger, Ph.D.BioTeam

ABOUT MSSNGMSSNG is a collaboration between Google and Autism Speaks to create the world’s largest genomic database on autism.By sequencing the DNA of over 10,000 families affected by autism, MSSNG will answer the many questions we still have about the disorder

Upcoming Release:1711 Individuals(681 affected, 1030 unaffected) https://mss.ng

General Process

ConsentedFamilies

DNASamples

DNASequencer

Ref: ACGTGCGATCCTAGCTACGSub: ACGTGCGAACCTAGCTACG

Total Variants in Big Query

Find the Unique VariantsSELECT

CONCAT(reference_name,'-',CAST(start AS STRING),'-',CAST(end AS STRING),'-',reference_bases,'-',alternate_bases) AS id,

reference_name,start +CASE

WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN 1WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN IF (reference_bases CONTAINS alternate_bases,

LENGTH(alternate_bases) + 1, 1)ELSE 1

END AS start_base_one,end +CASE

WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN 0WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN 0ELSE 0

END AS end_base_one,CASE

WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN IF (alternate_bases CONTAINS reference_bases, '-', reference_bases)

WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN IF (reference_bases CONTAINS alternate_bases, SUBSTR(reference_bases,1+LENGTH(alternate_bases)), reference_bases)

ELSE reference_basesEND AS reference_bases_one,CASE

WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN IF (alternate_bases CONTAINS reference_bases, SUBSTR(alternate_bases,1+LENGTH(reference_bases)), alternate_bases)

WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN IF (reference_bases CONTAINS alternate_bases, '-', alternate_bases)

ELSE alternate_basesEND AS alternate_bases_one

FROM FLATTEN ([mssng_20150303.variants],call)WHERE call.FILTER = 'PASS' OR call.FILTER = 'VQSRTrancheSNP99.90to100.00' OR call.FILTER = 'VQSRTrancheINDEL99.90to100.00'OMIT RECORD IF EVERY(alternate_bases IS NULL) OR EVERY(alternate_bases = '<NON_REF>') OR reference_bases = ''GROUP EACH BY id, reference_name, start_base_one, end_base_one, reference_bases_one, alternate_bases_one;

Unique Variants in BigQuery

38.5M Variants - Which one(s) are important?

Only found in specific family trees?In genes believed to be associated with ASD?

Associated with biological pathways implicated in ASD?damaging to a protein involved in a neurological pathway?

In a relevant gene’s regulatory region?In a gene know to be pathogenic in other relevant diseases?

Not seen in patients with some other disease?

Only found in affected patients?

next step is to Annotate the Variants with known data

Many different ways to prioritize, e.g. variants which are…

Variant Annotation• Existing annotation pipeline written in Perl

• AnnoVar used to associate variants with existing biological knowledge

• Moved entire infrastructure over to Google’s Cloud to allow it to use Google Genomics API and run entirely on the Google platform.

• Provide management tools to manage annotation databases and the annotation process

Annotation Jobs

38.5M Variants in~36hrs

Example Variant Annotations

Field Value Description

ID X-148037416-148037417-G-T Location and DNA change

Effect nonsynonymous SNV Predicted effect (Leu->Phe)

Symbol AFF2 AF4/FMR2 family, member 2

OMIM Mental retardation, X-linked Known disease phenotype

HPO HP:0000152 Human Phenotype Ontology IDs

dbSNP rs371160275 dbSNP identifier

10B observed variants

38M unique variants

?? causative variants

https://www.mss.ng/researchers

MSSNG’s philosophy is to promote and enable ‘open science’ research to lead to a better

understanding of autism. We welcome you to join us

https://www.mss.ng/poster

top related