a distributed annotation pipeline for mssng

14
A Dstrbuted Annotaton Ppelne for MSSNG Building infrastructure to annotate 10,000 Autism Spectrum Disorder Genomes using Google’s Cloud. Simon Twigger, Ph.D. BioTeam

Upload: simon-twigger

Post on 15-Apr-2017

388 views

Category:

Science


0 download

TRANSCRIPT

Page 1: A Distributed Annotation Pipeline for MSSNG

A Dstrbuted Annotaton Ppelne for

MSSNGBuilding infrastructure to annotate 10,000 Autism Spectrum Disorder Genomes using

Google’s Cloud.

Simon Twigger, Ph.D.BioTeam

Page 2: A Distributed Annotation Pipeline for MSSNG

ABOUT MSSNGMSSNG is a collaboration between Google and Autism Speaks to create the world’s largest genomic database on autism.By sequencing the DNA of over 10,000 families affected by autism, MSSNG will answer the many questions we still have about the disorder

Upcoming Release:1711 Individuals(681 affected, 1030 unaffected) https://mss.ng

Page 3: A Distributed Annotation Pipeline for MSSNG

General Process

ConsentedFamilies

DNASamples

DNASequencer

Ref: ACGTGCGATCCTAGCTACGSub: ACGTGCGAACCTAGCTACG

Page 4: A Distributed Annotation Pipeline for MSSNG

Total Variants in Big Query

Page 5: A Distributed Annotation Pipeline for MSSNG

Find the Unique VariantsSELECT

CONCAT(reference_name,'-',CAST(start AS STRING),'-',CAST(end AS STRING),'-',reference_bases,'-',alternate_bases) AS id,

reference_name,start +CASE

WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN 1WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN IF (reference_bases CONTAINS alternate_bases,

LENGTH(alternate_bases) + 1, 1)ELSE 1

END AS start_base_one,end +CASE

WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN 0WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN 0ELSE 0

END AS end_base_one,CASE

WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN IF (alternate_bases CONTAINS reference_bases, '-', reference_bases)

WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN IF (reference_bases CONTAINS alternate_bases, SUBSTR(reference_bases,1+LENGTH(alternate_bases)), reference_bases)

ELSE reference_basesEND AS reference_bases_one,CASE

WHEN LENGTH(reference_bases) < LENGTH(alternate_bases) THEN IF (alternate_bases CONTAINS reference_bases, SUBSTR(alternate_bases,1+LENGTH(reference_bases)), alternate_bases)

WHEN LENGTH(reference_bases) > LENGTH(alternate_bases) THEN IF (reference_bases CONTAINS alternate_bases, '-', alternate_bases)

ELSE alternate_basesEND AS alternate_bases_one

FROM FLATTEN ([mssng_20150303.variants],call)WHERE call.FILTER = 'PASS' OR call.FILTER = 'VQSRTrancheSNP99.90to100.00' OR call.FILTER = 'VQSRTrancheINDEL99.90to100.00'OMIT RECORD IF EVERY(alternate_bases IS NULL) OR EVERY(alternate_bases = '<NON_REF>') OR reference_bases = ''GROUP EACH BY id, reference_name, start_base_one, end_base_one, reference_bases_one, alternate_bases_one;

Page 6: A Distributed Annotation Pipeline for MSSNG

Unique Variants in BigQuery

Page 7: A Distributed Annotation Pipeline for MSSNG

38.5M Variants - Which one(s) are important?

Only found in specific family trees?In genes believed to be associated with ASD?

Associated with biological pathways implicated in ASD?damaging to a protein involved in a neurological pathway?

In a relevant gene’s regulatory region?In a gene know to be pathogenic in other relevant diseases?

Not seen in patients with some other disease?

Only found in affected patients?

next step is to Annotate the Variants with known data

Many different ways to prioritize, e.g. variants which are…

Page 8: A Distributed Annotation Pipeline for MSSNG

Variant Annotation• Existing annotation pipeline written in Perl

• AnnoVar used to associate variants with existing biological knowledge

• Moved entire infrastructure over to Google’s Cloud to allow it to use Google Genomics API and run entirely on the Google platform.

• Provide management tools to manage annotation databases and the annotation process

Page 9: A Distributed Annotation Pipeline for MSSNG
Page 10: A Distributed Annotation Pipeline for MSSNG

Annotation Jobs

Page 11: A Distributed Annotation Pipeline for MSSNG

38.5M Variants in~36hrs

Page 12: A Distributed Annotation Pipeline for MSSNG

Example Variant Annotations

Field Value Description

ID X-148037416-148037417-G-T Location and DNA change

Effect nonsynonymous SNV Predicted effect (Leu->Phe)

Symbol AFF2 AF4/FMR2 family, member 2

OMIM Mental retardation, X-linked Known disease phenotype

HPO HP:0000152 Human Phenotype Ontology IDs

dbSNP rs371160275 dbSNP identifier

Page 13: A Distributed Annotation Pipeline for MSSNG

10B observed variants

38M unique variants

?? causative variants

https://www.mss.ng/researchers

MSSNG’s philosophy is to promote and enable ‘open science’ research to lead to a better

understanding of autism. We welcome you to join us

Page 14: A Distributed Annotation Pipeline for MSSNG

https://www.mss.ng/poster