lecture 3. topics in high-throughput sequencing (identification of genetic variations) the chinese...

50
Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

Upload: darrell-richard

Post on 13-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations)

The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology

Page 2: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Types of genetic variation2. Single nucleotide variants and small

insertions/deletions3. Large insertions/deletions and translocations4. Repeats and copy number variations5. Inversions

Last update: 20-Sep-2015

Page 3: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

TYPES OF GENETIC VARIATIONPart 1

Page 4: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 4

Genetic “variation”• Two main definitions:

1. Differences in DNA among different individuals in a population

2. Differences in DNA between an individual and a reference (focus of this lecture)• Sometimes, it is easy to define the reference

– The human reference sequence(http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/ https://en.wikipedia.org/wiki/Human_Genome_Project)

– “Normal” genome (e.g., blood from the same cancer patient)

• Sometimes, it is not easy to define– A’s insertion with respect to B is B’s deletion with respect to A

– Which one is more “normal”?

Last update: 20-Sep-2015

Page 5: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 5

Types of genetic variation• Single nucleotide variants (SNVs)– Single nucleotide polymorphisms (SNPs) if found

in >1% of individuals in a population• Small insertions/deletions (indels)– Several nucleotides long

• Structural variations (SVs)– Larger variations

Last update: 20-Sep-2015

Page 6: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 6

Some proposed definitions

Last update: 20-Sep-2015

Term Definition

Structural variant (SV) A genomic alteration (e.g., a CNV, and inversion) that involves segments of DNA >1kb

Copy number variant (CNV) A duplication or deletion event involving >1kb of DNA

Duplicon A duplicated genomic segment >1kb in length with >90% similarity between copies

Indel Variation from insertion or deletion event involving <1kb of DNA

Intermediate-sized structural variant (ISV)

A structural variant that is -8kb to 40kb in size. This can refer to a CNV or a balanced structural rearrangement (e.g., an inversion)

Low copy repeat (LCR) Similar to segmental duplication

Multisite variant (MSV) Complex polymorphic variation that is neither a PSV nor a SNP

Paralogous sequence variant (PSV) Sequence difference between duplicated copies (paralogs)

Segmental duplication Duplicated region ranging from 1kb upward with a sequence identity of >90%

Interchromosomal Duplications distributed among nonhomologous chromosomes

Intrachromosomal Duplications restricted to a single chromosome

Single nucleotide polymorphism (SNP)

Base substitution involving only a single nucleotide; ~10 million are thought to be present in the human genome at >1%, leading to an average of one SNP difference per 1250 bases between randomly chosen individuals

Table source: Freeman et al., Genome Research 16(8):949-961, (2006) More commonly used

Page 7: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 7

Origin of genetic variations• SNVs: Errors during DNA replication that survive the

proof-reading and mismatch-repair mechanisms

Last update: 20-Sep-2015

Image credit: Wikipedia; Martin and D. Scharff, Nature Reviews Immunology 2(8):605-614, (2002)

Page 8: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 8

Origin of genetic variations• SVs: Various mechanisms

– FoSTeS: Fork stalling and template switching

– MEI: Mobile element insertion

– NAHR: Non-allelic homologous recombination

– NHEJ: Non-homologous end-joining

Last update: 20-Sep-2015

Image credit: Bickhart and Liu, Frontiers in Genetics 10.3389, (2014); Xing et al., Genome Research 19(9):1516-1526, (2009)

Page 9: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 9

Origin of genetic variations• SVs: More figures

Last update: 20-Sep-2015

Image credit: Gu et al., PathoGenetics 1(1):4, (2008)

Page 10: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 10

Consequence of genetic variants• Hitting genes:

Last update: 20-Sep-2015

Image source: http://www.nbs.csudh.edu/chemistry/faculty/nsturm/CHEMXL153/DNAMutationRepair.htm

Page 11: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 11

Consequence of genetic variants• Hitting genes:

– Synonymous (silent) mutation (no change in protein sequence)• May still affect translational efficiency

– Nonsense mutation (pre-mature stop codon)– Read-through (removal of the stop codon)– Missense mutation (change of one/a few amino acids)– Frameshift (shifting the reading frame)– Affecting splicing (removal/new acceptor site or donor site)– Deletion of whole exon/gene– Changing gene copy number– Gene fusion– ...

• Others (more difficult to determine):– Disrupting protein binding sites– Affecting gene regulation– Affecting DNA 3D structure– ...

• See “Effect prediction details” section of SnpEff manual (http://snpeff.sourceforge.net/SnpEff_manual.html) for more details

Last update: 20-Sep-2015

Page 12: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 12

Using NGS to identify genetic variations

• General steps:1. Align sequencing reads to reference

ORConstruct sequence assembly from sequencing reads

2. Look for differences• The alignment strategy only works when accurate and

efficient read alignment is possible.– Cannot determine parts that are completely not in reference

• The assembly strategy only works for genomic regions that can be accurately assembled.

• In both strategies, it is also required to distinguish between sequencing errors/biases and variants.

Last update: 20-Sep-2015

Page 13: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 13

DNA-seq vs. RNA-seq in calling variants

• Using DNA to identify genetic variants could identify variants not functionally significant– Example: Fused gene due to translocation not

actually expressed• Using RNA to identify genetic variants could

falsely treat post-transcriptional modifications as genetic variants– Example: RNA editing

• In general, good to have support from both DNA and RNA data

Last update: 20-Sep-2015

Page 14: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

SINGLE NUCLEOTIDE VARIANTS AND SMALL INSERTIONS/DELETIONS

Part 2

Page 15: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 15

A typical pipeline• The Genome Analysis Toolkit (GATK) workflow for

calling variants in RNA-seq data (similar for DNA-seq)

Last update: 20-Sep-2015

Image credit: Broad Institute, https://www.broadinstitute.org/gatk/guide/tagged?tag=rnaseq

Page 16: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 16

More details of pipeline• The Genome Analysis Toolkit (GATK) workflow for

calling variants in RNA-seq data

Last update: 20-Sep-2015

Image credit: Broad Institute, https://www.broadinstitute.org/gatk/guide/tagged?tag=rnaseq

Page 17: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 17

Read re-alignment• In standard sequence alignment, each read is

aligned to reference independently.• To discover indels accurately, re-alignment by

combining information from multiple reads is recommended.– Usually fix mis-alignments at read ends

• Example:Reference: CGACCGTRead 1: ACCAGT (more likely to be one insertion than two SNVs)Read 2: CGACCA (not sure whether it is insertion or SNV by itself,

more likely to be an insertion after considering read 1)

Last update: 20-Sep-2015

Page 18: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 18

Read re-alignment

Last update: 20-Sep-2015

Before re-alignment After re-alignmentImage credit: DePristo et al., Nature Genetics 43(5):491-498, (2011)

Page 19: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 19

Re-calibration of base quality scores1. Assuming the observed quality score is affected by:– Actual quality score– Machine cycle (i.e., base position on the read)– Di-nucleotide context (the base itself and the one before)

2. Estimating the weight of each factor using mismatches at loci not known to vary in the dbSNP database of genetic variants– All these mismatches are assumed to be due to

sequencing errors

3. Adjusting the quality scores accordingly

Last update: 20-Sep-2015

Page 20: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 20

Re-calibration of base quality scores

Last update: 20-Sep-2015

Image credit: DePristo et al., Nature Genetics 43(5):491-498, (2011)

Page 21: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 21

Calling SNVs• Notations:– D: data (all bases aligned to a position)– Di: the i-th aligned base (i.e., the base aligned to the

position on the i-th read)– Gj, Gk: genotypes

– Hj1, Hj2: alleles (haplotypes) of Gj

– i: base calling error rate of the i-th aligned base

• Bayesian formulation:

Last update: 20-Sep-2015

Pr൫𝐺𝑗ห𝐷൯= Pr൫𝐺𝑗൯Pr൫𝐷ห𝐺𝑗൯σ Prሺ𝐺𝑘ሻPrሺ𝐷ȁ𝐺𝑘ሻ𝐺𝑘=ሼAA,Aȁ,ȁȁሽPr൫𝐷ห𝐺𝑗൯= ෑ� ቈ

Pr൫𝐷𝑖ห𝐻𝑗1൯2 + Pr൫𝐷𝑖ห𝐻𝑗2൯2 𝑖Pr൫𝐷𝑖ห𝐻𝑗1൯=൜1− ε𝑖 if 𝐷𝑖 = 𝐻𝑗1ε𝑖 if 𝐷𝑖 ≠ 𝐻𝑗1

Page 22: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 22

Calling indels• For indels, Pr(Di|Hj1) is computed based on a hidden

Markov model:– Ix, Iy: The two indel haplotypes– : gap opening penalty– : gap extension penalty– pxi

,yj: likelihood of aligning xi and yj

– qxi : likelihood of aligning xi and a gap

Last update: 20-Sep-2015

Image source: https://www.broadinstitute.org/gatk/events/slides/1307/GATKwh1-BP-5-Variant_calling.pdf

Page 23: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

LARGE INSERTIONS/DELETIONS AND TRANSLOCATIONS

Part 3

Page 24: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 24

Useful types of information• Split reads: One single read aligned to two different locations on reference

– Precisely define break points– Could be difficult to align– Relatively rare

• Paired-end reads: The two reads in a mate pair aligned to the reference with an unexpected distance, or one read cannot be aligned– Easier to happen– Reads easier to align– Cannot determine precise break points– Could be hard to judge if it is an SV due to inexact insert size

• Read depth/alignment quality: Drop of read depth/alignment quality around break points due to difficulty of alignment, and lack of aligned reads in deleted regions– Can be observed even in standard alignment pipelines– The drop is not always clear– Some drops could be due to other reasons

Last update: 20-Sep-2015

Page 25: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 25

Useful types of information

Last update: 20-Sep-2015

Image credit: Keane et al., Frontiers in Genetics 10.3389, (2014)

Expected insert size

Distance of the aligned locations on the reference

Page 26: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 26

Alignment strategies• Split mapping– Need to try different possible ways to split a read,

or use specialized alignment algorithms– If the split is too imbalanced, the shorter part may

not be aligned (uniquely)• Constructing junction library (also used in

aligning RNA-seq reads), then aligning reads onto the putative junction sequences– Need to first have a rough idea of the break points– Need to try different possible junctions

Last update: 20-Sep-2015

Page 27: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 27

Junction library• Suppose we have the following rough estimate

of the break points of a deletion (e.g., based on alignment of paired-end reads):

• Possible junctions:

Last update: 20-Sep-2015

A C G A G A T A C T G A C A G A T T A C T G A T G C A G T A

A C G A G A T G A T G C A G T AA C G A G A T A T G C A G T AA C G A G A T T G C A G T AA C G A G A T G C A G T AA C G A G A G A T G C A G T AA C G A G A A T G C A G T AA C G A G A T G C A G T AA C G A G A G C A G T A

A C G A G G A T G C A G T AA C G A G A T G C A G T AA C G A G T G C A G T AA C G A G G C A G T AA C G A G A T G C A G T AA C G A A T G C A G T AA C G A T G C A G T AA C G A G C A G T A

Page 28: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 28

Real SVs vs. sequencing/alignment errors

• Real SVs are usually indicated by:– Even coverage around break points (ladder)– Good base quality and alignment scores

Last update: 20-Sep-2015

Page 29: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 29

Real SVs vs. sequencing/alignment errors

• A good case:

Last update: 22-Sep-2015

Putative junction sequence

Break points

Paired-end reads

Split reads

Page 30: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 30

Real SVs vs. sequencing/alignment errors

• A bad case (gray portions of reads are aligned perfectly; colored portions are mismatches, reads marked in dark red have unexpected insert sizes):

Last update: 22-Sep-2015

Break points

Putative junction sequence

Page 31: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 31

Break point confusion• SVs could be due to micro-homology at the

break points:

– Does the GAT come from the paternal or maternal copy?• Does it matter?• It matters more if we want to know what happens to

the other ends of the breaks

Last update: 20-Sep-2015

A C G A G A T A C T G A C A G A T T A C T G A T G C A G T A

A C G A G A T A C T G A C A G A T T A C T G A T G C A G T A

Paternal

Maternal

A C G A G A T G C A G T A

Page 32: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

REPEATS AND COPY NUMBER VARIATIONS

Part 4

Page 33: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 33

Copy number variation• For a diploid organism, each cell contains two

copies of the same chromosome.– If a gene is unique, there are exactly two copies of

it.• Sometimes, the copy number is not 2:– Paralogs (gene duplication – various mechanisms)– Retro-transcription– Aneuploidy (not exactly 2 copies of each

chromosome)• Whole genome• Whole chromosome

Last update: 20-Sep-2015

Page 34: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 34

Copy number variation• In general, DNA regions can have 2 copies for

many reasons• Copy numbers can have significant

consequences. For example,– Haploinsufficiency (having only one copy cannot

maintain function)– Gene dosage (amount of transcripts/proteins)– Complex phenotypic consequences (e.g., copy

number of DUF1220 domain related to human brain size and diseases)

Last update: 20-Sep-2015

Page 35: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 35

Smaller-scale repeats• Genomes contain many types of repeats

– By size• Tandem repeats: one immediately after another

– E.g., TTAGGG at telomeres: related to protection

• Short interspersed nuclear elements (SINEs)– E.g., Alu elements: ~280bp, GC rich

• Long interspersed nuclear elements (LINEs)– E.g., L1 elements: ~6-8kbp, AT rich

– By number of occurrences– By mechanism: transposable elements (TEs)

• Retrotransposons (transcription reverse transcription): copy and paste• DNA transposons: cut and paste

• Some regions are defined as low complexity regions (LCRs) – regions with low information content

Last update: 20-Sep-2015

Page 36: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 36

Identifying CNVs• Useful information:– For determining boundaries:• Split reads• Paired-end reads• Loss of heterozygosity (LOH)

– For determining both boundaries and copy number:• Read depth, relative to “normal”

– Could be hard to define the “normal” line

• B-allele frequency (BAF)• Long reads, if long enough

Last update: 20-Sep-2015

Page 37: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 37

LOH• Typically, heterozygous variants appear in all

different places in the genome• A large region without heterozygous variants

may indicate occurrence of CNV– Note: Having only one copy leads to LOH, but LOH

can also happen in regions with other copy numbers

Last update: 20-Sep-2015

Page 38: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 38

BAF• LOH only indicates regions with one allele

completely disappeared• B-allele frequency is a more general concept

that asks for the count of reads that support the B allele (defined arbitrarily) as a ratio of the total number of reads aligned to the location (that support either the A or B allele)– The concept was originally defined for microarray

data

Last update: 20-Sep-2015

Page 39: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 39

BAF, LOH and LRR• LRR: log2(observed signal / expected signal)

Last update: 20-Sep-2015

An illustration of log R Ratio (LRR) and B Allele Freq (BAF) values for the chromosome 15 q-arm of an individual. A normal chromosome region has three BAF genotype clusters, as represented as AA, AB, and BB genotypes in boxes, and with LRR values centered around zero. The copy-neutral LOH region has normal LRR values, but without the AB genotype cluster. The increased copy number for a CNV region can be detected based on an increased number of peaks in the BAF distribution, as well as increased LRR values. The patterns of LRR and BAF for different CNV regions, normal regions, and copy-neutral LOH regions are distinct from each other, thus the combination of LRR and BAF can be used to generate CNV calls.

Image credit: Wang et al., Genome Research 17(11):1665-1674, (2007)

Page 40: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

INVERSIONSPart 5

Page 41: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 41

Balanced mutation• Insertion, deletion and CNV result in copy

number changes• In contrast, translocations and inversions

usually do not– They are called “balanced mutations”

• Balanced mutations cannot be detected by checking read depth

Last update: 22-Sep-2015

Page 42: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 42

Inversions: A closer look• Suppose we have the following sequence:– ACGCAT

• What would it look like if the CGCA part is inverted?– AACGCT?– AGCGTT?– ATGCGT?

• Even with both strands sequenced and inversions, we do not try to align a 3’-5’ sequence with a 5’-3’ sequence

Last update: 20-Sep-2015

Page 43: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 43

Inversion: Strand and read orientation

Last update: 20-Sep-2015

Image credit: Okamura et al., BMC Genomics 8:160, (2007)

Reference

Sequenced DNA

Fragment 1 AACTTG

Alignments 1 AAC TTG

Fragment 2 AACGTT

Alignments 2 AAC TTG

Fragment 3 CTTTTG

Alignments 3 TTC TTG

Assuming perfect alignments

Fragment 4 CTTGTT

Alignments 4 TTCTTG

Page 44: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 44

More on read orientations• Some SVs are complex

Last update: 20-Sep-2015

Image credit: Medvedev et al., Nature Methods 6(11S):S13-S20, (2009); Pevzner, PNAS 100(13):7672-7677, (2003)

Page 45: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 45

Even more on read orientations• If a fragment is too long, one

can circularize it, segment the circularized DNA again, and sequence the segment with the junction

Last update: 20-Sep-2015

Image source: Illumina Nextera technical note, http://www.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf

Page 46: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 46

VCF files• There is a file format defined for genetic

variants called VCF (Variant Call Format).– Specification available at

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

– Two main sections: header and content– Header provides basic information of the file, and

defines content attributes and filters– Each line in the content section represents one

variant in one or more samples

Last update: 20-Sep-2015

Page 47: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 47

An example

Last update: 20-Sep-2015

Example source: http://samtools.github.io/hts-specs/VCFv4.2.pdf

##fileformat=VCFv4.2##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Page 48: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 48

Final remarks• Some types of genetic variation take more

time and need more complex methods to detect Detect the easy ones first1. Use standard alignment results to:• Detect SNVs and small indels• Get rough information of large indels, translocations,

CNVs and inversions

2. Use unaligned reads and additional procedures to determine detailed information of the SVs

Last update: 20-Sep-2015

Page 49: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 49

Final remarks• Some methods call genetic variants by

combining the information from multiple samples.– Consistency among samples– Contrast among samples (e.g., tumor vs. non-tumor

from the same patient – somatic variants) [lecture]• To study the relationships among multiple

variants, one may further construct haplotypes [project] or identify epistatic interactions among variants [project].

Last update: 20-Sep-2015

Page 50: Lecture 3. Topics in High-Throughput Sequencing (Identification of Genetic Variations) The Chinese University of Hong Kong CSCI5050 Bioinformatics and

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 50

Summary• Genetic variations– SNVs– Small indels– SVs: Large indels, translocations, CNVs, inversions

• Methods for detecting genetic variants– Split read– Paired-end read• Orientations

– Depth of coverage– Allele ratios and frequencies

Last update: 20-Sep-2015