barcode sequence alignment and statistical analysis (barcas)...

30
Barcode Sequence Alignment and Statistical Analysis (Barcas) tool 2016.10.05 Mun, Jihyeob and Kim, Seon-Young Korea Research Institute of Bioscience and Biotechnology

Upload: others

Post on 18-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Barcode Sequence Alignment and Statistical Analysis (Barcas) tool

2016.10.05Mun, Jihyeob and Kim, Seon-Young

Korea Research Institute of Bioscience and Biotechnology

Page 2: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Barcode-Sequencing

2

Ø Genome-wide screening method based on sequencing the counts of tens of thousands of individual tags (barcodes) for each gene for a given condition

Ø Originally developed as yeast deletion libraries such as Saccharomyces cerevisiae and Schizosaccharomycespombe

Ø Now applied for genome-wide siRNA or shRNA screening to measure the effects of knock-down of genes

Ø Or, using CRISPR-Cas9, applied for genome-wide sgRNA screening for the effects of gene knock-out

Page 3: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Examples of genome-wide barcode-sequencing libraries

3

Contents Organism # ofgenes

# of barcodes

References

Yeast deletion consortium S. cerevisiae 6,343 2 (UP and DN) www-sequence.Stanford.edu/group/

Bioneer pombe collection S. pombe 4,836 2 (UP and DN) http://us.bioneer.com/

MISSION shRNA (human) H. sapiens 20,018 129,696 shRNA http://sigmaaldrich.com

MISSION shRNA (human) M. musculus 21,171 118,072 shRNA http://sigmaaldrich.com

TRC1 (human) shRNA H. sapiens 16,019 80,717 shRNA https://portals.broadinstitute.org/gpp/trc1/

TRC1 (mouse) shRNA M. musculus 15,960 77,819 shRNA https://portals.broadinstitute.org/gpp/trc1/

Human DECIPHER (shRNA) H. sapiens 15,377 5+ shRNAs https://www.cellecta.com

Mouse DECIPHER (shRNA) M. musculus 9,145 5+ shRNAs https://www.cellecta.com

Cellecta Genome-wide shRNA H. sapiens 19,276 8 shRNAs https://www.cellecta.com

Cellecta Genome-wide CRISPR H. sapiens 19,001 8 sgRNAs https://www.cellecta.com

Human GeCKO v2 H. sapiens 19,050 123,411 sgRNA https://www.addgene.org/

Mouse GeCKO v2 M. musculus 20,611 130,209 sgRNA https://www.addgene.org/

Mouse genome-wide v1 (yusa) M. musculus 19,150 87,897 sgRNA https://www.addgene.org/

Oxford fly Drosophila 13,501 40,279 sgRNA https://www.addgene.org/

CRISPRa H. sapiens 15,977 198,810 sgRNA https://www.addgene.org/

CRISPRi H. sapiens 11,219 206,421 sgRNA https://www.addgene.org/

Page 4: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

4

Workflow : barcoded yeast deletion strains

Page 5: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

5

Workflow : genome-wide shRNA screening

Page 6: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Basic format of barcode-seq data

6

Universal Primer (20-25 bp)

Barcode(20-30 bp)

MID (Multiplexing Index, 4-6 bp)

Page 7: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Steps of barcode-seq data analysis

Barcode(20-30 bp)

UniversalPrimer (20-bp)

MultiplexIndex

(4-6 bp)

Trim index Trim primer

Map and count each TAGsample1 Sample2 sample3

tag1 3400 2500 2983tag2 120 199 739tag3 29920 3544 2232tag4 4300 3433 3344. . . .. . . .. . . .

NormalizationStatistical Analyses

Pre-processing and QC

Visualization

Page 8: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

8

Current tools and methods for barcode-seq data analysis

Tool (or method)

Pre-processing

QC Normalization

Statistical Analysis

Visualization

Software format

Ref.

Barcas O O O O O Java GUI Mun 2016 BMC Bioinfo

BarcodeDeconvoluter

O X X X X Windows or Mac GUI

www.decipherproject.net/software

BiNGS!LS-seq & edgeR

O O O O X R package Kim 2012 Method MolBiol

edgeR O X O O X R package Dai 2014 F1000 Res

HiTSelect X X X Multi-objectiveranking

O Matlabruntime

Diaz 2015 Nuc Acids Res

MAGeCK O O O O X Python, C source code

Li 2014 Genome Bio

MAGeCK-VISPR

O O O Robust rank aggregation

O Python script Li 2015 Genome Bio

RIGER X X X RNAi Gene Enrichment

Ranking

O GENE-E (=>Morpheus)Java GUI

Luo 2008 PNAS

RSA X X X Iterative hypergeometric P-

value

X Windows GUI (C#), R,

Perl

Konig 2007 Nat Methods

ScreenBEAM X X X Pooled scoring X R package Yu 2015 Bioinformatics

shALIGN & shRNAseq

O O O O X Perl and R script

Sims 2011 Genome Bio

Page 9: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Barcas (Barcode sequence Alignment and Statistical Analysis)

9

- Barcas is an all-in-one program for the analysis of multiplexed barcode sequencing (barcode-seq) data

- Available at http://medical-genome.kribb.re.kr/barseq/

Input: Barcode-seq data• Genome-wide shRNAs (Cellecta, TRC, Sigmaaldrich, etc)• Genome-wide sgRNAs (Addgene, Cellecta, etc)• barcoded yeast deletion strains: S. cerevisae or S. pombe

Ø Preprocessing & Mapping• Filtering, trimming, and mapping with mismatches and indels

Ø Quality Control (of barcodes and samples)

Ø Normalization

Ø Statistical Analysis• Two-condition comparison, multiple time points.

Ø Visualization• Various graphs and heatmap

Page 10: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

10

All in one package with user-friendly GUI

Step 1: Pre-processing & Mapping Step 2: QC of data quality

Step 3: Design experiment Step 4: Statistical analysis

Page 11: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

11

Step 1: Data preprocessing and mappingØ De-multiplexing and trimming (universal primers)

Ø Mapping with imperfect matches (mismatches and indels)

Ø Searching for individual tag sequences

Page 12: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

12

Step 2: Data quality evaluation

Ø Sequence level: overall sequence qualityØ Sample level: mapping counts and percentage, etcØ Barcode (or tag) level: mapping counts and percentage, etc

Page 13: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

13

Step 3: Experimental design

Ø Comparison of two conditions

Ø Across several different time points

Page 14: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

14

Step 4: Statistical analysis and VisualizationØ Calculates z-score and p-value for each barcodeØ Ranks each barcode by z-scoreØ Plots z-score graphØ Plots time dependent intensity heat-mapØ Allows searching for individual target gene

Page 15: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

15

Novel functions of Barcas for data pre-processing and QC

ØFlexible mapping with support for both substitutions and indels

ØDetection of erroneous barcodes in the library

ØChecking similarity among barcodes in the library collection

Page 16: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

16

Existing tools for data preprocessing

Name Mismatches Shifts of the position

Indel Backendtool

Ref.

BiNGS!LS-seq O X X bowtie Kim (2012)

Methods Mol BioshALIGN O X X Perl script

(or bowtie)Sims (2011) Genome Bio

edgeR O O X edgeR Dai (2014) F1000Res

Barcas O O O Trie data structure

Mun (2016) BMC Bioinfo

MID Universal Primer Barcode (shRNA)

TCAAAGATAGTCACGCGACCTCATCGACGAGCTACCTCAAAGATAGTCACGCGACCTCATCGACGAGCTACCTCAAAGATAGTCACGCGACCTCATCGACGAGCTACCTCAAAGATAGTCACGCGACC-ATCGACGAGCTACC

TCAAAGATAGTCACGCGACCTCATCGA--AGCTACC

Original barcodePerfect matchMismatchesPosition shift

Indel

Page 17: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

1:1 sequence matching processingAlgorithm : List basedMaximum time : N * M

(N: read count, M: reference count)

1:M sequence matching processingAlgorithm : Tree based

Maximum time : N(N: read count)

AGCT

CGCTGCCAATTAG

AGCT

Library referencereadLibrary reference

root

A T G C

TCAGTGCAGTTAT

T C

A

GGT

A

G

C

T

C

A C

G A

G

C

T

AGCT

read

Trie data structure

17AT

Ø Data structure based on prefix treeØ Useful data structure to store a dynamic set or associate array in which the keys

are usually stringsØ More efficient than hash table (or dictionary) or lists in terms of look-up speed an

d memory

Page 18: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

- Based on trie data structure, Barcas supports imperfect matching allowing mismatches, base shifting and indels

- Dynamic sequence lengths- Dynamic start positions

18

1. Data structure of Barcas for mapping

Page 19: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Comparison of speed and mapping rate of barcas with bowtie and edgeR package of R

Option

Result

Barcas was 1.7 times faster than bowtie and 13 times faster than edgeR. Owing to indel mapping, Barcas mapped at least 8-12% more than the other two programs.

Data • 215 million reads were mapped to 4,832 heterozygous diploid deletion strains in S. pombe. • 45-bp sequences were used as barcode library.

Page 20: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

20

2. Detection of erroneous barcodes from the genome-wide barcode library

ØWe are likely to assume that barcode sequences in the library are perfectly error-free from the original design

ØHowever, errors can creep in the barcodes during many steps including

• barcode synthesis, • random mutations during library maintenance,• erroneous incorporation of barcodes into the genome in case of

yeast strains.

Page 21: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

21

Erroneous barcodes in the yeast library

Eason et al (2004) Characterization of synthetic DNA bar codes in Saccharomyces cerevisiae gene-deletion strains PNAS 101(30):11046-51

Smith et al (2009) Quantitative phenotyping via deep barcode sequencing Genome Res 19:1836-42

U1 UpTag U2 D2 DnTag D1# correctby Smith

4,242 4,369 4,045 4,207 4,320 3,867

% correct by Smith

80.1% 82.5% 82.9% 80.9% 83.1% 83.7%

# correct by Easton

4185 3,764 4,057 4,343 3,807 4,095

% correct by Easton

79.1% 71.1% 83.2% 83.5% 73.2% 88.7%

% Agreed 86% 84.4% 89.2% 92.6% 85.1% 92%

Page 22: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

A simple method to detect erroneous barcodes

Original design

ACTGACTGACTGACTGACTG Counts

Perfect ACTGACTGACTGACTGACTG 50,000

Mismatch 1 ACTGACTGACTGACTGCCTG 10

Mismatch 2 ACTCACTGACTGACTGACTG 9

Mismatch 3 ACTGACAGACTGACTGACTG 20

Mismatch 4 ACTGACTGACTTACTGACTG 3

Mismatch 5 AGTGACTGACTGACTGACTG 7

Mismatch 6 ACTGACTGACTGACTGTCTG 12

Mismatch 7 ACTGACTGACTAACTGACTG 5

PM only 50,000

PM + MM 50,065

Gain 50,565/50,000 = 1.013% 0.13% gain

Original design

ACTGACTGACTGACTGACTG Counts

Perfect ACTGACTGACTGACTGACTG 200

Mismatch 1 ACTGACTGACTGACTGCCTG 40,000

Mismatch 2 ACTCACTGACTGACTGACTG 11

Mismatch 3 ACTGACAGACTGACTGACTG 12

Mismatch 4 ACTGACTGACTTACTGACTG 3

Mismatch 5 AGTGACTGACTGACTGACTG 12

Mismatch 6 ACTGACTGACTGACTGTCTG 9

Mismatch 7 ACTGACTGACTAACTGACTG 5

PM only 20

PM + MM 40,071

Gain 40,071/200 = 200.35% 200% gain

Dominant Perfect Match with minor Mismatches

One dominant Mismatch with minor Perfect Match and other Mismatches

Measure the amount of gains in count between perfect match only and (PM + MM)

Page 23: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Detection of erroneous barcodes Ø Library : 1,230 shRNA sequences of TRC library.Ø Data : Control samples in neuroepithelial (NE), early radial glial (ERG) and mid

radial glial (MRG)Ø We found 25 erroneous barcodes (2.03%).

23Ziller,MJ. et al., Nature 2015, 518, 355-9.

Page 24: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Detection of erroneous barcodes (TRC)

24

Gene ID Original sequence Major mapped(Two mismatch/indels)

PMcount

MM count

PBX2 TRCN0000285144 ATACTCCCACTTGCAACTATT ATACTCCCACTTGTAACTATT 10,785 34,084

SKI TRCN0000010439 GAATCTGCCACTCTCAGAATA -AATCTGCCACTCTCAGAATA 14 5,935

TERF2IP TRCN0000010356 GAGAGTTCTTGCATTGGAACT -AGAGTTCTTGCATTGGAACT 4 1,244

SKI TRCN0000010437 GATCGAAGACCTGCAGGTGAA -ATCGAAGACCTGCAGGTGAA 5 625

MYC TRCN0000010390 GAATGTCAAGAGGCGAACACA -AATGTCAAGAGGCGAACACA 3 393

JDP2 TRCN0000019000 CGGGAGAAGAACAAAGTCGCA CGGGAGAAGAACAAAAACGCA 46 508

TFAP2B TRCN0000019659 CGGTTCTTTCGAGTTTAGTAA CGGTTCTTTTGAGTTTTGTAA 87 522

NFFKB TRCN0000014868 CAGGGAGGTTGCATCATTGTT CAGGGAGGGTGCATCATTGTT 98 571

KLF13 TRCN0000016925 CGGGCGAGAAGAAGTTCAGCT CGGGCGAGAAGAAGTTCATGGT 0 124

Page 25: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

3. Check for sequence similarity among barcodes in a reference

25

Ø Erroneous barcodes can potentially be generated during the production of many barcodes.

Ø If two barcodes were designed similarly (i.e only 1 bpdifference) and mutations or sequencing errors occur, then it will be hard to distinguish errors from true differences.

Ø Thus, barcodes originally designed to be similar should be identified (and flagged) in advance.

Ø For this purpose, Barcas allows checking of sequence similarity among barcode sequences.

Page 26: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Library reference QC

Screen Library Date Species Module Barcode length

Barcode count

Gene count

shRNA

TRC 05/Apr/11

Human

21-bp 61,621 15,435

Cellecta 15/Feb/12

Module1 18-bp 27,500 5,046

Module2 18-bp 27,500 5,421

Module3 18-bp 27,500 4,923

sgRNA

yusa Mouse 19-bp 87,437 19,149

CeCKOv2 09/Mar/15

HumanLibrary A 20-bp 63,950 21,669

Library B 20-bp 56,869 19,834

MouseLibrary A 20-bp 65,959 22,486

Library B 20-bp 61,139 21,263

Deletionmutantstrains

Heterozygous diploid

Saccharomycescerevisiae

20-bp 6,318/UP6,126/DN 6,131

Schizosaccharomycespombe

20-bp 4,832/UP4,832/DN 4,832

Tested public library sets (11)

26

Page 27: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Library reference QC

Library Static sequence length comparison

Dynamic sequence length Comparison (indels)

GeCKOv2.Human.A 517 (0.81%) 538 (0.84%)

GeCKOv2.Human.B 437 (0.77%) 441 (0.78%)

GeCKOv2.Mouse.A 736 (1.12%) 755 (1.14%)

GeCKOv2.Mouse.B 850 (1.39%) 860 (1.41%)

yusa 517 (0.59%) 3,944 (4.51%)

Cellecta.Human.M1 0 (0 %) 412 (1.5%)

Cellecta.Human.M2 0 (0 %) 398 (1.45%)

Cellecta.Human.M3 0 (0 %) 410 (1.49%)

TRC 790 (1.28%) 1,909 (3.10%)

S. cerevisiae 0 (0 %) 0 (0 %)

S. pombe 0 (0 %) 0 (0 %)

Barcode counts having similar pairs within one base

27

Page 28: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Conclusions

Ø Barcas is an all-in-one software for barcode-seq data analysis with user-friendly interface and a few new useful functions for data pre-processing and quality control of barcode library

Ø Future improvementsSupports for diverse statistical analyses

• Sophisticated gene-level summary statistics for shRNA and sgRNA• RSA, RIGER, MAGeCK, HiTSelect, ScreenBEAM, etc

• Multiple-condition comparison (MAGeCK-VISPR)• Utilization of metadata and gene-set level analysis (HiTSelect)

Ø We hope Barcas will be useful for many researchers with minimal bioinformatics skills for barcode-seq data analysis

28

Page 29: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

Thank you for your attention

29

Page 30: Barcode Sequence Alignment and Statistical Analysis (Barcas) tooladmis.fudan.edu.cn/giw2016/slides/session-14/4-GIW2016.Barcas.pdf · Barcode Sequence Alignment and Statistical Analysis

30

Limits of the mapping of edgeR package1. Indels in the barcode reads are not supported2. Only shifts of the barcode positions allowed3. Mismatches in the MID, universal primers not allowed4. Indels in the MID and universal primers not allowed

MID Universal Primer Barcode (shRNA)Read format

Universal Primer (sense) Barcode (shRNA)

Universal Primer (anti-sense) Barcode (shRNA)

Example 1: TRC LibraryDifferent primer lengths of universal primers:

Forward: 37 bp, reverse 42 bp

Example 2: Cellecta libraryDifferent MID lengths:

From 9 to 17 bp

MID Universal Primer Barcode (shRNA)

MID

MID

Universal Primer Barcode (shRNA)

Universal Primer Barcode (shRNA)

Loss of sequences with indels in any of the MID, primers and barcodesLoss of sequences with mismatches in the MID and primers