mars - matrix of rna-seq · 2016-11-30 · rna-seq collection • rna-seq libraries are collected...

14
MaRS - Matrix of RNA-Seq ACOBIOM CINES SHAPE 12èmes Journées Fabien PIERRAT Bertrand CIROU November, the 25th 2016

Upload: others

Post on 05-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

MaRS - Matrix of RNA-Seq

1

ACOBIOM – CINES – SHAPE

12èmes Journées

Fabien PIERRAT – Bertrand CIROU

November, the 25th 2016

Page 2: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

GGGTGCGGAGCGTTCCACGTCCCCTCGCAGCACCCCGCTG

GCCTTCTGGGCCGCCCCGCCCCCAACCCGCGCCCGGGCGC

TGACGTCATCGGCCGAGCCCGACTCGGAACCACCGCCCGA

AGTCACGCGCGCCTCCCGGCGGGACGTGCCCTCTGCAGAG

ATCCCTCCGCCGCCGCCTAATTTTCCAGATGTCCTGCCGG

CCCTTCTTTCGGGCCTCGGCTGGGCGTGGTGACCTCAGCG

ACCCCGGCGGCTCCCGTCGCCCACTCAAGAGGGCGCCCCC

GCCCGCCACGCCCCCAGCCCCTCTCGGGGAAGCGCGTCCC

CCGCCCGACCTCAACATGGCACATCCGGGAAGAAAAGGGG

TCTTCTTCCTCATGGCATTCGCCTCCCAAATGTCACCTTG

AGCGCTGTCCGCGATCCCAAAAAGCACTGTTAAAGAGAAG

CGGGGATACACTGGGGTCCTTGCGTGGGGGCCGCCTGGGA

GACCATGCACCGCCCGGGAGTCTCTCCGGGCAGGGTCGGG

GAGGGACTTTTTTGCCGGTGGCCGTGGAGGAATCGTCCCG

TTGAGCAATGACCCCGGTGACGCGGCTGAGGGCTATAAAA Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 2

The MaRS Project

• A large set of gene expression data (RNA-Seq) is published & available

• Represents a big source of information for: - the discovery of new Biomarkers

- the validation of Biomarkers

• But, due to

- Huge amount of data, huge size of files - various methods of analysis used

• These data are: - challenging to compare - unexploited all together

MaRS - Matrix of RNA-Seq

Page 3: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

000100101010011101010

010101010010001000010

010111010101101001100

101010010100101010010

010100001001001000000

011110101101110100001

010010010101001010010

101001001011110100100

101010010101010100101

010001001010100111010

100101010100100010000

100101110101011010011

001010100101001010100

100101000010010010000

000111101011011101000

010100100101010010100

101000100101111010010

010101001010101010010

101000100101010011101

010010101010010001000

AACCCGCGCCCGGGCGCTGAC

GTCATCGGCCGAGCCCGACTC

GGAACCACCGCCCGAAGTCAC

GCGCGCCTCCCGGCGGGACGT

GCCCTCTGCAGAGATCCCTCC

GCCGCCGCCTAATTTTCCAGA

TGTCCTGCCGGCCCTTCTTTC

GGGCCTCGGCTGGGCGTGGTG

ACCTCAGCGACCCCGGCGGCT

CCCGTCGCCCACTCAAGAGGG

CGCCCCCGCCCGCCACGCCCC

CAGCCCCTCTCGGGGAAGCGC

GTCCCCCGCCCGACCTCAACA

TGGCACATCCGGGAAGAAAAG

GGGTCTTCTTCCTCATGGCAT

TCGCCTCCCAAATGTCACCTT

GAGCGCTGTCCGCGATCCCAA

AAAGCACTGTTAAAGAGAAGC

GGGGATACACTGGGGTCCTTG

CGTGGGGGCCGCCTGGGAGAC

Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 3

Lib01 Lib02 Lib03 Lib04 Lib05

Gene001 0 5 76 10 0

Gene002 23 0 2 6 9

Gene003 2 95 2 0 2

Gene004 11 4 0 33 1

Gene005 0 5 5 0 53

Lib03

.fastq

Lib05

.fastq

Lib04

.fastq

Lib02

.fastq

Lib01

.fastq

MaRS (only 1 matrix)

Gene

Expression

profiles (plenty of files)

HPC - High Performance

Computing (only 1 method)

The MaRS Project

Page 4: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

GGGTGCGGAGCGTTCCACGTCCCCTCGCAGCACCCCGCTG

GCCTTCTGGGCCGCCCCGCCCCCAACCCGCGCCCGGGCGC

TGACGTCATCGGCCGAGCCCGACTCGGAACCACCGCCCGA

AGTCACGCGCGCCTCCCGGCGGGACGTGCCCTCTGCAGAG

ATCCCTCCGCCGCCGCCTAATTTTCCAGATGTCCTGCCGG

CCCTTCTTTCGGGCCTCGGCTGGGCGTGGTGACCTCAGCG

ACCCCGGCGGCTCCCGTCGCCCACTCAAGAGGGCGCCCCC

GCCCGCCACGCCCCCAGCCCCTCTCGGGGAAGCGCGTCCC

CCGCCCGACCTCAACATGGCACATCCGGGAAGAAAAGGGG

TCTTCTTCCTCATGGCATTCGCCTCCCAAATGTCACCTTG

AGCGCTGTCCGCGATCCCAAAAAGCACTGTTAAAGAGAAG

CGGGGATACACTGGGGTCCTTGCGTGGGGGCCGCCTGGGA

GACCATGCACCGCCCGGGAGTCTCTCCGGGCAGGGTCGGG

GAGGGACTTTTTTGCCGGTGGCCGTGGAGGAATCGTCCCG

TTGAGCAATGACCCCGGTGACGCGGCTGAGGGCTATAAAA Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 4

RNA-Seq collection

• RNA-Seq libraries are collected from public databases

(NCBI, EBI)

• We selected libraries:

- with comparable data :

same type of data

same technology of sequencing

- from multiple sources (laboratories, countries)

~27000 Human RNA-Seq selected

(from multiple organs / pathologies / ethnic groups…)

Page 5: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

GGGTGCGGAGCGTTCCACGTCCCCTCGCAGCACCCCGCTG

GCCTTCTGGGCCGCCCCGCCCCCAACCCGCGCCCGGGCGC

TGACGTCATCGGCCGAGCCCGACTCGGAACCACCGCCCGA

AGTCACGCGCGCCTCCCGGCGGGACGTGCCCTCTGCAGAG

ATCCCTCCGCCGCCGCCTAATTTTCCAGATGTCCTGCCGG

CCCTTCTTTCGGGCCTCGGCTGGGCGTGGTGACCTCAGCG

ACCCCGGCGGCTCCCGTCGCCCACTCAAGAGGGCGCCCCC

GCCCGCCACGCCCCCAGCCCCTCTCGGGGAAGCGCGTCCC

CCGCCCGACCTCAACATGGCACATCCGGGAAGAAAAGGGG

TCTTCTTCCTCATGGCATTCGCCTCCCAAATGTCACCTTG

AGCGCTGTCCGCGATCCCAAAAAGCACTGTTAAAGAGAAG

CGGGGATACACTGGGGTCCTTGCGTGGGGGCCGCCTGGGA

GACCATGCACCGCCCGGGAGTCTCTCCGGGCAGGGTCGGG

GAGGGACTTTTTTGCCGGTGGCCGTGGAGGAATCGTCCCG

TTGAGCAATGACCCCGGTGACGCGGCTGAGGGCTATAAAA Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 5

RNA-Seq diversity

T I S S U E S

Page 6: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

10010010111101001001010100101010101001010

10001001010100111010100101010100100010000

10010111010001011110100001011100101001100

10101001011010011001010100101001010100100

10100001001001000000011110101101110100001

01001001010100101001010100100101111010010

01010100101010101001010100010010101001110

10100101010100100010000100101110101011010

01100101010010100101010010010100001001001

00000001111010110111010000101001001010100

10100101000100101111010010010101001010101

01001010100010010101001110101001010101001

00010000100101110101011010011001010100101

00101010010010100001001001000000011110101

10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 6

HPC Toolchain

1

2

3

4

5

compressed FASTQ

Quality & Filters

Organisation & Aligment

HTSeq

MaRS

Downloading

Cleaning

Mapping Counting

Matrix building

Page 7: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

10010010111101001001010100101010101001010

10001001010100111010100101010100100010000

10010111010001011110100001011100101001100

10101001011010011001010100101001010100100

10100001001001000000011110101101110100001

01001001010100101001010100100101111010010

01010100101010101001010100010010101001110

10100101010100100010000100101110101011010

01100101010010100101010010010100001001001

00000001111010110111010000101001001010100

10100101000100101111010010010101001010101

01001010100010010101001110101001010101001

00010000100101110101011010011001010100101

00101010010010100001001001000000011110101

10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 7

Why HPC?

- Download

- Cleaning

- Mapping

- Counting

1

2

3

4

1 RNA-Seq (average)

MaRS

+ Matrix Building 5

X

27000

6 Go & 0.5 hour

4 hours

on 12 CPU =

48 hours/core

120 To &

416 Days

1.2 million

hours/core

Page 8: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

10010010111101001001010100101010101001010

10001001010100111010100101010100100010000

10010111010001011110100001011100101001100

10101001011010011001010100101001010100100

10100001001001000000011110101101110100001

01001001010100101001010100100101111010010

01010100101010101001010100010010101001110

10100101010100100010000100101110101011010

01100101010010100101010010010100001001001

00000001111010110111010000101001001010100

10100101000100101111010010010101001010101

01001010100010010101001110101001010101001

00010000100101110101011010011001010100101

00101010010010100001001001000000011110101

10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 8

Many Challenges

• Very Large amount of data:

- Downloading

- Processing

• Software’s Installation, Settings on the cluster

• Optimization

- compiler choice

- packages choice

• Jobs duration & Limitations management

- Adaptation of software’s parameters for all libraries

HPC on cluster (Occigen)

Page 9: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

10010010111101001001010100101010101001010

10001001010100111010100101010100100010000

10010111010001011110100001011100101001100

10101001011010011001010100101001010100100

10100001001001000000011110101101110100001

01001001010100101001010100100101111010010

01010100101010101001010100010010101001110

10100101010100100010000100101110101011010

01100101010010100101010010010100001001001

00000001111010110111010000101001001010100

10100101000100101111010010010101001010101

01001010100010010101001110101001010101001

00010000100101110101011010011001010100101

00101010010010100001001001000000011110101

10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 9

Performance & Optimization

Best with

Intel

Tests & Benchmarks

Best with

GCC

Compiler GCC vs compiler Intel

Processing Steps

Performance GCC / Intel ratio

Libraries

Page 10: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

10010010111101001001010100101010101001010

10001001010100111010100101010100100010000

10010111010001011110100001011100101001100

10101001011010011001010100101001010100100

10100001001001000000011110101101110100001

01001001010100101001010100100101111010010

01010100101010101001010100010010101001110

10100101010100100010000100101110101011010

01100101010010100101010010010100001001001

00000001111010110111010000101001001010100

10100101000100101111010010010101001010101

01001010100010010101001110101001010101001

00010000100101110101011010011001010100101

00101010010010100001001001000000011110101

10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 10

Performance

Vtune analysis highlights the high consumption of

compression & decompression steps

32.2% deflate_slow

Performance & Optimization

trimmomatic

Picard SortSam

Samtools view

Page 11: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

10010010111101001001010100101010101001010

10001001010100111010100101010100100010000

10010111010001011110100001011100101001100

10101001011010011001010100101001010100100

10100001001001000000011110101101110100001

01001001010100101001010100100101111010010

01010100101010101001010100010010101001110

10100101010100100010000100101110101011010

01100101010010100101010010010100001001001

00000001111010110111010000101001001010100

10100101000100101111010010010101001010101

01001010100010010101001110101001010101001

00010000100101110101011010011001010100101

00101010010010100001001001000000011110101

10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 11

zlib decompression performance ~750 MB compressed lib

Perf

orm

an

ce

Tim

e (

s)

defaults intel

zlib flavor

cloudflare

(Smaller value is better)

Performance & Optimization

Page 12: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

000100101010011101010

010101010010001000010

010111010101101001100

101010010100101010010

010100001001001000000

011110101101110100001

010010010101001010010

101001001011110100100

101010010101010100101

010001001010100111010

100101010100100010000

100101110101011010011

001010100101001010100

100101000010010010000

000111101011011101000

010100100101010010100

101000100101111010010

010101001010101010010

101000100101010011101

010010101010010001000

AACCCGCGCCCGGGCGCTGAC

GTCATCGGCCGAGCCCGACTC

GGAACCACCGCCCGAAGTCAC

GCGCGCCTCCCGGCGGGACGT

GCCCTCTGCAGAGATCCCTCC

GCCGCCGCCTAATTTTCCAGA

TGTCCTGCCGGCCCTTCTTTC

GGGCCTCGGCTGGGCGTGGTG

ACCTCAGCGACCCCGGCGGCT

CCCGTCGCCCACTCAAGAGGG

CGCCCCCGCCCGCCACGCCCC

CAGCCCCTCTCGGGGAAGCGC

GTCCCCCGCCCGACCTCAACA

TGGCACATCCGGGAAGAAAAG

GGGTCTTCTTCCTCATGGCAT

TCGCCTCCCAAATGTCACCTT

GAGCGCTGTCCGCGATCCCAA

AAAGCACTGTTAAAGAGAAGC

GGGGATACACTGGGGTCCTTG

CGTGGGGGCCGCCTGGGAGAC

Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 12

Conclusion & Perspectives

• NOW : 95% of the Human MaRS matrix is built

• NEXT : MaRS Exploration Matrix hosting & Queries allowing:

- Web interface / Database - integrated analysis tools

RNA-Seq Data production increases continuously

- Integration of new profiles in MaRS Similar work on other species: Mouse

• BIOMARKERS discovery

Page 13: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

000100101010011101010

010101010010001000010

010111010101101001100

101010010100101010010

010100001001001000000

011110101101110100001

010010010101001010010

101001001011110100100

101010010101010100101

010001001010100111010

100101010100100010000

100101110101011010011

001010100101001010100

100101000010010010000

000111101011011101000

010100100101010010100

101000100101111010010

010101001010101010010

101000100101010011101

010010101010010001000

AACCCGCGCCCGGGCGCTGAC

GTCATCGGCCGAGCCCGACTC

GGAACCACCGCCCGAAGTCAC

GCGCGCCTCCCGGCGGGACGT

GCCCTCTGCAGAGATCCCTCC

GCCGCCGCCTAATTTTCCAGA

TGTCCTGCCGGCCCTTCTTTC

GGGCCTCGGCTGGGCGTGGTG

ACCTCAGCGACCCCGGCGGCT

CCCGTCGCCCACTCAAGAGGG

CGCCCCCGCCCGCCACGCCCC

CAGCCCCTCTCGGGGAAGCGC

GTCCCCCGCCCGACCTCAACA

TGGCACATCCGGGAAGAAAAG

GGGTCTTCTTCCTCATGGCAT

TCGCCTCCCAAATGTCACCTT

GAGCGCTGTCCGCGATCCCAA

AAAGCACTGTTAAAGAGAAGC

GGGGATACACTGGGGTCCTTG

CGTGGGGGCCGCCTGGGAGAC

Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 13

- French SME

- Discovery & validation of

new biomarkers

Partners

- Multiple missions:

HPC, digital archiving, hosting

- Occigen cluster: 2.1 Pflops (Top 100 of the most powerful computers in

the world)

PRACE - SHAPE SME HPC Adoption Programme in Europe

Page 14: MaRS - Matrix of RNA-Seq · 2016-11-30 · RNA-Seq collection • RNA-Seq libraries are collected from public databases (NCBI, EBI) • We selected libraries: -with comparable data

000100101010011101010

010101010010001000010

010111010101101001100

101010010100101010010

010100001001001000000

011110101101110100001

010010010101001010010

101001001011110100100

101010010101010100101

010001001010100111010

100101010100100010000

100101110101011010011

001010100101001010100

100101000010010010000

000111101011011101000

010100100101010010100

101000100101111010010

010101001010101010010

101000100101010011101

010010101010010001000

AACCCGCGCCCGGGCGCTGAC

GTCATCGGCCGAGCCCGACTC

GGAACCACCGCCCGAAGTCAC

GCGCGCCTCCCGGCGGGACGT

GCCCTCTGCAGAGATCCCTCC

GCCGCCGCCTAATTTTCCAGA

TGTCCTGCCGGCCCTTCTTTC

GGGCCTCGGCTGGGCGTGGTG

ACCTCAGCGACCCCGGCGGCT

CCCGTCGCCCACTCAAGAGGG

CGCCCCCGCCCGCCACGCCCC

CAGCCCCTCTCGGGGAAGCGC

GTCCCCCGCCCGACCTCAACA

TGGCACATCCGGGAAGAAAAG

GGGTCTTCTTCCTCATGGCAT

TCGCCTCCCAAATGTCACCTT

GAGCGCTGTCCGCGATCCCAA

AAAGCACTGTTAAAGAGAAGC

GGGGATACACTGGGGTCCTTG

CGTGGGGGCCGCCTGGGAGAC

Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 14

• Fabien PIERRAT

• Laurent MANCHON

• Roman BRUNO

• David PIQUEMAL

Acknowledgments

• Bertrand CIROU

• Victor CAMEO PONZ

• Francis DAUMAS*

THANK YOU for your attention

THANKS to all collaborators: