mars - matrix of rna-seq · 2016-11-30 · rna-seq collection • rna-seq libraries are collected...
TRANSCRIPT
MaRS - Matrix of RNA-Seq
1
ACOBIOM – CINES – SHAPE
12èmes Journées
Fabien PIERRAT – Bertrand CIROU
November, the 25th 2016
GGGTGCGGAGCGTTCCACGTCCCCTCGCAGCACCCCGCTG
GCCTTCTGGGCCGCCCCGCCCCCAACCCGCGCCCGGGCGC
TGACGTCATCGGCCGAGCCCGACTCGGAACCACCGCCCGA
AGTCACGCGCGCCTCCCGGCGGGACGTGCCCTCTGCAGAG
ATCCCTCCGCCGCCGCCTAATTTTCCAGATGTCCTGCCGG
CCCTTCTTTCGGGCCTCGGCTGGGCGTGGTGACCTCAGCG
ACCCCGGCGGCTCCCGTCGCCCACTCAAGAGGGCGCCCCC
GCCCGCCACGCCCCCAGCCCCTCTCGGGGAAGCGCGTCCC
CCGCCCGACCTCAACATGGCACATCCGGGAAGAAAAGGGG
TCTTCTTCCTCATGGCATTCGCCTCCCAAATGTCACCTTG
AGCGCTGTCCGCGATCCCAAAAAGCACTGTTAAAGAGAAG
CGGGGATACACTGGGGTCCTTGCGTGGGGGCCGCCTGGGA
GACCATGCACCGCCCGGGAGTCTCTCCGGGCAGGGTCGGG
GAGGGACTTTTTTGCCGGTGGCCGTGGAGGAATCGTCCCG
TTGAGCAATGACCCCGGTGACGCGGCTGAGGGCTATAAAA Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 2
The MaRS Project
• A large set of gene expression data (RNA-Seq) is published & available
• Represents a big source of information for: - the discovery of new Biomarkers
- the validation of Biomarkers
• But, due to
- Huge amount of data, huge size of files - various methods of analysis used
• These data are: - challenging to compare - unexploited all together
MaRS - Matrix of RNA-Seq
000100101010011101010
010101010010001000010
010111010101101001100
101010010100101010010
010100001001001000000
011110101101110100001
010010010101001010010
101001001011110100100
101010010101010100101
010001001010100111010
100101010100100010000
100101110101011010011
001010100101001010100
100101000010010010000
000111101011011101000
010100100101010010100
101000100101111010010
010101001010101010010
101000100101010011101
010010101010010001000
AACCCGCGCCCGGGCGCTGAC
GTCATCGGCCGAGCCCGACTC
GGAACCACCGCCCGAAGTCAC
GCGCGCCTCCCGGCGGGACGT
GCCCTCTGCAGAGATCCCTCC
GCCGCCGCCTAATTTTCCAGA
TGTCCTGCCGGCCCTTCTTTC
GGGCCTCGGCTGGGCGTGGTG
ACCTCAGCGACCCCGGCGGCT
CCCGTCGCCCACTCAAGAGGG
CGCCCCCGCCCGCCACGCCCC
CAGCCCCTCTCGGGGAAGCGC
GTCCCCCGCCCGACCTCAACA
TGGCACATCCGGGAAGAAAAG
GGGTCTTCTTCCTCATGGCAT
TCGCCTCCCAAATGTCACCTT
GAGCGCTGTCCGCGATCCCAA
AAAGCACTGTTAAAGAGAAGC
GGGGATACACTGGGGTCCTTG
CGTGGGGGCCGCCTGGGAGAC
Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 3
Lib01 Lib02 Lib03 Lib04 Lib05
Gene001 0 5 76 10 0
Gene002 23 0 2 6 9
Gene003 2 95 2 0 2
Gene004 11 4 0 33 1
Gene005 0 5 5 0 53
Lib03
.fastq
Lib05
.fastq
Lib04
.fastq
Lib02
.fastq
Lib01
.fastq
MaRS (only 1 matrix)
Gene
Expression
profiles (plenty of files)
HPC - High Performance
Computing (only 1 method)
The MaRS Project
GGGTGCGGAGCGTTCCACGTCCCCTCGCAGCACCCCGCTG
GCCTTCTGGGCCGCCCCGCCCCCAACCCGCGCCCGGGCGC
TGACGTCATCGGCCGAGCCCGACTCGGAACCACCGCCCGA
AGTCACGCGCGCCTCCCGGCGGGACGTGCCCTCTGCAGAG
ATCCCTCCGCCGCCGCCTAATTTTCCAGATGTCCTGCCGG
CCCTTCTTTCGGGCCTCGGCTGGGCGTGGTGACCTCAGCG
ACCCCGGCGGCTCCCGTCGCCCACTCAAGAGGGCGCCCCC
GCCCGCCACGCCCCCAGCCCCTCTCGGGGAAGCGCGTCCC
CCGCCCGACCTCAACATGGCACATCCGGGAAGAAAAGGGG
TCTTCTTCCTCATGGCATTCGCCTCCCAAATGTCACCTTG
AGCGCTGTCCGCGATCCCAAAAAGCACTGTTAAAGAGAAG
CGGGGATACACTGGGGTCCTTGCGTGGGGGCCGCCTGGGA
GACCATGCACCGCCCGGGAGTCTCTCCGGGCAGGGTCGGG
GAGGGACTTTTTTGCCGGTGGCCGTGGAGGAATCGTCCCG
TTGAGCAATGACCCCGGTGACGCGGCTGAGGGCTATAAAA Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 4
RNA-Seq collection
• RNA-Seq libraries are collected from public databases
(NCBI, EBI)
• We selected libraries:
- with comparable data :
same type of data
same technology of sequencing
- from multiple sources (laboratories, countries)
~27000 Human RNA-Seq selected
(from multiple organs / pathologies / ethnic groups…)
GGGTGCGGAGCGTTCCACGTCCCCTCGCAGCACCCCGCTG
GCCTTCTGGGCCGCCCCGCCCCCAACCCGCGCCCGGGCGC
TGACGTCATCGGCCGAGCCCGACTCGGAACCACCGCCCGA
AGTCACGCGCGCCTCCCGGCGGGACGTGCCCTCTGCAGAG
ATCCCTCCGCCGCCGCCTAATTTTCCAGATGTCCTGCCGG
CCCTTCTTTCGGGCCTCGGCTGGGCGTGGTGACCTCAGCG
ACCCCGGCGGCTCCCGTCGCCCACTCAAGAGGGCGCCCCC
GCCCGCCACGCCCCCAGCCCCTCTCGGGGAAGCGCGTCCC
CCGCCCGACCTCAACATGGCACATCCGGGAAGAAAAGGGG
TCTTCTTCCTCATGGCATTCGCCTCCCAAATGTCACCTTG
AGCGCTGTCCGCGATCCCAAAAAGCACTGTTAAAGAGAAG
CGGGGATACACTGGGGTCCTTGCGTGGGGGCCGCCTGGGA
GACCATGCACCGCCCGGGAGTCTCTCCGGGCAGGGTCGGG
GAGGGACTTTTTTGCCGGTGGCCGTGGAGGAATCGTCCCG
TTGAGCAATGACCCCGGTGACGCGGCTGAGGGCTATAAAA Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 5
RNA-Seq diversity
T I S S U E S
10010010111101001001010100101010101001010
10001001010100111010100101010100100010000
10010111010001011110100001011100101001100
10101001011010011001010100101001010100100
10100001001001000000011110101101110100001
01001001010100101001010100100101111010010
01010100101010101001010100010010101001110
10100101010100100010000100101110101011010
01100101010010100101010010010100001001001
00000001111010110111010000101001001010100
10100101000100101111010010010101001010101
01001010100010010101001110101001010101001
00010000100101110101011010011001010100101
00101010010010100001001001000000011110101
10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 6
HPC Toolchain
1
2
3
4
5
compressed FASTQ
Quality & Filters
Organisation & Aligment
HTSeq
MaRS
Downloading
Cleaning
Mapping Counting
Matrix building
10010010111101001001010100101010101001010
10001001010100111010100101010100100010000
10010111010001011110100001011100101001100
10101001011010011001010100101001010100100
10100001001001000000011110101101110100001
01001001010100101001010100100101111010010
01010100101010101001010100010010101001110
10100101010100100010000100101110101011010
01100101010010100101010010010100001001001
00000001111010110111010000101001001010100
10100101000100101111010010010101001010101
01001010100010010101001110101001010101001
00010000100101110101011010011001010100101
00101010010010100001001001000000011110101
10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 7
Why HPC?
- Download
- Cleaning
- Mapping
- Counting
1
2
3
4
1 RNA-Seq (average)
MaRS
+ Matrix Building 5
X
27000
6 Go & 0.5 hour
4 hours
on 12 CPU =
48 hours/core
120 To &
416 Days
1.2 million
hours/core
10010010111101001001010100101010101001010
10001001010100111010100101010100100010000
10010111010001011110100001011100101001100
10101001011010011001010100101001010100100
10100001001001000000011110101101110100001
01001001010100101001010100100101111010010
01010100101010101001010100010010101001110
10100101010100100010000100101110101011010
01100101010010100101010010010100001001001
00000001111010110111010000101001001010100
10100101000100101111010010010101001010101
01001010100010010101001110101001010101001
00010000100101110101011010011001010100101
00101010010010100001001001000000011110101
10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 8
Many Challenges
• Very Large amount of data:
- Downloading
- Processing
• Software’s Installation, Settings on the cluster
• Optimization
- compiler choice
- packages choice
• Jobs duration & Limitations management
- Adaptation of software’s parameters for all libraries
HPC on cluster (Occigen)
10010010111101001001010100101010101001010
10001001010100111010100101010100100010000
10010111010001011110100001011100101001100
10101001011010011001010100101001010100100
10100001001001000000011110101101110100001
01001001010100101001010100100101111010010
01010100101010101001010100010010101001110
10100101010100100010000100101110101011010
01100101010010100101010010010100001001001
00000001111010110111010000101001001010100
10100101000100101111010010010101001010101
01001010100010010101001110101001010101001
00010000100101110101011010011001010100101
00101010010010100001001001000000011110101
10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 9
Performance & Optimization
Best with
Intel
Tests & Benchmarks
Best with
GCC
Compiler GCC vs compiler Intel
Processing Steps
Performance GCC / Intel ratio
Libraries
10010010111101001001010100101010101001010
10001001010100111010100101010100100010000
10010111010001011110100001011100101001100
10101001011010011001010100101001010100100
10100001001001000000011110101101110100001
01001001010100101001010100100101111010010
01010100101010101001010100010010101001110
10100101010100100010000100101110101011010
01100101010010100101010010010100001001001
00000001111010110111010000101001001010100
10100101000100101111010010010101001010101
01001010100010010101001110101001010101001
00010000100101110101011010011001010100101
00101010010010100001001001000000011110101
10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 10
Performance
Vtune analysis highlights the high consumption of
compression & decompression steps
32.2% deflate_slow
Performance & Optimization
trimmomatic
Picard SortSam
Samtools view
10010010111101001001010100101010101001010
10001001010100111010100101010100100010000
10010111010001011110100001011100101001100
10101001011010011001010100101001010100100
10100001001001000000011110101101110100001
01001001010100101001010100100101111010010
01010100101010101001010100010010101001110
10100101010100100010000100101110101011010
01100101010010100101010010010100001001001
00000001111010110111010000101001001010100
10100101000100101111010010010101001010101
01001010100010010101001110101001010101001
00010000100101110101011010011001010100101
00101010010010100001001001000000011110101
10111010000101001001010100101001010 Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 11
zlib decompression performance ~750 MB compressed lib
Perf
orm
an
ce
Tim
e (
s)
defaults intel
zlib flavor
cloudflare
(Smaller value is better)
Performance & Optimization
000100101010011101010
010101010010001000010
010111010101101001100
101010010100101010010
010100001001001000000
011110101101110100001
010010010101001010010
101001001011110100100
101010010101010100101
010001001010100111010
100101010100100010000
100101110101011010011
001010100101001010100
100101000010010010000
000111101011011101000
010100100101010010100
101000100101111010010
010101001010101010010
101000100101010011101
010010101010010001000
AACCCGCGCCCGGGCGCTGAC
GTCATCGGCCGAGCCCGACTC
GGAACCACCGCCCGAAGTCAC
GCGCGCCTCCCGGCGGGACGT
GCCCTCTGCAGAGATCCCTCC
GCCGCCGCCTAATTTTCCAGA
TGTCCTGCCGGCCCTTCTTTC
GGGCCTCGGCTGGGCGTGGTG
ACCTCAGCGACCCCGGCGGCT
CCCGTCGCCCACTCAAGAGGG
CGCCCCCGCCCGCCACGCCCC
CAGCCCCTCTCGGGGAAGCGC
GTCCCCCGCCCGACCTCAACA
TGGCACATCCGGGAAGAAAAG
GGGTCTTCTTCCTCATGGCAT
TCGCCTCCCAAATGTCACCTT
GAGCGCTGTCCGCGATCCCAA
AAAGCACTGTTAAAGAGAAGC
GGGGATACACTGGGGTCCTTG
CGTGGGGGCCGCCTGGGAGAC
Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 12
Conclusion & Perspectives
• NOW : 95% of the Human MaRS matrix is built
• NEXT : MaRS Exploration Matrix hosting & Queries allowing:
- Web interface / Database - integrated analysis tools
RNA-Seq Data production increases continuously
- Integration of new profiles in MaRS Similar work on other species: Mouse
• BIOMARKERS discovery
000100101010011101010
010101010010001000010
010111010101101001100
101010010100101010010
010100001001001000000
011110101101110100001
010010010101001010010
101001001011110100100
101010010101010100101
010001001010100111010
100101010100100010000
100101110101011010011
001010100101001010100
100101000010010010000
000111101011011101000
010100100101010010100
101000100101111010010
010101001010101010010
101000100101010011101
010010101010010001000
AACCCGCGCCCGGGCGCTGAC
GTCATCGGCCGAGCCCGACTC
GGAACCACCGCCCGAAGTCAC
GCGCGCCTCCCGGCGGGACGT
GCCCTCTGCAGAGATCCCTCC
GCCGCCGCCTAATTTTCCAGA
TGTCCTGCCGGCCCTTCTTTC
GGGCCTCGGCTGGGCGTGGTG
ACCTCAGCGACCCCGGCGGCT
CCCGTCGCCCACTCAAGAGGG
CGCCCCCGCCCGCCACGCCCC
CAGCCCCTCTCGGGGAAGCGC
GTCCCCCGCCCGACCTCAACA
TGGCACATCCGGGAAGAAAAG
GGGTCTTCTTCCTCATGGCAT
TCGCCTCCCAAATGTCACCTT
GAGCGCTGTCCGCGATCCCAA
AAAGCACTGTTAAAGAGAAGC
GGGGATACACTGGGGTCCTTG
CGTGGGGGCCGCCTGGGAGAC
Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 13
- French SME
- Discovery & validation of
new biomarkers
Partners
- Multiple missions:
HPC, digital archiving, hosting
- Occigen cluster: 2.1 Pflops (Top 100 of the most powerful computers in
the world)
PRACE - SHAPE SME HPC Adoption Programme in Europe
000100101010011101010
010101010010001000010
010111010101101001100
101010010100101010010
010100001001001000000
011110101101110100001
010010010101001010010
101001001011110100100
101010010101010100101
010001001010100111010
100101010100100010000
100101110101011010011
001010100101001010100
100101000010010010000
000111101011011101000
010100100101010010100
101000100101111010010
010101001010101010010
101000100101010011101
010010101010010001000
AACCCGCGCCCGGGCGCTGAC
GTCATCGGCCGAGCCCGACTC
GGAACCACCGCCCGAAGTCAC
GCGCGCCTCCCGGCGGGACGT
GCCCTCTGCAGAGATCCCTCC
GCCGCCGCCTAATTTTCCAGA
TGTCCTGCCGGCCCTTCTTTC
GGGCCTCGGCTGGGCGTGGTG
ACCTCAGCGACCCCGGCGGCT
CCCGTCGCCCACTCAAGAGGG
CGCCCCCGCCCGCCACGCCCC
CAGCCCCTCTCGGGGAAGCGC
GTCCCCCGCCCGACCTCAACA
TGGCACATCCGGGAAGAAAAG
GGGTCTTCTTCCTCATGGCAT
TCGCCTCCCAAATGTCACCTT
GAGCGCTGTCCGCGATCCCAA
AAAGCACTGTTAAAGAGAAGC
GGGGATACACTGGGGTCCTTG
CGTGGGGGCCGCCTGGGAGAC
Journées de la Cancéropôle Grand Sud-Ouest – November, 23-25th 2016 14
• Fabien PIERRAT
• Laurent MANCHON
• Roman BRUNO
• David PIQUEMAL
Acknowledgments
• Bertrand CIROU
• Victor CAMEO PONZ
• Francis DAUMAS*
THANK YOU for your attention
THANKS to all collaborators: