graduate school bioinformatics sequence analysis...
Post on 05-Jul-2020
1 Views
Preview:
TRANSCRIPT
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Graduate SchoolBioinformatics Sequence Analysis
Introduction
Barbera van Schaik
Bioinformatics Laboratory, KEBBAcademic Medical Center
b.d.vanschaik@amsterdamumc.nl
March 9, 2020
1 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Related Graduate School courses
• DNA technology
• Unix
• Computing in R
• Practical biostatistics
• Advanced biostatistics
• Bioinformatics
• Bioinformatics Sequence Analysis
• Research Data Managementhttps://www.amc.nl/web/leren/graduate-school.htm
2 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
In this course
Bioinformatics Sequence Analysis
You will learn what is behind commonly used methods forsequence analysis, how to analyze datasets with(reasonably) user-friendly interfaces, and get introduced tocommand-line tools for next generation sequencing (NGS)
3 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Not in this course
1 Sequence assembly
2 Bisulphite sequencing
3 Protein sequence analysis
4 Metagenomics
4 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Bioinformatics Sequence Analysis
1 Introduction to sequence analysis
2 Sequencing techniques
3 Brief introduction Linux and R (self study)
4 NGS pre-processing
5 (Multiple) sequence alignment
6 Case: Neuroblastoma
7 Introduction to R2
8 Exome sequence analysis
9 RNAseq
The focus is on human data, but many techniques are alsoapplicable to other organisms
5 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Practical things
Certificate
• Attend all sessions (one day can be skipped, ask forpossibility for self-study)
• Active participation
Other things
• Lunch is not included
• Coffee is available at the machines with your AMC card
• Slides and exercises are published onhttps://bioinformatics.amc.nl/
6 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
In this hour
IntroductionYou will get an indication about the scale of sequence data,how to handle the data, where to find publicly availabledata and tools, and what can be done with NGS
7 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Overview
1 Welcome
2 Scale of sequence dataDNA sequencingGenome projects
3 Bioinformatics databases and toolsDatabasesSequence analysis
4 Handling sequence dataComputingApplication areas
8 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sanger
9 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Automated sequencing
10 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sequencing centers
11 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Next generation sequencing
12 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Genome projects
• HGP
• 1000g
• UK10K >100K genomes
• Personal genomes
13 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Human Genome Project
http://web.ornl.gov/sci/techresources/Human_Genome/index.shtml
14 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Human Genome Project
http://web.ornl.gov/sci/techresources/Human_Genome/index.shtml
15 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
1000 genomes project
http://www.1000genomes.org/
16 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
UK10K
4000 genomes6000 exomeshttp://www.uk10k.org/
17 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
The 100K genomes project
The project will focus onpatients with a rare disease andtheir families and patients withcancer. The first samples forsequencing are being takenfrom patients living in Englandwith discussions taking placewith Scotland, Wales andNorthern Ireland aboutpotential future involvement.http://www.genomicsengland.co.uk/
18 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Personal genomes
100,000 genomes plus medical recordshttp://www.personalgenomes.org/
19 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sequencers around the world
http://omicsmaps.com/
20 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sequencers around the world 2015
http://omicsmaps.com/
21 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Big data
22 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
DNA sequencing rate
Stephens et al. (2015) PLoS One
23 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
GenBank, EMBL and DDBJ
International Nucleotide Sequence Database CollaborationDaily exchange of sequence data
https://www.ncbi.nlm.nih.gov/
https://www.ebi.ac.uk/
http://www.ddbj.nig.ac.jp/
24 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Nucleotide sequence databases
From: http://www.davelunt.net/
25 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
GenBank
Release 236 (Feb 2020)has 399,376,854,872 base pairs from 216,214,215sequences. In addition, there are 1,206,720,688 WGSrecords containing 6,968,991,265,752 base pairs ofsequence data.
https://www.ncbi.nlm.nih.gov/genbank/statistics/
GenBank has doubled approximately every 18 months
26 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Core databases and derivatives
Nucleotide sequence databases
Core: RNA, DNA
Genbank
EMBL
DDBJ
RNA grouped per gene UniGene
Genome assemblies
Human
Model organisms
Bacteria
Plants
Etc
Genome comparisons Conserved regions
DNA motifs
Protein binding sites
Conserved regions
DNA structure
Restriction sites
Gene expressionExpressed Sequence Tags (ESTs)
RNAseq
Variants
SNPs, insertions and deletions
Structural variants
Allele databases
Specialized databases
Gene specific
Disease specific
Genome projects
MetagenomicsMicrobiome
Environment samples
Protein translations
27 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Where to start?
https://www.oxfordjournals.org/nar/database/c/
28 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sequence analysis
Sequence alignment
• Needleman-Wunsch
• Smith-Waterman
• BLAST
• BLAT
• ClustalW
• BWA, BFAST, Bowtie, Tophat, etc, etc
Sequence suites/packages
• Emboss package
• CLCbio workbench
• Galaxy
• R Bioconductor
Hundreds of tools to analyse sequence data...
29 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Tools
https://academic.oup.com/nar/article/47/W1/W1/5524725
30 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Tools
Most tools are only available via the command-line (on linuxsystems)
31 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Open source
Free as in freedomYou can use, change, integrate, and review the codeOpen source allows sharing and promotes collaborationNo vendor lock-in
32 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Open source
• Software
• Databases
• Journals
• Standards
• Hardware
• Art
• Money
• Drinks
• Medicine
• Fashion
• Educationhttps://en.wikipedia.org/wiki/Open_source
33 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Handling sequence data
34 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Buy a bigger cluster (centralizedmodel)
35 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Dutch life science grid
http://surfsara.nl/
36 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Cloud computing
37 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
HPC cloud at SurfSara
You will use a linux environment that runs on the HPC cloudto get acquainted with command-line tools
38 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
NGS application areas
39 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Whole genomes
• De novo sequencing
• Re-sequencing
• Copy number variations
• Rearrangements
• New insertions/deletions/mutations
40 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Structural variation
The Human Genome Structural Variation Working Group, Nature 2007
41 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
SNP / haplotype analysis
Linkage studiesForensic research
42 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Gene expression
https://en.wikipedia.org/wiki/Regulation_
of_gene_expression
• Full-length transcripts
• EST sequencing
• 5’ transcript ends(5’-RATE, CAGE)
• SAGE ditag sequencing
• SAGE-like 3’ endsequencing
• Nebulized fragments
• ncRNA sequencing
43 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Epigenetics
Treatment with sodium bisulfiteUnmethylated cytosines change into uracilMethylated cytosines are unchangedCompare sequences with reference sequence
44 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Metagenomics and microbialdiversity
Study genomic content in acomplex mixture ofmicroorganisms(bacteria or viruses in someenvironment)Identify new species
45 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Paleogenomics
Sequencing ofancient DNAMummiesSabretoothMammothNeanderthal
46 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Gene regulation
47 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Sequence analysis
Usually starts with sequence alignment or sequence assemblyDepending on the application other tools/methods are used ordeveloped
48 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
With a click of a button...
.. or perhaps not. You will find out during this course.Computer exercises sequence analysis:
1 Via web tools
2 Creating pipelines online
3 With command-line tools in a Linux environment
49 / 50
Introduction
Barbera vanSchaik
Welcome
Scale ofsequence data
DNA sequencing
Genome projects
Bioinformaticsdatabases andtools
Databases
Sequenceanalysis
Handlingsequence data
Computing
Applicationareas
Bioinformatics Sequence Analysis
1 Introduction to sequence analysis
2 Sequencing techniques
3 Brief introduction Linux and R (self study)
4 NGS pre-processing
5 (Multiple) sequence alignment
6 Case: Neuroblastoma
7 Introduction to R2
8 Exome sequence analysis
9 RNAseq
50 / 50
top related