nadia atallah purdue university center for cancer research€¦ · on transcriptome profiling in...

34
Nadia Atallah Purdue University Center for Cancer Research

Upload: others

Post on 18-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

Nadia AtallahPurdue University Center for Cancer Research

Page 2: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Consulting� Project Work & Data Analysis� Method Development� Study Design� Integrate data with public domain data� Training� Aid in Grant Writing� Aid in manuscript Preparation

Services provided

Page 3: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Ideally a bioinformatician working 40 hours a week will have 2-3 projects to work on at once.� Difficult to manage workload

� When is a project really finished?� Often a project will be “dormant” for months and then the PI will contact you upon beginning

the publication� Take excellent notes, this will make revisiting old projects far easier� Back up data� Manage expectations- very important!

� Do not make empty promises� Do not blindly say something can be done without knowing� Do not pretend to know everything� Give realistic timelines

� Communication is incredibly important – very important to maintaining positive relationships

Project Management

Page 4: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� After I get a request:1. Initial consultation2. Upon receiving data, put project in queue3. Begin analysis. Update PI after each major step is completed.4. Email results with brief description5. Type formal report6. Meet PI to go over report, address questions7. Perform any additional analyses8. Aid in writing of the manuscript9. Data Deposition

Project Management

Page 5: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Communicating across disciplines� Giving accurate timelines� Communicating results in a manner which is

accurate and understandable� Explaining limitations of the technology� Dealing with poor experimental design� Fast moving field - must keep learning

� Read� Classes� Book study groups� Conferences – meet as many people in the field as

possible� Try new software for fun

� Software is often difficult to use

Challenges of the Job

Page 6: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Clear methods description� Specific types of analyses� Whether a project can be completed by a certain date (ex: for a grant

submission)� Specific types of graphs, figures� For all data, including intermediate files� Help for you/your student/your postdoc – depends on the

bioinformatician� Scripts – depends on the bioinformatician

What is reasonable to ask from a Bioinformatician

Page 7: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� To move things around on your Excel spreadsheets for you� To be your graphic designer� To use an improper statistical design in a current analysis because it

will be consistent with your old dataset� To get something done with one day notice…� To try to “do something to find significant results”� To calculate significance without replicates� To fix poor experimental design

What to not ask me for….

Page 8: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Standard Analysis� Often involves the use of standardized pipelines� Examples: bulk RNA-Seq analysis, ChIP-seq analysis� Even when running a pipeline, I have to modify code to fit the project at hand

� Semi-standard Analysis� No high-quality standardized pipelines� Often have to try multiple software packages, edit existing software, write small scripts� Examples: lncRNA identification, single-cell RNA-Seq

� Novel Analysis� Much greater time commitment� Writing code, figuring out best algorithms to use, optimizing running time� Can take months (or longer)

Various levels of engagement

Page 9: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

Check data quality (R, unix)

Break reads into multiple segments (custom perl script)

Map using shell script Try using BBMap

Write custom script using Needleman-Wunsch algorithm

Write a quasi-global alignment

package

Novel Analysis – 3 steps out of 7Standard Analysis – all steps

Various levels of engagement

Checkdataquality(FastQC)

Trim&filterreads,removeadapters(Trimmomatic)

Alignreadstoreferencegenome

(Tophat)

Countreadsaligningtoeachgene(HTSeq)

UnsupervisedClustering

Differentialexpression

analysis(edgeR)

GOenrichmentanalysis(DAVID)

Pathwayanalysis(DAVID/IPA)

Page 10: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Provide a quality analysis, often involving the development of novel methods

� Clear goal in mind, sometimes with the understanding that the goals may change

� Requires significant time, commitment (personal engagement), and interest

� Authorship is generally talked about up-front

Collaboration� Provide a quality, standard analysis� Clear deliverables� Most projects fall under this category� Authorship is not guaranteed – up to

the PI and also dependent on the bioinformatician’s contribution

Service

Service vs Collaboration

Page 11: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

Using Supercomputers is Necessary for a Timely Analysis

� We have 4 nodes on two supercomputers (Conte, Snyder)

� Gives us the ability to run programs that would crash even the best personal computer.

� Speeds up data processing:� To map all sequence reads from a

cell to the human genome: ~ 4.4 min on Conte versus 2 h 28 min on a MacBook Pro for 1 cell.

� One project had ~550 cells to process. On Conte: ~1.7 days. On MacBook Pro: 56 days

Page 12: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Data should be stored in multiple places� Get the raw data and analysis files! It is risky to not have a copy of

your data!!!� Storing your data takes space and therefore money

� $150 per TB/year � Cost of service is not just time. We pay for storage, nodes, software, and IT

support.� For how long should a bioinformatics core/sequencing center store

data?

Storing Data

Page 13: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

What should I tell the sequencing center I want?

� Depth, number of lanes� Multiplexing� Single-end versus paired end� Which RNA species am I interested in sequencing?� Paired-end or single-end?� Strand-specific?� Length of reads� Poly A selection or ribodepletion

Page 14: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Quality control: quality measurements before sequencing, sometimes quality information from after sequencing

� Differs depending on where you get your data sequenced – make sure you know what they are giving you!� Trimmed? Adapters removed? Are reads aligned?

� Ask what kit was used to prepare the libraries� What instrument was used, what kind of selection was used to exclude

unwanted data (example polyA selection for RNA-seq)� This information is necessary for publication!

� Commands/parameters/software used in processing the data� Always get the raw data!

What to expect from your sequencing Center

Page 15: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

RNA extraction, purification, and quality assessment

• RIN= RNA integrity number• Generally, RIN scores >8 are good, depending on the organism• Important to use high RIN score samples, particularly when sequencing small RNAs to be sure

you aren’t simply selecting degraded RNAs

18S 28S

Page 16: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

Data Cleaning: a Multistep Process

Remove adapters

•• Remove contamination from fastq files

Remove contamination

••Removes adapter sequences

Trim reads••Trim reads based on

quality

Separate reads

••Separate reads into paired and unpaired

Make sure know where you are in the pipeline and what you have been given by your sequencing center!

Page 17: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

After TrimmingBefore Trimming

Quality Control – Per Base Sequence Quality

Page 18: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

File formats - FASTQ files – what we get back from the sequencing center

� This is usually the format your data is in when sequencing is complete

� Text files� Contains both sequence and base quality information

� Phred score = Q = -10log10P� P is base-calling error probability

� Integer scores converted to ASCII characters� Example:

@ILLUMINA:188:C03MYACXX:4:1101:3001:1999 1:N:0:CGATGTTACTTGTTACAGGCAATACGAGCAGCTTCCAAAGCTTCACTAGAGACATTTTCTTTCTCCCAACTCACAAGATGAACACAAAATGGAAACT+1=DDFFFHHHHHJJDGHHHIJIJIIJJIJIIIGIIGJIIIJCHEIIJGIJJIJIIJIJIFGGGGGIJIFFBEFDC>@@BB?A9@3;@(553>@>C(59:?

Page 19: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

File formats: FASTA files� Text file with sequences (amino acid or nucleotides)� First line per sequence begins with > and

information about sequence� Example:

>comp2_c0_seq1GCGAGATGATTCTCCGGTTGAATCAGATCCAGAGGCATGTATATATCGTCTGCAAAATGCTAGAAACCCTCATGTGTGTAATGCAGTGCATTCATGAAAACCTTGTAAGCTCACGTGTCGCTGACTGTCTGAGAACCGACTCGCTAATGTTCCATGGAGTGGCTGCATACATCACAGATTGTGATTCCAGGTTGCGAGACTATTTGCAGGATGCATGCGAGCTGATTGCCTATTCCTTCTACTTCTTAAATAAAGTAAGAGC

Page 20: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

File formats: BAM and SAM files� SAM file is a tab-delimited text file that contains sequence alignment

information� This is what you get after aligning reads to the genome� BAM files are simply the binary version (compressed and indexed

version )of SAM files à they are smaller� Example:

Header lines (begin with “@”)

Alignment section

Page 21: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Background information about your system� Timeline� What are your goals? Have a clear idea of your goals/hypothesis� Is there a specific genome version you want us to use? � Is there old data you want to compare the current data with?� What comparisons do you want to make?� Experimental design:

� How many replicates? How were replicates treated/grown?� Is there any potential for batch effects?

What do you need to tell your bioinformatician?

Page 22: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Understanding significance of results (or lack of)

� Understanding the analysis� Communication� Understanding what

bioinformaticians/statisticians do� Knowing what is/is not possible� What experiments to do/technology

to use� Limitations of the technology used� Experimental Design

Common Issues I see Amongst Users

Page 23: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� To pool or not to pool samples…..� No replication� How many reads (lanes) to sequence� Paired-end versus single end� What is a biological vs technical replicate. Can you have a true

biological replicate from a cell line?� ChIP protocol – be careful to perfect and optimize protocol for your

system and conditions� Common issues: Not enough sample used, poor antibody specificity,

sonication time, improper controls

Common Experimental Design Problems

Page 24: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

Pooling Samples in RNA-seq� Can be beneficial if tissue is scare/enough RNA is tough to obtain� Utilizes more samples, could increase power due to reduced biological

variability� Danger is of a pooling bias (a difference between the value measured in the

pool and the mean of the values measured in the corresponding individual replicates)

� Can get a positive result due to only one sample in the pool� Might miss small alterations that might disappear when only 1 sample has a

different transcriptome profile than others in the pool� Generally it is better to use one biological replicate per sample� If you must pool, try to use the same amount of material per sample in the

pool, use stringent FDR cutoffs, and many biological reps per poolEvaluated validity of two pooling strategies (3 or 8 biological replicates per pool; two pools per group). Found pooling bias and low positive predictive value of DE analysis in pooled samples.

Page 25: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Ask sequencing center how many reads/lane they get per run

� Reads needed depends on experimental objectives� Differential gene expression? Get enough

counts of each transcript such that accurate statistical inferences can be made

� De novo transcriptome assembly? Maximize coverage of rare transcripts and transcriptional isoforms

� Annotation?� Alternative splicing analysis?

How many reads/lanes should I sequence?

1) Liu Y., et al., RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 30(3):301-304 (2014) 2) Liu Y., et al., Evaluating the impact of sequencing depth on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008) 4) Rozowsky, J.et al., PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nature Biotech. 27, 65-75 (2009).

Page 26: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

What “counts” as a replicate in cell culture experiments?

Poor Design, n=1

A little better... Likely a bit more variable but still highly correlated

Better still. Paired observations, n=3

With cell lines, there are no true biological replicates. Ideal design would have biological replicates (cells from multiple people/animals). Though the ideal experiment is of course often impossible to perform.

Page 27: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Needed for calling peaks� IgG: nonspecific immunoglobulin G antibodies used to estimate noise

(“mock” ChIP reaction)� Most not obtained from true preimmune serum from same animal in which

specific antibody was raised� Immunoprecipitate much less DNA than specific antibodies do, overamplification

of regions from control and less coverage of genome� Input: DNA isolated from cells that have been cross-linked and fragmented

under same conditions as IP DNA� Perform a separate control experiment for each cell line, developmental stage,

and condition/treatment� Use identical controls to build ChIP and control sequencing libraries

� PCR amplification cycles, fragment size, etc….

ChIP-seq controls

Page 28: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

RNA-seq Terminology� Counts = (Xi) the number of reads that align to a particular feature i

(gene, isoform, miRNA…)� Library size= (N) number of reads sequenced� FPKM = Fragments per kilobase of exon per million mapped reads

� Takes length of gene (li) into account� FPKMi=(Xi /li*N)*109

� CPM = Counts Per Million mapped reads� CPMi= Xi /N*106

� There are other units as well (TPM, RPKM, effective counts…)� Not all of these units fit into all the different statistical packages

Uni

ts

Page 29: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

Caveats of RNA-seq� If you have zero counts it does not necessarily mean that a gene is not

expressed at all� Especially in single-cell RNA-seq

� RNA and protein expression profiles do not always correlate well� Correlations vary wildly between RNA and protein expression� Depends on category of gene� Correlation coefficient distributions were found to be bimodal between

gene expression and protein data (one group of gene products had a mean correlation of 0.71; the another had a mean correlation of 0.28) � Shankavaram et. al, 2007

Page 30: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� Necessary for publishing in many journals� Necessary for NIH-supported studies� Good scientific practice!� Most commonly used databases to submit to:

� Gene Expression Omnibus (GEO) – any large-scale gene expression dataset� Short Read Archives (SRA) – high throughput sequencing data� dbGaP – microarray data from clinical studies; requires controlled access

� These databases are very useful for submitting data to as well as for data mining

� You can submit your data, obtain an accession number, and still delay making the data publicly available until publication of your manuscript

Data Deposition

Page 31: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

GEO dataset� The GEO Accession display of a project

generally gives multiple types of information:� Status (when data became public)� Title� Organism� Experimental Type� Summary

� Background� Methods� Results� Conclusions

� Overall design� Contributors� Citation� Downloads

Page 32: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

Data Download from GEO

MINiML filesarethesameasSOFT,butinXMLformat

SeriesmatrixTXTfilesaretab-delimitedvalue-matrixfiles.CanbeimportedintoExcel

Page 33: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

� https://www.ncbi.nlm.nih.gov/geo/info/submission.html� Fill out a metadata spreadsheet (format will be dependent on the type

of data you plan to submit), then submit raw and processed datafilesusing an FTP server

� I use Filezilla� Instructions online

To submit to GEO

Page 34: Nadia Atallah Purdue University Center for Cancer Research€¦ · on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole

Questions?