nadia atallah purdue university center for cancer research€¦ · on transcriptome profiling in...

Nadia AtallahPurdue University Center for Cancer Research

� Consulting� Project Work & Data Analysis� Method Development� Study Design� Integrate data with public domain data� Training� Aid in Grant Writing� Aid in manuscript Preparation

Services provided

� Ideally a bioinformatician working 40 hours a week will have 2-3 projects to work on at once.� Difficult to manage workload

� When is a project really finished?� Often a project will be “dormant” for months and then the PI will contact you upon beginning

the publication� Take excellent notes, this will make revisiting old projects far easier� Back up data� Manage expectations- very important!

� Do not make empty promises� Do not blindly say something can be done without knowing� Do not pretend to know everything� Give realistic timelines

� Communication is incredibly important – very important to maintaining positive relationships

Project Management

� After I get a request:1. Initial consultation2. Upon receiving data, put project in queue3. Begin analysis. Update PI after each major step is completed.4. Email results with brief description5. Type formal report6. Meet PI to go over report, address questions7. Perform any additional analyses8. Aid in writing of the manuscript9. Data Deposition

Project Management

� Communicating across disciplines� Giving accurate timelines� Communicating results in a manner which is

accurate and understandable� Explaining limitations of the technology� Dealing with poor experimental design� Fast moving field - must keep learning

� Read� Classes� Book study groups� Conferences – meet as many people in the field as

possible� Try new software for fun

� Software is often difficult to use

Challenges of the Job

� Clear methods description� Specific types of analyses� Whether a project can be completed by a certain date (ex: for a grant

submission)� Specific types of graphs, figures� For all data, including intermediate files� Help for you/your student/your postdoc – depends on the

bioinformatician� Scripts – depends on the bioinformatician

What is reasonable to ask from a Bioinformatician

� To move things around on your Excel spreadsheets for you� To be your graphic designer� To use an improper statistical design in a current analysis because it

will be consistent with your old dataset� To get something done with one day notice…� To try to “do something to find significant results”� To calculate significance without replicates� To fix poor experimental design

What to not ask me for….

� Standard Analysis� Often involves the use of standardized pipelines� Examples: bulk RNA-Seq analysis, ChIP-seq analysis� Even when running a pipeline, I have to modify code to fit the project at hand

� Semi-standard Analysis� No high-quality standardized pipelines� Often have to try multiple software packages, edit existing software, write small scripts� Examples: lncRNA identification, single-cell RNA-Seq

� Novel Analysis� Much greater time commitment� Writing code, figuring out best algorithms to use, optimizing running time� Can take months (or longer)

Various levels of engagement

Check data quality (R, unix)

Break reads into multiple segments (custom perl script)

Map using shell script Try using BBMap

Write custom script using Needleman-Wunsch algorithm

Write a quasi-global alignment

package

Novel Analysis – 3 steps out of 7Standard Analysis – all steps

Various levels of engagement

Checkdataquality(FastQC)

Trim&filterreads,removeadapters(Trimmomatic)

Alignreadstoreferencegenome

(Tophat)

Countreadsaligningtoeachgene(HTSeq)

UnsupervisedClustering

Differentialexpression

analysis(edgeR)

GOenrichmentanalysis(DAVID)

Pathwayanalysis(DAVID/IPA)

� Provide a quality analysis, often involving the development of novel methods

� Clear goal in mind, sometimes with the understanding that the goals may change

� Requires significant time, commitment (personal engagement), and interest

� Authorship is generally talked about up-front

Collaboration� Provide a quality, standard analysis� Clear deliverables� Most projects fall under this category� Authorship is not guaranteed – up to

the PI and also dependent on the bioinformatician’s contribution

Service

Service vs Collaboration

Using Supercomputers is Necessary for a Timely Analysis

� We have 4 nodes on two supercomputers (Conte, Snyder)

� Gives us the ability to run programs that would crash even the best personal computer.

� Speeds up data processing:� To map all sequence reads from a

cell to the human genome: ~ 4.4 min on Conte versus 2 h 28 min on a MacBook Pro for 1 cell.

� One project had ~550 cells to process. On Conte: ~1.7 days. On MacBook Pro: 56 days

� Data should be stored in multiple places� Get the raw data and analysis files! It is risky to not have a copy of

your data!!!� Storing your data takes space and therefore money

� $150 per TB/year � Cost of service is not just time. We pay for storage, nodes, software, and IT

support.� For how long should a bioinformatics core/sequencing center store

data?

Storing Data

What should I tell the sequencing center I want?

� Depth, number of lanes� Multiplexing� Single-end versus paired end� Which RNA species am I interested in sequencing?� Paired-end or single-end?� Strand-specific?� Length of reads� Poly A selection or ribodepletion

� Quality control: quality measurements before sequencing, sometimes quality information from after sequencing

� Differs depending on where you get your data sequenced – make sure you know what they are giving you!� Trimmed? Adapters removed? Are reads aligned?

� Ask what kit was used to prepare the libraries� What instrument was used, what kind of selection was used to exclude

unwanted data (example polyA selection for RNA-seq)� This information is necessary for publication!

� Commands/parameters/software used in processing the data� Always get the raw data!

What to expect from your sequencing Center

RNA extraction, purification, and quality assessment

• RIN= RNA integrity number• Generally, RIN scores >8 are good, depending on the organism• Important to use high RIN score samples, particularly when sequencing small RNAs to be sure

you aren’t simply selecting degraded RNAs

18S 28S

Data Cleaning: a Multistep Process

Remove adapters

•• Remove contamination from fastq files

Remove contamination

••Removes adapter sequences

Trim reads••Trim reads based on

quality

Separate reads

••Separate reads into paired and unpaired

Make sure know where you are in the pipeline and what you have been given by your sequencing center!

After TrimmingBefore Trimming

Quality Control – Per Base Sequence Quality

File formats - FASTQ files – what we get back from the sequencing center

� This is usually the format your data is in when sequencing is complete

� Text files� Contains both sequence and base quality information

� Phred score = Q = -10log10P� P is base-calling error probability

� Integer scores converted to ASCII characters� Example:

@ILLUMINA:188:C03MYACXX:4:1101:3001:1999 1:N:0:CGATGTTACTTGTTACAGGCAATACGAGCAGCTTCCAAAGCTTCACTAGAGACATTTTCTTTCTCCCAACTCACAAGATGAACACAAAATGGAAACT+1=DDFFFHHHHHJJDGHHHIJIJIIJJIJIIIGIIGJIIIJCHEIIJGIJJIJIIJIJIFGGGGGIJIFFBEFDC>@@BB?A9@3;@(553>@>C(59:?

File formats: FASTA files� Text file with sequences (amino acid or nucleotides)� First line per sequence begins with > and

information about sequence� Example:

>comp2_c0_seq1GCGAGATGATTCTCCGGTTGAATCAGATCCAGAGGCATGTATATATCGTCTGCAAAATGCTAGAAACCCTCATGTGTGTAATGCAGTGCATTCATGAAAACCTTGTAAGCTCACGTGTCGCTGACTGTCTGAGAACCGACTCGCTAATGTTCCATGGAGTGGCTGCATACATCACAGATTGTGATTCCAGGTTGCGAGACTATTTGCAGGATGCATGCGAGCTGATTGCCTATTCCTTCTACTTCTTAAATAAAGTAAGAGC

File formats: BAM and SAM files� SAM file is a tab-delimited text file that contains sequence alignment

information� This is what you get after aligning reads to the genome� BAM files are simply the binary version (compressed and indexed

version )of SAM files à they are smaller� Example:

Header lines (begin with “@”)

Alignment section

� Background information about your system� Timeline� What are your goals? Have a clear idea of your goals/hypothesis� Is there a specific genome version you want us to use? � Is there old data you want to compare the current data with?� What comparisons do you want to make?� Experimental design:

� How many replicates? How were replicates treated/grown?� Is there any potential for batch effects?

What do you need to tell your bioinformatician?

� Understanding significance of results (or lack of)

� Understanding the analysis� Communication� Understanding what

bioinformaticians/statisticians do� Knowing what is/is not possible� What experiments to do/technology

to use� Limitations of the technology used� Experimental Design

Common Issues I see Amongst Users

� To pool or not to pool samples…..� No replication� How many reads (lanes) to sequence� Paired-end versus single end� What is a biological vs technical replicate. Can you have a true

biological replicate from a cell line?� ChIP protocol – be careful to perfect and optimize protocol for your

system and conditions� Common issues: Not enough sample used, poor antibody specificity,

sonication time, improper controls

Common Experimental Design Problems

Pooling Samples in RNA-seq� Can be beneficial if tissue is scare/enough RNA is tough to obtain� Utilizes more samples, could increase power due to reduced biological

variability� Danger is of a pooling bias (a difference between the value measured in the

pool and the mean of the values measured in the corresponding individual replicates)

� Can get a positive result due to only one sample in the pool� Might miss small alterations that might disappear when only 1 sample has a

different transcriptome profile than others in the pool� Generally it is better to use one biological replicate per sample� If you must pool, try to use the same amount of material per sample in the

pool, use stringent FDR cutoffs, and many biological reps per poolEvaluated validity of two pooling strategies (3 or 8 biological replicates per pool; two pools per group). Found pooling bias and low positive predictive value of DE analysis in pooled samples.

� Ask sequencing center how many reads/lane they get per run

� Reads needed depends on experimental objectives� Differential gene expression? Get enough

counts of each transcript such that accurate statistical inferences can be made

� De novo transcriptome assembly? Maximize coverage of rare transcripts and transcriptional isoforms

� Annotation?� Alternative splicing analysis?

How many reads/lanes should I sequence?

1) Liu Y., et al., RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 30(3):301-304 (2014) 2) Liu Y., et al., Evaluating the impact of sequencing depth on transcriptome profiling in human adipose. Plos One 8(6):e66883 (2013) 3) Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008) 4) Rozowsky, J.et al., PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nature Biotech. 27, 65-75 (2009).

What “counts” as a replicate in cell culture experiments?

Poor Design, n=1

A little better... Likely a bit more variable but still highly correlated

Better still. Paired observations, n=3

With cell lines, there are no true biological replicates. Ideal design would have biological replicates (cells from multiple people/animals). Though the ideal experiment is of course often impossible to perform.

� Needed for calling peaks� IgG: nonspecific immunoglobulin G antibodies used to estimate noise

(“mock” ChIP reaction)� Most not obtained from true preimmune serum from same animal in which

specific antibody was raised� Immunoprecipitate much less DNA than specific antibodies do, overamplification

of regions from control and less coverage of genome� Input: DNA isolated from cells that have been cross-linked and fragmented

under same conditions as IP DNA� Perform a separate control experiment for each cell line, developmental stage,

and condition/treatment� Use identical controls to build ChIP and control sequencing libraries

� PCR amplification cycles, fragment size, etc….

ChIP-seq controls

RNA-seq Terminology� Counts = (Xi) the number of reads that align to a particular feature i

(gene, isoform, miRNA…)� Library size= (N) number of reads sequenced� FPKM = Fragments per kilobase of exon per million mapped reads

� Takes length of gene (li) into account� FPKMi=(Xi /li*N)*109

� CPM = Counts Per Million mapped reads� CPMi= Xi /N*106

� There are other units as well (TPM, RPKM, effective counts…)� Not all of these units fit into all the different statistical packages

Uni

ts

Caveats of RNA-seq� If you have zero counts it does not necessarily mean that a gene is not

expressed at all� Especially in single-cell RNA-seq

� RNA and protein expression profiles do not always correlate well� Correlations vary wildly between RNA and protein expression� Depends on category of gene� Correlation coefficient distributions were found to be bimodal between

gene expression and protein data (one group of gene products had a mean correlation of 0.71; the another had a mean correlation of 0.28) � Shankavaram et. al, 2007

� Necessary for publishing in many journals� Necessary for NIH-supported studies� Good scientific practice!� Most commonly used databases to submit to:

� Gene Expression Omnibus (GEO) – any large-scale gene expression dataset� Short Read Archives (SRA) – high throughput sequencing data� dbGaP – microarray data from clinical studies; requires controlled access

� These databases are very useful for submitting data to as well as for data mining

� You can submit your data, obtain an accession number, and still delay making the data publicly available until publication of your manuscript

Data Deposition

GEO dataset� The GEO Accession display of a project

generally gives multiple types of information:� Status (when data became public)� Title� Organism� Experimental Type� Summary

� Background� Methods� Results� Conclusions

� Overall design� Contributors� Citation� Downloads

Data Download from GEO

MINiML filesarethesameasSOFT,butinXMLformat

SeriesmatrixTXTfilesaretab-delimitedvalue-matrixfiles.CanbeimportedintoExcel

� https://www.ncbi.nlm.nih.gov/geo/info/submission.html� Fill out a metadata spreadsheet (format will be dependent on the type

of data you plan to submit), then submit raw and processed datafilesusing an FTP server

� I use Filezilla� Instructions online

To submit to GEO

Questions?

nadia atallah purdue university center for cancer research€¦ · on transcriptome profiling in...

Documents