Download - MCB3895-004 Lecture #10 Sept 25/14
![Page 1: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/1.jpg)
MCB3895-004 Lecture #10Sept 25/14
SRA, Illumina data QC
![Page 2: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/2.jpg)
Underwstanding the BBC cluster• What is the cluster? Many individual computers
controlled by a "head node"• The head node is what you log onto by default
using SSH• It is bad etiquette to run things off the head
node• Can slow down the entire system
![Page 3: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/3.jpg)
Using the cluster - method 1
• When you SSH in, use the "qlogin" command to take you to a subordinate note
• Running programs here will not disrupt the head node
• You need to stay connected to the network until all of your programs are completed
![Page 4: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/4.jpg)
Checking a qlogin job
• Use the terminal command "top"
• Shows all processes running on your node
• kill a process by pressing "k" and then entering its PID when prompted
![Page 5: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/5.jpg)
Using the cluster - method 2
• Use the command "qsub" combined with a shell script
• e.g., qsub script.sh• shell is a programming language commonly
used for controlling actual processes• The BBC has example scripts for you to modify:
http://bioinformatics.uconn.edu/understanding-the-bbc-cluster-and-sge/
• This method allows you to walk away once your script is running
![Page 6: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/6.jpg)
qsub bash script#!/bin/bash
################################################################## TEMPLATE SGE SCRIPT - BLAST EXAMPLE ######################## /common/sge_templates/template_single.sh ###########################################################################
# Specify the name of the data file to be usedINPUTFILENAME="test.fasta"
# Name the directory (assumed to be a direct subdir of $HOME) from which the file is listedPROJECT_SUBDIR="test"
# Specify name to be used to identify this run#$ -N blastp_job
# Email address (change to yours)#$ -M [email protected]
# Specify mailing options: b=beginning, e=end, s=suspended, n=never, a=abortion#$ -m besa
![Page 7: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/7.jpg)
qsub bash script# Specify that bash shell should be used to process this script#$ -S /bin/bash
# Specify working directory (created on compute node used to do the work)WORKING_DIR="/scratch/$USER/$PROJECT_SUBDIR-$JOB_ID"
# If working directory does not exist, create it# The -p means "create parent directories as needed"if [ ! -d "$WORKING_DIR" ]; thenmkdir -p $WORKING_DIRfi
# Specify destination directory (this will be subdirectory of your home directory)DESTINATION_DIR="$HOME/$PROJECT_SUBDIR/$JOB_ID-$INPUTFILENAME"
# If destination directory does not exist, create it# The -p in mkdir means "create parent directories as needed"if [ ! -d "$DESTINATION_DIR" ]; thenmkdir -p $DESTINATION_DIRfi
![Page 8: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/8.jpg)
qsub bash script# navigate to the working directory
cd $WORKING_DIR
# Get script and input data from home directory and copy to the working directory
cp $HOME/$PROJECT_SUBDIR/$INPUTFILENAME ./test.fasta
cp $HOME/template_single.sh .
# Specify the output file
#$ -o $JOB_ID.out
# Specify the error file
#$ -e $JOB_ID.err
# Run the program
blastp -query $INPUTFILENAME -db /usr/local/blast/data/refseq_protein -num_alignments 5 -num_descriptions 5 -out my-results
# copy output files back to your home directory
cp * $DESTINATION_DIR
# clear scratch directory
rm -rf $WORKING_DIR
![Page 9: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/9.jpg)
Checking a qsub job
• Use "qstat" to understand the status of your job
• Shows jobs waiting to be executed• Monitor a running job's status using qstat -j <job_number>
• Retrieve information about a completed job using qaact -j <job_number>
![Page 10: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/10.jpg)
Cluster etiquette
• Never run something on the head node!• Always check that your processes will run
correctly before starting a large task! • Best strategy: run commands individually using a
reduced input dataset• If using a loop to execute multiple commands, only
go through a single iteration (e.g., use die)
![Page 11: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/11.jpg)
The first part of Assignment #4
• Write a perl script that subsamples the first ~10000 reads of your input fasta file(s)
• Allows you to do quick troubleshooting• Can be modified later to examine the effect of
sampling depth
![Page 12: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/12.jpg)
SRA
• "Sequence Read Archive"• http://www.ncbi.nlm.nih.gov/sra• The part of NCBI that holds raw sequencing
data• Generally, this is where you need to put your
raw data when you publish genomic research
![Page 13: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/13.jpg)
SRA
![Page 14: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/14.jpg)
A SRA record
![Page 15: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/15.jpg)
SRA run browser
![Page 16: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/16.jpg)
For kicks…
• Go to http://www.ncbi.nlm.nih.gov/sra• Search "Escherichia coli MG1655"• Note different results
• Different sequencing platforms• Note mutant strains!
• Try "Escherichia coli MG1655 Pacbio"
![Page 17: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/17.jpg)
Downloading SRA data
• Possible to do from web browser, but transferring large files is cumbersome
• Better: use NCBI's SRA Toolkit on the BBC server to perform file conversion while downloading
/opt/bioinformatics2/sratoolkit.2.3.5-2-centos_linux64/bin/fastq-dump --split-files SRR826450
• Decompresses files, splits paired ends into separate files
![Page 18: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/18.jpg)
The fastq file format
• 4 lines per sequence:• Line 1: begins with "@", followed by sequence ID• Line 2: raw sequence data• Line 3: begins with "+", may have sequence ID• Line 4: Phred quality score for each position, in ASCII
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
ASCII from low (left) to high (right):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
http://en.wikipedia.org/wiki/FASTQ_format
![Page 19: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/19.jpg)
Phred scores
• Developed by old program "Phred" during human genome project, adopted as standard throughout field
• Phred score = -10log(P(base call error))• e.g.,
Phred score of 10 = 90% base call accuracyPhred score of 20 = 99% base call accuracyPhred score of 30 = 99.9% base call accuracyPhred score of 40 = 99.99% base call accuracyetc.
![Page 20: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/20.jpg)
FastQC - QC for raw reads
• FastQC is the most common software to understand the quality of raw sequencing reads
• http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• Runs using a java applet• Using the server, we have to run via command
line
![Page 21: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/21.jpg)
FastQC screenshot
Starts into specific figures
Summary stats
What it thinks of yourrun quality - NOT HARDAND FAST RULES!!
![Page 22: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/22.jpg)
FastQC - Per base quality
• Blue line: mean
• Red line: median
• Boxes: 25-75% range
• Whiskers:• 10-90%
range
Phre
d sc
ore
![Page 23: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/23.jpg)
FastQC - Per read quality• Highlights
systematic problems
• e.g., a region of flowcell is problematic
![Page 24: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/24.jpg)
FastQC - Per base sequence content
• Unbiased sequences should have the same content across all bases
• Will show biases if some sequence is hugely overrepresented
• e.g., adapter contamination
• e.g., biased fragmentation
![Page 25: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/25.jpg)
FastQC - Per Sequence GC content
• Unbiased sequencing should have a normally distributed %GC content
• Deviations may indicated contamination
• e.g., adapter• e.g., two
species with different %GC contents
![Page 26: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/26.jpg)
FastQC - Per base N count
• Ns indicate that the base caller could not determine a base at that position
• Global N abundance generally correlates with sequence quality
![Page 27: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/27.jpg)
FastQC - Sequence length
• Some methods yield non-uniform read lengths
• e.g., Pacbio (shown)
• Illumina will only show one uniform value here
![Page 28: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/28.jpg)
FastQC - Duplicate sequences
• An unbiased library should have few duplicates
• A few duplicates may indicate saturated template sequencing
• High duplication may indicate adapter contamination or enrichment bias
![Page 29: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/29.jpg)
FastQC - Kmer content
• Tests for kmers enriched as a certain read position
• Graphs 6 worst, tabulates the rest
• Can indicate sequencing/library bias
• Can indicate contamination by one sequence, e.g., primers, adapters
![Page 30: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/30.jpg)
FastQC - Overrepresented sequences• May indicate how read diversity is limited, e.g.,
adapter/primer contamination• May be biological, e.g., repeats
![Page 31: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/31.jpg)
FastQC - Adapter content
• Specifically looks for known adapter/primer contamination
• Indicates reads are longer than insert size
![Page 32: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/32.jpg)
FastQC - Per tile sequence quality
• Shows flowcell tiles that are particularly error-prone
• Illumina data only, and only if positional metadata is included with reads
![Page 33: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/33.jpg)
Running FastQC on the server
• Very simple: $ fastqc <input_file>• Produces a .html file as output• Transfer the html to your computer and open it
using your favorite web browser
![Page 34: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/34.jpg)
Getting rid of adapters using Trimmomatic• Trimmomatic is a standard method to remove
adapter contamination• http://www.usadellab.org/cms/?
page=trimmomatic• Bolger et al. 2014 Bioinformatics btu170
![Page 35: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/35.jpg)
Running Trimmomatic• Admittedly, a complex syntaxjava -jar <path to trimmomatic.jar> PE [-threads <threads] [-phred33 | -phred64] [-trimlog <logFile>] <input 1> <input 2> <paired output 1> <unpaired output 1> <paired output 2> <unpaired output 2> <step 1> ...
java -jar /opt/bioinformatics/Trimmomatic-0.32/trimmomatic-0.32.jar PE SRR826450_1.fastq SRR826450_2.fastq output_forward_paired.fq output_forward_unpaired.fq output_reverse_paired.fq output_reverse_unpaired.fq ILLUMINACLIP:/opt/bioinforatics/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10
![Page 36: MCB3895-004 Lecture #10 Sept 25/14](https://reader034.vdocument.in/reader034/viewer/2022042704/56813a83550346895da280ee/html5/thumbnails/36.jpg)
Assignment #4
• Download these two E. coli K-12 MG1655 genome sequencing reads from NCBI SRA: SRR826450, SRR956947
• What are the differences?• Write script to subsample fastq files• Analyze your input data using fastq• If appropriate, adapter trim using Trimmomatic