mcb 3421 class 25. student evaluations please go to husky ct and complete student evaluations !

27
MCB 3421 class 25

Upload: bennett-owens

Post on 18-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

PSI (position-specific iterated) BLAST The NCBI page described PSI blast as follows: “Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases.”

TRANSCRIPT

Page 1: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

MCB 3421 class 25

Page 2: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

student evaluations

Please go to husky CT and complete student evaluations !

Page 3: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

PSI (position-specific iterated) BLAST

The NCBI page described PSI blast as follows:

“Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues.

The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching.

PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases.” 

Page 4: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

The Psi-Blast Approach

1. Use results of BlastP query to construct a multiple sequence alignment2. Construct a position-specific scoring matrix from the alignment3. Search database with alignment instead of query sequence4. Add matches to alignment and repeat

Psi-Blast can use existing multiple alignment, or use RPS-Blast to search a database of PSSMs

Page 5: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

PSI BLAST scheme

Page 6: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Position-specific Matrix

M Gribskov, A D McLachlan, and D Eisenberg (1987) Profile analysis: detection of distantly related proteins. PNAS 84:4355-8.

by B

ob F

riedm

an

Page 7: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Psi-Blast is for finding matches among divergent sequences (position-specific information) WARNING: For the nth iteration of a PSI BLAST search, the E-value gives the number of matches to the profile NOT to the initial query sequence! The danger is that the profile was corrupted in an earlier iteration.

PSI BLAST and E-values!

Page 8: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Often you want to run a PSIBLAST search with two different databanks - one to create the PSSM, the other to get sequences:To create the PSSM: blastpgp -d nr -i subI -j 5 -C subI.ckp -a 2 -o subI.out -h 0.00001 -F f

blastpgp -d swissprot -i gamma -j 5 -C gamma.ckp -a 2 -o gamma.out -h 0.00001 -F f

Runs 4 iterations of a PSIblastthe -h option tells the program to use matches with E <10^-5 for the next iteration, (the default is 10-3 )-C creates a checkpoint (called subI.ckp),-o writes the output to subI.out,-i option specifies input as using subI as input (a fasta formated aa sequence). The nr databank used is stored in /common/data/-a 2 use two processors -h e-value threshold for inclusion in multipass model [Real] default = 0.002 THIS IS A RATHER HIGH NUMBER!!!

(It might help to use the node with more memory (017) (command is ssh node017)

PSI Blast from the command line

Page 9: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Use of a PSSM:

blastpgp -d /Users/jpgogarten/genomes/msb8.faa -i subI -a 2 -R subI.ckp -o subI.out3 -F f

blastpgp -d /Users/jpgogarten/genomes/msb8.faa -i gamma -a 2 -R gamma.ckp -o gamma.out3 -F f

Runs another iteration of the same blast search, but uses the databank /Users/jpgogarten/genomes/msb8.faa

-R tells the program where to resume-d specifies a different databank-i input file - same sequence as before -o output_filename-a 2 use two processors-h e-value threshold for inclusion in multipass model [Real] default = 0.002. This is a rather high number, but might be ok for the last iteration.

Page 10: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

PSI Blast and finding gene families within genomes

use PSSM to search a genome:

A) Use protein sequences encoded in genome as target:

blastpgp -d target_genome.faa -i query.name -a 2 -R query.ckp -o query.out3 -F f

B) Use nucleotide sequence and tblastn. This is an advantage if you are also interested in pseudogenes, and/or if you don’t trust the genome annotation:

blastall -i query.name -d target_genome_nucl.ffn -p psitblastn -R query.ckp

Build PSSM from query sequence and a large database (nr is a good choice – if you know the annotation of the query sequences, you don’t need to worry about the annotations in the database)

Page 11: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

man wc

Page 12: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

>wc -l blastp.out PSIblastP.out psitblastn.out 34 blastp.out 44 PSIblastP.out 56 psitblastn.out

Comparison of blastp, PSIblastP, and psitblastn

Page 13: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

1 2 3

4

5 6 7 8

1

8

36

5

7

2

4

Page 14: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

ori

1

2

3

5

6

7

8

4

Page 15: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Finding transferred genes

Screening in the wet-lab and in the computer

Page 16: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Finding transferred genes

Page 17: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Taxplot at NCBI

Page 18: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Taxplot at NCBI

Page 19: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Other approaches to find transferred genes

• Gene presence absence data for closely related genomes (for additional genes)

• Phylogenetic conflict (for homologous replacement (e.g. quartet decompositon spectra see Figs. 1 and 2)

• Composition based analyses (for very recent transfers).

Page 20: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Discussion of HGT from Bacteria to Tardigrades

We estimate that approximately one-sixth of tardigrade genes entered by HGT, nearly double the fraction found in the most extreme cases of HGT into animals known to date. Foreign genes have supplemented, expanded, and even replaced some metazoan gene families within the tardigrade genome. Our results demonstrate that an unexpectedly large fraction of an animal genome can be derived from foreign sources.

Page 21: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Source of genes in the H. dujardini genome as determined by HGT index calculations

Page 22: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Discussion of HGT from Bacteria to Tardigrades

Page 23: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

BIOARCHIVES

doi: http://dx.doi.org/10.1101/033464

http://biorxiv.org/content/early/2015/12/01/033464

Page 24: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

• “While the raw data indicated extensive contamination with bacteria, presumably from the gut or surface of the animals, careful cleaning generated a clean tardigrade dataset for assembly.”

Page 25: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Our assembly, and inferences from it, conflict with a recently published draft genome (UNC) 6 for what is essentially the same strain of H. dujardini. Our assembly, despite having superior assembly statistics, is ~120 Mb shorter than the UNC assembly. Our genome size estimate from sequence assembly is congruent with the values we obtained by direct measurement. We find 15,000 fewer protein-coding genes, and a hugely reduced impact of predicted HGT on gene content in H. dujardini. These HGT candidates await detailed validation. While resolution of the conflict between these assemblies awaits detailed examination based on close scrutiny of the raw UNC data, our analyses suggest that the UNC assembly is compromised by sequences that derive from bacterial contaminants, and that the expanded genome span, additional genes, and HGT candidates are likely to be artefactual.

Page 26: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

Figure 4: Mapping of read data to UNC assembly identifies non-shared contaminants and no expression from bacterial scaffolds A Blobplot showing the UNC assembly contigs distributed by GC proportion and coverage derived from the UNC raw genomic sequence data (data file TG-300). Scaffold points are scaled by length, and coloured based on taxonomic assignment of the sum of the best BLAST and Diamond matches for all the genes on the scaffold. Taxonomic assignments are summed by phylum. B Blobplot showing the UNC assembly contigs distributed by GC proportion and coverage derived from the Edinburgh raw genomic sequence data. Scaffold points are scaled by length, and coloured based on taxonomic assignment of the sum of the best BLAST and Diamond matches for all the genes on the scaffold. Taxonomic assignments are summed by phylum.

Page 27: MCB 3421 class 25. student evaluations Please go to husky CT and complete student evaluations !

UNC reads Edinburgh reads

both mapped on the UNC assembly