lab7 twinscan, hmmer, pfam. twinscan twinscan twinscan finds genes in a "target" genomic...

19
Lab7 Twinscan, HMMER, PFAM

Upload: aleesha-mckinney

Post on 17-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Lab7

Twinscan, HMMER, PFAM

Page 2: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

TWINSCAN

Page 3: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

TwinScan• TwinScan finds genes in a "target" genomic sequence by simultaneously

maximizing the probability of the gene structure in the target and the evolutionary conservation derived from "informant" genomic sequences.

• The target sequence (i.e. the sequence to be annotated) should generally be of draft or finished quality.

• The informant can range from a single sequence to a whole genome in any condition from raw shotgun reads to finished assembly.

• References– http://mblab.wustl.edu/media/publications/Hu-Brent-2003-Proof.pdf– http://mrw.interscience.wiley.com/emrw/9780471250951/cp/cpbi/article/

bi0408/current/abstract– http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/suppl_1/

S140

Page 4: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Requirements• TwinScan/ Nscan 4.0 executable

• DNA sequence in FASTA format

• Twinscan parameter file (/parameters/twinscan_parameters) – Each filename contains the name of the target organism: eg maize_twinscan.zhmm.

• Conservation sequence (see examples/example.conseq)– A symbolic representation of the the best alignments between the target and informant sequences. – To create this conservation sequence, WU-BLAST (http://blast.wustl.edu) is used. – For NCBI BLAST, the input parameters need to be changed. (see parameters/examples/example_blast_parameters.txt)– xdformat (WU-BLAST package) formats the informant sequences to create a blast database. – After running BLAST, the output must be formatted with conseq, which is included in this package.– Example (using WU-BLAST)

• xdformat -n informant.fa• Blast M=1 N=-1 Q=5 R=1 B=10000 V=100 -cpus=1 -warnings -lcfilter filter=seg filter=dust topcomboN=1 informant.fa target.fa >

blast.out• conseq target.fa blast.out > conseq.fa

– Note: The runTwinscan2 script will run these steps without user intervention (see below).

• EST sequence (optional)– EST sequence is a symbolic representation of evidence from ESTs that align to the target sequence. – The estseq script included in the distribution creates EST sequence when given a DNA sequence and a (set of) BLAT reports of

the the ESTs aligned to the target. – For downloading BLAT, go to http://genome.ucsc.edu/FAQ/FAQblat.html#blat3

Page 5: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Web service and Installation• Web service– http://mblab.wustl.edu/nscan/submit/

• Local installation– http://mblab.wustl.edu/software/twinscan/

• Installation step$ tar xvzf iscan-4.1.2.tar.gz$ cd N-SCAN$ make linux$ ./test-executable

Page 6: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

How to run

• Usagetwinscan <parameter file> <masked sequence file> -c=<conseq file> [-

e=estseq_file] > <outputfile>

• Example:./bin/iscan ./parameters/twinscan_parameters/human_twinscan.zhmm

./examples/example.fa.masked -c=./examples/example.conseq > mySequence.gtf

Page 7: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Running Twinscan using the runTwinscan2 script

• In summary, there are 5 steps required to run Twinscan:– Step 1: Mask target sequence with RepeatMasker– Step 2: Create informant BLAST database– Step 3: Run BLAST– Step 4: Create conservation sequence– (Step 4b: Create EST sequence)– Step 5: Run Twinscan

• These five steps are all contained in the example script runTwinscan2 (see ./bin)

• The default BLAST parameters used by runTwinscan2 are those for C.elegans (see parameters/blast_params/Celegans.blast.param). – This can and should be changed for any other species with the -B option to the runTwinscan2

script.

• The file example.output in the /examples directory contains the output from runTwincan2 using the BLAST parameters found in the script.

Page 8: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Local environment setting for runTwinscan2

• Example– ../bin/runTwinscan2 -r ../parameters/twinscan_parameters/human_twinscan.zhmm -d

output -B ../parameters/blast_params/Hsapiens.blast.param example.fa.masked informant.fa

• After running you can find output files in the newly created /output directory.

• Several programs must be installed on your system to run runTwinscan.

• You may need to change runTwinscan to point it to these programs. – my $REPEATMASKER = "RepeatMasker"; # Format for local environment– my $BLASTN = "blastn"; # Format for local environment– my $BLAT = "blat"; # Format for local environment– my $XDFORMAT = "xdformat"; # Format for local environment– my $PRESSDB = "pressdb"; # Format for local environment

Page 9: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

HMMER

Page 10: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Sean Eddy’s Lab

• http://selab.janelia.org/software.html

Page 11: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Introduction

• HMMER2 is an implementation in UNIX (Linux, MacOS) platform of profile hidden Markov model, whose source code, executables, and user guide can be downloaded from http://hmmer.janelia.org/

• The experiment of HMMER is to look for known domains in a query sequence by searching a single sequence again a library of HMMs.

• One such library is PFAM, and you can also create your own library using HMMER

Page 12: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

HMMER executables1. hmmalign ‐ Align sequences to an existing model.

2. hmmbuild Build a model from a multiple sequence alignment.‐

3. hmmcalibrate Takes an HMM and empirically determines parameters that are used to make ‐searches more sensitive, by calculating more accurate expectation value scores (E values).‐

4. hmmconvert Convert a model file into different formats, including a compact HMMER 2 binary ‐format, and “best effort” emulation of GCG profiles.

5. hmmemit Emit sequences probabilistically from a profile HMM.‐

6. hmmfetch Get a single model from an HMM database.‐

7. hmmindex Index an HMM database‐ .

8. hmmpfam Search an HMM database for matches to a query ‐ sequence.

9. hmmsearch Search a sequence database for matches to an HMM.‐

Page 13: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Installation• Simple installation

– Download the current version of HMMER “hmmer 2.3.2.bin.intel linux.tar.gz” ‐ ‐ from http://hmmer.janelia.org/#download;

– Unpack the software by typing “tar –xvf hmmer 2.3.2.bin.intel linux.tar.gz” in ‐ ‐ the command line. You will see a new directory “hmmer 2.3.2.bin.intel linux”.‐ ‐

– Enter the directory of hmmer 2.3.2.bin.intel linux. You will see NINE executables ready in the subdirectory ‐ ‐“/binaries”, and also nine files in the subdirectory “/tutorial”;

• Installation from source code– Download the current HMMER source code version “hmmer 2.3.2.tar.gz” from ‐

http://hmmer.janelia.org/#download;– Create a new directory in your Linux account and upload or move the software package to the directory;– Unpack the software by typing “tar –xvf hmmer 2.3.2.tar.gz” in the command ‐ line;– Type “cd hmmer 2.3.2” to enter the software directory;‐– Type “./configure” to configure for your system and build the programs;– Type “make” to generate the executables;– Type “make check” to run the automated test suite; (This is optional but recommended, and all these tests

should pass);– Please note that by default programs are in “/usr/local/bin/” and man pages are in “/usr/local/man/man1”;– Type “make install” to install all executables;

Page 14: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Hmmbuild : build a profile HMM from an aignment

• hmmbuild [options] hmmfile alignfile • hmmbuild test.hmm test.aln• hmmbuild -h

• hmmbuild reads a multiple sequence alignment file alignfile , builds a new profile HMM, and saves the HMM in hmmfile.

• alignfile may be in ClustalW, GCG MSF, or SELEX alignment format.

• By default, the model is configured to find one or more non-overlapping alignments to the complete model.

• To configure the model for a single global alignment, use the -g option;

• To configure the model for multiple local alignments, use the -f option;

• To configure the model for a single local alignment (standard Smith/Waterman), use the -s option.

Page 15: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Hmmcalibrate: calibrate HMM search statistics

• hmmcalibrate [options] hmmfile• hmmcalibrate test.hmm• Hmmcalibrate -h

• hmmcalibrate reads an HMM file from hmmfile, scores a large number of synthesized random sequences with it, fits an extreme value distribution (EVD) to the histogram of those scores, and re-saves hmmfile now including the EVD parameters.

• This step is optional, but it will increase the sensitivity of your database search

• hmmcalibrate may take several minutes (or longer) to run. While it is running, a temporary file called hmmfile.xxx is generated in your working directory.

• If you abort hmmcalibrate prematurely (ctrl-C, for instance), your original hmmfile will be untouched, and you should delete the hmmfile.xxx temporary file.

Page 16: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Hmmsearch: - search a sequence database with a profile HMM

• hmmsearch [options] hmmfile seqfile • hmmsearch test.hmm query.faa > query.faa.domain• hmmsearch -h

• hmmsearch reads an HMM from hmmfile and searches seqfile for significantly similar sequence matches.

• hmmsearch may take minutes or even hours to run, depending on the size of the sequence database. It is a good idea to redirect the output to a file.

• The output consists of four sections: • a ranked list of the best scoring sequences, • a ranked list of the best scoring domains, • alignments for all the best scoring domains, and • a histogram of the scores.

• A sequence score may be higher than a domain score for the same sequence if there is more than one domain in the sequence; the sequence score takes into account all the domains. All sequences scoring above the -E and -T cutoffs are shown in the first list, then every domain found in this list is shown in the second list of domain hits. If desired, E-value and bit score thresholds may also be applied to the domain list using the -domE and -domT options.

Page 17: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

PFAM

Page 18: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Pfam 23.0 (July 2008, 10340 families)• The Pfam database is a large collection of protein families, each represented by multiple sequence

alignments and hidden Markov models (HMMs).

• Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature.

• The identification of domains that occur within proteins can therefore provide insights into their function.

• There are two components to Pfam: Pfam-A and Pfam-B. – Pfam-A entries are high quality, manually curated families. – Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence

database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. – Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A

entries are found.

• Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM. (see Pfam-C)

Page 19: Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the

Sequence analysis with HMM• ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam23.0/ to

download files “Pfam_fs.gz” and “Pfam_ls.gz”– Pfam_ls - All global (ls mode) Pfam-A HMMs in an HMM library searchable

with the hmmpfam program. – Pfam_fs - All local (fs mode) Pfam-A HMMs in an HMM library searchable with

the hmmpfam program.

• Data location– http://darwin.informatics.indiana.edu/col/courses/I529-12/Lab/Lab7/

Data/PFAM_data/

– Copy to your working director

• To search for domains in “test.faa” in the global sequence database, type – hmmpfam Pfam_fs test.faa > test.faa.pfam”

• The results is logged into an output file “test.faa.pfam”;