tweaking blast although you normally see blast as a web page with boxes to place data in and tick...

8
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that can be running just by typing the right command and options, e.g. >blastall –p blastn –I my_sequence.fasta –d refseq Which is the simplest form, where the basic program ‘blastall’ takes a number of different options or parameters indicated by the –x and followed by its value. -p <which blast flavour to run> -I <file with query sequence in> -d <pre-indexed database name> There are many other parameters, and if not listed explicitly will use a default value most appropriate to the blast flavour requested. E.g. for –W <word size> blastn uses –W 11, where blastx uses –W 3. There are also some options that appear on the web pages

Upload: mervyn-johns

Post on 30-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that

Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be running just by typing the right command and options eg

gtblastall ndashp blastn ndashI my_sequencefasta ndashd refseq

Which is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-I ltfile with query sequence ingt-d ltpre-indexed database namegt

There are many other parameters and if not listed explicitly will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3

There are also some options that appear on the web pages that which are not really parameters but manage the job in some way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTp

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtblastn-vs-blastp and go to the NCBI BLAST Home PageThis is a Xenopus tropicalis cDNA sequence

Go to NUCLEOTIDE BLAST sectionRun BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

BLAST Parameters Exercises2 Low complexity filtering

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtlow-complexity-filtering-A and go to the NCBI BLAST Home Page

Go to the TRANSLATED BLAST section BLASTxCarefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

BLAST Parameters Exercises1 BLASTn vs tBLASTx and nucleotide mismatch penalties

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlAlso open the NCBI BLAST Home Page and go to the SPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclinsCopy the sequence gtcyclin-A1-Xt to the Sequence 1 windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 windowRun the default comparison should be BLASTn Note the alignment

Now run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1Can we learn anything from this

Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

BLAST Parameters Exercises4 Limit Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching in fruit fly proteins enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home PageGo to the TRANSLATED BLAST section BLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises5 Word Size

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtmorpholino go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

  • Tweaking BLAST
  • The Many Parameters of BLAST
  • Slide 3
  • BLAST Parameters Exercises
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
Page 2: Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that

The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary

Here are some of the ones that I have used

-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers

BLAST Parameters Exercises1 BLASTn vs BLASTp

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtblastn-vs-blastp and go to the NCBI BLAST Home PageThis is a Xenopus tropicalis cDNA sequence

Go to NUCLEOTIDE BLAST sectionRun BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

BLAST Parameters Exercises2 Low complexity filtering

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtlow-complexity-filtering-A and go to the NCBI BLAST Home Page

Go to the TRANSLATED BLAST section BLASTxCarefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

BLAST Parameters Exercises1 BLASTn vs tBLASTx and nucleotide mismatch penalties

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlAlso open the NCBI BLAST Home Page and go to the SPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclinsCopy the sequence gtcyclin-A1-Xt to the Sequence 1 windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 windowRun the default comparison should be BLASTn Note the alignment

Now run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1Can we learn anything from this

Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

BLAST Parameters Exercises4 Limit Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching in fruit fly proteins enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home PageGo to the TRANSLATED BLAST section BLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises5 Word Size

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtmorpholino go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

  • Tweaking BLAST
  • The Many Parameters of BLAST
  • Slide 3
  • BLAST Parameters Exercises
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
Page 3: Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that

BLAST Parameters Exercises1 BLASTn vs BLASTp

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtblastn-vs-blastp and go to the NCBI BLAST Home PageThis is a Xenopus tropicalis cDNA sequence

Go to NUCLEOTIDE BLAST sectionRun BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page

Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx

How might the different results help us view the presence of this gene in other vertebrates

BLAST Parameters Exercises2 Low complexity filtering

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtlow-complexity-filtering-A and go to the NCBI BLAST Home Page

Go to the TRANSLATED BLAST section BLASTxCarefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

BLAST Parameters Exercises1 BLASTn vs tBLASTx and nucleotide mismatch penalties

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlAlso open the NCBI BLAST Home Page and go to the SPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclinsCopy the sequence gtcyclin-A1-Xt to the Sequence 1 windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 windowRun the default comparison should be BLASTn Note the alignment

Now run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1Can we learn anything from this

Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

BLAST Parameters Exercises4 Limit Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching in fruit fly proteins enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home PageGo to the TRANSLATED BLAST section BLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises5 Word Size

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtmorpholino go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

  • Tweaking BLAST
  • The Many Parameters of BLAST
  • Slide 3
  • BLAST Parameters Exercises
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
Page 4: Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that

BLAST Parameters Exercises2 Low complexity filtering

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtlow-complexity-filtering-A and go to the NCBI BLAST Home Page

Go to the TRANSLATED BLAST section BLASTxCarefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database

What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches

Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware

BLAST Parameters Exercises1 BLASTn vs tBLASTx and nucleotide mismatch penalties

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlAlso open the NCBI BLAST Home Page and go to the SPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclinsCopy the sequence gtcyclin-A1-Xt to the Sequence 1 windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 windowRun the default comparison should be BLASTn Note the alignment

Now run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1Can we learn anything from this

Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

BLAST Parameters Exercises4 Limit Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching in fruit fly proteins enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home PageGo to the TRANSLATED BLAST section BLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises5 Word Size

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtmorpholino go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

  • Tweaking BLAST
  • The Many Parameters of BLAST
  • Slide 3
  • BLAST Parameters Exercises
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
Page 5: Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that

BLAST Parameters Exercises1 BLASTn vs tBLASTx and nucleotide mismatch penalties

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlAlso open the NCBI BLAST Home Page and go to the SPECIAL ndash Align two sequences section

There are several Xenopus tropicalis cyclinsCopy the sequence gtcyclin-A1-Xt to the Sequence 1 windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 windowRun the default comparison should be BLASTn Note the alignment

Now run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these

Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1Can we learn anything from this

Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip

BLAST Parameters Exercises4 Limit Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching in fruit fly proteins enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home PageGo to the TRANSLATED BLAST section BLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises5 Word Size

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtmorpholino go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

  • Tweaking BLAST
  • The Many Parameters of BLAST
  • Slide 3
  • BLAST Parameters Exercises
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
Page 6: Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that

BLAST Parameters Exercises4 Limit Entrez query

Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching in fruit fly proteins enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home PageGo to the TRANSLATED BLAST section BLASTx and paste the sequence

Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip

BLAST Parameters Exercises5 Word Size

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtmorpholino go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

  • Tweaking BLAST
  • The Many Parameters of BLAST
  • Slide 3
  • BLAST Parameters Exercises
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
Page 7: Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that

BLAST Parameters Exercises5 Word Size

Go to informaticsgurdoncamacukonlineworkshopsuseful-web-siteshtmlOpen blast-parameter-sequenceshtmlCopy the sequence gtmorpholino go to the NCBI BLAST Home PageGo to the NUCLEOTIDE BLAST section BLASTn and paste the sequence

Check OFF the low complexity filter and then run the search

Now re-run the search setting the following parameters

Low complexity OFFExpect 100Word Size 7Other advanced -q-1 (mismatch penalty -1 instead of default -3)

What difference does this make

  • Tweaking BLAST
  • The Many Parameters of BLAST
  • Slide 3
  • BLAST Parameters Exercises
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8