tunis, march 2007 a. auchincloss uniprotkb and expasy 1 practical exercises answers…

53
Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Upload: cory-simmons

Post on 11-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy1

Practical exercises

Answers…

Page 2: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy2

Page 3: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy3

- Nucleic acid database in Japan: DDBJ: http://www.ddbj.nig.ac.jp/

- Microarrays data: Arrayexpress: http://www.ebi.ac.uk/microarray/

- Mass spectrometry data: PRIDE: http://www.ebi.ac.uk/pride/, OPD http://www.ebi.ac.uk/pride/

- Protein-protein interaction: INTACT: http://www.ebi.ac.uk/intact/site/ DIP: http://dip.doe-mbi.ucla.edu/ JCB: http://www.imb-jenade/jcb/ppi/

- rat enamel 2D gel electrophoresis: http://biocadmin.otago.ac.nz/fmi/xsl/toothprint/home.xsl

(Last revision August 2006)

- CFTR mutation http://www.genet.sickkids.on.ca/cftr/; This web site was last updated March 2007

Page 4: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy4

Exercise 2

E.coli K12 recombinase A (recA) in different protein sequence databasesFind, if it exists, the entry corresponding to the E.coli (strain K12) recA protein sequence in the following protein sequence databases - EMBL http://www.ebi.ac.uk/embl/ - RefSeq http://www.ncbi.nlm.nih.gov/RefSeq/ - UniProtKB http://www.expasy.org/sprot/ or http://beta.uniprot.org/ find sequence(s) in UniProtKB/Swiss-Prot and sequence(s) in UniProtKB/TrEMBL - PIR-PSD http://pir.georgetown.edu/pirwww/dbinfo/pir_psd.shtml - PDB http://www.rcsb.org/pdb/home/home.do- UniParc (use SRS ) or the UniParc query tool. - EnsEMBL http://www.ensembl.org/index.html

-Find the UniProtKB/Swiss-Prot entry corresponding to the RefSeq entry NP_036231

Hints: • You can use the query tool provided by each database. • You can use SRS • You can use the crosslinks (if they exist) to go from one database to another... • You can use the mapping tool on the new UniProt web site http://beta.uniprot.org/

Page 5: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy5

EMBL: U00096 RefSeq: NC_000913UniProtKB: P0A7G6, Swiss-Prot only (there are 2 fragments in TrEMBL, but they are not from K12)PIR-PSD: G65049; RQECA. Retrieved from UniProtUniParc: UPI0000112C1CPDB: 1AA3,1N03,1REA,1U94 etc, easily retrieved from UniProt.

EnsEMBL: Not possible, bacteria are not in EnsEMBL!

Page 6: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…
Page 7: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy7

Exercise 3-Find the human erythropoietin protein sequence in UniProt.- BLASTp it at ExPASy (http://www.expasy.org/tools/blast/); restrict the BLAST to human sequences (Homo sapiens).- Look at the Blast results and guess from which database(s) the protein sequences are derived. How many distinct human erythropoietin protein sequences do you get?

-Do the same, but at (http://www.ncbi.nlm.nih.gov/BLAST/) - How many distinct human erythropoietin protein sequences do you get?

Page 8: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy8

BLASTp at ExPASy against UniProtKB

Only 2 entries; one annotated in Swiss-Prot, the other unannotated in TrEMBL.Looking at the Swiss-Prot entry you see a lot of rich information.

Page 9: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy9

BLASTp at NCBI against nr

At least 9 entries; RefSeq (ref, 1), GenPept (embl, gb, 6) and PDB (pdb, 2). The Swiss-Prot entry has most of these cross-references, and more besides.

Page 10: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy10

Exercise 4:

Understanding BLAST output

Compare the results of BLASTp for entry O05891-against UniProtKB (http://www.expasy.org/tools/blast/)-against NCBI-nr (http://www.ncbi.nlm.nih.gov/BLAST/)

Look for the same best hits and compare the scores, why are they different?

Keep the UniProtKB output, we will use it again in a minute.

Page 11: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy11

BLASTp at NCBI against nr

BLASTp at ExPASy against UniProtKB

Page 12: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy12

NCBI BLAST FAQ: http://www.ncbi.nlm.nih.gov/blast/blast_FAQs.shtml

Q: What is the Expect (E) value?The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is.

Page 13: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy13

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

The Statistics of Sequence Similarity Scores

The E-value of equation (1) applies to the comparison of two proteins of lengths m and n. How does one assess the significance of an alignment that arises from the comparison of a protein of length m to a database containing many different proteins, of varying lengths? One view is that all proteins in the database are a priori equally likely to be related to the query. This implies that a low E-value for an alignment involving a short database sequence should carry the same weight as a low E-value for an alignment involving a long database sequence. To calculate a "database search" E-value, one simply multiplies the pairwise-comparison E-value by the number of sequences in the database. The E value depends on the size of the database.

Page 14: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy14

Exercise 5:

Start site issues: bacteria

Take the UniProtKB BLASTp output for O05891.Align the first 9 sequences using ClustalW (tool on the BLAST output page). What do you see, what is one possible interpretation?

Look at the entry in UniProt, what can you see to strengthen this interpretation?

Page 15: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy15

Page 16: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy16

MYCTF = Mycobacterium tuberculosis strain F11. It is not clear if it is a WGS or a fully finished genome…

In either case there has probably been an error in the start codon prediction. In bacteria there are several other codons beside ATG that can start a protein (Val (GTG) and Leu (TTG)). That is probably what happened here, and the fact that there is another potential start a few residues upstream, that corresponded to predictions for other Mycobacteria, was not noticed…

O05891 has the KW Direct protein sequencing, and one reference has the descriptor “Protein sequence of N-terminus…”

Page 17: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy17

Exercise 6:

BLASTp and UniRef

Compare the results of BLASTing P04150 against UniProtKB,UniRef100, UniRef90 and UniRef50 (use BLAST at ExPASy).Compare the results. In which cluster(s) do you find the alternatively spliced sequences (how many are there)?

Page 18: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy18

The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences (including some from UniParc) into a single record to speed searches.

One UniRef100 entry -> all identical sequences (including fragments) – reduction of 12% of DB.One UniRef90 entry -> sequences that have at least 90% identity – reduction of 45% of DB.One UniRef50 entry-> sequences that are at least 50% identical –reduction of 69% ofDB.

Species independent!!

UniRef is useful for comprehensive BLAST sequence searches by providing sets of representative sequences.

Page 19: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy19

First BLAST against UniProtKB

Page 20: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy20

UniRef100:+ more furtherdown the output

They are not all in the same cluster (remember 12% reduction in DB size)

Page 21: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy21

UniRef90:Still not all in the same cluster

(remember 45% reduction in DB size)

Page 22: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy22

UniRef50:All in the same cluster

(remember 69% reduction in DB size)

Page 23: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy23

Exercise 7

Different looks and tools for a same entry depending on the server...

Starting with the new UniProt server (http://beta.uniprot.org/):

a. Look for the amino acid sequence of human carbonic anhydrase 2.

b. Get the corresponding nucleic acid entries in EMBL and GenBank: try to find a nucleic acid sequence derived from genomic DNA sequencing and another one derived from cDNA sequencing.

c. From the UniProtKB/Swiss-Prot entry, look at the data available for the variant Pro-92 and in particular its position in the 3D structure (Use the “Astex viewer”).

Starting with the NCBI server (http://www.ncbi.nlm.nih.gov/):

a. Look for the amino acid sequence of human carbonic anhydrase 2 using ENTREZ protein at the NCBI server.

b. b. Find the UniProtKB/Swiss-Prot entry and as above: - Get the corresponding nucleic acid entries in EMBL and GenBank. - Find the data available for the variant Pro-92.

Page 24: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy24

UniProt: P00918, follow links

Page 25: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy25

NCBI, Entrez protein, can also just type in P00918

Page 26: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy26

Note differences in UniProt cross-reference presentation, and in information present about a given cross-reference

Page 27: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy27

Feature table ordering is very different here, numerical only

Page 28: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy28

Exercise 8Environmental sequences: how to check the quality of

a protein sequence...a. Look at DQ284920 at EMBL

(http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+top+-newId): where does the sequence come from? How reliable is the translated CoDing Sequences (CDS)?

b. How many environmental sequences are found in the acid nucleic databases (use SRS (ENV)?

c. Look at DQ380558: can you find the protein sequence in UniProtKB? Where does the annotation come from (from which type of analysis)?

Page 29: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy29

a.

Page 30: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy30

b.

(March 11, 22:00)

Page 31: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy31

c.

Page 32: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy32

Page 33: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy33

Page 34: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy34

Exercise 9Genomic databases (I)a. Look for the Swiss-Prot entry of the E.coli gene gutQ

(http://beta.uniprot.org/).b. Follow the link to EcoGene (EcoGene Database of

Escherichia coli sequence and function) and find the chromosomal location.

c. Get the next E.coli gene on the same strand.d. Follow the link to Swiss-Prot.e. Find the subcellular localisation of the protein.f. What regions and domains does the protein contain,

visualize them. g. Have a look at the domain structure in the different

domain databases. In PROSITE, get the list of proteins with at least one common domain.

Page 35: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy35

Page 36: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy36

EcoGene page

Page 37: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy37

Or in this pull down list

Page 38: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy38

Note: Currently the NiceProt view

Page 39: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy39

Flavodoxin-likeZinc metallo-hydrolase

Rubredoxin-like

Page 40: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy40

Page 41: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy41

From InterPro

or

Page 42: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy42

Page 43: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy43

From PROSITE

Page 44: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy44

Exercise 10Protein domain / family databasesa. How many different databases are used by InterPro? b. Do an InterPro scan with the sequence on the next page.c. How many different domains does the protein contain? d. How many phosphopantetheine-binding domain does the

protein contain?e. How many different protein domain databases have a

discriminator for the phosphopantetheine-binding domain? Are they using patterns, profiles or HMMs? What are the most frequent domains found in Mycobacterium tuberculosis H37Rv? (Go to the integr8 site complete proteome: http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeId=30). What percentage of proteins in M.tuberculosis have a phosphopantetheine-binding domain?

Page 45: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy45

MVHATACSEI IRAEVAELLG VRADALHPGA NLVGQGLDSI RMMSLVGRWR RKGIAVDFATLAATPTIEAW SQLVSAGTGV APTAVAAPGD AGLSQEGEPF PLAPMQHAMW VGRHDHQQLGGVAGHLYVEF DGARVDPDRL RAAATRLALR HPMLRVQFLP DGTQRIPPAA GSRDFPISVADLRHVAPDVV DQRLAGIRDA KSHQQLDGAV FELALTLLPG ERTRLHVDLD MQAADAMSYRILLADLAALY DGREPPALGY TYREYRQAIE AEETLPQPVR DADRDWWAQR IPQLPDPPALPTRAGGERDR RRSTRRWHWL DPQTRDALFA RARARGITPA MTLAAAFANV LARWSASSRFLLNLPLFSRQ ALHPDVDLLV GDFTSSLLLD VDLTGARTAA ARAQAVQEAL RSAAGHSAYPGLSVLRDLSR HRGTQVLAPV VFTSALGLGD LFCPDVTEQF GTPGWIISQG PQVLLDAQVTEFDGGVLVNW DVREGVFAPG VIDAMFTHQV DELLRLAAGD DAWDAPSPSA LPAAQRAVRAALNGRTAAPS TEALHDGFFR QAQQQPDAPA VFASSGDLSY AQLRDQASAV AAALRAAGLRVGDTVAVLGP KTGEQVAAVL GILAAGGVYL PIGVDQPRDR AERILATGSV NLALVCGPPCQVRVPVPTLL LADVLAAAPA EFVPGPSDPT ALAYVLFTSG STGEPKGVEV AHDAAMNTVETFIRHFELGA ADRWLALATL ECDMSVLDIF AALRSGGAIV VVDEAQRRDP DAWARLIDTYEVTALNFMPG WLDMLLEVGG GRLSSLRAVA VGGDWVRPDL ARRLQVQAPS ARFAGLGGATETAVHATIFE VQDAANLPPD WASVPYGVPF PNNACRVVAD SGDDCPDWVA GELWVSGRGIARGYRGRPEL TAERFVEHDG RTWYRTGDLA RYWHDGTLEF VGRADHRVKI SGYRVELGEIEAALQRLPGV HAAAATVLPG GSDVLAAAVC VDDAGVTAES IRQQLADLVP AHMIPRHVTLLDRIPFTDSG KIDRAEVGAL LAAEVERSGD RSAPYAAPRT VLQRALRRIV ADILGRANDAVGVHDDFFAL GGDSVLATQV VAGIRRWLDS PSLMVADMFA ARTIAALAQL LTGREANADRLELVAEVYLE IANMTSADVM AALDPIEQPA QPAFKPWVKR FTGTDKPGAV LVFPHAGGAAAAYRWLAKSL VANDVDTFVV QYPQRADRRS HPAADSIEAL ALELFEAGDW HLTAPLTLFGHCMGAIVAFE FARLAERNGV PVRALWASSG QAPSTVAASG PLPTADRDVL ADMVDLGGTDPVLLEDEEFV ELLVPAVKAD YRALSGYSCP PDVRIRANIH AVGGNRDHRI SREMLTSWETHTSGRFTLSH FDGGHFYLND HLDAVARMVS ADVR

Page 46: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy46

c. 6 domains, 1 PTM, 1 family detected

d.2 phosphopantetheine-binding domains

Page 47: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy47

Pfam: HMM,PROSITE; Profile

Page 48: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy48

Page 49: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy49

Page 50: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy50

Exercise 11Use of UniProtKB/Swiss-Prot for creating dataset and prediction tools.

-Find proteins with the following EC numbers; 3.5.1.1, 3.5.1.38-Look for proteins which have been experimentally proven to have an active site. - Alignment the sequences. -From the alignment suggest a pattern based around the active threonine (do this manually).-Scan your pattern against UniProtKB/Swiss-Prot (http://expasy.org/tools/scanprosite/). How many matches do you find?- Compare your pattern with that found in the PROSITE database PS00144, (http://www.expasy.org/cgi-bin/prosite-search-ac?PDOC00132). How many matches in UniProtKB/Swiss-Prot are there with PS00144?- Can you do the same with the NCBInr data ?

Page 51: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy51

PS00144, have to give the EC numbersATGGTIAG

Scan against SP, get 13 hitsPROSITE pattern gives 517 hits against UniProt, 45 against SP

Page 52: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy52

Done March 2007, ANA

Page 53: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy53