1 introduction to bioinformatics fall 2008. 2 administration adi doron [email protected] ...

38
1 Introduction to Introduction to Bioinformatics Bioinformatics Fall 2008

Upload: darcy-ward

Post on 17-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

11

Introduction to Introduction to BioinformaticsBioinformatics

Fall 2008

Page 2: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

22

AdministrationAdministration

Adi DoronAdi Doron [email protected] [email protected] Nimrod RubinsteinNimrod Rubinstein [email protected] [email protected] Dudu BursteinDudu Burstein [email protected] [email protected] Reception hours:Reception hours:

by appointmentby appointmentBritania 405, 6409245Britania 405, 6409245

Page 3: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

33

Course WebsiteCourse Website

http://bioinfo.tau.ac.il/~intro_bioinfo/http://bioinfo.tau.ac.il/~intro_bioinfo/

Page 4: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

44

ExercisesExercises

Each student participates once in 2 weeks:Each student participates once in 2 weeks:Sunday 16:00-18:00Sunday 16:00-18:00Monday 12:00-14:00Monday 12:00-14:00

Monday 14:00-16:00 Monday 14:00-16:00 Computer classroom Sherman 03Computer classroom Sherman 03

Page 5: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

55

RequirementsRequirements

Exam – 80% of final gradeExam – 80% of final grade Assignments – 20% of final grade Assignments – 20% of final grade

(Compulsory)(Compulsory) Assignments include class and home works:Assignments include class and home works:

• Class works are planned to be completed during the Class works are planned to be completed during the exercise. They should be mailed to the TA. They will exercise. They should be mailed to the TA. They will be checked but not graded.be checked but not graded.

• Home works should be handed in the following Home works should be handed in the following exercise (2 weeks after the hand out date). They will exercise (2 weeks after the hand out date). They will be checked and graded.be checked and graded.

Page 6: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

66

GoalsGoals

To familiarize the students with research topics To familiarize the students with research topics in bioinformatics, and with bioinformatic toolsin bioinformatics, and with bioinformatic tools

The emphasis will be on tools and their useThe emphasis will be on tools and their use

PrerequisitesPrerequisites

Familiarity with topics in molecular biology Familiarity with topics in molecular biology (cell biology and genetics)(cell biology and genetics)

Basic familiarity with computers & internetBasic familiarity with computers & internet

Page 7: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

77

BIOINFORMATIC DATABASESBIOINFORMATIC DATABASES

Page 8: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

88

What’s in a databaseWhat’s in a database?? Sequences – genes, proteins, etc.Sequences – genes, proteins, etc.

Full genomesFull genomes

Annotation – information about the gene/protein:Annotation – information about the gene/protein:- function- function- cellular location- cellular location- chromosomal location- chromosomal location- introns/exons- introns/exons- protein structure- protein structure- phenotypes, diseases- phenotypes, diseases

PublicationsPublications

Page 9: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

99

NCBI and EntrezNCBI and Entrez

One of the largest and most comprehensive One of the largest and most comprehensive databases belonging to the NIH – national databases belonging to the NIH – national institute of health (USA)institute of health (USA)

Entrez is the search engine of NCBIEntrez is the search engine of NCBI Search for :Search for :

genes, proteins, genomes, structures, diseases, genes, proteins, genomes, structures, diseases, publications and morepublications and more..

httphttp://://wwwwww..ncbincbi..nlmnlm..nihnih..govgov//

Page 10: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

1010

Search for published papersSearch for published papers Yang X, Kurteva S, Ren X, Lee S,Yang X, Kurteva S, Ren X, Lee S,

Sodroski JSodroski J.. “Subunit stoichiometry of human “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Viroltrimers during virus entry into host cells “, J Virol.. 2006 2006

May;80(9):4388-95.May;80(9):4388-95.

Page 11: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

1111

Use fieldsUse fields!!Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]

For the full list of field tags: go to help -> Search Field Descriptions and Tags

Page 12: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

1212

ExerciseExercise

Retrieve all publications in which the Retrieve all publications in which the first first author is:author is: Pe'er I Pe'er I and the and the last author is:last author is: Shamir RShamir R

Page 13: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

1313

Using LimitsUsing Limits

Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years

Page 14: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

1414

Google scholarGoogle scholarhttp://scholar.google.com/

Page 15: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

1515

Page 16: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

1616

NCBI gene & protein databases: NCBI gene & protein databases: GenBankGenBank

GenBankGenBank is an annotated collection of all is an annotated collection of all publicly available DNA sequences. publicly available DNA sequences.

Holds Holds 65 billion65 billion bases (Oct. 2007)bases (Oct. 2007)

GenPeptGenPept is a database of translated is a database of translated coding sequences from GenBankcoding sequences from GenBank

Page 17: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

1717

Searching for CD4 human using Searching for CD4 human using EntrezEntrez

Search demonstrationSearch demonstration

Page 18: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

1818

Page 19: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

1919

Using Field Descriptions, Qualifiers, Using Field Descriptions, Qualifiers, and Boolean Operatorsand Boolean Operators

Cd4[GENE] AND human[ORGN] Cd4[GENE] AND human[ORGN] Or Or Cd4[gene name] AND human[organism]Cd4[gene name] AND human[organism]

List of field codes: List of field codes: httphttp://://wwwwww..ncbincbi..nlmnlm..nihnih..govgov//entrezentrez//queryquery//staticstatic//helphelp//Summary_MatricesSummary_Matrices..html#Search_Fields_and_Qualifiershtml#Search_Fields_and_Qualifiers

Boolean Operators:Boolean Operators:ANDANDORORNOTNOT

Note: do not use the field Protein name [PROT], only Note: do not use the field Protein name [PROT], only GENE!GENE!

Page 20: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

2020

Page 21: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

2121

RefSeqRefSeq REFSEQ: sub-collection of NCBI databases with REFSEQ: sub-collection of NCBI databases with

only non-redundant, highly annotated entries only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein (genomic DNA, transcript (RNA), and protein products)products)

Page 22: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

2222

Page 23: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

2323An explanation on GenBank records

Page 24: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

2424

Accession NumbersAccession NumbersGenBankGenBank

EMBLEMBL

Two letters followed by six digits, e.g.:Two letters followed by six digits, e.g.:AY123456AY123456

One letter followed by five digits, eOne letter followed by five digits, e..gg.:.:U12345U12345

GenPept (a.a. GenPept (a.a. translations of translations of GenBank)GenBank)

Three letters and five digits, e.g.:Three letters and five digits, e.g.:AAA12345AAA12345

RefseqRefseqRefSeq accession numbers can be distinguished from RefSeq accession numbers can be distinguished from GenBank accessions by their prefix distinct format of GenBank accessions by their prefix distinct format of [[2 2 characters+underscorecharacters+underscore]], e.g.: , e.g.: NP_015325NP_015325..NM_: nucleotide, NP_: proteinNM_: nucleotide, NP_: protein

SWISSSWISS--PROTPROT

(another protein (another protein database)database)

All are six charactersAll are six characters::Character/FormatCharacter/Format1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9]1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9]5 [A-Z,0-9] 6 [0-9] 5 [A-Z,0-9] 6 [0-9] e.g.:e.g.:P12345P12345 and and Q9JJS7Q9JJS7

PDB (Protein Data PDB (Protein Data Bank – structure Bank – structure database)database)

one digit followed by three letters, eone digit followed by three letters, e..gg.:.:1hxw1hxw

Page 25: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

2525

SwissprotSwissprot

A protein sequence database which A protein sequence database which strives to provide a high level of strives to provide a high level of annotation:annotation:* the function of a protein* the function of a protein* domains structure* domains structure* post* post--translational modificationstranslational modifications* variants* variants

One entry for each proteinOne entry for each protein

Page 26: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

2626

Page 27: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

2727

GenBank Vs. Swiss-ProtGenBank Vs. Swiss-Prot

GenBank results Swiss-Prot results

Page 28: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

2828

Downloading & Fasta formatDownloading & Fasta format Fasta formatFasta format

> sp|P01730|CD4_HUMAN T-cell surface glycoprotein CD4 precursor MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI

Save Accession Numbers for future use (makes searching quicker):Refseq: NP_000607Swissprot: P01730

Page 29: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

2929

Page 30: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

3030

PDBPDB:: Protein Data Bank Protein Data Bank

Main database of 3D structures.Main database of 3D structures. Includes ~47,000 entries (Includes ~47,000 entries (proteinsproteins, ,

nucleic acids, others).nucleic acids, others). Proteins organized in groups, families etc.Proteins organized in groups, families etc. Is highly redundant.Is highly redundant. http://www.rcsb.orghttp://www.rcsb.org

Page 31: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

3131

CD4 in complex with gp120CD4 in complex with gp120

gp120

CD4

PDB ID 1G9M

Page 32: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

3232

Model organisms have independent database:Model organisms have independent database:

Organism specificOrganism specific

HIV database http://hiv-web.lanl.gov/content/index

Page 33: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

3333

GenecardsGenecards

All in one database of human genes (a All in one database of human genes (a project by Weizmann institute) project by Weizmann institute)

Attempts to integrate as many as possible Attempts to integrate as many as possible databases, publications and all available databases, publications and all available knowledgeknowledge

httphttp://://wwwwww..genecardsgenecards..orgorg

Page 34: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

3434

Page 35: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

3535

SummarySummary

General and comprehensive databases:General and comprehensive databases: NCBI, EMBL, DDBJNCBI, EMBL, DDBJ

Genome specific databases:Genome specific databases: ENSEMBL, UCSC genome browserENSEMBL, UCSC genome browser

Highly annotated databases:Highly annotated databases: Human genesHuman genes

• Genecards Genecards Proteins:Proteins:

• Swissprot, RefseqSwissprot, Refseq Structures:Structures:

• PDBPDB

Page 36: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

3636

The MOST important of allThe MOST important of all

1.1.GoogleGoogle (or any search engine) (or any search engine)

Page 37: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

3737

And always rememberAnd always remember::

2.2.RT(F)MRT(F)M – –

Read the manual!!Read the manual!!

Page 38: 1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein

3838

HelpHelp!!

Read the Help sectionRead the Help section Read the FAQ sectionRead the FAQ section Google the question!Google the question!