essential bioinformatics and biocomputing (lsm2104: section i) biological databases and...

43
Essential Bioinformatics and Biocomputing Essential Bioinformatics and Biocomputing ( ( LSM2104: Section LSM2104: Section I) I) Biological Databases and Biological Databases and Bioinformatics Software Bioinformatics Software Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected] [email protected] http://xin.cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS Room 07-24, level 7, SOC1, NUS January 2003 January 2003

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Essential Bioinformatics and Biocomputing Essential Bioinformatics and Biocomputing ((LSM2104: SectionLSM2104: Section I) I)

Biological Databases andBiological Databases andBioinformatics SoftwareBioinformatics Software

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sg

http://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUS

January 2003January 2003

Essential Bioinformatics and Biocomputing Essential Bioinformatics and Biocomputing ((LSM2104: SectionLSM2104: Section I) I)

Four lecturesFour lectures

Part 1: Biological databases:Part 1: Biological databases:

Lecture 2. Biological information and databasesLecture 2. Biological information and databasesLecture 3. More databases, retrieval systems, and database searching Lecture 3. More databases, retrieval systems, and database searching

Part 2: Software:Part 2: Software:

Lecture 4. Examples of the applications of bioinformatics softwareLecture 4. Examples of the applications of bioinformatics software and basic principles and basic principlesLecture 5. Overview of bioinformatics softwareLecture 5. Overview of bioinformatics software

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

33

Part 1: Biological databasesPart 1: Biological databasesPart 1 outline:

1. Biological information and databases– Overview and definition, types of biological databases

2. Popular databases, records, data format– Genbank, SwissProt, OMIM, PDB, KEGG, BIND, Pfam, PROSITE, PubMed

3. Accessing biological databases, retrieval systems– Entrez, SRS

4. Searching biological databases– Data quality, coverage, redundancy, errors

Textbook:--T.K.Atwood and D.J. Parry Smith, Introduction to Bioinformatics. Biological databases: chapters 3 and 4

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

44

Biological Biological InformationInformation

Cancer as anexample:

Genes:Growth GenesTumor suppressor genes

Proteins:Growth FactorsEnzymesReceptors

Pathways:Cell death

Systems:Immune systemBlood supply

Function:Role of proteinsMolecular interactions

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

55

Biological InformationBiological Information

Nucleic acids:• DNA sequence, genes, gene products (proteins), mutation,

gene coding, distribution patterns, motifs• Genomics: genome, gene structure and expression, genetic

map, genetic disorder• RNA sequence, secondary structure, 3D structure,

interactions

Proteins:• Protein sequence, corresponding gene, secondary structure,

3D structure, function, motifs, homology, interactions • Proteomics: expression profile, proteins in disease processes

etc.• Ligands and drugs (inhibitors, activators, substrates,

metabolites)

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

66

Biological InformationBiological Information

Pathways:• Molecular networks, biological chain events,

regulation, feedback, kinetic data

Function:• Binding sites, interactions, molecular action

(binding, chemical reaction, etc.)• Biological effect (signaling, transport, feedback,

regulation, modification, etc.)• Functional relationship, protein families, motifs, and

homologs

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

77

Biological databasesBiological databasesPurpose

1. To disseminate biological data and information2. To provide biological data in computer-readable form3. To allow analysis of biological data

A database needs to have at minimum a specific tool for searching and data extraction.

– Web pages, books, journal articles, tables, text files, and spreadsheet files cannot be considered as databases

• Reading materials:– Baxevanis AD.The Molecular Biology Database Collection: 2002 update.

Nucleic Acids Res. 2002 Jan 1;30(1):1-12.

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

88

Biological databasesBiological databasesLists of biological databases

• INFOBIOGEN Catalog of Databases http://www.infobiogen.fr/services/dbcat/

• Nucleic Acids Research Database Listing http://nar.oupjournals.org/cgi/content/full/30/1/1/DC1

– These serve as starting point of biological databases.– More than 500 databases have been catalogued to date

and those from the two listings satisfy minimal criteria for the content, access, and quality.

– Other sites as a starting point.

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

99

Biological databasesBiological databases• INFOBIOGEN Catalog of Databases Type of database No of records DNA 87 RNA 29 Protein 94 Genomic 58 Mapping 29 Protein structure 18 Literature 43 Miscellaneous 153 Total 511

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

1010

Biological databases- Biological databases- in Nucleic Acids Researchin Nucleic Acids ResearchType of database No of recordsMajor Sequence Repositories 7Comparative Genomics 7Gene Expression 20Gene Identification and Structure 30Genetic and Physical Maps 10Genomic Databases 48Intermolecular Interactions 5Metabolic Pathways and Cellular Regulation 12Mutation Databases 33Pathology 8Protein Databases 50Protein Sequence Motifs 18Proteome Resources 7RNA Sequences 26Retrieval Systems and Database Structure 3Structure 32Transgenics 2Varied Biomedical Content 18TOTAL 336

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

1111

Literature databases – PubMed (MedLine)

1. It contains entries for more than 11 million abstracts of scientific publications.

2. It enables user to do keyword searches, provides links to a selection of full articles, and has text mining capabilities, e.g. provides links to related articles, and GenBank entries, among others.

3. Efficient searching PubMed requires some skill. For example, searching with a keyword “interleukin” returns 108,366 matches.

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

1212

PubMed web-sitePubMed web-site ((http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed ) )

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

1313

PubMed SearchPubMed Search ((

http://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www3.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed ) )

Key Word No. of EntriesCancer 1.45M

Cancer

Blood supply

22K

Cancer

Blood supply

Protein

3.9K

Cancer

Blood supply

Enzyme

1.5K

Cancer treatment by targeting blood supply:

Cancer growth depends on blood supply (why?) and thus requires the growth of new blood vessels – angiogenesis

Proteins involved in angiogenesis may be potential anticancer targets

You can find some of these targetsby searching Pubmed

Key word “cancer angiogenesis enzyme drug” produces 856 entries

Cancer

Blood supply

Enzyme

Drug

500

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

1414

Nucleic Acids databasesWhat info are in these databases:• DNA sequence, genes, gene products (proteins),

mutation, gene coding, distribution patterns, motifs• Genomics: genome, gene structure and expression,

genetic map, genetic disorder• RNA sequence, secondary structure, 3D structure,

interactions

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

1515

Nucleic Acids databasesDNA databases – GenBank, EMBL, DDBJ

1. General purpose databases focusing on DNA sequences and their properties

2. GenBank, EMBL-bank and DDBJ exchange data to ensure comprehensive worldwide coverage and accession numbers are managed consistently between the three centers.

Reading materials:– Textbook, chapter 4

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

1616

DNA databases• GenBank database (http://www.ncbi.nih.gov/Genbank/)

– Contains publicly available DNA sequences from more than 100,000 organisms.

– Also contains derived protein sequences, and annotations describing biological, structural, and other relevant features.

– Accessible through Entrez, NCBI’s integrated retrieval system (studied later)

– Sequence similarity search tools: BLAST (studied later)

• EMBL nucleotide sequence database (http://www.ebi.ac.uk/embl/) – Contains nucleotide sequences collected from all public sources. – Accessible through Sequence Retrieval System (SRS) which allows

keyword searching (studied later)– Sequence similarity search tools: Blitz, Fasta, and BLAST (studied

later)

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

1717

DNA databases:

GenBank Web pageWeb page

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

1818

DNA databases

• An Example from GenBank– flat file

– Human Alpha-Lactalbumin gene

This protein is a complex of 2 proteins A and B. In the absence of the

B protein, the enzyme catalyzes the transfer of

galactose from UDP-galactose to Nacetylglucosamine (cf. EC 2.4.1.90).

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

1919

A GenBank entry – HEADERA GenBank entry – HEADER

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

2020

GenBank Entry – Links provided in the Header GenBank Entry – Links provided in the Header

• MapViewer – find the gene position in chromosome

• Related Sequences – other entries related to this gene (or sequence)

• OMIM– link to catalog of human genes and genetic disorders • Protein – retrieve protein record from GenPept

• Medline and PubMed –literature abstracts related to this gene

• Taxonomy – Classification of organisms

• UniGene – Unified gene data

• UniSTS – Unified sequence tagged sites, marker and mapping data

• LinkOut – links to publishers, aggregators libraries, biological databases, sequence centers, and other Web resources

• REFSEQ – reference sequence standards

Note: These links are representative. Other links may also be found in GenBank entries.

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

2121

GenBank entry - FEATURES

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

2222

GenBank Entry– Links provided in the Feature sectionGenBank Entry– Links provided in the Feature section

LocusID – locus and display of genomic and mRNA sequences

MIM – Link to OMIM description, other entries for this sequence

EC_number – link to the corresponding cataloged enzymes

Protein_id – retrieve protein record from GenPept

CD– conserved protein domain (SMART),

CDD – conserved protein domain (Pfam).

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

2323

Biological databases: GenBank - SEQUENCE

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

2424

GenBank - GenBank - NOTESNOTES

Majority of GenBank entries have similar form to our example.

When accessing the database, the following needs to be noticed:

• Some entries are huge, containing as much as 30,000 lines. (NT_021877 Homo sapiens chromosome 1 working draft sequence segment)

• Some entries have contig information instead of sequence information. (NT_021877 Homo sapiens chromosome 1 working draft sequence segment)

• Some entries are derived from cDNA sequences and thus represent putative

genes/proteins. These should be used with caution. (AK007430. Mus musculus 10 d...

[gi:12840976]). • Some annotations are predicted using automated analysis. These should also be

used with caution. (XM_131483 Mus musculus simi...[gi:20832685]).

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

2525

GenBank - GenBank - StatisticsStatistics

Year Base Pairs Sequences

1982 680338 606

1992 101008486 78608

2000 11101066288 10106023

2001 15849921438 14976310

Data size is large and increases fast

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

2626

Biological DatabasesDatabase Searching

1. Databases must have methods for accessing and extracting data stored.

2. The most basic search is keyword searching

Keywords can be any word that occurs somewhere in the database

records. It can be the name of the gene or protein (e.g. lactalbumin),

species (e.g.homo sapiens, human), a taxonomy term

(e.g.primates), or a word from the reference title (e.g. cancer)

3. Others include: Entry Id number, sequence

4. Databases typically have hyperlinks that provide access to additional information related to the entry from other sources.

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

2727

Biological databases: OMIMOMIM Online Mendelian Inheritance in Man Online Mendelian Inheritance in Man ( (http://http://

www.ncbi.nlm.nih.gov/Omimwww.ncbi.nlm.nih.gov/Omim//) )

• The OMIM database contains abstracts and texts describing genetic disorders to support genomics efforts and clinical genetics. It provides gene maps, and known disorder maps in tabular listing formats. Contains keyword search.

Hamosh A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledge base

of human genes and genetic disorders Nucleic Acids Res. 2002 30: 52-55.

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

2828

Biological databases: OMIM web-pageOMIM web-page

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

2929

Biological databases: OMIM search engineOMIM search engine

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

3030

Biological databases: OMIM statisticsOMIM statistics

All Entries : 14088

Established Gene Locus : 10476

Phenotype Descriptions : 1194

Other Entries : 2418

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

3131

Biological databasesProtein databases1. SWISS-PROT (http://us.expasy.org/sprot/sprot-top.html) is a

curated database focusing on high level of annotation (sequence, function, structure, post-translational modifications, variants, etc.) of proteins.

2. TrEMBL is Computer-annotated supplement to SWISS-PROT

Reading materials: Textbook, chapter 3

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

3232

Protein databasesProtein databases

What are in these databases:• Protein sequence, corresponding gene, secondary

structure, 3D structure, function, motifs, homology, interactions

• Proteomics: expression profile, proteins in disease processes etc.

• Ligands and drugs (inhibitors, activators, substrates, metabolites)

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

3333

Protein databasesProtein databases – – SWISS-PROTSWISS-PROT

Notes:• SWISS-PROT provides high-quality annotations and

detailed info about sequence, structural, functional, and other properties of proteins.

• It provides a rich set of links to other sources of information on SWISS-PROT entries. Unfortunately, some of the links will not work at all times, because of the dynamical change of the Web.

• It also provides a rich set of protein analysis tools.

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

3434

SWISS-PROTSWISS-PROT web-page web-page

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

3535

SWISS-PROT entry P00709

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

3636

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

3737

SWISS-PROT entry P00709

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

3838

SWISS-PROT entry P00709

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

3939

Biological databases: Protein structure database: PDB (http://www.pdb.org))

1. More than 18,000 macromolecular structures on proteins, peptides, viruses, protein/nucleic acids complexes, nucleic acids, and carbohydrates.

2. Among the oldest databases – the first structure was deposited in 1972.

3. New deposited structures has been steadily growing (3298 in 2001, and 1486 Jan 1-June 5, 2002).

4. Determined mainly by the X-ray diffraction and NMR.

5. It Contains tools for keyword search, comprehensive visualization, and information extraction – such as sequence, geometry, and structural neighbors details.

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

4040

Biological databases: PDB web-pagePDB web-pagehttp://www.rcsb.org/pdb/http://www.rcsb.org/pdb/

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

4141

Biological databases: A PDB entryA PDB entryhttp://www.rcsb.org/pdb/http://www.rcsb.org/pdb/

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

4242

Biological databases PDB statistics

Essential Bioinformatics and BiocomputEssential Bioinformatics and Biocomputing (LSM2104), NUSing (LSM2104), NUS

4343

Biological databases Summary of Today’s lectureSummary of Today’s lecture

• Types of Biological information, data and databases

• Simple data retrieval method.

• Popular databases: Pubmed, Genbank, SwissProt, OMIM, PDB

• Statistics: – Large number of publications (MEDLINE: >12M since 1960)

– Large amount of data for sequence (DNA: >14M, Protein: > 120K)

– Fair amount of data for 3D structure (Protein >14K, Nucleic acid >1K)