![Page 1: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/1.jpg)
DATA ACQUISITION FROM BIO-DATABASES
AND BLASTNatapol Pornputtapong
18 January 2018
![Page 2: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/2.jpg)
DATABASE
• Collections of data
• To share – multi-user interface
• To prevent data loss
• To make sure to get the right things
Bioinformatics for Phylogenetic Analysis Workshop 2
![Page 3: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/3.jpg)
LIBRARY -> DIGITAL LIBRARY
Bioinformatics for Phylogenetic Analysis Workshop 3
![Page 4: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/4.jpg)
DATABASE: A LIBRARY OF DATA
Database• Files, Tables, Records
• Data structure
• Database management system
• Programming interface
• User interface
Library• Books
• building, shelves
• Librarian
• Protocols, SOPs
• Services
Bioinformatics for Phylogenetic Analysis Workshop 4
![Page 5: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/5.jpg)
ADVANTAGE OF DATABASE
• Data integrity
• Smaller space
• Data availability
• Speed
Bioinformatics for Phylogenetic Analysis Workshop 5
![Page 6: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/6.jpg)
DATABASE FOR USERS
Bioinformatics for Phylogenetic Analysis Workshop 6
Database
Search
Download
Users
Submission
![Page 7: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/7.jpg)
HOW TO CHOOSE DATABASE?
• 1695 bio-databases in NAR online Molecular Biology Database Collection in 15 categories
Bioinformatics for Phylogenetic Analysis Workshop 7
![Page 8: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/8.jpg)
DATA CONTENT
• Literature
• DNA sequence
• Protein sequence
Bioinformatics for Phylogenetic Analysis Workshop 8
GenBank
RefSeq TrEMBL
![Page 9: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/9.jpg)
CONCEPTS OF DATABASE
Bioinformatics for Phylogenetic Analysis Workshop 9
Source Source Source
Database
interface
DatabaseDatabase
Database
Database
interface• Primary database• Secondary database
![Page 10: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/10.jpg)
PRIMARY & SECONDARY DB
Primary database Secondary database
Synonyms Archival database Curated database; knowledgebase
Source of data Direct submission of experimentally-derived data from researchers
Results of analysis, literature research and interpretation, often of data in primary databases
Examples •ENA, GenBank and DDBJ (nucleotide sequence)•ArrayExpress Archive and GEO (functional genomics data)•Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures)
•InterPro (protein families, motifs and domains)•UniProt Knowledgebase (sequence and functional information on proteins)•Ensembl (variation, function, regulation and more layered onto whole genome sequences)
Bioinformatics for Phylogenetic Analysis Workshop 10
![Page 11: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/11.jpg)
DATA COLLECTION CRITERIA
Bioinformatics for Phylogenetic Analysis Workshop 11
GenBank RefSeq
GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences
![Page 12: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/12.jpg)
Bioinformatics for Phylogenetic Analysis Workshop 12
![Page 13: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/13.jpg)
ACCESSIBILITY: TOOLS & INTERFACES
Bioinformatics for Phylogenetic Analysis Workshop 13
NCBI Entrez RESTful interface to the ENA
![Page 14: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/14.jpg)
NCBI SEARCH TOOL
Bioinformatics for Phylogenetic Analysis Workshop 14
![Page 15: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/15.jpg)
SIMPLE SEARCH
Bioinformatics for Phylogenetic Analysis Workshop 15
![Page 16: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/16.jpg)
BOOLEAN OPERATOR
Bioinformatics for Phylogenetic Analysis Workshop 16
![Page 17: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/17.jpg)
FILTER
• Limit with filter
• Advanced search builder
Bioinformatics for Phylogenetic Analysis Workshop 17
![Page 18: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/18.jpg)
RESULTS
Bioinformatics for Phylogenetic Analysis Workshop 18
![Page 19: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/19.jpg)
BLAST: BASIC LOCAL ALIGNMENT SEARCH TOOL
Bioinformatics for Phylogenetic Analysis Workshop 19
![Page 20: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/20.jpg)
MAJOR BLAST PROGRAMS
Bioinformatics for Phylogenetic Analysis Workshop 20
![Page 21: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/21.jpg)
BLAST SEARCH
Bioinformatics for Phylogenetic Analysis Workshop 21
![Page 22: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/22.jpg)
Bioinformatics for Phylogenetic Analysis Workshop 22
![Page 23: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/23.jpg)
OTHER BLAST PROGRAMS
Bioinformatics for Phylogenetic Analysis Workshop 23
![Page 24: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/24.jpg)
WORLD OF FILES
Text files Binary files
Bioinformatics for Phylogenetic Analysis Workshop 24
![Page 25: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/25.jpg)
TEXT FILES: WORLD OF FORMATS
• MS Words: .doc, .docx, .rtf, .txt
• Sequence: FastA (.fasta), Genbank (.gbk)
• Protein structure: PDB (.pdb)
Bioinformatics for Phylogenetic Analysis Workshop 25
![Page 26: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/26.jpg)
FASTA FORMAT
>P01013 GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMP
FHVTKQESKPVQMMCMNNSFNVATLPAEKMKILELPFASGDL
SMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVY
LPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKI
SQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP
>…
Bioinformatics for Phylogenetic Analysis Workshop 26
![Page 27: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/27.jpg)
GENBANKFORMAT
Bioinformatics for Phylogenetic Analysis Workshop 27
![Page 28: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/28.jpg)
NEXUS FORMAT
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=8 NCHAR=1202;
FORMAT MISSING=? DATATYPE=PROTEIN GAP=-;
OPTIONS GAPMODE=MISSING;
MATRIX
[ 10 20 ...]
[ ---------|---------|-...]
Homo_sapiens_4379045 TERLVLPPPDPLDLPLRAVEL...
Pan_troglodytes_114606536 TERLVLPPPDPLDLPLRAVEL...
Ailuropoda_melanoleuca_301788522 TERLVLPPPDPLDLPLRPVEL...
Mus_musculus_87252727 TERLVLPPLDPLNLPLRALEV...
Danio_rerio_113678409 MDKIDLPPVGPDDLPLSLLEM...
Xenopus_tropicalis_301627725 MNTLDLSNRDPLDLPLSVLEL...
Monodelphis_domestica_126309591 TERLVLPPRGPLDLPLCALEL...
Canis_familiaris_73972333 TERLALPPPDPLDLPLRPVEL...;
END;
Bioinformatics for Phylogenetic Analysis Workshop 28
![Page 29: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/29.jpg)
NEXT
Bioinformatics for Phylogenetic Analysis Workshop 29
Inputs Analysis Results
![Page 30: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all](https://reader034.vdocument.in/reader034/viewer/2022042806/5f6e383d656f65475d74af92/html5/thumbnails/30.jpg)
QUESTION?
Bioinformatics for Phylogenetic Analysis Workshop 30