sequence formats suchat udomsopagit. sequences dna and protein sequences can be read and written in...

48
Sequence Formats Suchat Udomsopagit

Upload: geoffrey-shaw

Post on 12-Jan-2016

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Sequence Formats

Suchat Udomsopagit

Page 2: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Sequences

DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT Required arrangement of characters, symbols

and keywords that specify things e.g. the sequence, ID name, comments, etc.

Program should look to find them in seq entry

Page 3: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Sequences

Never any hidden, unprintable 'control' characters in any sequence format.

All standard sequence formats can be printed out or viewed simply by displaying their file.

Page 4: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

MS word

Microsoft WORD format is not a sequence format. If using a word-processor to type a sequence:

Save sequence to a file as ASCII text Try selecting: File, Save As, Save as type Text

Do not using word-processors to write sequences Simple text editors should be used:

Notepad Wordpad

For UNIX Pico nedit

Page 5: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Some common formats

Single sequence per file

Multiple sequences per file

Either single or multiple sequences per file

gcg Multiple sequence format (msf)

fasta

staden clustal

embl phylip

Plus some others, e.g. MacVector, GeneWorks, DNA Strider etc.

Page 6: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Each Sequence analysis program has its own format for

storing sequence data!!!

Page 7: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

FASTA format

First line is a Description lineA single-lineContains a greater-than (">") symbol in the

first column Followed by lines of sequence data It is recommended that all lines of text be

shorter than 80 characters in length

Page 8: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

FASTA format

>gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK

Description line

Sequence

Page 9: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Standard IUB/IUPAC amino acid and nucleic acid codes

Lower-case letters are accepted and are mapped into upper-case No numerical digits in the sequence. The nucleic acid codes supported are:

A adenosine M A C (amino)C cytidine S G C (strong)G guanine W A T (weak)T thymidine B G T CU uridine D G A TR G A (purine) H A C TY T C (pyrimidine) V G C AK G T (keto) N A G C T (any)- gap of indeterminate length

Page 10: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Standard IUB/IUPAC amino acid and nucleic acid codes

The accepted amino acid codes are:

A alanine P prolineB aspartate or asparagine Q glutamineC cystine R arginineD aspartate S serineE glutamate T threonineF phenylalanine U selenocysteineG glycine V valineH histidine W tryptophanI isoleucine Y tyrosineK lysine Z glutamate or glutamineL leucine X anyM methionine * translation stopN asparagine - gap of indeterminate length

Page 11: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

FASTA format

Multiple sequencesBlank lines inserted

> mysequenceACGTCGATCGATCGATGCATCGTGCTAGCTACAGTCGATGCATCAGTCGATGCTAGCATGCTAGCTGCATCGATCGATGCTACGTACAGTCGATCGATGCAT

> mysequence2ACCGTACGATGCTAGCTAGCTAGCTACAGTCAGTCGATGCTACGCAGTCGTAGCATGCTAACGTCGATCGTA

> mysequence3CAGTCAGTCGTAGCTAGCTAGCTAGCTAGGGGTATCGATGCTAACAGTACTTTGCATGCAGCATGCTAGCTAGCTAGCTA

Page 12: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Genbank File Format Sequence Data Entries

First line Begins with ‘LOCUS’ in the first 5 spaces Followed by genetic locus name or identifier Length of the sequences Type of sequences

Second line DEFINITION in the first 10 spaces Followed by free form text to identify the sequence.

Third line ACCESSION in the first 9 spaces Spaces 13 - 18 must hold the primary accession number

Fourth line ORIGIN in the first 6 spaces Nothing else is required on this line, it indicates that the nucleic acid

sequence begins on the next line.

Page 13: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Genbank File Format Fifth line

Begins the nucleotide sequence. The first 9 spaces of each sequence line may either be blank

or may contain the position in the sequence of the first nucleotide on the line.

The next 66 spaces hold the nucleotide sequence in six blocks of ten nucleotides.

Each of the six blocks begins with a blank space followed by ten nucleotides.

Thus the first nucleotide is in space 11 of the line while the last is in space 75.

Last line Must have // in the first 2 spaces to indicate termination of

the sequence.

Page 14: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Genbank File Format

LOCUS name size bp type date

Genbank Locus name

total base count

DNA, RNA, PROTEIN, MASK, or TEXT

dd-MON-yyyy

Page 15: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Genbank ExampleLOCUS NM_079846 1190 bp mRNA linear INV 15-DEC-2001DEFINITION Drosophila melanogaster Triose phosphate isomerase (Tpi), mRNA.ACCESSION NM_079846VERSION NM_079846.1 GI:17864111KEYWORDS .SOURCE fruit fly.ORGANISM Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera;

Muscomorpha; Ephydroidea; Drosophilidae; Drosophila.REFERENCE 1 (bases 1 to 1190)AUTHORS Shaw-Lee,R.L., Lissemore,J.L. and Sullivan,D.T.TITLE Structure and expression of the triose phosphate isomerase (Tpi) gene of

Drosophila melanogaster JOURNAL Mol. Gen. Genet. 230 (1-2), 225-229 (1991)MEDLINE 92079900PUBMED 1720860COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI

review. The reference sequence was derived from AE003772.1.FEATURES Location/Qualifiers source 1..1190

/organism="Drosophila melanogaster“ /db_xref="taxon:7227“ /chromosome="3“ /map="99E1-99E2“

gene 1..1190 /gene="Tpi“ /note="TPI; TPIS; CG2171; CT6334“ /db_xref="FLYBASE:FBgn0003738“ /db_xref="LocusID:43582“

Page 16: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Genbank ExampleCDS 181..924

/gene="Tpi“ /EC_number="5.3.1.1“ /note="Nucleotide sequence of the Celera sequence differs from the published sequence for this transcript.“ /codon_start=1 /db_xref="FLYBASE:FBgn0003738“ /db_xref="LocusID:43582“ /product="Triose phosphate isomerase“ /protein_id="NP_524585.1“ /db_xref="GI:17864112" /translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPA

IYLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFG ESDALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVV VAYEPVWAIGTGQTATPDQAQEVHAFLRQWLSDNISKEVSASLRIQYGGSVTAANAKE LAKKPDIDGFLVGGASLKPEFVDIINARQ“

misc_feature 187..921 /note="TIM; Region: Triosephosphate isomerase“

BASE COUNT 279 a 368 c 323 g 220 t ORIGIN

1 ttaatctcga atctgggaaa aatctgagtg gaaaagtcga cggcgagcct ccagtcatcg 61 agttacccac ttgaaattat cagttccaaa cactctaata gcagtcccct tgttttgtcc 121 cccgatccgc agttctacgc caatttcagc accgattgca ccgacagcaa cagcaacaac 181 atgagccgaa agttctgcgt gggaggcaac tggaagatga acggcgacca gaagtccatc 241 gccgagatcg ccaagaccct gagctcggcc gccctcgacc ccaacacgga ggtggtcatc 301 ggctgcccgg ccatctacct gatgtacgcc cgcaacctgc tgccctgcga gctgggtctg 361 gccggccaga atgcctacaa ggtggccaag ggcgcattca ccggcgagat ctcccctgcg 421 atgctgaagg

//

Page 17: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

EMBL File Format

European Molecular Biology Laboratory First line

Begins with two letters ID Followed by the EMBL identifier

Second line AC, followed by accession number

Third line DE, followed by a free form text definition

Fourth line SQ, followed by the length of the sequence After the sequence length there is a blank space and the two

letters BP.

Page 18: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

EMBL File Format

Fifth line Nucleotide sequence begins Each line of sequence begins with four blank spaces Next 66 spaces hold the nucleotide sequence in 6

blocks of 10 nucleotides. Each of the six blocks begins with a blank space

followed by ten nucleotides. Thus the first nucleotide is in space 6 of the line while

the last is in space 70. The last line ~ terminator line

Two characters // in the first two spaces Multiple sequences may appear in each file

Page 19: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

EMBL Example (1)

LINE 1 :ID ID_nameLINE 2 :AC Accession numberLINE 3 :DE Describe the sequence any way you wantLINE 4 :SQ Length BPLINE 5 : ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTA...LINE 6 : ACGT...LINE 7 ://

Page 20: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

EMBL Example (2)

Page 21: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

EMBL Example (3)ID DMTPIG standard; DNA; INV; 3419 BP.XXAC X57576; S70377;XXSV X57576.1XXDT 20-JAN-1992 (Rel. 30, Created)DT 19-AUG-1996 (Rel. 49, Last updated, Version 10)XXDE D.melanogaster Tpi gene for Triosephosphate isomeraseXXKW glycolytic enzyme; tpi gene; triosephosphate isomerase.XXOS Drosophila melanogaster (fruit fly)OC Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; Pterygota;OC Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea;OC Drosophilidae; Drosophila.XXRN [1]RP 1-3419RA Sullivan D.T.;RT ;RL Submitted (07-FEB-1991) to the EMBL/GenBank/DDBJ databases.RL D.T. Sullivan, Biological Research Laboratories, 130 College Pl, SyracuseRL University, Syracuse, NY 13244, USAXX

Page 22: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

RN [3]RX MEDLINE; 92079900.RA Shaw-Lee R.L., Lissemore J.L., Sullivan D.T.;RT "Structure and expression of the triose phosphate isomerase (Tpi) gene ofRT Drosophila melanogaster.";RL Mol. Gen. Genet. 230:225-229(1991).XXDR FLYBASE; FBgn0003738; Tpi.DR SWISS-PROT; P29613; TPIS_DROME.XXFH Key Location/QualifiersFHFT source 1..3419FT /db_xref="taxon:7227"FT /germlineFT /organism="Drosophila melanogaster"FT /strain="Oregon-R"FT /clone_lib="EMBL-4"FT CDS join(2237..2773,2830..3036)FT /db_xref="FLYBASE:FBgn0003738"FT /db_xref="SWISS-PROT:P29613"FT /gene="Tpi"FT /EC_number="5.3.1.1"FT /product="triosephosphate isomerase"FT /protein_id="CAA40804.1"FT /translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPAIFT YLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFGESFT DALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVVVAYFT EPVWAIGTGKTATPDQAQEVHASLRQWLSDNISKEVSASLRIQYGGSVTAANAKELAKKFT PDIDGFLVGGASLKPEFLDIINARQ"FT mRNA join(2004..2028,2186..2773,2830..3036)FT /gene="Tpi"FT prim_transcript 2004..3296

Page 23: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

FT exon 2008..2032FT /number=1FT exon 2189..2773FT /number=2FT exon 2830..3296FT /number=3FT intron 2033..2188FT /number=1FT intron 2774..2829FT /number=2FT misc_feature 2147..2151FT /note="intron 1 lariat sequence"FT misc_feature 2789..2793FT /note="intron 2 lariat sequence"FT polyA_signal 3258..3262XXSQ Sequence 3419 BP; 855 A; 933 C; 849 G; 778 T; 4 other; gatctcgagc gagaaatgtg gaacatagtg gaggcctcca gtggcgccga gctgggtgaa 60 accagctacg agttcccttc ccccgctccg gttcccagcg cagcagtgaa cgaaatagca 120 gttccacagt cccaccagct cctcctgctc ctgcgaagcc ctcagttccg tccgcctcct 180 atgacaacca caactacagt ttcagccagg atgaggacga agatgatgat gatctggagt 240 ttgaggacgt attcgtgccg gccagctctg ttccaaatcc cgttcagcct ggcatagatc 300 ccgtggaact gcgtcgctcc ctggctttgg tcatgaggga gaaattgcga tcggatgaca 360 cggactccag gccaatgggc aacaatcagg atcttcccat agatgaacag tccagggaga 420 gaccgctctc cactcaaaca tctcccacaa atggcccact tccggctctt ctgagggcca 480 aactgcttgc tgggcaactc nnnncaatag cgctcactgc ctgccaggat ccacggcgag 540 tcctgctccc caggagcaat ccggtatctt tgtgatcgat agtgaggcga gtcccggctc 600 aaatgggcac aagcctaagt atcgaaaggg cacggcattc actcggagtt cgctgaagaa 660 gagccgatcc tgcaactgta gctccatcgc taagggacga ggggtccacg acgagcccag 720 cagtaatctc tgcagggatc aggagtcctc tgtacttcca cagcatccgc agccagccaa 780 ccatcccaca gagaactttt //

Page 24: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

National Biomedical Research Foundation (NBRF) formatProtein Information Resource (PIR) format

First line Begins with a greater than symbol (>) Immediately followed by 2 character sequence type specifier

Specifier Sequence type P1 protein, complete F1 protein, fragment DL DNA, linear DC DNA, circular RL RNA, linear RC RNA, circular N1 functional RNA, other than tRNA N3 tRNA

Then a semicolon (;) Followed by sequence name or identification code for the NBRF

database Four to six letters and numbers

>P1;CBRT

Page 25: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

National Biomedical Research Foundation (NBRF) formatProtein Information Resource (PIR) format

Second line contains two kinds of informationFirst:

Sequence name Followed by 3 characters blank space, " - “

Second Organism or organelle name

>P1;CBRTCytochrome b - Rat mitochondrion (SGC1)

Page 26: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

National Biomedical Research Foundation (NBRF) formatProtein Information Resource (PIR) format

Amino acid or nucleic acid sequence begins on line three Free format May be interrupted by blanks for ease of reading

Protein sequences May contain special punctuation to indicate various

indeterminacies in the sequence

The last character in the sequence must be an asterisks (*).

Page 27: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

NBRF/PIR Example

LINE 1 :>P1;CBRTLINE 2 :Cytochrome b - Rat mitochondrion (SGC1)LINE 3 :M T N I R K S H P L F K I I N H S F I D L P A P SLINE 4 : VTHICRDVN Y GWL IRYLINE 5 :TWIGGQPVEHPFIIIGQLASISYFSIILILMPISGIVEDKMLKWN*

Page 28: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

MolGen/Stanford File Format

Molecular Genetics Group at Stanford U. First line ~ comment line

Begins with a semi-colon (;) Followed by descriptive text May be as many comment lines as desired Need not be present

Second line Must be present Contains an identifier or name for the sequence

Page 29: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

MolGen/Stanford File Format

Third lineSequence beginsOccupies up to 80 spacesSpaces may be included in the sequence for

ease of reading.Terminated with 1 or 2

1 indicates a linear sequence 2 marks a circular sequence

Page 30: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

MolGen/Stanford Example

LINE 1 :; Describe the sequence any way you want

LINE 2 :ECTRNAGLY2

LINE 3 :ACGCACGTAC ACGTACGTAC A C G T C C G T ACG TAC GTA CGT

LINE 4 : GCTTA GG G C T A1

Page 31: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

PHYLIP File Format

Interleaved and Sequential formats Created and used by several

phylogenetics programs

Page 32: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

PHYLIP File Format

Interleaved format Similar to the output

of alignment programs The first part of file

contains the first part of each of the sequences

Only the first parts of the sequences should be preceded by names

Then the second part of each sequence, and so on.

18 206a121 MNTTNCFIAL VHAIREIRAF FLSRATG-KM EFTLYNGERK TFYSRPNNHD a241 MNTTDCFIAL VTAIREIRAF FLPRATG-RM EFTLHNGERK VFYSRPNNHD c-s8c1 MNTTDCFIAV VNAIKEVRAL FLPRTAG-KM EFTLHDGEKK VFYSRPNNHD c1nov MNTTDCFIAV VNAIREIRAL FLPRTTG-KM EFTLHDGEKK VFYSRPNNHD o1brazl MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD o1campos MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD o1kauf MNTTDCFIAL VQAIREIKAL FLSRTTG-KM ELTLYNGEKK TFYSRPNNHD ken1-76 MNTTDCFIAL LRAFREIKTL FLSRVRG-KM EFTLYNGEKK TFYSRPNNHD ken34-84 MNTTDCFIAL VRAIREFKIL FSLRPLARKM EFTLYNGIKK TFYSRPNKHD ken MNTTDCFIAL VQAIREIKLL FKG--IR-KM KLTLYNGEKK TFYSRPNSHD uga97-1 MNTTDCFIAL VQAIREIKSL FRS--SR-KM EFTLYNGEKK TFYSRPNNHD bec1-65 MKTTDCFNVL FEIFHRFGQT FKA--DR-KM EFTLYNGEKK TFYSRPNTHG zim88-3 MKTTDCFDVL LEIFHRFRQT FKT--DR-KM EFTLYNGEKK TFYSRPNTHG knp10-90 MKTTDCFNVL LETFHRFRNV FKT--DR-KM EFTLYNGDKK TFYSRPNTHG zim96-3 MKTTGCFDVL IEIAHRLRQL NKT--DR-KM EFTLYNGEKK TFYSRPNTHG zim7-83 MKTTDCFNVL LEIIYRFRHT FKT--DR-KM EFTLYNGEKK TFYSRPNKHG knp196-9 MKTTDCFSVL FEIFHRLRHT LKT--ER-KM EFTLYNGERK TFYSRPNKHG zam4-96 MKTTDCFDAL LEAFHRLRQT FKT--DR-KM EFTLYNGEKK TFYSRPNRHG

NCWLNTILQL FRYVDEPFFD WVYNSPENLT LAAIKQLEEL TGLELHEGGP NCWLNTILQL FRYVGEPFFD WVYDSPENLT LEAIEQLEEL TGLELHEGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVEEPFFD WVYSSPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVDEPFFE WVYDSPENLT VEAIRQLEEL TGLELHEGGP NCWLNAILQL FRYVDEPFFD WVYESPENLT IQAIGQLEEL TGLDLREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LRAIEQLEEL TGLELREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LQAIEQLEEL TGLELHEGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKRLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP

PALVIWNIKH LLQTGIGTAS RPAR-CMVDG TNMCLADFHA GIFLKEQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TNMCLADFHA GIFLKGQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGREHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA

Page 33: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

PHYLIP File Format

Sequential format All of one sequence is

given, possibly on multiple lines, before the next starts.

18 206 YFa121 MNTTNCFIAL VHAIREIRAF FLSRATG-KM EFTLYNGERK TFYSRPNNHD NCWLNTILQL FRYVDEPFFD WVYNSPENLT LAAIKQLEEL TGLELHEGGP PALVIWNIKH LLQTGIGTAS RPAR-CMVDG TNMCLADFHA GIFLKEQEHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN GGWKANVQRK LK----a241 MNTTDCFIAL VTAIREIRAF FLPRATG-RM EFTLHNGERK VFYSRPNNHD NCWLNTILQL FRYVGEPFFD WVYDSPENLT LEAIEQLEEL TGLELHEGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TNMCLADFHA GIFLKGQEHA VFACVTSNGW YAIDDDDFYP WTPDPSDVLV FVPYDQEPLN GEWKTKVQQK LK----c-s8c1 MNTTDCFIAV VNAIKEVRAL FLPRTAG-KM EFTLHDGEKK VFYSRPNNHD NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGREHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN EGWKASVQRK LKGAGQc1nov MNTTDCFIAV VNAIREIRAL FLPRTTG-KM EFTLHDGEKK VFYSRPNNHD NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGQEHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN EGWKANVQRK LKGAGQo1brazl MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN GEWKAKVQRK LK----o1campos MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA VFAC…

Page 34: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Protein Data Bank (PDB) File Format

Each line is 80 columns wide and is terminated by an end-of-line indicator.

The first 6 columns of every line contain a "record name".

The list of ATOM records in each polymer chain must be terminated by a TER record.

ATOM records for polymer atoms must include non-blank chain ID fields.

To use the automatic validation check, the coordinate file must include a complete CRYST1 record defining the unit cell and space group information.

Each file should terminate with a line containing only the word END.

Page 35: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Protein Data Bank (PDB) File FormatCOLUMNS DATA TYPE FIELD DEFINITION--------------------------------------------------------------------------------

- 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number.13 - 16 Atom name Atom name.17 Character altLoc Alternate location indicator.18 - 20 Residue name resName Residue name.22 Character chainID Chain identifier.23 - 26 Integer resSeq Residue sequence number.27 AChar iCode Code for insertion of residues.31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms.39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms.47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms.55 - 60 Real(6.2) occupancy Occupancy.61 - 66 Real(6.2) tempFactor Temperature factor.73 - 76 LString(4) segID Segment identifier, left-justified.77 - 78 LString(2) element Element symbol, right-justified.79 - 80 LString(2) charge Charge on the atom.

Page 36: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Protein Data Bank (PDB) File FormatPattern:RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB LineATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94

Where:RTyp: Record Type

Num: Serial number of the atom. Each atom has a unique serial number.

Atm: Atom name (IUPAC format).

Res: Residue name (IUPAC format).

Ch: Chain to which the atom belongs (in this case, L for light chainof an antibody).

ResN: Residue sequence number.

X, Y, Z: Cartesian coordinates specifying atomic position in space.

Occ: Occupancy factor

Temp: Temperature factor (atoms disordered in the crystal have hightemperature factors).

PDB: The PDB data file unique identifier.

Line: Line (record) number in the data file.

Page 37: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

PDB ExampleHEADER LYASE 06-JUL-99 1QU4 TITLE CRYSTAL STRUCTURE OF TRYPANOSOMA BRUCEI ORNITHINE TITLE 2 DECARBOXYLASE COMPND MOL_ID: 1; COMPND 2 MOLECULE: ORNITHINE DECARBOXYLASE; COMPND 3 CHAIN: A, B, C, D; COMPND 4 EC: 4.1.1.17; COMPND 5 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: TRYPANOSOMA BRUCEI; SOURCE 3 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 4 EXPRESSION_SYSTEM_COMMON: BACTERIA; SOURCE 5 EXPRESSION_SYSTEM_STRAIN: B21/DG3; SOURCE 6 EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID KEYWDS POLYAMINE METABOLISM, PYRIDOXAL 5'-PHOSPHATE, ALPHA-BETA KEYWDS 2 BARREL, LYASE EXPDTA X-RAY DIFFRACTION AUTHOR N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS, AUTHOR 2 E.J.GOLDSMITH REVDAT 2 29-DEC-99 1QU4 1 JRNL COMPND REMARK REVDAT 1 17-NOV-99 1QU4 0 JRNL AUTH N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS, JRNL AUTH 2 E.J.GOLDSMITH JRNL TITL X-RAY STRUCTURE OF ORNITHINE DECARBOXYLASE FROM JRNL TITL 2 TRYPANOSOMA BRUCEI: THE NATIVE STRUCTURE AND THE JRNL TITL 3 STRUCTURE IN COMPLEX WITH JRNL TITL 4 ALPHA-DIFLUOROMETHYLORNITHINE JRNL REF BIOCHEMISTRY V. 38 15174 1999 JRNL REFN ASTM BICHAW US ISSN 0006-2960 REMARK 1 REMARK 2 REMARK 2 RESOLUTION. 2.90 ANGSTROMS.

REMARK …

Page 38: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

DBREF 1QU4 A 1 425 SWS P07805 DCOR_TRYBB 21 445 DBREF 1QU4 B 1 425 SWS P07805 DCOR_TRYBB 21 445 DBREF 1QU4 C 1 425 SWS P07805 DCOR_TRYBB 21 445 DBREF 1QU4 D 1 425 SWS P07805 DCOR_TRYBB 21 445 SEQRES 1 A 425 GLY ALA MET ASP ILE VAL VAL ASN ASP ASP LEU SER CYS SEQRES 2 A 425 ARG PHE LEU GLU GLY PHE ASN THR ARG ASP ALA LEU CYS SEQRES 3 A 425 LYS LYS ILE SER MET ASN THR CYS ASP GLU GLY ASP PRO SEQRES 4 A 425 PHE PHE VAL ALA ASP LEU GLY ASP ILE VAL ARG LYS HIS SEQRES 5 A 425 GLU THR TRP LYS LYS CYS LEU PRO ARG VAL THR PRO PHE SEQRES 6 A 425 TYR ALA VAL LYS CYS ASN ASP ASP TRP ARG VAL LEU GLY SEQRES 7 A 425 THR LEU ALA ALA LEU GLY THR GLY PHE ASP CYS ALA SER SEQRES 8 A 425 ASN THR GLU ILE GLN ARG VAL ARG GLY ILE GLY VAL PRO SEQRES 9 A 425 PRO GLU LYS ILE ILE TYR ALA ASN PRO CYS LYS GLN ILE SEQRES 10 A 425 SER HIS ILE ARG TYR ALA ARG ASP SER GLY VAL ASP VAL SEQRES 11 A 425 MET THR PHE ASP CYS VAL ASP GLU LEU GLU LYS VAL ALA SEQRES 12 A 425 LYS THR HIS PRO LYS ALA LYS MET VAL LEU ARG ILE SER SEQRES 13 A 425 THR ASP ASP SER LEU ALA ARG CYS ARG LEU SER VAL LYS SEQRES 14 A 425 PHE GLY ALA LYS VAL GLU ASP CYS ARG PHE ILE LEU GLU SEQRES 15 A 425 GLN ALA LYS LYS LEU ASN ILE ASP VAL THR GLY VAL SER SEQRES 16 A 425 PHE HIS VAL GLY SER GLY SER THR ASP ALA SER THR PHE SEQRES 17 A 425 ALA GLN ALA ILE SER ASP SER ARG PHE VAL PHE ASP MET SEQRES 18 A 425 GLY THR GLU LEU GLY PHE ASN MET HIS ILE LEU ASP ILE SEQRES 19 A 425 GLY GLY GLY PHE PRO GLY THR ARG ASP ALA PRO LEU LYS SEQRES 20 A 425 PHE GLU GLU ILE ALA GLY VAL ILE ASN ASN ALA LEU GLU SEQRES 21 A 425 LYS HIS PHE PRO PRO ASP LEU LYS LEU THR ILE VAL ALA SEQRES 22 A 425 GLU PRO GLY ARG TYR TYR VAL ALA SER ALA PHE THR LEU SEQRES 23 A 425 ALA VAL ASN VAL ILE ALA LYS LYS VAL THR PRO GLY VAL SEQRES 24 A 425 GLN THR ASP VAL GLY ALA HIS ALA GLU SER ASN ALA GLN SEQRES 25 A 425 SER PHE MET TYR TYR VAL ASN ASP GLY VAL TYR GLY SER SEQRES 26 A 425 PHE ASN CYS ILE LEU TYR ASP HIS ALA VAL VAL ARG PRO SEQRES 27 A 425 LEU PRO GLN ARG GLU PRO ILE PRO ASN GLU LYS LEU TYR SEQRES 28 A 425 PRO SER SER VAL TRP GLY PRO THR CYS ASP GLY LEU ASP SEQRES 29 A 425 GLN ILE VAL GLU ARG TYR TYR LEU PRO GLU MET GLN VAL SEQRES 30 A 425 GLY GLU TRP LEU LEU PHE GLU ASP MET GLY ALA TYR THR SEQRES 31 A 425 VAL VAL GLY THR SER SER PHE ASN GLY PHE GLN SER PRO SEQRES 32 A 425 THR ILE TYR TYR VAL VAL SER GLY LEU PRO ASP HIS VAL SEQRES 33 A 425 VAL ARG GLU LEU LYS SER GLN LYS SER

Page 39: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

HET PLP A 600 15 HET PLP B 600 15 HET PLP C 600 15 HET PLP D 600 15 HETNAM PLP PYRIDOXAL-5'-PHOSPHATE HETSYN PLP VITAMIN B6 COMPLEX FORMUL 5 PLP 4(C8 H10 N1 O6 P1) HELIX 1 1 LEU A 45 LEU A 59 1 15 HELIX 2 2 LYS A 69 ASN A 71 5 3 HELIX 3 3 ASP A 73 GLY A 84 1 12 HELIX 4 4 SER A 91 ILE A 101 1 11 HELIX 5 5 PRO A 104 GLU A 106 5 3 HELIX 6 6 GLN A 116 SER A 126 1 11 HELIX 7 7 CYS A 135 HIS A 146 1 12 HELIX 8 8 LYS A 173 GLU A 175 5 3 HELIX 9 9 ASP A 176 LEU A 187 1 12 HELIX 10 10 ALA A 205 LEU A 225 1 21 HELIX 11 11 LYS A 247 PHE A 263 1 17 HELIX 12 12 GLY A 276 ALA A 281 1 6 HELIX 13 13 PHE A 326 HIS A 333 1 8 HELIX 14 14 THR A 390 THR A 394 5 5 HELIX 15 15 SER A 396 PHE A 400 5 5 SHEET 1 A 6 GLN A 365 PRO A 373 0 SHEET 2 A 6 LEU A 350 TRP A 356 -1 N TYR A 351 O LEU A 372 SHEET 3 A 6 SER A 313 VAL A 318 1 O PHE A 314 N SER A 354 SHEET 4 A 6 PHE A 284 THR A 296 -1 N ILE A 291 O TYR A 317 SHEET 5 A 6 PHE A 40 ASP A 44 -1 O PHE A 40 N ALA A 287 SHEET 6 A 6 THR A 404 VAL A 408 1 O THR A 404 N PHE A 41 SHEET 1 A1 6 GLN A 365 PRO A 373 0 SHEET 2 A1 6 LEU A 350 TRP A 356 -1 N TYR A 351 O LEU A 372 SHEET 3 A1 6 SER A 313 VAL A 318 1 O PHE A 314 N SER A 354 SHEET 4 A1 6 PHE A 284 THR A 296 -1 N ILE A 291 O TYR A 317 SHEET 5 A1 6 TRP A 380 PHE A 383 -1 N LEU A 381 O VAL A 288 SHEET 6 A1 6 PRO A 338 PRO A 340 -1 O LEU A 339 N LEU A 382

Page 40: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

CRYST1 66.800 151.700 85.350 90.00 102.30 90.00 P 1 21 1 8 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 0.014970 0.000000 0.003264 0.00000 SCALE2 0.000000 0.006592 0.000000 0.00000 SCALE3 0.000000 0.000000 0.011992 0.00000 ATOM 1 N ASP A 35 34.731 -5.686 15.000 1.00 98.44 N ATOM 2 CA ASP A 35 34.249 -5.884 13.629 1.00 98.39 C ATOM 3 C ASP A 35 33.320 -4.750 13.203 1.00 98.13 C ATOM 4 O ASP A 35 33.474 -3.594 13.603 1.00 98.29 O ATOM 5 CB ASP A 35 33.558 -7.247 13.545 1.00 98.38 C ATOM 6 CG ASP A 35 33.566 -7.887 12.170 1.00 98.36 C ATOM 7 OD1 ASP A 35 33.717 -9.133 12.114 1.00 98.26 O ATOM 8 OD2 ASP A 35 33.419 -7.182 11.148 1.00 98.39 O ATOM 9 N GLU A 36 32.332 -5.073 12.378 1.00 97.79 N ATOM 10 CA GLU A 36 31.446 -4.080 11.787 1.00 95.51 C ATOM 11 C GLU A 36 32.259 -2.944 11.199 1.00 90.65 C ATOM 12 O GLU A 36 32.220 -1.813 11.692 1.00 94.96 O ATOM 13 CB GLU A 36 30.419 -3.638 12.840 1.00 97.63 C ATOM 14 CG GLU A 36 29.111 -3.155 12.261 1.00 98.19 C ATOM 15 CD GLU A 36 27.791 -3.597 12.824 1.00 98.33 C ATOM 16 OE1 GLU A 36 27.308 -4.727 12.601 1.00 98.28 O ATOM 17 OE2 GLU A 36 27.115 -2.806 13.520 1.00 98.43 O ATOM 18 N GLY A 37 33.018 -3.192 10.131 1.00 52.86 N ATOM 19 CA GLY A 37 33.624 -2.167 9.299 1.00 39.88 C ATOM 20 C GLY A 37 32.598 -1.167 8.712 1.00 34.34 C ATOM 21 O GLY A 37 32.236 -1.162 7.531 1.00 31.44 O ATOM 22 N ASP A 38 32.135 -0.248 9.564 1.00 37.23 N ATOM 23 CA ASP A 38 31.136 0.700 9.138 1.00 36.44 C ATOM 24 C ASP A 38 31.794 1.722 8.228 1.00 33.49 C ATOM 25 O ASP A 38 33.029 1.896 8.156 1.00 34.06 O ATOM 26 CB ASP A 38 30.500 1.242 10.405 1.00 42.06 C ATOM 27 CG ASP A 38 29.583 0.207 11.047 1.00 44.59 C ATOM 28 OD1 ASP A 38 29.408 -0.876 10.434 1.00 45.72 O ATOM 38 CA PHE A 40 32.728 6.727 7.615 1.00 20.51 C ...CONECT1117911177 CONECT1118011177 MASTER 482 0 4 60 80 0 0 611176 4 64 132 END

Page 41: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Conversion of Sequence Formats

readseq (all flavors of UNIX)1. IG/Stanford 10. Olsen (in-only)

2. GenBank/GB 11. Phylip3.2 (Sequential)

3. NBRF 12. Phylip (Interleaved)

4. EMBL 13. Plain/Raw

5. GCG 14. PIR/CODATA

6. DNAStrider 15. MSF

7. Fitch 16. ASN.1

8. Pearson/Fasta 17. PAUP/NEXUS

9. Zuker (in-only) 18. Pretty (out-only)

Page 42: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Conversion of Sequence Formats seqret (EMBOSS)

gcg GCG 9.x and 10.x format embl swiss fasta genbank nbrf pir NBRF (PIR) codata CODATA format. strider DNA strider format clustal phylip PHYLIP non-interleaved multiple alignment format. acedb ACeDB format msf Wisconsin Package GCG's MSF multiple sequence format. hennig86 Hennig86 format jackknifer Jackknifer format jackknifernon Jackknifernon format nexus paup Nexus/PAUP format treecon Treecon format mega Mega format ig IntelliGenetics format. staden text

Page 43: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Conversion of Sequence Formats

Using Perl Downloadable from Biomolecular Engineering

Research Center (BMERC) http://bmerc-www.bu.edu/needle-doc/latest/seq-tools.

html pdb-to-seq.pl: pdb several standard formats fa2tbl.pl: fasta sequence table (.tbl) file tbl2fa.pl: .tbl file fasta etc.

Page 44: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Conversion of Sequence Formats

Web-basedWWW READSEQ Sequence Conversion at

NIH http://bimas.cit.nih.gov/molbio/readseq/

WWW READSEQ at Human Genome Mapping Project (HGMP) Center

http://www.hgmp.mrc.ac.uk/Registered/Webapp/readseq/

Page 45: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

http://bimas.cit.nih.gov/molbio/readseq/

Page 46: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Conversion of Sequence Formats

Biological software at Institut Pasteur http://bioweb.pasteur.fr/seqanal/formats-uk.html READSEQ EMBOSS Abiview: trace files (ABI) fasta EMBOSS:

cutseq: Removes a specified section from a sequence pasteseq: Insert one sequence into another. nthseq: Writes one sequence from a multiple set of

sequences. extractseq: Extract regions from a sequence.

Page 47: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Conversion of Sequence Formats

Windows-based program SeqVerter

Downloadable from GeneStudio, Inc. (free) http://www.genestudio.com/seqverter.htm Read: ABI traces, Clustal, DCSE, DNASIS, DNAStar,

DNAStrider (including binary), EMBL, FASTA, GDE, GenBank, IBI/Pustell, Macaw, MSF, Nexus/PAUP, PHYLIP Interleaved, PIR/NBRF, SCF 2.0 and SCF 3.0 traces, Swiss-Prot, and TreeCon.

Write: Clustal, DNASIS, DNAStar, FASTA and FASTA-SequIn, GenBank, IBI/Pustell, MSF, Nexus/PAUP, PHYLIP Interleaved, and TreeCon.

Tutorial also available

Page 48: Sequence Formats Suchat Udomsopagit. Sequences DNA and protein sequences Can be read and written in a variety of formats Sequence formats are ASCII TEXT

Reference Fundamentals of Sequence Analysis. (ppt)

Fourie Joubert. Bioinformatics Training at SANBI, 2001 EMBOSS Sequence Formats

http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Themes/SequenceFormats.html

Exchanging Sequence Data http://cmgm.stanford.edu/classes/seqanal/exchange.html

Sequence Analysis Tools http://molbio.info.nih.gov/molbio/analysis.html

Sequence Formats http://www.molbiol.ox.ac.uk/help/sequenceformats.htm

Sequence file formats http://www.psc.edu/general/software/packages/seq-intro/

Mount, DW. (2001) Bioinformatics: Sequence and Genome Analysis. Chapter 2: Collecting and Storing Sequences in the Laboratory. Cold Spring Harbor Laboratory Press. NY. pp. 19-49.