influenza research database (ird) - fludb. · pdf fileannotations for influenza virus sequence...

1
In order to provide comprehensive, consistent annotations for influenza virus sequence data, the IRD team has developed a variety of custom annotation and computational pipelines. This special newsletter issue highlights the variant protein annotation efforts carried out by IRD. Variant Protein Annotations In recent years, the influenza community has identified several novel proteins generated from non-canonical translation strategies such as leaky ribosomal scanning (PB1-F2, PB1-N40, PA-N155 and PA-N182), ribosomal frameshift (PA-X) and alternative splicing (M42 and NS3). Anticipating the desire to search and analyze these newly discovered variant proteins, the IRD team developed a custom annotation algorithm that predicts the open reading frames and protein sequences for each of the PB1-N40, PA-N155, PA-N182, PA-X, M42 and NS3 variant proteins based on the presence of experimentally defined sequence features (SOP). Using this algorithm, the IRD team has annotated all relevant influenza segment sequences with variant proteins if they are predicted to be present. As of March 2017, over 93% of complete genome strains in IRD have predicted PB1-N40, PA-N155, PA-N182 and PA-X (in three variant forms: +41, +61 or other) proteins (Table 1). M42 and NS3 have very rare and strict alternative splicing, and are therefore only found in 0.2% and 0.1% of influenza strains, respectively. These annotations can be viewed on the Strain Details page (Figure 1). Search for Variant Proteins These predicted sequences can be retrieved from the Nucleotide Sequence Search and Protein Sequence Search pages (Figure 2), transferred to any IRD analysis tools (Figure 3) and downloaded. Outreach Events May 29, 2017: IRD/ViPR workshop, Erasmus University Medical Center, Netherlands June 24-28, 2017: American Society for Virology Annual Meeting, Madison, Wisconsin Figure 2. The Protein Sequence Search page supports queries based on ‘classical proteins’, ‘variant proteins’ and sequence-associated metadata. Protein: PA-N155 protein(562) Protein Information *2 Protein Name: PA-N155 protein(562) Gene Symbol: PA-N155 UniProtKB Accession: -N/A- IRD Protein Accession: IRD_528916889_463_2151.1 IRD Protein GI: IRD_528916889_463_2151 Source: IRD Protein Sequence: View Sequence Add to Working Set Save Search Download < Previous 56 57 58 59 60 61 62 Next > Page: 59 of 1988 Your search returned 39,757 proteins. Search Criteria Displaying 20 records per page, sorted by Strain Name in ascending order. Display Settings Protein Sequence Search Results Your Selected Items: 0 items selected Select all 39,757 proteins More columns were returned than can be displayed without scrolling. Use scroll bars at top and bottom of display to move right and left or reduce the number of columns displayed by using the Display Settings link above. Name Sequence Accession Segment Subtype * Host Species Country State/Province Flu Season Str PA-X protein(+61) CY181574 Yes 3 2232 H7N9 2013 Human China -N/A- -N/A- A/Anhui/DEWH72- PA-X protein(+61) CY181582 Yes 3 2232 H7N9 2013 Human China -N/A- -N/A- A/Anhui/DEWH72- PA-X protein(+61) EU008580 No 3 2102 H5N1 2006 Human China -N/A- -N/A- A/Anhui/T2/2006 PA-X protein(+41) CY071593 No 3 2151 H1N1 10/19/2009 *Human Turkey -N/A- -N/A- *A/Ankara/WR1429 PA-X protein(+41) CY073063 No 3 2151 H1N1 11/15/2009 *Human Turkey -N/A- -N/A- *A/Ankara/WRAIR1 PA-X protein(+41) CY073071 No 3 2151 H1N1 10/20/2009 *Human Turkey -N/A- -N/A- *A/Ankara/WRAIR1 PA-X protein(other) CY073079 No 3 2154 H1N1 10/21/2009 *Human Turkey -N/A- -N/A- *A/Ankara/WRAIR1 PA-X protein(+41) CY073087 Yes 3 2151 H1N1 11/13/2009 *Human Turkey -N/A- -N/A- A/Ankara/WRAIR1 PA-X protein(+41) CY073095 * No 3 2148 H1N1 11/04/2009 *Human Turkey -N/A- -N/A- *A/Ankara/WRAIR1 PA-X protein(+61) CY125915 Yes 3 2208 H2N2 1957 Human USA Michigan -N/A- *A/Ann Arbor/23/19 PA-X protein(+61) CY031584 Yes 3 2190 H2N2 1957 Human USA Michigan -N/A- A/Ann Arbor/23/19 PA-X protein(+61) M23974 Yes 3 2233 H2N2 1960 Human USA Michigan -N/A- A/Ann Arbor/6/196 PA-X protein(+61) CY125907 Yes 3 2208 H2N2 1960 Human USA Michigan -N/A- *A/Ann Arbor/6/196 PA-X protein(+61) AY209994 No 3 2151 H2N2 1960 Human USA Michigan -N/A- A/Ann Arbor/6/60 PA-X protein(+61) CY125843 Yes 3 2193 H2N2 1967 Human USA Michigan -N/A- *A/Ann Arbor/7/196 PA-X protein(+61) AY210005 No 3 2151 H2N2 1967 Human USA Michigan -N/A- A/AnnArbor/7/67 PA-X protein(+61) KT699055 Yes 3 2151 H9N2 12/18/2014 *Goose/Avian China -N/A- -N/A- A/Anser fabalis/An PA-X protein(+61) KM076703 Yes 3 2151 H9N2 01/16/2014 *Goose/Avian China -N/A- -N/A- A/Anser fabalis/Ch PA-X protein(+41) CY066828 Yes 3 2175 H1N1 10/28/2009 *Human Belgium -N/A- 09-10 *A/Antwerp/INS221 PA-X protein(+61) CY055174 Yes 3 2151 H11N1 12/23/2007 Aquatic Bird/Avian India -N/A- -N/A- *A/aquatic bird/Ind Run Analysis Home Protein Sequence Search Results SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA HELP About Us Community Announcements Links Resources Support Sign Out [email protected] Figure 1. Variant protein annotations are displayed on the Strain Details page Figure 3. A portion of the Protein Sequence Search Results page from a query for PA-X, showing annotations of the three different PA-X variants: PA-X (+41), PA-X (+61) and PA-X (other). Selected records from this page can be input to any of the analysis tools under the ‘Run Analysis’ dropdown menu (red arrow), or downloaded to a local computer. DATA TO RETURN Segment / Nucleotide Protein Strain VIRUS TYPE A B C Provisional Influenza D (PMID:24595369) SUB TYPE * Use comma to separate multiple entries. Ex: H1N1, H7, H3N2. STRAIN NAME * Use comma to separate multiple entries. Ex: A/chicken/Israel/1055/2008, A/chicken/Laos/16/2008. Include Partial Sequences Complete Segments Only Complete Genomes Only Include pH1N1 proteins Include only pH1N1 proteins Exclude all pH1N1 proteins 'CLASSICAL' PROTEINS 1 PB2 2 PB1 3 PA 4 HA 5 NP 6 NA 7 M1 7 M2 8 NS1 8 NS2 'VARIANT' PROTEINS (SOP) 2 PB1-F2 2 PB1-N40 3 PA-N155 3 PA-N182 3 PA-X 7 M42 8 NS3 COMPLETE SEQUENCES 2009 pH1N1 SEQUENCES (SOP) DATE RANGE From: YYYY To: YYYY To add month to search, see Advance Options: Month Range HOST All Anteater Avian Bat Beetle Bovine Camel Civet Civet Cat Dog Domestic Cat Donkey Environment Equine Ferret Flat-Faced Bat Human AVIAN All Adelie Penguin African Starling African Stonechat American Black Duck American Green-Winged Teal American White Pelican American Widgeon GEOGRAPHIC GROUPING All Africa Antarctica Asia Europe North America Oceania COUNTRY Mexico Montserrat Nicaragua Panama Puerto Rico Trinidad and Tobago USA USA STATE Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware ADVANCED OPTIONS Search Clear Results matching your criteria: 8,360 Tip: To select multiple or deselect, Ctrl-click (Windows) or Cmd-click (MacOS) Show All Protein Sequence Search Search for influenza sequences, proteins, and strains using two types of searches. Use the advanced search to allow you to refine your search with the more fine grained search, and you can pick your viewing options. Home Protein Sequence Search SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA HELP About Us Community Announcements Links Resources Support Workbench Sign In The March 2017 release of IRD is now available, visit www.fludb.org Variant Protein Variant Protein from Complete Genomes Percentage* Source PB1-F2 22585 70.9% GenBank PB1-N40 31509 99.0% IRD PA-N155 31650 99.1% IRD PA-N182 29755 93.1% IRD PA-X 31649 99.1% IRD PA-X protein(+41) 9178 28.7% GenBank & IRD PA-X protein(+61) 22392 70.1% GenBank & IRD PA-X protein(other) 79 0.2% GenBank & IRD PA-X protein 2837 8.9% GenBank M42 76 0.2% IRD NS3 38 0.1% IRD Table 1. Variant protein annotations in IRD as of March 10, 2017 New Features in IRD Influenza Research Database (IRD) Questions? Problems? Suggestions? Click Here March 2017

Upload: dangliem

Post on 01-Mar-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Influenza Research Database (IRD) - fludb. · PDF fileannotations for influenza virus sequence data, the IRD team has developed a variety of custom annotation and computational pipelines

In order to provide comprehensive, consistent annotations for influenza virus sequence data, the IRD team has developed a variety of custom annotation and computational pipelines. This special newsletter issue highlights the variant protein annotation efforts carried out by IRD.

Variant Protein Annotations

In recent years, the influenza community has identified several novel proteins generated from non-canonical translation strategies such as leaky ribosomal scanning (PB1-F2, PB1-N40, PA-N155 and PA-N182), ribosomal frameshift (PA-X) and alternative splicing (M42 and NS3). Anticipating the desire to search and analyze these newly discovered variant proteins, the IRD team developed a custom annotation algorithm that predicts the open reading frames and protein sequences for each of the PB1-N40, PA-N155, PA-N182, PA-X, M42 and NS3 variant proteins based on the presence of experimentally defined sequence features (SOP).

Using this algorithm, the IRD team has annotated all relevant influenza segment sequences with variant proteins if they are predicted to be present. As of March 2017, over 93% of complete genome strains in IRD have predicted PB1-N40, PA-N155, PA-N182 and PA-X (in three variant forms: +41, +61 or other) proteins (Table 1). M42 and NS3 have very rare and strict alternative splicing, and are therefore only found in 0.2% and 0.1% of influenza strains, respectively. These annotations can be viewed on the Strain Details page (Figure 1).

Search for Variant Proteins

These predicted sequences can be retrieved from the Nucleotide Sequence Search and Protein Sequence Search pages (Figure 2), transferred to any IRD analysis tools (Figure 3) and downloaded.

Outreach Events

• May 29, 2017: IRD/ViPR workshop, Erasmus University Medical Center, Netherlands

• June 24-28, 2017: American Society for Virology Annual Meeting, Madison, Wisconsin

Figure 2. The Protein Sequence Search page supports queries based on ‘classical proteins’, ‘variant proteins’ and sequence-associated metadata.

3/10/2017 Influenza Research Database - A/northern shoveler/Minnesota/Sg-00651/2008 - PA Polymerase (acidic) protein, PA-N155 protein(562), PA-N182 protein(535), …

https://www.bacpathbrc.org/brc/fluSegmentDetails.spg?ncbiProteinId=AGS48650&decorator=influenza&context=1489182921105 3/6

Prediction Details

MHC Supertype # Predictions

A3 32

A2 23

A24 48

B7 10

B44 34

Total 147

Identical SequencesThere are 3 other protein(s) with a sequence identical to this protein

Sequence accession Collection Date Host Species Country Subtype Strain Name

CY042759 2008­07­31 Northern Shoveler/Avian USA H1N1 A/northern shoveler/Minnesota/Sg­00651/2008(H1N1)

CY042775 2008­07­31 Northern Shoveler/Avian USA H1N1 A/northern shoveler/Minnesota/Sg­00655/2008(H1N1)

KF424249 2008­07­31 Northern Shoveler/Avian USA H1N1 A/northern shoveler/Minnesota/Sg­00655/2008(H1N1)

Gene Ontology ClassificationName GO ID Annotation Source Evidence Similar Sequences

Biological Process

transcription, DNA­dependent GO:0006351 UniProtKB IEA

Molecular Function

RNA binding GO:0003723 UniProtKB IEA

RNA­directed RNA polymerase activity GO:0003968 UniProtKB IEA

Database Cross References*2

Database Name Accession Description

INTERPRO IPR001009 RNA_pol_P2

PFAM PF00603 Influenza RNA­dependent RNA polymerase subunit PA

Protein: PA­N155 protein(562)

Protein Information *2

Protein Name: PA­N155 protein(562)

Gene Symbol: PA­N155

UniProtKB Accession: ­N/A­

IRD Protein Accession: IRD_528916889_463_2151.1

IRD Protein GI: IRD_528916889_463_2151

Source: IRD

Protein Sequence: View Sequence

HMM/Pfam Domains (SOP)Accession Name Description Start End

PF00603 Flu_PA Influenza RNA­dependent RNA polymerase subunit PA 1 561

Other Domains/Motifs (SOP)Domain/Motif Start End Program

low_complexity 41 55 seg

Predicted Epitopes (SOP)Prediction Details

MHC Supertype # Predictions

A2 18

A3 22

A24 36

B7 10

B44 26

Loading Influenza Research Database...

Add to Working Set Save Search Download

< Previous 56 57 58 59 60 61 62 Next > Page: 59 of 1988

< Previous 56 57 58 59 60 61 62 Next > Page: 59 of 1988

Your search returned 39,757 proteins. Search Criteria Displaying 20 records per page, sorted by Strain Name inascending order.

Display Settings

Protein Sequence Search ResultsYour Selected Items: 0 items selected

Select all 39,757 proteins

More columns were returned than can be displayed without scrolling. Use scroll bars at top and bottom of display to move right and left or reduce the number ofcolumns displayed by using the Display Settings link above.

Your Selected Items: 0 items selected

Name SequenceAccession

CompleteGenome Segment Segment

Length Subtype * CollectionDate Host Species Country State/Province

FluSeason (SOP)

Strain Name

PA­Xprotein(+61)

CY181574 Yes 3 2232 H7N9 2013 Human China ­N/A­ ­N/A­ A/Anhui/DEWH72­08/2013

PA­Xprotein(+61)

CY181582 Yes 3 2232 H7N9 2013 Human China ­N/A­ ­N/A­ A/Anhui/DEWH72­09/2013

PA­Xprotein(+61)

EU008580 No 3 2102 H5N1 2006 Human China ­N/A­ ­N/A­ A/Anhui/T2/2006

PA­Xprotein(+41)

CY071593 No 3 2151 H1N1 10/19/2009 *Human Turkey ­N/A­ ­N/A­ *A/Ankara/WR1429T/2009(H1N1)

PA­Xprotein(+41)

CY073063 No 3 2151 H1N1 11/15/2009 *Human Turkey ­N/A­ ­N/A­ *A/Ankara/WRAIR1425T/2009(H1N1)

PA­Xprotein(+41)

CY073071 No 3 2151 H1N1 10/20/2009 *Human Turkey ­N/A­ ­N/A­ *A/Ankara/WRAIR1426N/2009(H1N1)

PA­Xprotein(other)

CY073079 No 3 2154 H1N1 10/21/2009 *Human Turkey ­N/A­ ­N/A­ *A/Ankara/WRAIR1428T/2009(H1N1)

PA­Xprotein(+41)

CY073087 Yes 3 2151 H1N1 11/13/2009 *Human Turkey ­N/A­ ­N/A­ A/Ankara/WRAIR1435T/2009

PA­Xprotein(+41)

CY073095 * No 3 2148 H1N1 11/04/2009 *Human Turkey ­N/A­ ­N/A­ *A/Ankara/WRAIR1440T/2009(H1N1)

PA­Xprotein(+61)

CY125915 Yes 3 2208 H2N2 1957 Human USA Michigan ­N/A­ *A/Ann Arbor/23/1957(H2N2)

PA­Xprotein(+61)

CY031584 Yes 3 2190 H2N2 1957 Human USA Michigan ­N/A­ A/Ann Arbor/23/1957

PA­Xprotein(+61)

M23974 Yes 3 2233 H2N2 1960 Human USA Michigan ­N/A­ A/Ann Arbor/6/1960

PA­Xprotein(+61)

CY125907 Yes 3 2208 H2N2 1960 Human USA Michigan ­N/A­ *A/Ann Arbor/6/1960(H2N2)

PA­Xprotein(+61)

AY209994 No 3 2151 H2N2 1960 Human USA Michigan ­N/A­ A/Ann Arbor/6/60

PA­Xprotein(+61)

CY125843 Yes 3 2193 H2N2 1967 Human USA Michigan ­N/A­ *A/Ann Arbor/7/1967(H2N2)

PA­Xprotein(+61)

AY210005 No 3 2151 H2N2 1967 Human USA Michigan ­N/A­ A/AnnArbor/7/67

PA­Xprotein(+61)

KT699055 Yes 3 2151 H9N2 12/18/2014 *Goose/Avian China ­N/A­ ­N/A­ A/Anser fabalis/Anhui/L139/2014

PA­Xprotein(+61)

KM076703 Yes 3 2151 H9N2 01/16/2014 *Goose/Avian China ­N/A­ ­N/A­ A/Anser fabalis/China/HuBS428/2014

PA­Xprotein(+41)

CY066828 Yes 3 2175 H1N1 10/28/2009 *Human Belgium ­N/A­ 09­10 *A/Antwerp/INS221/2009(H1N1)

PA­Xprotein(+61)

CY055174 Yes 3 2151 H11N1 12/23/2007 AquaticBird/Avian

India ­N/A­ ­N/A­ *A/aquatic bird/India/NIV­17095/2007(H11N1)

Run Analysis ▼

Home Protein Sequence Search Results

SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA HELP

About Us Community Announcements Links Resources Support Sign Out

[email protected]

Figure 1. Variant protein annotations are displayed on the Strain Details page

Figure 3. A portion of the Protein Sequence Search Results page from a query for PA-X, showing annotations of the three different PA-X variants: PA-X (+41), PA-X (+61) and PA-X (other). Selected records from this page can be input to any of the analysis tools under the ‘Run Analysis’ dropdown menu (red arrow), or downloaded to a local computer.

Release Date: Mar 16, 2017

This system is provided for authorized users only. Anyone using this system expressly consents to monitoring while using the system. Improper use of this system may be referred to lawenforcement officials.This project is funded by the National Institute of Allergy and Infectious Diseases (NIH / DHHS) under Contract No. HHSN272201400028C and is a collaboration between NorthropGrumman Health IT, J. Craig Venter Institute, and Vecna Technologies.

DATA TO RETURN

Segment / Nucleotide

Protein

Strain

VIRUS TYPE

A

B

C

Provisional Influenza D(PMID:24595369)

SUB TYPE

* Use comma to separate multipleentries.Ex: H1N1, H7, H3N2.

STRAIN NAME

* Use comma to separate multipleentries.Ex: A/chicken/Israel/1055/2008,A/chicken/Laos/16/2008.

Include Partial SequencesComplete Segments OnlyComplete Genomes Only

Include pH1N1 proteinsInclude only pH1N1 proteinsExclude all pH1N1 proteins

'CLASSICAL' PROTEINS1 PB22 PB13 PA4 HA5 NP6 NA7 M17 M28 NS18 NS2

'VARIANT' PROTEINS (SOP)

2 PB1-F22 PB1-N403 PA-N1553 PA-N1823 PA-X7 M428 NS3

COMPLETE SEQUENCES

2009 pH1N1 SEQUENCES(SOP)

DATE RANGE

From: YYYY To: YYYY

To add month to search, seeAdvance Options: Month Range

HOSTAllAnteaterAvianBatBeetleBovineCamelCivetCivet CatDogDomestic CatDonkeyEnvironmentEquineFerretFlat-Faced BatHuman

AVIANAllAdelie PenguinAfrican StarlingAfrican StonechatAmerican Black DuckAmerican Green-Winged TealAmerican White PelicanAmerican Widgeon

GEOGRAPHIC GROUPINGAllAfricaAntarcticaAsiaEuropeNorth AmericaOceania

COUNTRYMexicoMontserratNicaraguaPanamaPuerto RicoTrinidad and TobagoUSA

USA STATEAlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelaware

ADVANCED OPTIONSSearchClear

Results matching your criteria: 8,360

Tip: To select multiple or deselect, Ctrl-click (Windows) or Cmd-click (MacOS)Show All

Protein Sequence SearchSearch for influenza sequences, proteins, and strains using two types of searches. Use the advanced search to allow you to refine your search with the more fine grained search,and you can pick your viewing options.

Home Protein Sequence Search

SEARCH DATA ANALYZE & VISUALIZE WORKBENCH SUBMIT DATA HELP

About Us Community Announcements Links Resources Support Workbench Sign In

Influenza Research Database - Sequence Search https://www.fludb.org/brc/influenza_sequence_search_segment_d...

1 of 1 3/24/17, 4:35 PM

The March 2017 release of IRD is now available, visit

www.fludb.org

Variant Protein Variant Protein from

Complete Genomes

Percentage* Source

PB1-F2 22585 70.9% GenBank

PB1-N40 31509 99.0% IRD

PA-N155 31650 99.1% IRD

PA-N182 29755 93.1% IRD

PA-X 31649 99.1% IRD

PA-X protein(+41) 9178 28.7% GenBank & IRD

PA-X protein(+61) 22392 70.1% GenBank & IRD

PA-X protein(other)

79 0.2% GenBank & IRD

PA-X protein 2837 8.9% GenBank

M42 76 0.2% IRD

NS3 38 0.1% IRD

Table 1. Variant protein annotations in IRD as of March 10, 2017

New Features in IRD

Influenza Research Database (IRD)

Questions? Problems? Suggestions?Click Here

March2017