arrayexpress and gene expression atlas: mining functional genomics data gabriella rustici, phd...
TRANSCRIPT
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data
Gabriella Rustici, PhDFunctional Genomics Team
EBI-EMBL
ArrayExpress2
Talk structure
Why do we need a database for functional genomics data?
ArrayExpress database
• Archive
• Gene Expression Atlas
ArrayExpress content
How to query the database
How to download data
How to submit data
What is functional genomics (FG)?
• The aim of FG is to understand the function of genes and
other parts of the genome
• FG experiments typically utilize genome-wide assays to
measure and track many genes (or proteins) in parallel
under different conditions
• High-throughput technologies such as microarrays and
high-throughput sequencing (HTS) are frequently used in
this field to interrogate the transcriptome
3 ArrayExpress
What biological questions is FG addressing?
• When and where are genes expressed?
• How do gene expression levels differ in various cell types and states?
• What are the functional roles of different genes and in what cellular processes do they participate?
• How are genes regulated?
• How do genes and gene products interact?
• How is gene expression changed in various diseases or following a treatment?
4 ArrayExpress
Components of a FG experiment
ArrayExpress5
ArrayExpresswww.ebi.ac.uk/arrayexpress/
Is a public repository for FG data, which provides easy access to
well annotated data in a structured and standardized format
Serves the scientific community as an archive for data supporting
publications, together with GEO at NCBI and CIBEX at DDBJ
Facilitates the sharing of experimental information associated with
the data such as microarray designs, experimental protocols,……
Based on community standards: MIAME guidelines & MAGE-TAB
format for microarray, MINSEQE guidelines for HTS data
(http://www.mged.org/minseqe/)
ArrayExpress6
Reporting standards for sequencingMINSEQE checklist
Minimal Information about a high-throughput Nucleotide SEQuencing Experiment
The proposed guidelines for MINSEQE are (still work in progress):
General information about the experiment Essential sample annotation including experimental factors and their
values (e.g. compound and dose) Experimental design including sample data relationships (e.g. which
raw data file relates to which sample, ….) Essential experimental and data processing protocols Sequence read data with quality scores, raw intensities and
processing parameters for the instrument Final processed data for the set of assays in the experiment
ArrayExpress7
MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment. We adapted it to handle HTS data:
IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols.
SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter.
Data files Raw and processed data files.
The ‘raw’ data files are the trace data files (.srf or .sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA).
The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates.
8
Standards for microarray & sequencingMAGE-TAB format
HTS data in ArrayExpress and Atlas
ArrayExpress9
ArrayExpress – two databases
What is the difference between them?
ArrayExpress10
ArrayExpress Archive• Central object: experiment • Query to retrieve experimental information and
associated data
Expression Atlas• Central object: gene/condition• Query for gene expression changes across
experiments and across platforms
ArrayExpress – two databases
ArrayExpress11
ArrayExpress Archive – when to use it?
• Find FG experiments that might be relevant to your research
• Download data and re-analyze it.
Often data deposited in public repositories can be used to answer different biological questions from the one asked in the original experiments.
• Submit microarray or HTS data that you want to publish.
Major journals will require data to be submitted to a public repository like ArrayExpress as part of the peer-review process.
ArrayExpress12
How much data in AE Archive?
ArrayExpress13
GEO import
ArrayExpress14
Browsing the AE Archive
Browsing the AE Archive
The direct link to raw and processed data. An icon indicates that this type of data is available.
The direct link to raw and processed data. An icon indicates that this type of data is available.
The total number of experiments and assay retrievedThe total number of experiments and assay retrieved
Species investigated
Curated title of experiment
Curated title of experiment
The date when the data were loaded in
the Archive
AE unique experiment ID
AE unique experiment ID
Number of assays
Number of assays
The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed
The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed
loaded in Atlas flag
Raw sequencing data available in
ENA
ArrayExpress15
ArrayExpress16
Browsing the AE Archive
Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo
Application focused ontology modeling experimental factors (EFs) in AE – selected by default
Developed to:
• increase the richness of annotations that are currently made in AE Archive
• to promote consistency
• to facilitate automatic annotation and integrate external data
EFs are transformed into an ontological representation, forming classes and relationships between those classes
EFO terms map to multiple existing domain specific ontologies, such as the Disease Ontology and Cell Type Ontology
ArrayExpress17
ArrayExpress18
Building EFOAn example
sarcoma
cancer
neoplasm
disease
Kaposi’s sarcoma
Take all experimental
factors
sarcoma
cancer
neoplasm
Kaposi’s sarcoma
disease is the parent term
is a type of disease
is synonym of neoplasm
is a type of cancer
is a type of sarcoma
Find the logical connection between them
disease
neoplasm
cancer
sarcoma
Kaposi’s sarcoma
[-]
[-]
[-]
[-]
Organize them in an ontology
ArrayExpress19
Exploring EFOAn example
Searching AE ArchiveSimple query
ArrayExpress20
Searching AE ArchiveSimple query
Search across all fields:• AE accession number e.g. E-MEXP-568 • Secondary accession numbers e.g. GEO series accession
GSE5389 • Experiment name • Submitter's experiment description • Sample attributes, experimental factor and values, including
species (e.g. GeneticModification, Mus musculus, DREB2C over-expression)
• Publication title, authors and journal name, PubMed ID
Synonyms for terms are always included in searches e.g. 'human' and 'Homo sapiens’
ArrayExpress21
AE Archive query output
• Matches to exact terms are highlighted in yellow• Matches to synonyms are highlighted in green• Matches to child terms in the EFO are highlighted in pink
23
RNA-seq data in AE Archive
HTS data in ArrayExpress and Atlas
24
AE Archive – experiment view
HTS data in ArrayExpress and Atlas
Master headline19.04.2325
Link to raw data in ENA
26
AE Archive – experiment view
HTS data in ArrayExpress and Atlas
SDRF file – sample & data relationship
ArrayExpress27
ArrayExpress – two databases
ArrayExpress28
Expression Atlas – when to use it?
• Find out if the expression of a gene (or a group of genes with a common gene attribute, e.g. GO term) change(s) across all the experiments available in the Expression Atlas;
• Discover which genes are differentially expressed in a particular biological condition that you are interested in.
ArrayExpress29
The criteria we use for selecting experiments for inclusion in the Atlas are as follows:
• Array designs relating to experiment must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done)
• High MIAME scores • Experiment must have 6 or more hybridizations • Sufficient replication and large sample size• EF and EFV must be well annotated• Adequate sample annotation must be provided • Processed data must be provided or raw data which can be
renormalized must be available
Expression Atlas constructionExperiment selection criteria
ArrayExpress30
ArrayExpress31
New meta-analytical tool for searching gene expression profiles across experiments in AE
Data is taken as normalized by the submitter
Gene-wise linear models (limma) and t-statistics are applied to calculate the strength of genes’ differential expression across conditions across experiments
The result is a two-dimensional matrix where rows correspond to genes and columns correspond to conditions, rather than samples.
The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression
Expression Atlas constructionAnalysis pipeline
ArrayExpress32
Expression Atlas construction
ArrayExpress33
Expression Atlas construction
ArrayExpress34
Expression Atlas
ArrayExpress35
Atlas home pagehttp://www.ebi.ac.uk/gxa/
Query for genes
Query for conditionsRestrict query by direction of differential expression
The ‘advanced query’ option allows building more complex queries
Atlas home pageThe ‘Genes’ and ‘Conditions’ search boxes
ArrayExpress36
Atlas home pageA single gene query
ArrayExpress37
Atlas gene summary page
ArrayExpress38
Atlas experiment page
ArrayExpress39
ArrayExpress40
Atlas experiment page – HTS data
ArrayExpress41
Atlas home pageA ‘Conditions’ only query
ArrayExpress42
Atlas heatmap view
Atlas gene-condition query
ArrayExpress43
Atlas advanced search
ArrayExpress44
Atlas advanced search
ArrayExpress45
Atlas advanced search
ArrayExpress46
ArrayExpress47
Data submission to AE
ArrayExpress48
Data submission to AEwww.ebi.ac.uk/microarray/submissions.html
Submission of HTS gene expression data
• Submit via MAGE-TAB submission route
• Submit:
• MAGE-TAB spreadsheet containing details of the samples and protocols used.
• Trace data files for each sample (in SRF, FASTQ or SFF format )
• Processed data files
• For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA).
• If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely.
ArrayExpress49
Types of data that can be submitted
ArrayExpress50
What happens after submission?
• Email confirmation
• Curation
• The curation team will review your submission and will email you with any questions.
• Possible reopening for editing
• We will send you an accession number when all the required information has been provided.
• We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public.
ArrayExpress51
Find out more
• Visit our eLearning portal, Train online, at http://www.ebi.ac.uk/training/online/ for courses on ArrayExpress and Atlas
• Email us at: [email protected]
• Atlas mailing list: [email protected]
ArrayExpress52