arrayexpress and gene expression atlas: mining functional genomics data gabriella rustici, phd...

52
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL [email protected]

Upload: marshall-tate

Post on 28-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data

Gabriella Rustici, PhDFunctional Genomics Team

EBI-EMBL

[email protected]

Page 2: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress2

Talk structure

Why do we need a database for functional genomics data?

ArrayExpress database

• Archive

• Gene Expression Atlas

ArrayExpress content

How to query the database

How to download data

How to submit data

Page 3: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

What is functional genomics (FG)?

• The aim of FG is to understand the function of genes and

other parts of the genome

• FG experiments typically utilize genome-wide assays to

measure and track many genes (or proteins) in parallel

under different conditions

• High-throughput technologies such as microarrays and

high-throughput sequencing (HTS) are frequently used in

this field to interrogate the transcriptome

3 ArrayExpress

Page 4: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

What biological questions is FG addressing?

• When and where are genes expressed?

• How do gene expression levels differ in various cell types and states?

• What are the functional roles of different genes and in what cellular processes do they participate?

• How are genes regulated?

• How do genes and gene products interact?

• How is gene expression changed in various diseases or following a treatment?

4 ArrayExpress

Page 5: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Components of a FG experiment

ArrayExpress5

Page 6: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpresswww.ebi.ac.uk/arrayexpress/

Is a public repository for FG data, which provides easy access to

well annotated data in a structured and standardized format

Serves the scientific community as an archive for data supporting

publications, together with GEO at NCBI and CIBEX at DDBJ

Facilitates the sharing of experimental information associated with

the data such as microarray designs, experimental protocols,……

Based on community standards: MIAME guidelines & MAGE-TAB

format for microarray, MINSEQE guidelines for HTS data

(http://www.mged.org/minseqe/)

ArrayExpress6

Page 7: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Reporting standards for sequencingMINSEQE checklist

Minimal Information about a high-throughput Nucleotide SEQuencing Experiment

The proposed guidelines for MINSEQE are (still work in progress):

General information about the experiment Essential sample annotation including experimental factors and their

values (e.g. compound and dose) Experimental design including sample data relationships (e.g. which

raw data file relates to which sample, ….) Essential experimental and data processing protocols Sequence read data with quality scores, raw intensities and

processing parameters for the instrument Final processed data for the set of assays in the experiment

ArrayExpress7

Page 8: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment. We adapted it to handle HTS data:

IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols.

SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter.

Data files Raw and processed data files.

The ‘raw’ data files are the trace data files (.srf or .sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA).

The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates.

8

Standards for microarray & sequencingMAGE-TAB format

HTS data in ArrayExpress and Atlas

Page 9: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress9

ArrayExpress – two databases

Page 10: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

What is the difference between them?

ArrayExpress10

ArrayExpress Archive• Central object: experiment • Query to retrieve experimental information and

associated data

Expression Atlas• Central object: gene/condition• Query for gene expression changes across

experiments and across platforms

Page 11: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress – two databases

ArrayExpress11

Page 12: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress Archive – when to use it?

• Find FG experiments that might be relevant to your research

• Download data and re-analyze it.

Often data deposited in public repositories can be used to answer different biological questions from the one asked in the original experiments.

• Submit microarray or HTS data that you want to publish.

Major journals will require data to be submitted to a public repository like ArrayExpress as part of the peer-review process.

ArrayExpress12

Page 13: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

How much data in AE Archive?

ArrayExpress13

GEO import

Page 14: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress14

Browsing the AE Archive

Page 15: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Browsing the AE Archive

The direct link to raw and processed data. An icon indicates that this type of data is available.

The direct link to raw and processed data. An icon indicates that this type of data is available.

The total number of experiments and assay retrievedThe total number of experiments and assay retrieved

Species investigated

Curated title of experiment

Curated title of experiment

The date when the data were loaded in

the Archive

AE unique experiment ID

AE unique experiment ID

Number of assays

Number of assays

The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed

The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed

loaded in Atlas flag

Raw sequencing data available in

ENA

ArrayExpress15

Page 16: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress16

Browsing the AE Archive

Page 17: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo

Application focused ontology modeling experimental factors (EFs) in AE – selected by default

Developed to:

• increase the richness of annotations that are currently made in AE Archive

• to promote consistency

• to facilitate automatic annotation and integrate external data

EFs are transformed into an ontological representation, forming classes and relationships between those classes

EFO terms map to multiple existing domain specific ontologies, such as the Disease Ontology and Cell Type Ontology

ArrayExpress17

Page 18: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress18

Building EFOAn example

sarcoma

cancer

neoplasm

disease

Kaposi’s sarcoma

Take all experimental

factors

sarcoma

cancer

neoplasm

Kaposi’s sarcoma

disease is the parent term

is a type of disease

is synonym of neoplasm

is a type of cancer

is a type of sarcoma

Find the logical connection between them

disease

neoplasm

cancer

sarcoma

Kaposi’s sarcoma

[-]

[-]

[-]

[-]

Organize them in an ontology

Page 19: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress19

Exploring EFOAn example

Page 20: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Searching AE ArchiveSimple query

ArrayExpress20

Page 21: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Searching AE ArchiveSimple query

Search across all fields:• AE accession number e.g. E-MEXP-568 • Secondary accession numbers e.g. GEO series accession

GSE5389 • Experiment name • Submitter's experiment description • Sample attributes, experimental factor and values, including

species (e.g. GeneticModification, Mus musculus, DREB2C over-expression)

• Publication title, authors and journal name, PubMed ID

Synonyms for terms are always included in searches e.g. 'human' and 'Homo sapiens’

ArrayExpress21

Page 22: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

AE Archive query output

• Matches to exact terms are highlighted in yellow• Matches to synonyms are highlighted in green• Matches to child terms in the EFO are highlighted in pink

Page 23: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

23

RNA-seq data in AE Archive

HTS data in ArrayExpress and Atlas

Page 24: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

24

AE Archive – experiment view

HTS data in ArrayExpress and Atlas

Page 25: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Master headline19.04.2325

Link to raw data in ENA

Page 26: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

26

AE Archive – experiment view

HTS data in ArrayExpress and Atlas

Page 27: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

SDRF file – sample & data relationship

ArrayExpress27

Page 28: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress – two databases

ArrayExpress28

Page 29: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Expression Atlas – when to use it?

• Find out if the expression of a gene (or a group of genes with a common gene attribute, e.g. GO term) change(s) across all the experiments available in the Expression Atlas;

• Discover which genes are differentially expressed in a particular biological condition that you are interested in.

ArrayExpress29

Page 30: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

The criteria we use for selecting experiments for inclusion in the Atlas are as follows:

• Array designs relating to experiment must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done)

• High MIAME scores • Experiment must have 6 or more hybridizations • Sufficient replication and large sample size• EF and EFV must be well annotated• Adequate sample annotation must be provided • Processed data must be provided or raw data which can be

renormalized must be available

Expression Atlas constructionExperiment selection criteria

ArrayExpress30

Page 31: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress31

New meta-analytical tool for searching gene expression profiles across experiments in AE

Data is taken as normalized by the submitter

Gene-wise linear models (limma) and t-statistics are applied to calculate the strength of genes’ differential expression across conditions across experiments

The result is a two-dimensional matrix where rows correspond to genes and columns correspond to conditions, rather than samples.

The matrix entries are p-values together with a sign, indicating the significance and direction of differential expression

Expression Atlas constructionAnalysis pipeline

Page 32: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress32

Expression Atlas construction

Page 33: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress33

Expression Atlas construction

Page 34: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress34

Expression Atlas

Page 35: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress35

Atlas home pagehttp://www.ebi.ac.uk/gxa/

Query for genes

Query for conditionsRestrict query by direction of differential expression

The ‘advanced query’ option allows building more complex queries

Page 36: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Atlas home pageThe ‘Genes’ and ‘Conditions’ search boxes

ArrayExpress36

Page 37: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Atlas home pageA single gene query

ArrayExpress37

Page 38: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Atlas gene summary page

ArrayExpress38

Page 39: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Atlas experiment page

ArrayExpress39

Page 40: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress40

Atlas experiment page – HTS data

Page 41: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress41

Atlas home pageA ‘Conditions’ only query

Page 42: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress42

Atlas heatmap view

Page 43: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Atlas gene-condition query

ArrayExpress43

Page 44: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Atlas advanced search

ArrayExpress44

Page 45: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Atlas advanced search

ArrayExpress45

Page 46: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Atlas advanced search

ArrayExpress46

Page 47: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress47

Data submission to AE

Page 48: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

ArrayExpress48

Data submission to AEwww.ebi.ac.uk/microarray/submissions.html

Page 49: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Submission of HTS gene expression data

• Submit via MAGE-TAB submission route

• Submit:

• MAGE-TAB spreadsheet containing details of the samples and protocols used.

• Trace data files for each sample (in SRF, FASTQ or SFF format )

• Processed data files 

• For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA).

• If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely.

ArrayExpress49

Page 50: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Types of data that can be submitted

ArrayExpress50

Page 51: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

What happens after submission?

• Email confirmation

• Curation

• The curation team will review your submission and will email you with any questions.

• Possible reopening for editing

• We will send you an accession number when all the required information has been provided.

• We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public.

ArrayExpress51

Page 52: ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk

Find out more

• Visit our eLearning portal, Train online, at http://www.ebi.ac.uk/training/online/ for courses on ArrayExpress and Atlas

• Email us at: [email protected]

• Atlas mailing list: [email protected]

ArrayExpress52