arrayexpress and expression atlas: mining functional genomics data

52
ArrayExpress and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL [email protected]

Upload: didina

Post on 24-Feb-2016

67 views

Category:

Documents


0 download

DESCRIPTION

ArrayExpress and Expression Atlas: Mining Functional Genomics data. Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL [email protected]. What is functional genomics (FG)?. The aim of FG is to understand the function of genes and other parts of the genome - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

ArrayExpress and Expression Atlas: Mining Functional Genomics data

Gabriella Rustici, PhDFunctional Genomics TeamEBI-EMBL

[email protected]

Page 2: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

What is functional genomics (FG)?

• The aim of FG is to understand the function of genes and other parts of the genome

• FG experiments typically utilize genome-wide assays to measure and track many genes (or proteins) in parallel under different conditions

• High-throughput technologies such as microarrays and high-throughput sequencing (HTS) are frequently used in this field to interrogate the transcriptome

2 ArrayExpress

Page 3: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

What biological questions is FG addressing?

• When and where are genes expressed?

• How do gene expression levels differ in various cell types and states?

• What are the functional roles of different genes and in what cellular processes do they participate?

• How are genes regulated?

• How do genes and gene products interact?

• How is gene expression changed in various diseases or following a treatment?

3 ArrayExpress

Page 4: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Components of a FG experiment

ArrayExpress4

Page 5: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

FG public repositories: ArrayExpress

Is a public repository for FG data, which provides easy access to well annotated data in a structured and standardized format

Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ

Facilitates the sharing of experimental information associated with the data such as microarray designs, experimental protocols,……

Based on community standards: MIAME guidelines & MAGE-TAB format for microarray, MINSEQE guidelines for HTS data (http://www.mged.org/minseqe/)

ArrayExpress5

Page 6: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Community standards for data requirement MIAME = Minimal Information About a Microarray Experiment

MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencing Experiment

The checklist:

ArrayExpress6

Requirements MIAME MINSEQE1. Experiment design / background description 2. Sample annotation and experimental factor 3. Array design annotation (e.g. probe sequence) 4. All protocols (wet-lab bench and data processing) 5. Raw data files (from scanner or sequencing machine) 6. Processed data files (normalised and/or transformed)

Page 7: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray or HTS experiments

IDF Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols.

SDRF Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter.

ADFArray Design Format file that describes probes on an array, e.g. sequence, genomic mapping location (for array data ony)

Data files Raw and processed data files. The ‘raw’ data files are the trace data files (.srf or .sff). Fastq format files are also accepted, but SRF format files are preferred. The trace data files that you submit to ArrayExpress will be stored in the European Nucleotide Archive (ENA). The processed data file is a ‘data matrix’ file containing processed values, e.g. files in which the expression values are linked to genome coordinates.

Standards for microarray & sequencingMAGE-TAB format

ArrayExpress7

Page 8: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

ArrayExpress8

ArrayExpress – two databases

Page 9: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

What is the difference between them?

ArrayExpress Archive• Central object: experiment • Query to retrieve experimental information and

associated data

Expression Atlas• Central object: gene/condition• Query for gene expression changes across

experiments and across platforms

ArrayExpress9

Page 10: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

ArrayExpress – two databases

ArrayExpress10

Page 11: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

ArrayExpress Archive – when to use it?

• Find FG experiments that might be relevant to your research

• Download data and re-analyze it.

Often data deposited in public repositories can be used to answer different biological questions from the one asked in the original experiments.

• Submit microarray or HTS data that you want to publish.

Major journals will require data to be submitted to a public repository like ArrayExpress as part of the peer-review process.

ArrayExpress11

Page 12: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

How much data in AE Archive?(as of mid-September 2012)

ArrayExpress12

Page 13: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

HTS data in AE Archive(as of mid-September 2012)

Microarray vs HTS

RNA-, DNA-, ChIP-seq breakdown

ArrayExpress13

Page 14: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

ArrayExpresswww.ebi.ac.uk/arrayexpress/

ArrayExpress14

Page 15: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Browsing the AE Archive

The direct link to raw and processed data. An icon indicates that this type of data is available.

The total number of experiments and assay retrieved

Species investigated

Curated title of experiment

The date when the data were loaded in the

ArchiveAE unique

experiment IDNumber of

assays

The list of experiments retrieved can be printed, saved as Tab-delimited format or exported to Excel or as RSS feed

loaded in Atlas flag

Raw sequencing data available in

ENA

ArrayExpress15

Page 16: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Browsing the AE Archive

ArrayExpress16

Page 17: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Searching AE with the Experimental factor ontology (EFO) Application focused ontology modeling the relationship between

experimental factors (EFs) in AE

Developed to:

• increase the richness of annotations that are currently made in AE Archive

• to promote consistency

• to facilitate automatic annotation and integrate external data

EFs are transformed into an ontological representation, forming classes and relationships between those classes

Combine terms from a subset of well-maintained and compatible ontologies, e.g. Gene Ontology, NCBI Taxonomy

ArrayExpress17

Page 18: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Building EFOAn example

sarcoma

cancer

neoplasm

disease

Kaposi’s sarcoma

Take all experimental factors

sarcoma

cancer

neoplasm

Kaposi’s sarcoma

disease is the parent term

is a type of disease

is synonym of neoplasm

is a type of cancer

is a type of sarcoma

Find the logical connection between them

disease

neoplasm

cancer

sarcoma

Kaposi’s sarcoma

[-]

[-]

[-]

[-]

Organize them in an ontology

ArrayExpress18

Page 19: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Exploring EFOAn example

ArrayExpress19

More information at: http://www.ebi.ac.uk/efo

Page 20: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Searching AE ArchiveSimple query

ArrayExpress20

Page 21: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

AE Archive query output

• Matches to exact terms are highlighted in yellow• Matches to synonyms are highlighted in green• Matches to child terms in the EFO are highlighted in pink

ArrayExpress21

Page 22: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

AE Archive – experiment view

ArrayExpress22

Page 23: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

SDRF file – sample & data relationship

ArrayExpress23

Page 24: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

ArrayExpress – two databases

ArrayExpress24

Page 25: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Expression Atlas – when to use it?

• Find out if the expression of a gene (or a group of genes with a common gene attribute, e.g. GO term) change(s) across all the experiments available in the Expression Atlas;

• Discover which genes are differentially expressed in a particular biological condition that you are interested in.

ArrayExpress25

Page 26: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

• Array (platform) designs relating to the experiment must be provided. Probe annotation must be adequate to enable re-annotation of external references (e.g. Ensembl gene ID, Uniprot ID)

• At least 3 replicates for each value of the experimental factor

• Maximum 4 experimental factors

• Adequate sample annotation using EFO terms

• Presence of data files: CEL raw data files for Affymetrix assays, processed data files for non-Affymetrix ones

Expression Atlas constructionExperiment selection criteria during curation

ArrayExpress26

Page 27: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Expression Atlas constructionAnalysis pipeline

gene

s

Cond.1 Cond.2 Cond.3

Linear model*(Bio/C Limma)

Cond.1 Cond.2 Cond.3

Input data (Affy CEL, non-Affy processed)

1= differentially expressed 0 = not differentially expressed

Output: 2-D matrix

* More information about the statistical methodology: http://nar.oxfordjournals.org/content/38/suppl_1/D690.full

ArrayExpress27

Page 28: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Expression Atlas constructionAnalysis pipeline

“Is gene X differentially expressed in condition 1 in this

experiment?”

Cond.1 mean Cond.2 mean Cond.3 mean Mean of all samples

= a single expression value for gene X

Compare and calculate statistic

ArrayExpress28

Page 29: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

gene

sCond.1 Cond.2 Cond.3Exp.1

gene

s

Cond.4 Cond.5 Cond.6Exp. 2

gene

s

Cond.X Cond.Y Cond.ZExp. n

Statistical test

Statistical test

Statistical test Each experiment has its own

“verdict” or “vote” on whether a gene is differentially expressed or not

under a certain condition

ArrayExpress29

Expression Atlas construction

Page 30: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Expression Atlas construction

Summary of the “verdicts” from different experiments

ArrayExpress30

Page 31: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Expression Atlas

ArrayExpress31

Page 32: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas home pagehttp://www.ebi.ac.uk/gxa/

Query for genes

Query for conditionsRestrict query by direction of differential expression

The ‘advanced query’ option allows building more complex queries

ArrayExpress32

Page 33: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas home pageThe ‘Genes’ and ‘Conditions’ search boxes

ArrayExpress33

Page 34: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas home pageA single gene query

ArrayExpress34

Page 35: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas gene summary page

Page 36: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas experiment page

ArrayExpress36

Page 37: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas experiment page – HTS data

ArrayExpress37

Page 38: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas home pageA ‘Conditions’ only query

ArrayExpress38

Page 39: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas heatmap view

ArrayExpress39

Page 40: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas gene-condition query

ArrayExpress40

Page 41: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas advanced search

ArrayExpress41

Page 42: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas advanced search

ArrayExpress42

Page 43: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Atlas advanced search

ArrayExpress43

Page 44: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

A glimpse of what’s coming…“Differential atlas”

“Is gene X differentially expressed in condition 1 in this experiment?”

Cond.1 mean Cond.2 mean Cond.3 mean Mean of all samples

= a single expression value for gene X

Compare and calculate statistic

ArrayExpress44

Page 45: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

A glimpse of what’s coming…“Differential atlas” mock-up (1)

ArrayExpress45

Page 46: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

A glimpse of what’s coming…“Differential atlas” mock-up (2)

ArrayExpress46

Page 47: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

• Gene expression in normal tissues, not looking for differentially expressed genes based on different conditions

• E.g. “Give me all the genes expressed in normal human kidney”

• Can also filter genes by expression level (e.g. FPKM values)

• Start with Illumina Body Map 2.0 RNA-seq data

• 16 tissues: adrenal, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells

• We are working on something similar for mouse

A glimpse of what’s coming…“Baseline atlas”

ArrayExpress47

Page 48: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

A glimpse of what’s coming…“Baseline atlas” mock-up display

ArrayExpress48

Page 49: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Data submission to AE

ArrayExpress49

Page 50: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Data submission to AEwww.ebi.ac.uk/microarray/submissions.html

ArrayExpress50

Page 51: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Submission of HTS dataArrayExpress acts as a “broker” for submitter.

• Meta-data and processed data: ArrayExpress• Raw sequence reads* (e.g. fastq, bam): ENA

*See http://www.ebi.ac.uk/ena/about/sra_data_format for accepted read file format

ArrayExpress51

Page 52: ArrayExpress  and  Expression  Atlas: Mining Functional Genomics data

Find out more

• Visit our eLearning portal, Train online, at http://www.ebi.ac.uk/training/online/ for courses on ArrayExpress and Atlas

• Watch this short YouTube video on how to navigate the MAGE-TAB submission tool: http://youtu.be/KVpCVGpjw2Y

• Email us at: [email protected]

• Atlas mailing list: [email protected]

ArrayExpress52