smd data quality assessment and repository tools tutorial

79
SMD Data Quality Assessment and Repository Tools Tutorial November 10, 2007 Catherine Ball Janos Demeter

Upload: felix-wagner

Post on 30-Dec-2015

29 views

Category:

Documents


1 download

DESCRIPTION

SMD Data Quality Assessment and Repository Tools Tutorial. November 10, 2007 Catherine Ball Janos Demeter. SMD: Getting Help. Click on the “Help” menu Tool-specific links will be listed at the top. Use the SMD help index to look for specific subjects Send e-mail to: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SMD Data Quality Assessment and Repository Tools Tutorial

SMD Data Quality Assessment and Repository

Tools Tutorial

November 10, 2007

Catherine Ball

Janos Demeter

Page 2: SMD Data Quality Assessment and Repository Tools Tutorial

SMD: Getting Help

• Click on the “Help” menu– Tool-specific links

will be listed at the top.

• Use the SMD help index to look for specific subjects

• Send e-mail to:[email protected]

Page 3: SMD Data Quality Assessment and Repository Tools Tutorial

Quality Assessment and Repository Tools Tutorial

• Quality Assessment Tools– Ratios on Array– HEEBO/MEEBO plots– Graphing tool– Q-score

• Repository– Repository– SVD– Synthetic Gene Tool– kNNimpute

Page 4: SMD Data Quality Assessment and Repository Tools Tutorial

SMD Data Repository Help

• How to use the tool• Limitations of file sizes• Sharing data• Options• Links to help for analysis

methods, data file formats, data retrieval and clustering

Page 5: SMD Data Quality Assessment and Repository Tools Tutorial

SMD Help: File Formats

Page 6: SMD Data Quality Assessment and Repository Tools Tutorial

File Formats: Pre-clustering (PCL) File

UID is the Unique Identifier for the Spot/Reporter

NAME sequence label for the Spot/Reporter

GWEIGHT indicates the weight the Spot/Reporter is given in clustering

Names and orders of arrays (if arrays are not clustered)

EWEIGHT indicates the weight the Array/Experiment is given in clustering

Values are for each spot/reporter on each array (usually log ratios)

Page 7: SMD Data Quality Assessment and Repository Tools Tutorial

File Formats: Clustering Design Tree (CDT) File

Page 8: SMD Data Quality Assessment and Repository Tools Tutorial

SMD Data Repository• What is the SMD Data Repository?

– What is the repository?– Using the repository to save or upload data– Using the repository to share data– Using the repository to analyze data

• Options for PCL files via the repository– View– Data– Delete– Edit– Cluster– Filter– SVD– Synthetic Genes– KNN Impute

• Options for CDT files via the repository– GeneXplorer– TreeView– View Clusters, spots

Page 9: SMD Data Quality Assessment and Repository Tools Tutorial

What is the SMD Repository?

• A method to save data sets to prevent repeatedly performing the same data retrieval

• A method to share processed data with others

• A way SMD can provide you with access to new and/or computationally-intensive tools

Page 10: SMD Data Quality Assessment and Repository Tools Tutorial

Accessing the SMD Data Repository

Here!

Page 11: SMD Data Quality Assessment and Repository Tools Tutorial

SMD Data Repository

Page 12: SMD Data Quality Assessment and Repository Tools Tutorial

Uploading files to Repository

• If uploading clustered data, enter “CDT” files

• If uploading pre-clustering data, enter “PCL” files

• Choose an organism• Give a unique name to

your data set• Provide a useful

description to your data set

Page 13: SMD Data Quality Assessment and Repository Tools Tutorial

Using Your Repository: CDT Deposits

• View cluster using GeneXplorer or TreeView• View cluster images• View retrieval and clustering report• Download files• Assign access

Page 14: SMD Data Quality Assessment and Repository Tools Tutorial

Using Your Repository: PCL Deposits

View information about your repository entry

Download data

Delete the repository entry

Edit the entry

Cluster data

Filter data

Apply SVD to data

Apply “Synthetic Genes” to data

Estimate missing data with KNN impute

Page 15: SMD Data Quality Assessment and Repository Tools Tutorial

Using the Repository: CDT File Options

CDT files have a few other options

GeneXplorer

TreeView

Clustering with Proxy images

Clustering with Spotimages

Clustering with Proxy and Spot images

Page 16: SMD Data Quality Assessment and Repository Tools Tutorial

Viewing Repository Entries

• Name• Organism• Number of genes• Number of arrays• Size of file• Date uploaded• Description• Data retrieval summary

Page 17: SMD Data Quality Assessment and Repository Tools Tutorial

Downloading Repository Entries

Downloading puts file(s) into a folder labeled with your SMD user name onto your computer’s desktop

Page 18: SMD Data Quality Assessment and Repository Tools Tutorial

Deleting Repository Entries

• Details about your repository entry

• Asks you to confirm before deleting!

Page 19: SMD Data Quality Assessment and Repository Tools Tutorial

Editing Entries -- How to Share!

• Change repository entry name

• Change description• Add access to

repository entry to a GROUP

• Add access to a repository entry to a SMD USER

Page 20: SMD Data Quality Assessment and Repository Tools Tutorial

Filtering Data in Repository Entries

• If your repository entry is a PCL file, you can re-enter the SMD filtering pipeline

Page 21: SMD Data Quality Assessment and Repository Tools Tutorial

SVD: Singular Value Decomposition

• The goal of SVD is to find a set of patterns that describe the greatest amount of variance in a dataset

• SVD determines unique orthogonal (or uncorrelated) gene and corresponding array expression patterns (i.e. "eigengenes" and "eigenarrays," respectively) in the data

• Patterns might be correlated with biological processes OR might be correlated with technical artifacts

Page 22: SMD Data Quality Assessment and Repository Tools Tutorial

SVD: The Concept (easy version)

• Let’s imagine we have a three-dimensional cigar, as shown in A

• We can represent this in one dimension, by looking at its lengthwise shadow (B)

• Looking at its cross-wise shadow (C), we get an orthogonal view of the cigar that tells us more about the three-dimensional object than B alone.

Page 23: SMD Data Quality Assessment and Repository Tools Tutorial

SVD: Missing Data Estimation

• Some algorithms (such as SVD) cannot operate with missing data

• You can use this simple method or you can use KNNImpute to estimate missing data

Page 24: SMD Data Quality Assessment and Repository Tools Tutorial

SVD Display in SMD

Page 25: SMD Data Quality Assessment and Repository Tools Tutorial

SVD: Raster Display

• Each row represents an “eigengene” -- an orthogonal representation of the genes in the dataset

• The topmost eigengene contributes the most to the data set

Page 26: SMD Data Quality Assessment and Repository Tools Tutorial

SVD: View Projection

• Clicking on a row in the Raster Display brings you the Projection View

• You can select genes that have high and low contributions from an eigengene and download them in a PCL file

• In this way, you might use SVD to help classify subtypes

Page 27: SMD Data Quality Assessment and Repository Tools Tutorial

SVD: Eigenexpression

• Each bar show the probability of expression of each eigengene

• You can compare the probabilities to see which eigengenes contribute more to the overall “view” of the data

Page 28: SMD Data Quality Assessment and Repository Tools Tutorial

SVD: Plot selected eigengenes

• You can plot as many or as few eigengenes as you like

• This plot gives you an easy-to-understand view of the behavior of each eigengene

Page 29: SMD Data Quality Assessment and Repository Tools Tutorial

Synthetic Genes• Purpose:

average data based on arbitrary groupings of genes

- for biological reasons

- for technical reasons• Can average data using:

- common genelists

- your own genelists• After averaging:

- a new row for the synthetic gene data

- Original data can be removed/included

Page 30: SMD Data Quality Assessment and Repository Tools Tutorial

Synthetic Genes• Common lists available (only mouse and human

data):– Unigene (all clones/oligos that report on a given Unigene id

will be averaged and shown as the Unigene id)– LocusLink (same as above, but for LocusLink id)

These lists are useful to collapse data by gene, rather than suid/luid.

They allow comparison of experiments between different platforms - oligo print to cDNA print or spotted arrays to Agilent arrays where the arrays don’t share common suids. Also can be used to compare cDNA prints with h/meebo arrays

These synthetic gene lists are updated on a regular basis.

Page 32: SMD Data Quality Assessment and Repository Tools Tutorial

Synthetic Genes

• You can use your own genelists:– 1 genelist for each synthetic gene– Name of the genelist is the synthetic gene’s name

- tab-delimited text file- File must have header (NAME, WEIGHT)- NAME contains cloneid- WEIGHT can be -1 to 1 (weight of clone during averaging)- Can have comment lines (start with #)

Page 33: SMD Data Quality Assessment and Repository Tools Tutorial

Synthetic Genes

• Tool only works on pcl files in repository• During data retrieval the ‘include UIDs’

option should not be used• After collapsing, file can be

downloaded, added to your repository, and/or clustered

• Currently works only for human and mouse data

Page 34: SMD Data Quality Assessment and Repository Tools Tutorial

Synthetic Genes/Merge PCL Files

• Related tool: Merge PCL Files – On main page (lists menu -> all programs)

under tools section– Can be used to combine 2 pcl files from

different sources into a single pcl file. – Cloneids that belong to the same gene can

be combined into single row (based on a translation file provided).

Page 35: SMD Data Quality Assessment and Repository Tools Tutorial

Synthetic Genes/Merge PCL Files

Page 36: SMD Data Quality Assessment and Repository Tools Tutorial

Synthetic Genes/Merge PCL Files

• Same experiments in the pcl files can be averaged• Averaging method can be mean/median• Translation file:

– Tab-delimited text file– First column: desired final identifier– Second column: desired final annotation– Third and subsequent columns: identifiers (first column of a

pcl file) in the pcl files that should be collapsed to the identifier in the first column.

– Data for identifiers not included in the translation file will not be collapsed

Page 37: SMD Data Quality Assessment and Repository Tools Tutorial

KNNImpute: The Missing Values Problem

• Microarrays can have systematic or random missing values

• Some algorithms aren’t robust to missing values

• Large literature on parameter estimation exists

• What’s best to do for microarrays?

Page 38: SMD Data Quality Assessment and Repository Tools Tutorial

Why Estimate Missing Values?

Complete data set Data set with missing values estimated by KNNimpute algorithm

Data set with 30% entries missing (missing values appear black)

 

Page 39: SMD Data Quality Assessment and Repository Tools Tutorial

KNNimpute Algorithm

• Idea: use genes with similar expression profiles to estimate missing values

2 | 4 | 5 | 7 | 3 | 2

2 | | 5 | 7 | 3 | 1

3 | 5 | 6 | 7 | 3 | 2

Gene X

Gene B

Gene C

j

2 | 4 | 5 | 7 | 3 | 2

2 |4.3| 5 | 7 | 3 | 1

3 | 5 | 6 | 7 | 3 | 2

Gene X

Gene B

Gene C

j

Page 40: SMD Data Quality Assessment and Repository Tools Tutorial

Clustering: Cluster Image

• Scale is indicated on the color bar

• Gene names are at the right

• Tree generated by hierarchical clustering is at the left

Page 41: SMD Data Quality Assessment and Repository Tools Tutorial

Clustering Display: Clustered Spot Images

• Spot images can also be viewed in a clustered image

• This can give you a visual impression of the data that are the basis of your analysis

Page 42: SMD Data Quality Assessment and Repository Tools Tutorial

Clustering Display: Adjacent Cluster and Clustered Spot Images

Page 43: SMD Data Quality Assessment and Repository Tools Tutorial

GENEXPLORER

Page 44: SMD Data Quality Assessment and Repository Tools Tutorial

TREEVIEW

Page 45: SMD Data Quality Assessment and Repository Tools Tutorial

SMD: Getting Help

• Click on the “Help” menu– Tool-specific links

will be listed at the top.

• Use the SMD help index to look for specific subjects

• Send e-mail to:[email protected]

Page 46: SMD Data Quality Assessment and Repository Tools Tutorial

Quality Assessment and Repository Tutorial

• Quality assessment tools– Ratios on Array– H/Meebo plots– Graphing tool– Q-score

• Repository– Repository– SVD– Synthetic Gene Tool– kNNimpute

Page 47: SMD Data Quality Assessment and Repository Tools Tutorial

Ratios on Array Tool

• Accessible from the display data -> view data pages

• Ratios on array

Page 48: SMD Data Quality Assessment and Repository Tools Tutorial

Ratios on Array Tool

• Quick visualization of log-ratio distribution on the slide

• Color assignments are based on log-ratio values and also intensity

• Can visualize normalized or non-normalized log-ratios

• PLUS: ANOVA analysis to detect spatial bias (print-tip or plate)

Page 49: SMD Data Quality Assessment and Repository Tools Tutorial

Ratios on Array Tool

• Not normalized vs. normalized (loess intensity, print-tip)

Page 50: SMD Data Quality Assessment and Repository Tools Tutorial

Ratios on Array Tool

• One way ANOVA to test dependence of log-ratios on print-tip and printing plate

• F-statistic is given for the hypothesis: no bias in data

• In the example, normalization significantly improved print-tip bias

Page 51: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO plots

• HEEBO/MEEBO quality assessment graphs from BioConductor package

• If you used doping controls on the slide, the graphs are automatically generated during experiment loading

• Accessible from – For single experiment: display data -> view data pages– For batch: from main page, under tools

• You can create new graphs or look at existing ones• Help page: http://smd.stanford.edu/help/arrayQuality.shtml

Single experiment Batch access

Page 52: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO plots

• Can be used for a gpr file uploaded from desktop - print has to be present in SMD and oligo_ids in the id/name column

• In batch for a result set list on loader.stanford.edu

• If called for a specific experiment, the values are already filled in.

• Normalization options available from limma. Note: this normalization will NOT change data stored in SMD, only used for generating graphs

• Background subtraction methods - same story as normalization

• Job is placed in the job-queue - email is sent with link

Page 53: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO plots

• Can be used for a gpr file uploaded from desktop - print has to be present in SMD and oligo_ids in the id/name column

• In batch for a result set list on loader.stanford.edu

• If called for a specific experiment, the values are already filled in.

• Normalization options available from limma. Note: this normalization will NOT change data stored in SMD, only used for generating graphs

• Background subtraction methods - same story as normalization

• Job is placed in the job-queue - email is sent with link

Page 54: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO plots

• Can be used for a gpr file uploaded from desktop - print has to be present in SMD and oligo_ids in the id/name column

• In batch for a result set list on loader.stanford.edu

• If called for a specific experiment, the values are already filled in.

• Normalization options available from limma. Note: this normalization will NOT change data stored in SMD, only used for generating graphs

• Background subtraction methods - same story as normalization

• Job is placed in the job-queue - email is sent with link

Page 55: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO plots

Page 56: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO: diagnostics

• MA-plots before and after normalization A = 1/2*(log2(Cy5) + log2(Cy3))M = log2(Cy5 / Cy3)

• Loess lines are shown for sectors if print-tip normalization was selected

• Distribution should be centered around M=0, with no intensity dependence

Page 57: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO: diagnostics

• Distribution of ranked log-ratios (M-values) on slide, before and after normalization

• Spatial distribution of non-normalized A-values

Page 58: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO: diagnostics

• Histograms of signal-to-noise ratios for Cy5 (upper) and Cy3 (lower) channels

• Histogram for all probes (probe) and curves for subgroups (doping, negative, positive controls and actual probes)

Page 59: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO: diagnostics

• Box-plots for groups of reporters (colors same as on previous)

• A-values without background subtraction

• Normalized M-values for positive/negative controls (should be around 0 for type 1 experiment)

Page 60: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO: doping controls

• Amount of doping control (DC) vs. observed fluorescence intensity

• Expected sigmoid curve

• Additional graphs for individual DCs

Page 61: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO: doping controls

• non-normalized Cy5 vs. Cy3 signal intensity (log2 scale) (background corrected if selected)

• DCs with same ratio should fit line parallel to diagonal

• Log-ratio increases from top left to bottom right

Page 62: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO: doping controls

• Observed vs. expected log-ratios (normalized and bg corrected) for each doping control group

• Ratios should be aligned on the diagonal

• Graphs for individual doping controls as well

Page 63: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO: mismatch and tiling controls

• Mismatch and tiling probes

are used for 2 tests: – Assess integrity of sample

(degradation) - tiling probes– Degree of cross-hybridization -

mismatch probes

• Mutations are anchored (at the extremities) or distributed (along transcript)

• Calculated binding energies vs. normalized (i.e. divided by median of corresponding wild type probes) raw intensities

Page 64: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO: mismatch and tiling controls

• Percent mismatch vs. log2 intensity for anchored (blue) and distributed (green) probes

• Wild-type probe on left (red box-plot) and negative controls on the right (red box-plot)

• Right axis: fraction of all A-values

Page 65: SMD Data Quality Assessment and Repository Tools Tutorial

HEEBO/MEEBO: mismatch and tiling controls

• Tiling probes were designed along the transcript

• Non-normalized signal intensities (Cy5 and Cy3) vs. probe’s distance from 3’-end

• Quick drop in signal indicates problem in sample (degradation/ivt)

Page 66: SMD Data Quality Assessment and Repository Tools Tutorial

Graphing Tool

• Can be accessed directly from display data page or from view data page.

• It allows you to create graphs of any two data columns in linear or log space

• Can be applied for individual experiment or in batch for experiment set• Interactive tool

or

histograms

Page 67: SMD Data Quality Assessment and Repository Tools Tutorial

Graphing Tool

In batch mode (for experiment set) it can be configured to work on a subset of the experiments in the set.

Page 68: SMD Data Quality Assessment and Repository Tools Tutorial

Graphing Tool

• Can create scatter plots or histograms• Can transform data to log space• Wide selection of data columns to choose from• Combine with data filter to look at distribution of subset of the

data

Page 69: SMD Data Quality Assessment and Repository Tools Tutorial

Graphing Tool

• Can create scatter plots or histograms• Can transform data to log space• Wide selection of data columns to choose from• Combine with data filter to look at distribution of subset of the

data

Page 70: SMD Data Quality Assessment and Repository Tools Tutorial

Graphing Tool

• Can create scatter plots or histograms• Can transform data to log space• Wide selection of data columns to choose from• Combine with data filter to look at distribution of subset of

the data

Page 71: SMD Data Quality Assessment and Repository Tools Tutorial

Graphing Tool: Filter selection• Data filters should be customized for the data

retrieved. • Graphing tool helps in filter selection and

finding a cut-off value

Page 72: SMD Data Quality Assessment and Repository Tools Tutorial

Graphing Tool: Filter selection• Plot filter field (here

regression correlation) against test field (log ratio).

• Log ratios should center around 0.

• Here, the log ratios appear to diverge below a regression correlation of about 0.4 - 0.6.

Page 73: SMD Data Quality Assessment and Repository Tools Tutorial

Graphing Tool: Filter selection

• Foreground / Background (log scale) plotted against log-ratio

• Data should center around a log ratio of zero

• Impose cutoff at 2.0 (linear, ~0.3 log10) to eliminate “flare” at low relative intensity.

Page 74: SMD Data Quality Assessment and Repository Tools Tutorial

Graphing Tool: Filter selection

• As intensity decreases, the log(ratio) tends to scatter

• Spots with low intensities might seem falsely significant

• A cut-off value of ~250 (2**8) is suggested for Ch2 normalized net

Page 75: SMD Data Quality Assessment and Repository Tools Tutorial

Q-score Tool

• Tool to use for filter and cut-off value selection

• Currently usable for cDNA slides (uses UNIGENE clusterid), for human and mouse (will be extended to HEEBO/MEEBO arrays)

• Still in experimental stage• Simple idea: pool reporters that belong

to same gene, calculate their spread and combine values for each gene into score for whole array => Q-score

• Filtering that removes bad quality spots should decrease spread of measurements for genes, hence improve (decrease) Q-score

Page 76: SMD Data Quality Assessment and Repository Tools Tutorial

Q-score Tool

• Works in batch for a group of slides (from same print) in a result set list

• Requires a genelist that specifies which reporters to use. Common genelists for human and mouse clusterids

• Filters and their ranges need to be defined.

• Log-ratio mean/median is used to calculate Q-score

• Run from the job-queue, email is sent to user with the link

Page 77: SMD Data Quality Assessment and Repository Tools Tutorial

Q-score Tool

• Output is a set of graphs showing the fraction of reporters not filtered out and the corresponding Q-scores at increasingly stringent filter values

• Cut-off values (if found) saved in a new result set list for data retrieval

Page 78: SMD Data Quality Assessment and Repository Tools Tutorial

SMD: Office Hours

• Grant S201• Mondays 3 -

5 pm• Wednesdays

2 - 4 pm

Page 79: SMD Data Quality Assessment and Repository Tools Tutorial

SMD StaffGavin SherlockCo-Investigator

Catherine BallDirector

Janos DemeterComputational Biologist

Catherine BeauheimScientific Programmer

Heng JinScientific Programmer

Patrick BrownCo-InvestigatorFarrell Wymore

Lead ProgrammerMichael NitzbergDatabase Administrator

Zac ZachariahSystems Administrator

Don MaierSenior Software Engineer

Takashi KidoVisiting Scholar