expression profile clustering
DESCRIPTION
Expression Profile Clustering. Expression Profile = the pattern of signal values for one gene over several chips. Expression Profile Clustering = the clustering of “similar” profiles Why? Similar expression profiles suggest regulation (by shared factor or cluster member) - PowerPoint PPT PresentationTRANSCRIPT
-
Expression Profile ClusteringExpression Profile = the pattern of signal values for one gene over several chips. Expression Profile Clustering = the clustering of similar profiles
Why? Similar expression profiles suggest regulation (by shared factor or cluster member)related functionALSO: opposite expression profiles suggestregulation (e.g. inhibition)
Software:EpClustJ-ExpressEtc.
Chart3
1
2
1
3
0
Gene1
Chips
signal value
Sheet1
12130
21203
Sheet1
00
00
00
00
00
experiment
signal
Sheet2
Sheet3
MBD00042D1E.xls
Chart1
14
25
14
36
03
experiments
signal value
Sheet1
12130
45463
Sheet1
00
00
00
00
00
experiments
signal value
Sheet2
Sheet3
-
Guided Tour of EpClust
-
Data file formatCan enter in a variety of formats.See EpClusts data upload page
One simple way:Download tab-delimited data from NASCOpen with (or paste into) excelDelete all but one name column and Signal value columns. Do not include any hyphens! (I try to avoid all punctuation)Save as tab delimited text
Annotation can be added as either:A second file with the same first column of namesA single second column in the signal file (if specified)
-
To Upload Data
-
To select uploaded data
-
Select Specific Experiments within Input File
-
To Remove Unreliable Data
-
To Choose Algorithm TypeHierarchicalMeasures distance between each profile (i.e. gene) with each other profileSo time increases exponentially with each geneThen clusters closest genes together, followed by increasingly distant ones Into a tree of clusters within clusterscan be visualised and finally split at chosen distance with knowledge of treeK-meansInitially choose a set number (K) of clusters1) Chooses the K most different profiles2) Clusters each remaining profile with one of the KSo, much less time than Hierarchical (if many genes)
Cluster size can vary so can return too many or few geneswith no way to select.Then must repeat with different K values.
-
So Which is Better?K-Means is better able to manage large data-sets.
Hierarchical seams a more objective aproachIn that you dont need to decide cluster number at start
But which is more biologically informative?
Opinion divided.Both artificial.Dont prove, just suggest.K-means gives more consistent results.
.and then theres SOTA too!
-
Alternative Distance Measurements
-
Pearson-Based Distance MeasurementThe most commonly usedHow similar the SHAPES of the two profiles areBased on average of values and the standard deviationRates from identical (1) to completely uncorrelated (0) to perfect opposites (-1)
Centred Test?If profiles have identical shape, but offset from each other by a fixed value (or magnitude)Centred: Identical (1)Uncentred: Not (
-
Euclidean-Based Distance Measurementbetween gene expression levels directly based on magnitude of changes More about signal VALUES. Less about profile shapes. data must be suitably normalizede.g. use log-ratios of signals
Euclidean: shortest path between points Manhattan: the sum of distances along each dimension
Chart3
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
chip1 signal
chip2 signal
chip3 signal
Sheet1
00000000
00000000
00000000
00000000
00000000
00000000
13200000000
22300000000
754
465
Sheet1
0
0
0
0
chip1
chip2
Sheet2
00000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
chip1
chip2
chip3
Sheet3
Chart1
3
2
5
6
chip1 signal
chip2 signal
Sheet1
13
22
75
46
Sheet1
0
0
0
0
chip1
chip2
Sheet2
Sheet3
-
To Choose Algorithm Type
-
RESULTS
-
RESULTS_MyGene
-
_MyGene
-
Extra Options
-
Extra OptionsPhylip FormatSearch for Promoter MotifsFormat Results PictureExport Lists as TextStatisticsExport clusters as:text (lists of gene names) tree cut at chosen height.Or formatted for Phylip (tree drawing software)
Visualise promoter region next to each profile
Highlight motifs You must tell it the sequences of motifs though.
Must prepare file of upstream sequences Currently must arrange with EpClust staff to upload Can get from querying a database such as PlantCare using the upstream sequence of our particular gene of most interestCan get from TAIR using a list of gene names
-
PlantCARECis-Acting Regulatory Elementshttp://intra.psb.ugent.be:8080/PlantCARE/
Database of CAREs
ToolsSearch for CAREEnter upstream sequence of a geneLists and then highlights known motifs from database
Motif SamplerEnter upstream sequences of your clusterHighlights 8mers conserved more in cluster members than in genome
Other toolsClusteringQuery for info on motifs
-
Gene OntologyA hierarchical structure to describe gene function.
As PlantCARE compares:Expression Profile Clusters to Promoter Motif Conservation
There are also tools to compare:Expression Profile Clusters to Gene Ontology