minería de datos - wordpress.com · minería de datos analisis de un set de datos ! visualization...
TRANSCRIPT
Minería de Datos
ANALISIS DE UN SET DE DATOS
! Visualization Techniques ! Combined Graph ! Charts and Pies ! Search for specific functions
Data Mining on the DAG
ü When working with large datasets, annotation results need to be summarized
ü The DAG provides visualization of annotation data within its biological context
ü In Blast2GO --> Combined Graph Function
Combined Graph
Each term has a number of sequences associated
Nodes can be coloured to indicate relevance
Each term is displayed around its biological context
Node shape to differentiate between direct and indirect annotation
Combined Graph
Different GO branches Reduces nodes by number of annotate sequences
Criterion for highlighting and filtering nodes
Node data to be displayed
Let's paint the DAG of the dataset analized yesterday (1000 sequences)
Too many nodes!!!
Combined Graph
Need way to find relevant information
Accumulated by node (Sequence Count)
3 1
4
5
1 3
1
Incomming information (Node Score)
3 1
2.4
2.5
1
1 3
Node Information Content
Node score We compute a node score that reflects the
amount of direct information at the node
3 1
2.4
2.5
1
1 3
Node score
GO2 3
GO1 1
GO2 2.4
GO4 2.5
1
1 3
NodeScore (GO1) = 1 * 0.6 0 = 1 NodeScore (GO2) = 3 * 0.6 0 = 3
dist=0 dist=0
dist=1 dist=1
NodeScore (GO3) = 1 * 0.6 1 + 3 * 0.61 = 0.6 + 1.8 = 2.4
dist=0
dist=2 dist=2
NodeScore (GO4) = 1 * 0.6 2 + 3 * 0.62 + 1 * 0.60 = 0.36 + 1.08 + 1 = 2.5
α = 0.6
Node score vs Annotation score
3 1
2.4
2.5
1
1 3
GO1 1 seq
GO2 1 child
GO3 50
hit1
hit2 hit3
ROOT
GO4 55
GO1 1 child
GO2 52
GO1 60
Annotation Score:
- In annotation context
- Relates to Blast results of ONE sequence
Node Score:
- In data-mining context
- Relates to analysis of a GROUP of sequences
DO NOT MIX-UP !!!!!
AS = max{%sim * ECw]}+ (#TPR_GOs-1) * GOw
Filtered Graph
Direct annotations
Transition nodes
# Filtered Nodes
Compacting Graphs by GOSlim
Show node content
Saving Options
Save as picture and as txt
Graph Charts
Graph Charts
• Sequence Distribution/GO as Multilevel-Pie (#score or #seq cutoff)
• Sequence Distribution/GO as Bar-Chart
• Sequence Distribution/GO as Level-Pie (level selection)
Multilevel vs. GO-Slim Chart
GO-Slim: Handy to summarize functional content
Multi-level Pie with a sequence filter of 20
Use DAG to analyze a function
How many sequences are annotated to the function “photosynthesis”?
Option 1: Find in the GO graph à direct & indirect annotation Option 2: Find through the Select function. Two sub options
Option 2.1. Direct annotation (use GOid or description) Option 2.2. Direct&indirect (use GOid and “include GO parents”)
DAG can be used to make queries on general concepts without direct annotations
Example: analyze a specific function
Find a function on the graph
search export
Example: analyze a specific function
Select all sequences annotated to this function and its descendents
Example: analyze a specific function
Locate these sequences
Example: analyze a specific function
Explore the annotation diversity of a given function within the graph
Exporting the sequence table you can see all Sequences annotated to a given function (GO)
Conclusions ü DAGs are interesting for browsing functional
annotation but can be too large ü With filtering and pruning options you can create
more navigable DAGs ü Pies are good to compact information: try out levels ü GO-Slim compacts to more equivalent terms than
filtering the GO ü You can use the DAG to query on general terms
Minería de Datos
ANALISIS DE VARIOS SETS DE DATOS
! Functional Enrichment ! Enriched Graphs ! Meta-analysis
Biosynthesis 54% Biosynthesis 18%
Sporulation 18% Sporulation 27%
One Gene List (A) The other list (B)
Are this two groups of genes carrying out
different biological roles?
Enrichment Analysis
Are these differences statistically significant?
???
???
Interpretation of a large list of genes: which are relevant functions?
Biosynthesis 54% Biosynthesis 18%
Sporulation 18% Sporulation 27%
One Gene List (A) The other list (B)
9 5 No biosynthesis
2 6 Biosynthesis
B A
Fisher's Exact Test
Contingency table
p-value for biosynthesis < 0.05 8 9 No sporulation
3 2 Sporulation
B A
p-value for sporulation > 0.05
Multiple testing correction
We do this for all GO term of our dataset!!!
Many tests => Many false positive => We need correction!
FDR control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly rejected null hypotheses.
FWER control: The familywise error rate is the probability of making one or more false discoveries among all the hypotheses when performing multiple pairwise tests.
(more conservative)
8 9 No GO 3 2 GO B A
Test-set Ref-set
Fisher’s Exact Test in Blast2GO
Three files: ! Blast2GO project with annotations (.dat/.annot) ! One txt file with IDs: Test-set (.txt) ! Other txt file with IDs: Ref-set (.txt)
Different types of comparisons
● Compare one condition against another
● Remove Common Ids ● Test and Ref-Set are
interchangeable
Set 1
Set 2
Common IDs
● Compare a subset against the total
● Gossip default setting
● Test and Ref-Set are NOT interchangeable
Test- Set
Ref- Set
Common IDs
Test- Set
Ref- Set
Common IDs
FET in Blast2GO ● Two-Tailed test not only identifies over but also
under represented functions. ● If no Ref-Set is chosen all annotations are
used as reference
Enrichment Results
● Result table with link outs to sequence lists
Most specific terms
Retains only the lowest, most specific enriched term per GO branch
Enriched Graph View enriched terms data as DAG graphs!
reduce
=> To draw all nodes, set filter to 1
Bar-Chart ● Export enriched terms as chart!
=> Filter results
% of sequences in Ref group
% of sequences in Test group
If Test > Ref = over-expressed
If Ref > Test = under-expressed
Meta-analysis in Blast2GO
Sequence_1 GO:0005792 Sequence_1 GO:0006412 Sequence_1 GO:0003735 Sequence_2 GO:0016705 Sequence_2 GO:0005840 Sequence_2 GO:0005506
Treatment_1 GO:0005792 Treatment_1 GO:0006412 Treatment_1 GO:0003735
Annotation Result (.annot) Enrichment Result
ó Equivalent formats
Treatment_1 GO:0005792 Treatment_1 GO:0006412 Treatment_1 GO:0003735 Treatment_2 GO:0016705 Treatment_2 GO:0005840 Treatment_2 GO:0005506
By joining different functional enrichment results we can create and annotation file of conditions that capture their functional profile
Enrichment Result (.annot)
Meta-analysis in Blast2GO
Use seq names to see treatments
Use color by SeqCount
FIND SIMILARITIES BETWEEN TREATMENTS
Meta-analysis in Blast2GO DISPLAY FUNCTIONAL DISSIMILARITIES ON DAG
Use second column number for color
Ejercicios: Minería de Datos