syn presentation(6!05!10)1
TRANSCRIPT
-
8/8/2019 Syn Presentation(6!05!10)1
1/29
Efficient Clustering Approaches for
Organizing Document Collection
School of Computer &System SciencesJawaharlal Nehru University
New Delhi-110067
Dr. Aditi Sharan Sonia
Assistant Professor PhD Scholar
-
8/8/2019 Syn Presentation(6!05!10)1
2/29
Table of Contents
Information Retrieval
Efficient Retrieval System
Document ClusteringClustering Algorithm
Feature SelectionDimensionality Reduction
Subspace ClusteringSubspace Creation
Research ProposalObjective
-
8/8/2019 Syn Presentation(6!05!10)1
3/29
IR System
IR
SystemQuery
String
Document
corpus
Ranked
Documents
1. Doc1
2. Doc23. Doc3
.
.
Web
Spider
Web Search System
?
-
8/8/2019 Syn Presentation(6!05!10)1
4/29
A perfect IRS always retrieves all relevant
documents without retrieving any non
relevant document.
In reality , systems retrieve relevant aswell as non relevant documents.
To measure effectiveness of retrieval
two ratios are used :precision and recall. Present document according to user need
IRSystem
-
8/8/2019 Syn Presentation(6!05!10)1
5/29
Document Clustering Automatically partition documents into clusters based on
content
Documents within each cluster should be similar
Documents in different clusters should be different
Discover categories in an unsupervised manner
No sample category labelsprovided by humans
It is a common and important task that finds manyapplications in IR and other places
-
8/8/2019 Syn Presentation(6!05!10)1
6/29
Example Star
-
8/8/2019 Syn Presentation(6!05!10)1
7/29
Why cluster documents?
Whole corpus analysis/navigation Better user interface
For improving recall in search applications
Better search results
For better navigation of search results
Effective user recall will be higher
For speeding up vector space retrieval Faster search
-
8/8/2019 Syn Presentation(6!05!10)1
8/29
ChallengingTask What are the challenges ofWeb Data ?
Why it is difficult to ClusterWeb data?
Structure Based Problem Unstructured Heterogeneous Distributed Language dependent
Information Based Problem
Larger repository
Unlabelled
Dynamic Duplication Interconnected (Hyper Link)
User based Problem
Insufficient Query
Heterogeneous User
Dynamic Requirements
Behavioral Changes
-
8/8/2019 Syn Presentation(6!05!10)1
9/29
Why it is difficult to ClusterWeb data?
Data is Heterogeneous High dimensionality of Data
No good definition of similarity itself
Pre-clustering of Data Therefore traditional clustering algorithms
have to modified ornew algorithm should
be developed to cluster we
bdata
-
8/8/2019 Syn Presentation(6!05!10)1
10/29
Self-organizing maps (SOM)
Multidimensional scaling (MDS)
Latent Semantic Indexing (LSI)
Generative Distributions forDocuments
Expectation Maximization ( EM)
Multiple Cause Mixture Model (MCMM)
AspectModels and Probabilistic LSI
Bottom-up clustering
Top-down clustering
Model and Feature Selection
Generative Models
& Probabilistic
Geometric
EmbeddingHierarchal
Clustering
Algorithm
Partitioning
Clustering Algorithms
Buckshot
Fractionation
K-means
Clustering
Pre-Clustering
Post-ClusteringCombining Clustering with IR
Pre-Clustering
To retrieve one or more clusters in their entirety to a query
-
8/8/2019 Syn Presentation(6!05!10)1
11/29
Post-Clustering Approaches
Clustering is used in Improving document search and
retrieval An attempt to improve conventional search techniques
Enhancing of near-neighbor search
A document browsing technique that employs document clustering as its primaryA document browsing technique that employs document clustering as its primary
operation.operation.
Scatter/Gather MethodScatter/Gather Method
Document clustering algorithms are often slow, with
quadratic running times
How clustering can be effective method in its own right
-
8/8/2019 Syn Presentation(6!05!10)1
12/29
Scatter/Gather : A Cluster Based Approach
How it works
The system clusters documents into small no of groups - Scatter
The system displays short summaries of them
Userchooses one or more of the groups for further study
Selected groups are gatheredtogether to form a subcollection
With each successive iteration the groups become smallerand moredetailed
The groups become small enough, this process bottoms out by displayingindividual documents
-
8/8/2019 Syn Presentation(6!05!10)1
13/29
Application to Scatter/Gather
Zooming into a large document collection Interactive browsing paradigm
Effective Information access tool
Helpful in situation where the query is unspecified
Comparatively fast algorithms Buckshot andfractionation linear-time preprocessing
constant-time query processing
Effective geometric clustering Tool
Limitations
Even Buckshot or Fractionation algorithms may be too slow for
large corpus on theWeb
Quality of clustering
-
8/8/2019 Syn Presentation(6!05!10)1
14/29
Scoring Cluster
Suffix Tree Construction
Merging Clusters
Labeling Clusters
Preparing the Doc
Web Document Clustering Using SuffixTree Algorithm
Clusters x and y if (Bx By) / |Bx|>k
(Bx By) / |By|>k
SC = NC * p(li)
one or more labels in the original suffix tree
-
8/8/2019 Syn Presentation(6!05!10)1
15/29
The definition of STC an incremental, o(n) time clustering
algorithm that satisfies these requirements
Effective for Information Retrieval
Snippets versus Whole Documents Clustering
Execution Time is less
Analysis of STCApplications!
Analysis of the STCDrawbacks!
Non-Exclusiveness
Incompleteness
Documents may appear in more than one No specific category
Share only few short word Not contain all documents
Absoluteness
Topic Generating
No information about document lengths or suffix mismatches
Topic identification for document clusters
-
8/8/2019 Syn Presentation(6!05!10)1
16/29
ClusteringHigh-Dimensional Data Clustering high-dimensional data
Many applications: text documents, DNA micro-array data
Major challenges:
Many irrelevant dimensions may mask clusters
Distance measure becomes meaninglessdue to equi-distance
Clusters may exist only in some subspaces
Methods
Feature transformation: only effective if most dimensions are
relevant
PCA & SVD useful only when features are highly
correlated/redundant
Feature selection: wrapper or filter approaches
useful to find a subspace where the data have nice clusters
Subspace-clustering: find clusters in all the possible subspaces
CLIQUE, ProClus, and frequent pattern-based clustering
-
8/8/2019 Syn Presentation(6!05!10)1
17/29
Feature Selection Feature selection strategy
Remove non-informative words from documents Improve categorization effectiveness
Reduce computational complexity
Remove redundant data
Result: Dimensionality Reduction
n m1 km2>> >> >>
Data Space Feature Space Cluster/Class
Dimensionality Reduction
-
8/8/2019 Syn Presentation(6!05!10)1
18/29
Document Clustering usingFeature
selection
Feature Selection
Preprocessing(Stop word Elimination,
Stemming,)
ClusteringAlgorithm
Documents
Clusters
Feature Extraction(Document-Term Matrix)
-
8/8/2019 Syn Presentation(6!05!10)1
19/29
Feature Selection A good feature set is
Efficient Low dimension as mush as possible - Objective
Effective Discriminating documents as much as possible Subjective
Feature selection process: Optimization process,minimizing the number of features and maximizingthe discriminating property of the feature set
Problem statements
Searching the feature space to find an optimum subset
of features to satisfy goal
Silent about the clusters of different subspaces
-
8/8/2019 Syn Presentation(6!05!10)1
20/29
The Curse of Dimensionality
When the number of dimension increases, the distance between any two points is nearly
the same
Surprising results!
This is the reason why we need to study subspace clustering
-
8/8/2019 Syn Presentation(6!05!10)1
21/29
Document Clustering using Subspace
Preprocessing(Stop word Elimination,
Stemming,)
Documents
Clusters
Subspace Clustering
-
8/8/2019 Syn Presentation(6!05!10)1
22/29
Why Subspace Clustering?
To integrate feature evaluation and clustering in order to find
clusters in different subspaces
Uncover complex relationship in data set
Subspace-clustering: find clusters in all the subspaces Cover all the document collection to make sub space
Can handle the new features
Extension of feature selection
Top-down subspace clustering search
Bottom-up subspace clustering search
Dense Unit-based Method
Entropy-Based Method
Transformation-Based Method
-
8/8/2019 Syn Presentation(6!05!10)1
23/29
Top-down Subspace
Clustering Algorithms
Multiple iterations of expensive
clustering algorithms
Find out Initial Clustering in full set of
Dimension
Evaluate the Subspace of each cluster
Iterative processing will be done to
improve the result
Bottom-up Subspace
Clustering Algorithm
Integrate the clustering and subspa
selection
Find the dense regions in low
dimension spaces
Combine them to form cluster
Text mining are particularly relevant and present unique challenges to subspace
clustering.
Subspace Clustering
-
8/8/2019 Syn Presentation(6!05!10)1
24/29
Information Integration
Web Text Mining
DNA Microarray
Applications of Subspace Clustering
WebText Mining
Web Page in Document-Term Matrix
Instance
(Pages)
Feature
(Keywords)
Find set of keywords (Subspace) for given group of Page
Keywords connect the group
Cluster represent the Domain
-
8/8/2019 Syn Presentation(6!05!10)1
25/29
ExampleData Set
(400 instances)
ClusterI
(100 instances)ClusterII
(100 instances)
ClusterIII
(100 instances)
ClusterIV
(100 instances)
3-D
(a,b,c)
2-D
(a,b)
3-D
(b,c)
2-D
(a,b)
3-D
(b,c)
Apply k-means
Do poor Job finding the Cluster
As each cluster are in irrelevant Dimensions
Consider the Fewer Dimension
-
8/8/2019 Syn Presentation(6!05!10)1
26/29
Apply Feature Transformation
Transform the dimension from high to low
Relative distance preserve
Unaffected the irrelevant dimensions
Apply Feature Selection
Reduce the dimensionality
Find the cluster in the same subspace
Not explain the cluster in different subspace
Find the Cluster in each subspace
-
8/8/2019 Syn Presentation(6!05!10)1
27/29
Apply Subspace Clustering
Represent the cluster in interpretable and meaningful ways
Represent cluster as well as subspace in which it exists
Uncover the complex relationship found in data
In order to this
Unique challenges in subspace clustering
Finding appropriate result depends on cluster technique
Strength,Weakness & biases of potential clustering algorithm
-
8/8/2019 Syn Presentation(6!05!10)1
28/29
ResearchProposal
To investigate computationally efficient ways for combininginformation retrieval with clustering.
Efforts will be made to explore the efficient clustering algorithms,which work better in high dimensional datasets and apply them for
document clustering.
Work on feature vector representation and reduction of itsdimensionality using feature selection and subspace clustering willbe investigated to make clustering algorithm more efficient for
large set of documents. Specifically we will focus on the word co-occurrence frequency to reduce feature space for clustering.
-
8/8/2019 Syn Presentation(6!05!10)1
29/29
ThanksSuggestions!!!!