descriptive clustering as a method for exploring text collections · 2009-07-20 · descriptive...
TRANSCRIPT
Poznan University of Technology
Institute of Computing Science
Descriptive Clustering as a Method
for Exploring Text Collections
Dawid Weiss
A dissertation submitted to
the Council of the Faculty of Computer Science and Management
in partial fulfillment of the requirements for the degree of Doctor of Philosophy.
Supervisor
Jerzy Stefanowski, PhD Dr Habil.
Poznan, Poland
2006
Politechnika Poznanska
Instytut Informatyki
Grupowanie opisowe jako metoda eksploracji
zbiorów dokumentów tekstowych
Dawid Weiss
Rozprawa doktorska
Przedłozono Radzie Wydziału Informatyki i Zarzadzania
Politechniki Poznanskiej
Promotor
dr hab. inz. Jerzy Stefanowski
Poznan 2006
S T R E S Z C Z E N I E
Tematyka rozprawy dotyczy systemów wyszukiwania informacji oraz
identyfikacji w kolekcjach dokumentów ich tematycznie powiazanych podgrup.
Główna motywacja rozprawy jest tworzenie mozliwie czytelnych, gramatycznie
poprawnych i zrozumiałych opisów odnalezionych grup. Obecnie stosowane
algorytmy grupowania tekstów nie sa dostosowane do automatycznego tworzenia
dostatecznie dobrych opisów, a nowe zastosowania w eksploracji danych wskazuja
na ich praktyczne znaczenie.
Rozprawa zawiera propozycje ukonkretnienia powyzszych postulatów
dotyczacych opisów grup w formie definicji problemu grupowania opisowego
(ang. descriptive clustering). Nastepnie przedstawione jest ogólne podejscie do
konstrukcji algorytmów grupowania próbujacych spełnic przedstawione załozenia,
nazwane Description Comes First (dcf).
W odróznieniu do klasycznego podejscia, gdzie ocenie podlega jedynie
sposób przydziału dokumentów do grup, dcf bierze pod uwage opis grupy
jako jeden z kluczowych elementów wyniku całego algorytmu i stosuje ów
opis zarówno w trakcie konstrukcji modelu skupien dokumentów, jak i przy
tworzeniu ich opisów. W podejsciu dcf, poszukiwanie zbioru kandydatów na
opisy grup i matematycznego modelu skupien nastepuje niezaleznie. Wsród
opisów-kandydatów poszukiwane sa nastepnie takie, które posiadaja wsparcie
w odnalezionym modelu grup. W kroku ostatnim nastepuje przydział dokumentów
do wybranych opisów.
Praca prezentuje dwa algorytmy bedace przykładem praktycznej implementacji
podejscia dcf. Algorytm pierwszy, Lingo, znajduje zastosowanie w grupowaniu
wyników z wyszukiwarek internetowych. Algorytm drugi, Descriptive k-Means,
słuzy do grupowania duzej liczby dłuzszych dokumentów. Oba algorytmy
implementuja ten sam ogólny schemat działania oparty o dcf, lecz rózniaca je
specyfika przetwarzanych danych pociaga koniecznosc uzycia innych docelowych
rozwiazan — Lingo wykorzystuje frazy czeste i redukcje wymiarów macierzy słów,
Descriptive k-Means natomiast ekstrakcje fraz czestych, frazy nominalne oraz
grupowanie przy pomocy algorytmu k-Means (k-srednich).
W pracy przedstawiono eksperymenty obliczeniowe dla obu algorytmów.
Wyniki eksperymentów porównuja jakosc grupowania (rozumiana jako sposób
odtworzenia znanego przydziału dokumentów do grup) przy uzyciu Lingo oraz
Descriptive k-Means, z ich najblizszymi odpowiednikami literaturowymi —
algorytmami Suffix Tree Clustering oraz k-Means. Inny istotny aspekt praktyczny
ewaluacji stanowi przedstawienie danych zebranych z publicznej wersji systemu
Carrot2, dostepnego na zasadach wolnego oprogramowania.
Contents
Preface IV
1 Introduction 1
1.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Assumptions and Overview of the Solution . . . . . . . . . . . . . . . . . . . . 8
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Typographical Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Background 14
2.1 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Word and Sentence Segmentation . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Shallow Linguistic Preprocessing . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Finding Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Text Chunking in English 16 d Text Chunking in Polish 17 d Heuristic Chunking
in Polish 18 d Phrases as Frequent Sequences of Words 19
2.2 Text Representation: Vector Space Model . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Document Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Similarity Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Overview of Selected Clustering Algorithms . . . . . . . . . . . . . . . 29
Partitioning Methods 30 d Hierarchical Methods 30 d Clustering Based on Phrase
Co-occurrence 31 d Other Clustering Methods 32
2.3.3 Applications of Document Clustering . . . . . . . . . . . . . . . . . . 32
2.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Evaluation of Clustering Quality . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.1 User Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.2 Measures of Distortion from Predefined Classes . . . . . . . . . . . . 38
I
II
3 Descriptive Clustering 40
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Cluster Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Comprehensibility 42 d Conciseness 43 d Transparency 43
3.2.2 Document Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Internal Consistency 45 d External Consistency 45 d Overlaps and Outliers 45
3.3 Relationship with Conceptual Clustering . . . . . . . . . . . . . . . . . . . . . 45
4 Solving the Descriptive Clustering Task: Description Comes First Approach 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Anatomy of Description Comes First . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Phase 1: Cluster Label Candidates . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Phase 2: Document Clustering (Dominant Topic Detection) . . . . . 50
4.2.3 Phase 3: Pattern Phrase Selection and Document Assignment . . . . 50
4.2.4 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Discussion of Clustering Quality . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 The Lingo Algorithm 57
5.1 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Input Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.2 Frequent Phrase Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Cluster Label Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.4 Cluster Content Allocation . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.5 Final Cluster Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Descriptive k-Means Algorithm 65
6.1 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.2 Dominant Topic Detection . . . . . . . . . . . . . . . . . . . . . . . . . 69
Preparation of Document Vectors 69 d Clustering with k-Means 69
6.2.3 Selecting Pattern Phrases and Allocating Cluster Content . . . . . . . 71
6.3 Dealing With Large Instances: Implementation Details . . . . . . . . . . . . . 73
6.3.1 Data Storage and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.3 Searching for Pattern Phrases . . . . . . . . . . . . . . . . . . . . . . . 75
III
6.3.4 Searching for Documents Matching Pattern Phrases . . . . . . . . . . 75
6.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7 Evaluation 79
7.1 Evaluation Scope and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 Experiment 1: Clustering in Lingo . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2.1 Test Data and Experiment’s Setting . . . . . . . . . . . . . . . . . . . . 80
7.2.2 Output Clusters Structure and Quality . . . . . . . . . . . . . . . . . . 81
7.2.3 Analysis of Cluster Contamination . . . . . . . . . . . . . . . . . . . . 88
7.2.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3 Experiment 2: Clustering in Descriptive k-Means . . . . . . . . . . . . . . . . 90
7.3.1 Test Data and Experiment’s Setting . . . . . . . . . . . . . . . . . . . . 90
7.3.2 Output Clusters Structure and Quality . . . . . . . . . . . . . . . . . . 93
7.3.3 Manual Inspection of Cluster Content . . . . . . . . . . . . . . . . . . 103
7.3.4 A Look at Cluster Labels . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.3.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.4 User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.4.1 Open Sourcing Carrot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.2 Online Demo Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8 Summary and Conclusions 112
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A Cluster Contamination 116
Bibliography 120
Web Resources 128
Index 129
Preface
The nature of processes leading to useful classifications remains lit-
tle understood, despite considerable effort in this direction.
— Ryszard Michalski, Robert Stepp [66]
Grouping objects into larger structures is a natural impulse exhibited by most humans —
we like when things are in order, when they have clear relationships with other things. We
like it when structures we discover or create form a hierarchy of some sort. Our intuition is
driven by potential gains: categorization lets us shorten the time needed to find things we
look for (compare a well organized file drawer to a cluttered desktop), grouping things into
logical, higher level structures relieves our consciousness from digesting excessive amounts
of information (a car is composed of hundreds of different parts, yet when we look at it we
still perceive a single object).
The ability to organize and link objects into larger structures is very natural — we all
constantly use it after all — but not at all trivial to explain or formalize. The field of natu-
ral science provides an excellent example. A taxonomy of living things on Earth was for a
long time an unsolved puzzle — many people tried to come up with a classification encom-
passing all the known organisms, but it took the genius of Carl Linnaeus’ intuition to put
together a hierarchical organization of species where (most) of the known creatures fitted.
Linnaeus concept was to group species based on similarities of anatomic features they ex-
hibited. His methodology of work was revolutionary even if its outcome was in many places
doubtful or even laughed at (it included dragons and Phoenix). The point we try to make is
that creating any form of sensible, useful taxonomy is always difficult and subjective. Even
when it is performed by a computer algorithm, all the assumptions, intuition and practical
experience of its designer are inevitably entwined somewhere inside it.
The work on this thesis started with the following practical scenario. Imagine a set of text
documents, each one discussing one or more subjects (topics) and a human user interested
in an overview of these topics. I decided to call this activity of exploring an unknown col-
lection of documents exploratory text mining — human inspection of a group of documents
with a vague (impossible to express with a precise query) or no particular information need
known in advance. The goal is to reduce the time it takes the user to extract and compre-
hend the structure of topics in the original collection.
IV
Preface V
There are a few scientific methods for identifying topics in texts. I focused my interests
on text clustering because it is traditionally associated with information retrieval and has
gained popularity recently due to its combined use with Web search engines. What I hope
distinguishes this work is that instead of optimizing document allocation quality, we suggest
a slightly different viewpoint on the goals of text clustering when applied to exploratory text
mining — a viewpoint based on the human perspective, where the meaning of cluster labels
is at least as important as their content. It would be untrue to claim that methods contained
in this thesis provide the best answer to clustering textual content in general, very likely no
unique all-purpose “best” algorithm for this task exists, the main value of this work is in the
idea of constructing clustering algorithms in a slightly different way from the mainstream
research in the field. Our primary objective was to show an approach that facilitates con-
struction of sensible cluster descriptions and makes the results easier to comprehend to the
human user — a task traditionally very difficult and hence often neglected in information
retrieval. The fact that the algorithms we describe seem to perform at least equally well
compared with methods known in literature is an additional, but of course quite pleasing,
bonus.
Acknowledgment
Numerous people helped me through this effort and I’d like to mention some of them here.
A number of my past and present teachers filled me with deep understanding of the di-
versity of computer science problems, so valuable in undertaking new challenges. I deeply
thank my supervisor Jerzy Stefanowski for his persistence and dedication to guiding my
work. Professor Roman Słowinski was an endless source of inspiration and thoughtful criti-
cism. Professor Jerzy Nawrocki gently introduced me to real-life software engineering which
still constitutes a significant part of my research.
I appreciate the invaluable help of Marek Kubiak, Stanisław Osinski, Przemysław Wesołek
and other doctoral students I learned so much from.
Finally, I’d like to thank my family for the enormous support and encouragement they
offered me and of course congratulate the endless patience they showed for my late-night
study sessions.
Chapter 1
Introduction
1.1 Problem Setting
Two Types of Information Needs Just a few years ago size seemed to be everything —
search engines weekly published numbers of Web pages indexed and available for immedi-
ate access to the public. But as the numbers grew into billions, size became just another in-
comprehensible factor. Nowadays electronic information sources on the Web include a great
variety of different content. Alongside traditional Web pages in html, we have newswire
stories, books, e-mails, blogs, source code repositories, video streams, music and even tele-
phone conversations. Even narrowing the scope to textual content, the range of different
possibilities is overwhelming. A piece of text downloaded from the Internet is usually un-
structured, multilingual, touching upon all kinds of subject (from encyclopaedia entries to
personal opinions) and in general unpredictable (think of all the typographical conventions,
abbreviations, new words that come with electronic publications). The sheer availability of
information is no longer that exciting, the new revolution lies in improving the ways of com-
prehending and utilizing the information we have gathered so far.
Let us narrow the focus to a collection of textual documents (possibly very large) and
distinguish two types of information needs: (1) searching for a concrete piece of information, information needs
and (2) exploring the structure of a collection of documents.
We search for concrete pieces of information when we need to find a particular Web
page, document, historical fact or a person. This kind of information need has an interesting
property: we can express it to a computer system using a query. The computer system may query
try to find a direct answer to our query (as in question-answering systems), but more often
just locate a resource (document) that possibly contains the answer our query. The latter
systems are called document-retrieval systems or in short search engines and in this thesis search engine
our discussion concerns mostly programs of this type.
A single query typically matches a number of documents and a search engine must ar-
range the result into an ordered list, sorting documents according to their relevance to the
query. This final list is often called a hit list and its topmost entries are shown to the user. hit list
Obviously, users rarely have the time to browse through all the returned documents and
1
1.1. Problem Setting 2
searching for information exploring the structure
︷ ︸︸ ︷ ︷ ︸︸ ︷
research problems research problems
– document retrieval
– ranking algorithms
– query-answering systems
– summarization
– document clustering
– topic detection and tracking
︷ ︸︸ ︷
computer science disciplines
– natural language processing
– probability, graph analysis
– information retrieval
– Web and text mining
– machine learning
– human-computer interaction
Figure 1.1: Selected computer science problems, disciplines and their relation to the two
types of information needs.
limit their effort to the hit list’s topmost few entries, so calculating relevance is a key factor
assuring documents most likely containing answers to the query are pushed up in the final
ranking. These are the basic foundations of all search engines available at the time of writ-
ing this thesis: use a very simple, but fast Boolean model of term containment [4] to find
matching documents and rely on two things to provide valuable service to the user: qualita-
tive ranking algorithms and, most of all, the user’s ability to rephrase the query until his or
her information need is satisfied.
The activity of exploring a collection of documents takes place when there is no infor-
mation need or it is too vague to formulate a specific query. For example, imagine creating a
query for the following question: given a set of previously unseen documents, what subjects
are they about? An alternative task could be this: what subjects dominate the headlines of all
major newspapers today? A human being could answer these questions simply by reading
through all the available documents, but such solution is usually unacceptable as it requires
too much time and effort. Exploration problems are also encountered in combination with
search engines. Queries issued to search engines are mostly short and ambiguous [104] and
match vast numbers of documents concerning variety of subjects. Creating a linear hit list
out of such a broad set of results often requires trade-offs and hiding documents that could
prove useful to the user. If shown an explicit structure of topics present in a search result,
users quickly narrow the focus to just a particular subset (slice) of all returned documents.
Computer science responds to the above information needs by specifying problems and
devising solutions rooted in a number of disciplines (see Figure 1.1). This thesis concen- focus of this thesis
trates on text clustering methods developed primarily in information retrieval and attempts
to improve their applicability to the task of exploration of document collections by making
sure clusters are described in a meaningful way.
Text Clustering Text clustering or shortly clustering (although this term denotes a much text clustering
broader class of problems) is about discovering semantically related groups in an unstruc-
tured collection of documents. Clustering has been very popular for a long time because it
provides unique ways of digesting and generalizing large amounts of information.
An initial application of text clustering was in document retrieval. Keith Van Rijsbergen
1.1. Problem Setting 3
observed that “Closely associated documents tend to be relevant to the same requests” (this
statement is referred to as the cluster hypothesis [96]). The idea was to cluster the collection cluster hypothesis
of documents offline and enrich the search result with documents from the same clusters
as these matching the query directly. This way documents related to the query but not con-
taining all of its terms could be found and shown to the user. Obviously, there is a question
of how good the clustering algorithm is at finding related documents which then become
clusters. Inspired by this question researchers started to evaluate clustering algorithms by
measuring how accurate they are at reproducing a man-made taxonomy. Several large con-
ferences, such as the Text Retrieval Conference (trec) [E], published reference data with
samples of documents clustered manually by humans. The evaluation of “quality” of an
algorithm was based solely on the difference between automatically discovered document
groups and predefined classes.
All applications of text clustering methods to this moment exploited the set of discov-
ered clusters with no intention of presenting them to the user; clustering was hidden in the
background. This situation changed when people realized that clusters could be used as a
compact user interface for browsing document collections. This observation inspired Marti
Hearst and Jan Pedersen and resulted in their influential paper about a search system Scat-
ter/Gather [37]. By scanning the description of a cluster the user could assess the relevance
of the remaining documents in that cluster and narrow the search to the interesting docu-
ments (or at least identify the irrelevant clusters and skip their content).
Note that bringing clusters forward to the user interface introduces a great deal of prob-
lems with their presentation. A cluster is basically a group of documents, typically discov-
ered by means of a mathematical model that is unexplainable in plain text. Most cluster-
ing algorithms used very simple display techniques: selected titles of documents within the
cluster, excerpts of documents, and the most popular — a list of the most prominent clus-
ter terms called keywords. Unfortunately, none of these methods truly solve the problem of keywords
creating sensible cluster descriptions. In our opinion, a true solution is basically impossible
without changing the goals of clustering and including cluster labels as one of the crucial
elements of the overall result. Concluding, instead of optimizing the document allocation
quality, our focus shifts to how users perceive clusters — to the comprehensibility of their
descriptions.
Cluster Descriptions At the moment of writing, clustering of search results and short doc-
uments is a popular research topic. A good few commercial search engines with clustering
functionality also exist. But, against the expectations, clustering is far from dominating the
search world. While politely nodded at as very useful, search engines enriched with dy-
namic clustering are still considered an interesting curiosity rather than a daily search tool.
This reluctance may be caused by the fact that clustering algorithms have been notoriously
weak at generating sensible cluster labels, discouraging users. Representing clusters using
keywords strips the semantics from their description and makes them difficult to under-
stand (see Figure 1.2 on the following page for an example). The order of keywords may also
suggest an entirely different meaning. A humorous example is a cluster named poswiecony,
1.1. Problem Setting 4
Figure 1.2: Keyword representation of clusters is often insufficient to comprehend the
meaning of documents in a cluster. Examples above come from research systems [53] (left)
and [37] (right).
Figure 1.3: Search results enriched with clusters in Carrot2. The discovered clusters on the
left side of the screen, fragment of the original set of search results on the right side of the
screen.
1.1. Problem Setting 5
serwis (consecrated, service) where original documents contained a phrase serwis poswiecony
(service dedicated to).
The work on this thesis was initially inspired by Oren Zamir’s stc algorithm [104, 106],
which spawned a new research field called search results clustering. The input in search search resultsclustering
results clustering contains only the fragments of original documents retrieved by a search
engine (called snippets) and the expected output is a set of labeled groups of topics domi- snippet
nating in the input. Figure 1.3 on the previous page shows a screenshot of a search results
clustering system Carrot2. The number of “documents” (snippets) is usually quite small (a
few hundred entries at most) and the input text is incomplete, fragmented and really short
(a few sentences). With this kind of input the breakthrough idea in stc was to utilize fre-
quently recurring sequences of terms (phrases) to construct document groups and use the phrase
same phrases to describe the identified document groups later.
We reimplemented stc and used it to cluster search results in Polish [101]. The results
caused mixed feelings. The semantics of cluster descriptions was often very poor — frequent
phrases with no meaning crept into cluster labels and problems with handling Polish inflec-
tion were evident. Similar issues applied to clustering search results in English, although on
a smaller scale.
The most valuable contribution of stc was the inclusion of phrases in the process of
clustering with the thought of creating more comprehensible cluster labels. Unfortunately,
the use of phrases as a similarity feature used for grouping documents is quite often in-
convenient. In many cases it is impossible to distinguish meaningful phrases from frequent
word groups that are not at all relevant — collocations, phrasal verbs and others (e.g. home
page, e-mail address, see Figure 1.4(b) on the following page for more examples). We could
summarize these linguistic problems in the following list:
• Inflection. In many languages the same word can appear as different tokens depend-
ing on its role in a sentence. A mechanism such as stemming or lemmatisation is re-
quired to transform input words into identical tokens, regardless of their grammatical
function. Unfortunately, this causes difficulties in finding the right form of the phrase
when it is needed for display (see Figure 1.4(a) on the next page).
• Less strict word order. Word order in English usually determines the meaning of a sen-
tence, so similar subjects are indeed referred to using similar phrases. In Polish, for
example, this is no longer the case. Word order is fixed only to a certain degree and
the meaning of a sentence is a result of grammar relationships between words, usually
manifested by different word suffixes. The assumption that phrases are good features
is quite demanding.
• Phrase boundary detection. Detecting proper phrase boundaries is difficult in any lan-
guage. Consider the following example.
John went to San Francisco to see Mary and Peter
1.1. Problem Setting 6
(a) Inflection changes the visual representation of
words, but does not change the meaning. See
inflected forms of the phrase Polska Akademia
Nauk.
(b) Good cluster label selection without linguistic
and domain hints might be impossible. Frequent
word groups Strona Główna (home page) and
Zobacz Temat (see the subject) are not meaningful
phrases.
Figure 1.4: Problems with phrase and linguistic aspects present in a popular commercial
clustering search engine Vivísimo [D].
Potentially interesting phrases in the above example could have the following bound-
aries:
[John went to
[San Francisco
]]to see
[Mary and Peter
]
but also:
[[John went to San Francisco to see Mary
]and Peter
]
or even single term phrases like:
[John
]went to San Francisco to see
[Mary
]and
[Peter
]
Without full linguistic analysis (often involving contextual knowledge), which is still
a rarity in systems with on-line processing requirements, finding the real extent of a
phrase is usually impossible.
Summarizing, neither keyword cluster representation, nor phrase-based clustering are
satisfactory. If we use a text representation model convenient for computation, it is very
difficult to return to sensible cluster labels. On the other hand, if we decide to use phrases
as similarity features, cluster labels construction is simpler, but the clustering quality and
detection of weak clusters is much more problematic, especially in languages like Polish.
We try to find alternative ways of clustering documents, which permit finding cluster
descriptions that are informative, expressive and possibly grammatically correct and at the
same time enable the use of standard word-based similarity models. According to our best
knowledge this problem has not been sufficiently addressed in the literature.
1.2. Motivation 7
1.2 Motivation
In the emerging new wave of applications where people are the ultimate target of text clus-
tering methods, cluster labels are intended to be read and comprehended by humans. The
primary objective of a clustering method should be to focus on providing good, descrip-
tive cluster labels in addition to optimizing traditional clustering quality indicators such as
document-to-group assignment. In yet other words: in document browsing, text cluster-
ing serves the main purpose of describing and summarizing a larger set of documents, the
particular document assignment is of lesser importance. Even if this statement seems blas-
phemous at first, it is true — users expect labels to convey all the information about clusters’
content. For example, we found that inspection of documents inside a cluster in search re-
sults clustering rarely takes place and is equally likely with re-querying the search engine
using terms taken from the cluster’s label.
There seems to be a distant analogy to search algorithms in information retrieval. At
first, these algorithms were focused only on precision and recall (how many relevant doc-
uments are retrieved for a query and how many irrelevant documents are retrieved along
with the good ones). Nowadays there is a third important factor contributing to the quality
of a search engine: the ranking among the retrieved documents (which documents should
be placed at the top of the hit list). Text clustering, originally focused on recreating a pre-
defined, man-made taxonomy of documents, must now take into account a new important
factor: how to describe and present the result to the human user so that he or she under-
stands it.
We believe creating proper representation of clusters is an independent research prob-
lem, its investigation is worthwhile and none of the existing research directions directly ad-
dress it. To make a clear distinction from traditional document clustering, we will call it a
descriptive clustering problem:
Descriptive clustering is a problem of discovering diverse groups of semanti- descriptiveclustering
cally related documents described with meaningful, comprehensible and com-
pact text labels.
To conclude, the motivation behind this thesis can be expressed by the following state-
ments.
• We believe there is a new type of application of clustering methods in information re-
trieval which focuses on revealing the structure of document collections, summarizing
their content and presenting this content to a human user in a compact way.
• A problem stemming from this application is different to original text clustering: qual-
ity of cluster descriptions is at least equally important as their content, we call it de-
scriptive clustering. The most obvious difference between descriptive clustering and
traditional clustering is that clusters with no apparent label can be (and should be)
discarded because they provide no value to the user.
1.3. Goals and Objectives 8
• Descriptive clustering lies at the border of text summarization, topic identification,
classification and text clustering, but none of these disciplines address it directly.
• Existing methods for labeling clusters are insufficient and often fail to create compre-
hensible cluster descriptions. These problems are present especially in Polish (and
presumably other inflectional languages).
• Elements of descriptive clustering itself are underspecified and require a more formal
description.
1.3 Goals and Objectives
The overall goal of this thesis is to find a text clustering method fulfilling the requirements
concerning cluster labels and applicable for exploratory text mining. Specific objectives are
given below:
1. Define the descriptive clustering problem and specify its requirements.
2. Devise a method — Description Comes First (dcf) — that yields comprehensible, con-
cise and grammatically correct (to the extent possible) descriptions of clusters. The
method should assure a clear, transparent relationship between the cluster label and
its contents.
3. Show practical implementation of dcf in two clustering algorithms: Lingo (applicable
to search results clustering) and Descriptive k-Means (dkm) (applicable to collections
of several thousands short and medium documents).
4. Address the performance requirements and scalability of the presented algorithms.
Experimentally verify how dcf affects the quality of document assignment.
A secondary objective of this work is to provide efficient implementations of the pre-
sented algorithms.
1.4 Assumptions and Overview of the Solution
This section starts with important assumptions delimiting the boundaries of this work and
our interpretation of several key concepts, such as comprehensibility. Following is a brief
outline of the presented solution, stripped of many details, but hopefully useful for grasping
the major concepts detailed in later chapters.
Comprehensibility and Clustering Properties We mentioned comprehensibility and clar-
ity of cluster descriptions a few times so far, including the definition of descriptive cluster-
ing. Obviously, terms such as “meaningful”, “accurate” or “comprehensible” leave a great
deal of freedom for interpretation. While the meaning or accurateness of a cluster label is
intuitively easy for a human to verify, it is quite difficult to formalize. We define several re-
quirements concerning cluster labels and document-to-clusters assignment and name them
1.4. Assumptions and Overview of the Solution 9
altogether a problem of descriptive clustering. Chapter 3 is devoted entirely to this subject,
but because descriptive clustering plays such an important role in this thesis, we shortly in-
troduce these requirements now, leaving their full description and interpretation for later.
With regard to cluster labels, we have three expectations:
• comprehensibility — understood as grammatical correctness (word order, inflection,
agreement between words if applicable); we consider elliptical phrases comprehensi-
ble, although undesirable;
• label conciseness — phrases selected for a cluster label should minimize its total length
(without sacrificing its comprehensibility);
• transparency — the relationship between a cluster’s label and its content should be
clear for the user. This is best explained by ability to answer questions such as: “Why
was this label selected for these documents?” and “Why is this document in a cluster
labeled X?”.
Let us emphasize that the above label-related requirements are specific and essential to de-
scriptive clustering, but other characteristics of a clustering method remain just as impor-
tant. We expect the following from clusters:
• internal consistency — similar documents should be grouped together;
• external consistency — groups of documents should be dissimilar to one another;
• group overlap is allowed — documents may belong to more then one group (because
multi-topic objects are natural when dealing with texts);
• ungrouped documents are allowed — clustering must not enforce strict assignment of
each document to a group; there will always be documents not similar to anything
else;
• flat clusters structure — we will focus on flat clustering algorithms; if the presented
approach can produce a hierarchy is an open question included in future research
directions.
Input Data and Text Representation Model The size and properties of input documents
have a great impact on the choice and performance of a clustering method. We restrict
our interest to snippets (as in search results) and short or medium documents (as in news
articles, mailing lists or Web pages). We additionally assume that all the input is given in
one language, although we try to make our methods more universal by considering caveats
of clustering in English and Polish.
A typical text clustering algorithm, regardless of the actual method used, follows the
steps depicted in Figure 1.5(a). Input documents are first transformed into a mathematical
model where each document is described by certain features, facilitating calculation of doc-
ument proximity and therefore cluster discovery. In this thesis we restrict our discussion to
1.4. Assumptions and Overview of the Solution 10
(a) A typical clustering algorithm outline.
(b) Outline of an algorithm following the dcf approach.
Figure 1.5: Clustering procedure in a typical clustering algorithm and in the dcf approach.
Note parallelization in dcf.
term Java Sun language coffee bean . . .
weight 0.2 0.1 0.16 0.05 0.3 . . .
Figure 1.6: An example document vector in vector space model.
one of the most popular text representation models – the Vector Space Model (vsm) [4, 102]. vector space model
In the vsm, an input document is transformed into a vector, where each individual compo-
nent denotes a single word in a language and its value represents the importance of that
word. Such document vectors are often very sparse, so typically a document’s representation
is limited to components with non-zero weight (see Figure 1.6 for an example). Note that
this one-way transformation from the source text to its vector representation is partially re-
sponsible for the problems with reconstructing cluster labels: viable syntactical information
is lost, making a document just an uninterpretable set of words.
Overview of Description Comes First Approach To overcome the problems mentioned so
far, we suggest an approach called Description Comes First (dcf). The dcf changes the
troublesome conventional order of a typical clustering algorithm (cluster discovery → la-
bel induction) in a way that splits cluster discovery and candidate label discovery into two
independent phases (see Figure 1.5(b)):
• Candidate label discovery is responsible for collecting all phrases potentially useful as
good cluster labels.
• Cluster discovery provides a computational data model of document groups present
in the input data.
By splitting the process into these two phases the most difficult element so far — creating
proper cluster description from a mathematical model — is avoided and replaced by a prob-
1.4. Assumptions and Overview of the Solution 11
lem of selection of appropriate labels for each cluster found. We mentioned that a mathe-
matical model of clusters is never likely to be understandable to the user. The dcf approach
tries to decrease this “semantic gap” between a model of clusters and a set of selected labels
by discarding the model of clusters entirely and building final groups of documents starting
with just the selected cluster labels. This way the actual model of clusters is used only in-
ternally and can be suitably complex because it never surfaces to the user interface. On the
other hand the candidate cluster label selection procedure should ensure their comprehen-
sibility to the user. The whole trick is about merging these two steps together.
To put it in yet other words, an algorithm following the dcf approach first attempts to
find meaningful patterns (cluster labels) and then looks for objects exhibiting these patterns
(documents).1 Clusters in the traditional sense are used solely to select the strongest pat-
terns and have no later use, final groups of documents are created by assigning documents
to their matching patterns.
Algorithms In our opinion the dcf is a meta-method which can be used to construct algo-
rithms following a certain paradigm. We are going to demonstrate its practical implementa-
tion using two examples: Lingo and Descriptive k-Means.
The Lingo algorithm was the first implementation of the dcf approach. The algorithm lingo
was originally developed by Stanisław Osinski and presented in his master’s thesis written
under supervision of Jerzy Stefanowski. Several concepts presented here are a result of au-
thor’s close collaboration with Stanisław. Lingo’s motivation and implementation fall exactly
in the general pattern introduced in the dcf approach, so we present the algorithm and our
joint experiments as an example of a dcf-based method applicable to clustering search re-
sults. A detailed overview of Lingo can be found in Section 5, the next paragraph shows a
general look at the algorithm.
All candidate cluster labels in Lingo are discovered directly from the input text by select-
ing frequently repeated phrases and terms (so-called recurring phrases). This step is similar
to phrase discovery in the stc algorithm. Concurrently to candidate label discovery, a vector
space model is built for all documents in the input. Lingo uses dimensionality reduction
methods applied to the term-document space to find synthetic vectors that approximate
topics present in the input documents. These vectors are then used to select final cluster la-
bels from the set of all candidates. In the final step, the selected cluster labels are treated as
queries to a conventional vsm-based search system and documents matching a given label
are assigned to it forming a final cluster.
Descriptive k-Means (dkm), introduced in this thesis, is the second algorithm following descriptive k-means
the dcf approach. The dkm was created for two reasons: first, to deal with a different type
of input data — thousands of short and medium documents. Second, to provide answers to
a few interesting questions that arose during the work on descriptive clustering. The field
of search results clustering is still quite recent and there is a significant shortage of data
sets and methods of quality evaluation. Full-text clustering, on the other hand, has a much
1This pattern-search way of looking at the dcf approach has been suggested to us by prof. Tadeusz Morzy.
1.5. Thesis Outline 12
longer history and well established methodologies. Even if we criticize them a bit later in
this thesis, it was tempting to see if a well-known text clustering algorithm can be modified
to follow dcf’s principles and how its results would compare against the original version. We
decided that k-Means algorithm would be a good candidate for such an extension because
it is widely recognized and used.
In Descriptive k-Means the phase of cluster label candidate discovery can be performed
using two methods: extraction of frequent phrases (similar to Lingo and stc) or shallow
linguistic preprocessing resulting in extraction of noun phrases. We experiment with both
methods. A baseline version of k-Means algorithm is used to discover clusters in the input
and cluster centroids are used to select pattern phrases from the set of candidate cluster
labels. The obvious challenge was to keep the algorithm efficient considering the predicted
large number of input documents. We implement dkm by reusing efficient data structures
and algorithms known in document retrieval and already present in a typical search engine
used for document storage.
Experiments Experimental evaluation of clustering algorithms is difficult [59]. Experimen-
tal evaluation of comprehensibility of cluster labels is even more difficult. We delve in the
discussion of these problems in later sections, here only briefly pointing out that all the
experiments we conducted had a common goal: to show that the quality of clustering, mea-
sured in different ways, does not decrease by applying the dcf approach. We are convinced
that, a bit “by design”, the dcf approach must yield more comprehensible cluster labels. The
point we wanted to verify was if this comes at a price in document allocation quality.
1.5 Thesis Outline
Chapter 2 introduces several key concepts used in this thesis: foundations of linguistic pre-
processing in information retrieval, data structures such as suffix trees and suffix arrays and
an overview of selected clustering methods and their applications. In the linguistic section
we also outline the complexity of shallow linguistic processing in Polish and report on our
initial attempts to select sensible cluster label candidates in that language. The chapter ends
with a discussion of issues related to evaluation of clustering quality.
In Chapter 3 we try to provide a semi-formal set of requirements and goals of the de-
scriptive clustering problem and discuss its relationship with conceptual clustering present
in machine learning.
Chapter 4 contains a description of a generalized approach for solving the descriptive
clustering problem — the Description Comes First (dcf) approach.
In the two following chapters, we describe algorithms that “implement” the concepts
of dcf in practice: chapter 5 describes the Lingo algorithm, chapter 6 presents Descriptive
k-Means. For each we illustrate the way they implement the dcf approach. The chapter
devoted to Descriptive k-Means additionally contains the technical details required for its
efficient implementation.
1.6. Typographical Conventions 13
Chapter 7 presents the results of two computational experiments. The first experiment
assesses Lingo’s quality in the task of clustering search results. In the second experiment
we evaluate Descriptive k-Means’ performance on a standard evaluation data set compared
to the “baseline” k-Means algorithm. We wrap up the evaluation chapter with some conclu-
sions resulting from making the source code of the system an open source project, including
the statistics from an on-line demo of the system.
Finally, in Chapter 8 we end with a summary of conclusions, list of contributions and an
outline of future research directions.
1.6 Typographical Conventions
We use several typographical conventions in this thesis. The meaning of particular mathe-
matical symbols and variables is explained in their context.
new term Important concepts, terms.
code Program code, statements, variables, classes.
A phrase Example phrases and queries in the text.
key term Margin notes about key terms and concepts appearing in the corresponding
paragraph.
C A set of classes (given a priori partitions of the input documents).
K A set of clusters (returned by the algorithm).
[12] Citations to books, articles and other, mostly printed, literature.
[A] Citations to resources found on the Internet (listed on page 128).
Chapter 2
Background
This chapter presents the research background we build upon in the following parts of this
thesis. It is impossible to squeeze a thorough introduction to information retrieval, data
mining and natural language processing in one chapter, so whenever a deeper context is
necessary we provide references to books and articles delving deeper in the subject.
If the Reader is remotely familiar with the subjects mentioned above then it is probably
best to skip this chapter entirely and backreference it when the presented concepts are ac-
tually put to use (we always try to make explicit references to relevant background sections).
2.1 Text Preprocessing
2.1.1 Word and Sentence Segmentation
Any text processing application starts from analysis of a stream of bytes. A single byte can character encoding
take up to 256 different values, clearly insufficient to represent all characters available in
the world’s languages. For this reason, characters are usually encoded in a binary stream as
multi-byte sequences (as in Unicode [F]) or single-byte values mapped to specific charac-
ters according to a predefined encoding (byte-to-character pairs). We further assume that
the application knows how to interpret the bytes of an input stream and, using proper con-
version, interpret them in terms of a stream of characters.
The problem of text segmentation (or tokenization) starts when we have a stream of char- text segmentation
acters and the need to delimit boundaries of individual terms and sentences. The simplest
definition of a term is that it is a continuous sequence of letters. This definition covers most term
words in Latin languages and is quite useful, but obviously not powerful enough to properly
interpret terms such as these (examples partially from [60]):
północno-wschodni Micro$oft :-) ASP.NET
$22.50 child’s i18n c net
The difficulties start to pile up when we consider languages with no explicit word bound-
aries, such as Chinese, or languages where letters in words may be written in right-to-left
order (such as Arabic or Hebrew). In the worst case, we can have a mix of foreign words
written in both directions in the same fragment of text.
14
2.1. Text Preprocessing 15
In practice authors usually fall back to simple heuristic rules to determine term bound-
aries. Manning [60] (and others) give the following procedure applicable for text segmenta-
tion in most Western languages:
• white space and punctuation characters delimit terms,
• a full stop followed by a white space delimits a sentence.
In our experiments we used a bit more complex set of rules, implemented in the Carrot2
project. We detect and parse hyphenated terms, question and exclamation marks and de-
termine certain token types (term, number, abbreviation). An in-depth discussion of text
segmentation problems and potential solutions can be found in [60].
2.1.2 Shallow Linguistic Preprocessing
Once the text is split into tokens, we would like to identify their meaning and role in the
sentence. In information retrieval this task is typically accomplished with methods based on
mutual relationships between words and rarely with deeper grammatical analysis. A family
of methods relying on statistics, probability and heuristic techniques of preprocessing text
is called shallow linguistic preprocessing.
When a single term denotes more then one word in a language we call them homo-
graphs. A single word, on the other hand, may appear in many different graphical word homographs
forms, depending on its function in a sentence. A single unambiguous “meaning” of a given word form
word, regardless of its actual notation, is called a lemma. Lemmas are abstract concepts and lemma
in order to write them down, we use their lexemes, a unique token that identifies a lemma. lexeme
Lexemes may be just about anything. In information retrieval it is quite convenient to rep-
resent them as numbers to speed up processing. For humans, a lexeme is typically a head
entry of a given lemma listed in a dictionary. For example, a lexeme mean1 can denote an
arithmetical mean and lexeme mean2 can identify the verb “to have in the mind as a pur-
pose”.
We said that a single lexeme may represent many different word forms. This happens fre-
quently in inflectional languages, like Polish, where the graphical representation of a lemma
depends on its role in the sentence and the surrounding context. The knowledge if two to-
kens represent one lemma and what function they have in a sentence is valuable in text
mining applications. This task is called morphological analysis and is the target of a number
of methods in natural language processing [60].
When full morphological analysis is prohibitive because of high computational costs, a
simpler alternative is needed. An approximate method of conflating all word forms of a stemming algorithm
given lemma to a single identifier (which may or may not be the lexeme) is called a stemming
algorithm and the algorithm implementing it is a stemmer.
The most common stemmers for English are Martin Porter’s stemmer (or simply the
Porter stemmer) [76], Chris Paice’s stemmer [71] and historically important Beth Lovins’
stemmer [54]. All of them are based on a set of iterative transformation rules converting
the input word into its conflated version (by stripping or replacing sequences of letters).
2.1. Text Preprocessing 16
The work on stemmers for Polish is much more recent. Krzysztof Szafran’s morphologi-
cal analyzer sam [92] could be used as a stemmer. Jan Daciuk [14] wrote a dictionary stem-
mer encoded in a finite state automaton. A few commercial stemmers are also available
([33] contains an overview). Recently there’s been a some interest in building a fast heuris-
tic stemmer for Polish [97, 75], [B], also see [98] for an overview. We contributed to this
area with an idea and implementation of a dictionary stemmer called Lametyzator [89] and
then with a hybrid (dictionary and heuristic) stemmer called Stempelator [100]. From the
feedback we received we know that both packages have been successfully used in practice.
It should be clearly stated that the importance of stemming in information retrieval has
been questioned in the past [42]. Word conflation allows to retrieve more documents for a
given query (all documents containing terms conflating to an identical stem, to be precise),
but introduces certain noise to the output by including documents that just happened to
have terms conflating to the same stem (and not relevant to the query). In our earlier experi-
ments we tried to establish the importance of stemming in search results clustering [89, 101].
According to the results, stemming was almost a necessity when applied to Polish texts, but
the results for English were not as evident, supporting the conclusions reported in literature.
2.1.3 Finding Phrases
Remembering that the purpose of descriptive clustering is to find a clear, concise and com-
prehensible description of a cluster, let us examine potential candidates that could function
as cluster labels. We could use entire sentences for this purpose, but they seem to be a bit
coarse (conciseness requirement). Looking for something more fine-grained, we reach the
level of a phrase. As usual, what’s quite clear to the intuition becomes very hard to define. phrase
There are at least a few different ways of looking at phrases and hence extracting them from
the input text. We will take a look at frequency-based methods and text chunking in English
and Polish.
Text Chunking in English
The task of text chunking is about dividing a piece of text in syntactically correlated parts. text chunking
It was first defined by Steven Abney in his influential paper Parsing By Chunks [2]. Abney
starts with the following intuitive instructions for identifying a chunk: chunk
• when articulating a sentence, the strongest stresses fall on chunks with pauses be-
tween them,
• the order of words in a chunk is more predictable than the order of chunks.
The above rules are then refined in linguistic terms, which we take the liberty of omitting
here because they are of no relevance to our problem. Instead, let us take a look at the
intuitive example Abney gives:
[I begin
][with an intuition
]:[when I read
][a sentence
],[I read it
][a chunk
][at a time
].
2.1. Text Preprocessing 17
Figure 2.1: Classification of different noun types in English (based on [8]).
Abney suggests that “a simple, context-free grammar is quite adequate to describe the struc-
ture of chunks” [2] and gives such a grammar for English. The task proved to be not so sim-
ple and research on chunk extraction flourished into an independent research field, espe-
cially with the use of machine learning techniques [78, 110] and statistical methods [12, 74].
The great thing about chunks is that they seem to be “self-contained” short groups inside
the text and should be comprehensible when taken out of their original context — perfect
candidates to build cluster labels. But not all chunks are equally interesting for our purposes.
Most of the vocabulary in any language consists of nouns and, simplifying a lot, nouns are
the elements we usually use when identifying people, places or things (with modifiers and
verbs used to make the message more detailed or identify actions of nouns). We narrow our
interest in this thesis to noun phrases. Let us quote the definition of a noun phrase verbatim noun phrase
after Routledge Dictionary of Language and Linguistics [8]:
noun phrase, grammatical category (or phrase) which normally contains a noun
(fruit, happiness, Phil) or a pronoun (I, someone, one) as its head and which can
be modified (specified) in many ways. [. . . ]
The variety of different noun types and noun phrase configurations is quite impressive
(see Figure 2.1). Some of the modifiers mentioned in the definition are: adjuncts, usually
placed before the noun (very good beer), complements in the form of a genitive attribute
(Phil’s house), prepositions (the house on the hill) or relative clauses (the family that lives next
door). A comprehensive overview of a noun phrase can be found in [1] and in [28].
From the above classification of noun phrases we could identify a certain subset that
could function best as comprehensible cluster labels: all concrete nouns (perhaps with mod-
ifiers). Unfortunately, the available software for noun phrase extraction did not offer the
possibility of distinguishing the head noun type, so we did not explore this direction any
further. Summarizing, when we talk about noun phrases in this thesis, we mean any noun
phrase extracted using statistical methods (as described in [60]), although ideally we would
only be interested in the subset identified above.
Text Chunking in Polish
Statistical text chunking in Polish computational linguistics trails behind its English cousin,
mostly because for a long time the electronic resources to work with were very scarce. A
2.1. Text Preprocessing 18
large tagged corpora of Polish texts was published just recently [L], so hopefully we should
see some progress in this area.
Among the existing resources we should mention a few parsers that attempt to decom-
pose the full grammar structure of Polish sentences (from which noun phrases could be
extracted). Janusz Bien implemented a formal representation of Polish grammar first de-
scribed by Stanisław Szpakowicz [93] and then refined by Zygmunt Saloni and Marek Swidz-
inski [81]. Janusz Bien then collaborated with Marcin Wolinski, eventually leading to the
publication of Swigra — Wolinski’s doctoral thesis [62], which attempts to efficiently imple-
ment the formal syntax of Polish described in [81].
Effective use of these parsers in information retrieval applications is problematic. For ex-
ample, Swigra’s average speed of parsing is reportedly around 500 milliseconds per sentence
— quite slow to parse thousands of documents. But Marcin Wolinski’s contribution does
not end on the parser, it also includes a highly efficient morphological analyzer for Polish
called Morfeusz [103]. It seemed quite interesting to see if we could build an approximate
noun phrase extractor for Polish based solely on tag sequences. We report our conclusions
from an initial attempt to create such a heuristic chunker in the next section. We should
perhaps justify its presence in the background chapter — we place it here because the out-
comes are still too premature to call them a contribution, but at the same time they provide
some insight into the problems present in phrase extraction in Polish.
Heuristic Chunking in Polish
We started from scratch and analyzed what was known about the syntax of noun phrases in
Polish. Swidzinski introduces an entity intuitively similar to Abney’s chunk: a distribution-
ally equivalent group. Such groups can be freely replaced or moved around in the syntactical distributionallyequivalent group
decomposition of a sentence without sacrificing its correctness. There is one problem: syn-
tactical decomposition in Polish forms a tree-like structure where nodes may be sometimes
freely pivoted with respect to their parents. A noun phrase composed of several words can
appear in different order in the same text (e.g. author’s intentional stylistic manipulation to
avoid repetition). In the example shown in Figure 2.2, the same sentence appears in two
different word orders — note how fragments of it are pivoted around certain nodes.
Yet another problem with identifying phrases in Polish lies in the fact that words in Polish
undergo inflection and have internal and external constraints that modify their appearance
(usually the suffix), so apart from pivoting, the same noun phrase may also consist of slightly
different word forms each time it occurs in the text.
To have an initial look at the complexity of the problem we performed the following ex-
periment. We parsed a large amount of text in English (articles from the 20-newsgroups data
set [A]) and Polish (articles from Rzeczpospolita newspaper corpora) and tagged all words
appearing in the input with morphosyntactical categories. For English, we used MontyLin-
gua [C] package and a set of part-of-speech tags from the Penn Treebank’s tagset (approxi-
mately 40 active tags). Polish texts were processed with Morfeusz [103]. We created a syn-
thetic tagset for Polish: each tag was a composition of number, case, gender and person,
2.1. Text Preprocessing 19
moja starsza kolezanka zwykle kupuje wieczorne gazety
gazety wieczorne kupuje zwykle moja starsza kolezanka
Figure 2.2: Decomposition of two equivalent sentences: moja starsza kolezanka zwykle kupuje
wieczorne gazety (my older (girl)friend usually buys evening papers) and gazety wieczorne kupuje
zwykle moja starsza kolezanka (papers evening buys usually my older (girl)friend). Pivoted nodes
marked in circles. Example from [81].
other grammatical elements were ignored. We then extracted the most frequent sequences
of tags in both languages.
The results were easy to be predicted, but their extent was amazing. There seemed to
be almost no tag sequences longer than two elements in Polish, compared to a significant
number of two, three and longer tag sequences in English (see Figure 2.3 on the next page).
This is an empirical proof of the loose sentence order in Polish — something to worry about
if we want to extract candidate cluster labels using frequent phrases.
Out of shear curiosity we also created a directed graph of the most frequent tag seq-
uences. We took a few hundred entries and connected tags (nodes in the graph) with an
edge if they were present in any of the frequent sequences. It turned out to be quite an in-
teresting form of generative art1 (see Figure 2.4 on page 21), but also shows the difference in
complexity of part-of-speech sequences between the two languages.
In the end we decided that the construction of a statistical noun phrase chunker for Pol-
ish might be too hard and at least temporarily exceeded the scope of this thesis. We did,
however, construct a heuristic chunker — an automaton for detecting several dozen man- heuristic chunker
ually selected frequent tag sequences. The automaton accepted words in the input, estab-
lished their tag or tags (using Morfeusz) and triggered and output action whenever the input
sequence matched any of the patterns it was configured to recognize. Note that we do not
perform any morphosyntactical disambiguation — if a given word has more than one pos-
sible tag, we simply spawn more processing states in the automaton. This simple approach
gave quite promising results — see Figure 2.5 on page 22 for several examples of extracted
chunks.
Phrases as Frequent Sequences of Words
A more pragmatic view on phrases is that each one contains several words that occur in the
same order whenever found in the text [104]. Obviously, more than just a single sentence
1Generative art is art or design generated, composed or constructed through computer software algorithms [G].
2.1. Text Preprocessing 20
1
2
3
4
5
6
7
8
9
50 100 150 200 250 300 350 400
Le
ng
th o
f ta
g s
eq
ue
nce
X
Occurrences of tag sequence X
(a) English
1
2
3
4
5
6
7
8
9
50 100 150 200 250 300 350 400
Le
ng
th o
f ta
g s
eq
ue
nce
X
Occurrences of tag sequence X
(b) Polish
Figure 2.3: Length distribution of frequent tag sequences in Polish and English. The window
shows tag sequences at most terms long and occurring more than 50 times in the input.
2.1. Text Preprocessing 21
EO
S
RB
S
JJ
76
IN
37
NN
51
21
49
VB
Z
10
32
WD
T
23
1
VB
N
33
0
MD
50
5
NN
P
23
3
PO
S
51
4
PR
P
15
0
CC
98
1
VB
P
10
7
RB
41
2
CD
98
VB
D
12
03
NN
S
17
17
24
0
JJR
36
CO
MM
A
26
06
TO
97
3
VB
G
21
8
INT
ER
P
29
6
40
88
VB
40
DT
20
3
WR
B
59
WP
71
RB
R
51
EX
89
3835
17
0
60
3
24
7
10
1
55
3
78
77
30
3
35
12
9
20
1
22
8
PR
P$
42
37
40
7
67
8
35
26
2
12
9
31
14
9
27
23
6
36
3
13
2
13
2
34
62
16
8
80
17
3
18
1
27
13
5
44
6
75
49
36
13
39
27
9
34
9
14
48
64
93
9
63
9
25
19
8
32
51
77
5
64
8
66
15
0
23
2
10
50
43
2
19
3
17
85
13
5
47
112
59
8
35
49
NN
PS
18
7
57
7
17
7
47
17
4
26
9
JJS
32
68
7
40
4
60
3
19
0
77
8
32
67
46
14
3
27
51
51
7
113
10
2
61
66
0
17
0
45
22
7
18
5
18
7
31
8
43
0
58
33
41
112
84
59
25
6
16
4
53
9
86
38
8
43
87
40
6
25
26
91
19
4
27
69
14
6
24
0
43
25
3
29
2
34
48
25
7
47
8
75
40
72
46
19
9
33
3
12
6
30
5
32
55
2
99
58
6
15
9
18
7
76
73
0
64
7
30
3
27
61
10
38
47
26
92
12
3
36
77
7
49
75
3
25
7
54
2
17
9
49
54
7
119
26
8
53
3
27
7
34
9
46
1
33
7
15
8
35
4
55
17
1
36
4
16
4
15
4
99
79
8
93
2
30
21
2
94
15
3
28
0
35
9
38
10
5
59
66
7
10
25
37
6
25
89
5
12
4
20
8
13
29
46
8
19
2
20
6
21
79
14
4
32
10
3
34
91
35
34
9
28
18
7
58
16
1
19
49
70
0
33
7
34
0
27
58
61
7
71
40
S
RP
25
26
PD
T
54
19
1
10
5
31
26
9
46
5
30
34
2
34
5
23
1
91
99
3
42
0
89
5
82
45
0
16
7
53
2
25
7
39
1
25
88
44
7
36
43
8
84
5
47
12
16
64
111
13
4
26
47
26
39
83
18
1
48
21
2
10
1
113
33
48
18
3
20
87
47
1
34
3
86
12
1
67
30
10
5
47
24
2
17
3
29
54
27
5
82
46
6
54
6
65
7
99
33
29
2
31
9
12
1
33
17
8
10
3
32
5
69
76
81
2
110
53
14
9
41
57
12
5
33
22
5
37
16
85
33
30
72
114
19
82
59
0
39
29
5
91
0
10
23
13
88
28
10
8
10
1
65
74
54
0
65
0
35
1
57
1
40
62
39
54
37
30
6
43
1
14
8
15
3
66
22
3
98
28
9
39
9
27
69
98
26
2
95
24
5
43
70
3
38
99
4
31
56
45
91
16
1
25
9
32
15
07
18
6
39
2
58
10
76
24
53
10
0
43
15
6
15
9
13
4
14
5
34
35
43
114
32
39
12
2
10
9
36
43
89
25
13
2
25
25
33
30
50
69
42
32
10
0
(a) English
num:pl:
gen:m1
:congr
subst:p
l:gen:m
1
51
subst:sg
:gen:f
401
praet:sg
:m1:per
f
32
praet:sg
:n:perf
27
subst:sg
:gen:n
260
subst:sg
:acc:f
25
adj:pl:g
en:f:pos
46
adv:pos
50
praet:sg
:m1:per
f:nagl
26
subst:p
l:gen:f
174
prep:gen
182
prep:ins
t:nwok
114
subst:p
l:gen:n
100
qub
200
fin:pl:t
er:impe
rf
109
subst:sg
:nom:f
65
adj:sg:g
en:f:pos
1366
adj:sg:g
en:n:pos
36
subst:sg
:nom:n
39
praet:sg
:m3:per
f
30
adj:pl:g
en:m3:po
s
39
prep:loc
183
fin:sg:t
er:impe
rf
322
adja
37
prep:gen
:nwok
79
prep:loc
:nwok
342
xxx
33
subst:sg
:nom:m1
85sub
st:sg:nom
:m3
74
ger:sg:g
en:n:im
perf:aff
45
fin:sg:t
er:perf
28
prep:ins
t
42
imps:pe
rf
40
adj:sg:g
en:m3:po
s
69
interp
112
praet:sg
:f:perf
34
subst:sg
:gen:m1
114
ger:sg:g
en:n:per
f:aff
47
subst:sg
:gen:m3
319
prep:acc
172
conj
177
116
subst:p
l:gen:m
3
189
prep:acc
:nwok
79
pact:sg
:gen:f:im
perf:aff46
pact:pl
:gen:f:im
perf:aff
27
num:pl:
gen:m3
:congr
58
267
siebie:g
en
praet:sg
:m3:per
f:nagl
109
25
ppron3
:sg:nom
:m1:ter
25
62
49
subst:sg
:acc:m3
48
57
28
268
adj:sg:a
cc:f:pos
25
adj:sg:n
om:m1:
pos
26
aglt:sg:
pri:imp
erf:wok
316
37
49
89
45
34
pact:sg
:acc:f:im
perf:aff
fin:pl:p
ri:perf36
subst:sg
:loc:n
93
38
25
48
47
75
28
adj:sg:l
oc:n:pos
140
36
26
73
34
28
34
78
26
47
205
25
26
ppron3
:sg:acc:
m1:ter:
nakc:np
raep
26num
:pl:gen
:f:congr
65
pcon:im
perf
44
44
41
51
27
subst:p
l:acc:m
3
27
170
63
39
27
51
33
187
112
44
39
96
45
25
143
59
45
42
387
73
174
33
193
62
26
49
67
118
72
52
44
37
31
adj:sg:g
en:f:com
p
53
num:pl:
nom:m3
adj:sg:l
oc:m3:po
s
33
adj:sg:n
om:f:po
s122
26
praet:sg
:f:imperf
:nagl
34
praet:sg
:f:perf:
nagl
48
42
30
95
962
244
praet:sg
:f:imperf
64
51
69
39
63
72
57
fin:sg:s
ec:impe
rf
adj:sg:d
at:m3:po
s
subst:sg
:dat:m3
35
praet:pl
:n:impe
rf
35
prep:acc
:wok
40
num:pl:n
om:m1:
congr
subst:p
l:nom:m
1
27
204
29
109
26
34
74
105
103
55
78
351
42
28
27
94
63
76
156
61
56
50
56
64
180
142
39
34
100
38
241
subst:sg
:loc:m3
1116
38
37
33
ppron3
:sg:gen
:m1:ter
:akc:npr
aep
31
adj:pl:l
oc:f:pos
subst:p
l:loc:f
231
adj:pl:l
oc:n:pos
subst:p
l:loc:n
131
praet:pl
:f:perf:
nagl
50
adj:pl:a
cc:f:pos
52
subst:p
l:acc:f
475
ppas:sg
:nom:m3
:perf:aff
25
32
38
30
39
num:pl:n
om:f:co
ngr
subst:p
l:nom:f
62adj:
pl:nom:
f:pos
27
adj:pl:i
nst:n:po
s
subst:p
l:inst:n
72
adj:pl:a
cc:n:pos
subst:p
l:acc:n
231
31
58
707
26
47
38
51
65
34
29
59
prep:nom
37
36
adj:pl:g
en:n:pos
34
418
43
28
56
43
105
53
27
31
188
praet:sg
:m1:im
perf
42
pred
30
inf:perf
105
39
adv:com
p
30
118
137
331
inf:imp
erf
84
112
70
ppas:sg
:nom:f:p
erf:aff
31
54
adj:sg:a
cc:n:pos
35
30
44
adj:pl:a
cc:m3:po
s
28
27
106
215
adj:pl:n
om:m3:
pos
36
111
182
31
42
41
fin:pl:p
ri:imperf
26
praet:sg
:n:impe
rf
25
adj:sg:a
cc:m3:po
s
50
praet:pl
:m1:im
perf
25
25
31
72
28
adj:sg:n
om:m3:
pos
60
116
fin:sg:p
ri:imperf
27
38
adv:sup
32
adj:sg:n
om:n:po
s
52
84
subst:sg
:acc:n
88
45
33
43
72
38
111
124
37
55
67
44
102
39
44
73
35
ppron3
:sg:loc:
f:ter
ppas:sg
:nom:n:p
erf:aff
ger:sg:i
nst:n:im
perf:aff
praet:pl
:n:perf
43
praet:pl
:m1:per
f:nagl
69
praet:sg
:n:impe
rf:nagl
76
26
40
41
30
29
234
26278
33
38
57
32
ppas:pl
:nom:f:p
erf:aff
34
fin:pl:t
er:perf
110
adj:pl:i
nst:m1:
pos
subst:p
l:inst:m
1
29
adj:pl:g
en:m3:co
mp
30
ppron3
:pl:gen
:f:ter:pr
aep
subst:sg
:dat:m1
41
subst:sg
:voc:m1
27
26
25
41
226
230
107
39
64
41
48
31
40
26
39
45
149
31
70
30
29
praet:sg
:m3:im
perf
104
35
30
ppas:sg
:acc:f:p
erf:aff
29
ppas:pl
:gen:f:p
erf:aff
46
pact:sg
:acc:m3
:imperf
:aff
subst:sg
:acc:m1
27
87
adj:pl:i
nst:f:po
s
subst:p
l:inst:f
102
53
169
46
subst:sg
:loc:m1
prep:ins
t:wok
subst:sg
:inst:f
38
subst:sg
:inst:n
73
subst:sg
:inst:m
3
42
num:pl:
acc:f:co
ngr
63
25
97
31
25
143
27
47
6828
131
118
30
32
43
276
36
25
51
60
94
41
32
96
25
38
281
128
25
26
40
35
26
75
109
76
59
107
37
67
33
80
71
70
213
27
47
43
ppas:sg
:acc:m3
:perf:aff
27
425
26
57
50
num:pl:
acc:m3:
rec
30
54
31
164
126
25
25
75
31
304
33
28
111
436
33
29
79
108
148
50
130
58
66
84
40
27
39
64
137
51
71
42
164
58
35
127
45
32
77
167
48
35
61
siebie:d
at
31
25
28
ppron3
:sg:gen
:m1:ter
:akc:pra
ep
26
33
129
31
130
63
55
825
46
43
65
39
62
28
70
114
39
125
42
136
59
42
57
adj:sg:g
en:n:com
p
26
ppas:pl
:gen:m3
:perf:aff
33
adj:sg:d
at:f:pos
subst:sg
:dat:f
85
34
27
adj:sg:i
nst:f:po
s38 471
subst:p
l:dat:m
1
adj:sg:a
cc:f:sup
29
adj:sg:i
nst:n:po
s
207
ger:sg:i
nst:n:pe
rf:aff
31
31
82
subst:p
l:dat:m
3
873
10650
397
131
67
29
4138
203
162
48
adj:pl:g
en:m1:po
s
61
294
126
ppron3
:sg:gen
:f:ter:pr
aep
34
adj:sg:g
en:m1:po
s
53
118
ppron3
:pl:gen
:f:pri
45
222
adjp
53
230
156
num:pl:
gen:m3
:rec
64
329
726
184
184
praet:pl
:f:imperf
:nagl
32
ppron3
:pl:nom:
f:ter
3332
59183
83
adj:sg:i
nst:m1:
pos
34592
adj:pl:i
nst:m3:
pos
45
294
adj:sg:i
nst:m3:
pos
133
69
subst:p
l:inst:m
3
144
100
72
subst:sg
:inst:m
1
137563
num:pl:
inst:m3
:congr
46
92
82
26
380
26
45
34
28
44
65
pact:pl
:gen:n:i
mperf:a
ff
35
29
91
ppas:pl
:gen:n:p
erf:aff
25
26
45
33
31
adj:sg:g
en:m3:co
mp
44
75
subst:p
l:acc:m
1
94
36
34
44
34
pact:pl
:acc:m3
:imperf
:aff
28
252
42
25
50
45
44
32
51
30
245
45
31
46
25
26
71
25
46
34
32
135
30
186
29
78
121
80
221
27
141
55
28
40
779
89
37
69
249
59
202
82
33
39
72
494
154
55
65
161
66
35
575
392
28
65
44
2168
praet:pl
:m1:imp
erf:nag
l
45
371
383
adj:pl:n
om:m1:
pos
48
394
123
imps:im
perf
28
bedzie:
pl:ter:i
mperf
37
28
38
80
39
67
ger:sg:a
cc:n:per
f:aff
33
103
40
244
num:pl:
acc:f:re
c
35
54
60
praet:sg
:n:perf:
nagl
39
prep:loc
:wok
25
fin:sg:p
ri:perf
39
40
43
prep:dat
40
37
538
1071
118
132
28
84
334
915
bedzie:
sg:ter:i
mperf
75
92
69
adj:pl:n
om:n:po
s
47
204
subst:p
l:nom:n
74
269
ppron3
:sg:nom
:f:ter
27
112
num:pl:
acc:m3:
congr
27
152
124
92
prep:gen
:wok
29
78
166
ger:sg:n
om:n:pe
rf:aff
34
69
75
159
88
98
subst:p
l:nom:m
3
103
111
189
27
136
481
praet:sg
:m1:im
perf:nag
l
111
num:pl:
nom:m3
:rec
70
praet:pl
:f:perf
27
aglt:sg:
pri:imp
erf:nwo
k
80
praet:pl:
m1:perf
64
156
107
34
praet:pl
:f:imperf
36
136
praet:sg
:m3:im
perf:nag
l
32
num:pl:
nom:m3
:congr
30
62
273
num:pl:
nom:f:re
c
31
adj:pl:a
cc:m1:po
s
81
adj:pl:g
en:m3:su
p
31
ppas:pl
:nom:f:im
perf:aff
28
243
num:pl:
gen:f:re
c
40
num:pl:
acc:n:re
c
41
63
ppas:sg
:nom:m1
:perf:aff
26
41
97
25
26
27
43
65
73
44
65
30
37
30
246
41
79
37
74
68
29
77
27
29
66
28
142
25
55
33
89
49
30
40
648
135
3226
38
64
6548
54
129
6435
4386
105
30
34
53
28
80
36
187
29
260
116
732
47
69109
102
66
70
60
188
132
4545
32
27
115
437
119
66
147
30
48
37
32
49
87
pact:sg
:nom:f:im
perf:aff
29
146
109
55
197
59
56
82
120
113
59
31
215
64
46
39
56
44
25
34
77
37
36
46
36
25
ppron3
:sg:dat:
f:pri:na
kc
49
adj:sg:l
oc:f:pos
31
66
32
44
subst:sg
:loc:f
852
39
330
36
ger:sg:l
oc:n:per
f:aff
31
38
36
38
37
praet:pl
:m3:per
f
50
110
subst:p
l:loc:m3
81
36
32
25
56
44
adj:pl:l
oc:m3:po
s
221
916
89
37
26
26
34
ppas:sg
:loc:f:p
erf:aff
26
392
42
76
ger:sg:a
cc:n:im
perf:aff
41
1604
26
68
32
61
124
33
92
31
117
31
26
32
26
29
64
74
31
30
36
31
29
75
39
56
44
33
64
53
62
54
35
401
32
570
39
31
39
104
105
27
111
68
47
77
2733
55
25
264
26
27
32
38
34
39
382
42
83
44
31
90
40
36
79
33
286
128
39
aglt:pl:
pri:imp
erf:nwo
k
84
53
79
55
27
53
33
41
37
37
30
466
ppron3
:sg:dat:
m1:ter:
nakc:np
raep
42
ppas:pl
:gen:m3
:imperf
:aff
ppas:sg
:gen:f:p
erf:aff
45
ger:sg:n
om:n:im
perf:aff
30
26
30
subst:sg
:dat:n
103
27
26
33
139
25
27
num:pl:
nom:m1
:rec
33
adj:sg:a
cc:m3:co
mp
35
pact:sg
:gen:m3
:imperf
:aff
48
33
38
adj:sg:a
cc:m1:po
s
74
pact:pl
:gen:m3
:imperf
:aff
27
6962
136
34
399
35
28
26
36
29
5264
154
25
40
68
35
55
70
59
791
643
292
10965
51
196
32
817
133
38
312181
192223
num:pl:
loc:f:co
ngr
3276
num:pl:
loc:m3:
congr
54
ger:sg:l
oc:n:im
perf:aff
129116
1077
ppron3
:sg:nom
:f:pri
30
53
49
146
182
46
25
338
8661
232
140
31
197
67
37
60
1878
28
309
207
73
26
70
38
30
40
262
36
33
31
26
177
132
259
71
80
166
59
61
51
72
50
30
92
32
272
25
107
85
pact:pl
:gen:m1
:imperf
:aff
37
32
27
182
25
42
27
65
57
34
28
29
52
134
47
56
322
pact:pl
:nom:f:im
perf:aff32
29
49
34
27
39
S
51
1366
27
267
109
316
36
140
47
205
26
65
170
387
53
33
962
35
35
40
27
351
1116
31
231131
50
475
39
62
72
231
707
37
418
331
124
4369
76
278
34
110
29
30
41
27
230
31
149
104
29
46
87
102
169
73
63
97
276
425
436
31
129
825
26
33
85
34
471
29
207
82
873
32
592
380
44
75
252
245
2168
81
31
28
243
40
41
63
26
97
246
648
187
732
215
36
25
49
852
39
330
38
50
110
221
916
26
392
1604
75
401
570
382
128
84
79
466
42
45
30
103
139
33
35
48
38
74
27
69
62
136
399
64
154
791
1077
30
1878
182
322
num:pl:
gen:m1
:rec
47
50
35
49
396
766
ppas:sg
:gen:m3
:perf:aff
30
2309
62
52
409
184
192
61
adj:pl:d
at:m1:po
s
26
60
202
606
169
623
96
127
adj:sg:d
at:m1:po
s
177
ppas:pl
:nom:m3
:perf:aff
28
272
38
230
997
91
123
194
87
120
1711
108
113
225
357
410
247
165
805
pact:sg
:nom:m3
:imperf
:aff
26
praet:pl
:m3:im
perf
65
179
973
1005
adj:sg:l
oc:m3:co
mp
45
42
adj:pl:g
en:f:com
p
30
106
adj:pl:d
at:m3:po
s
36
203
417
adj:pl:d
at:f:pos
34
adj:sg:a
cc:f:com
p
48
80
63
149
81
63
139
367
56
447
praet:pl
:m3:per
f:nagl
29
101
787
499
30
87
adj:sg:n
om:f:co
mp25
29
47
50
35
49
25
39
33
42
134
26
49
25
396
25
27
766
65
327
47
40
153
125
34
222
84
130
237
45
50
616
66
201
30
812
866
15685
46
27
337
2309
209
27
643240
44684
38
30
72
32
109317
1739
subst:p
l:dat:f
29
62
52
36
29
70
129
409
36
52
25
35
26
139
72
35
49
25
48
37184
31
64
53
25
25
25
27
87
30
30
28
68
192
61
26
60
202
167
315
27
111
39
150
220
43
38
232
51
34
144
40
284
65
79
26
606
46
136
42
159
27
101
96
33
36
69
25
169
272
78
10573
39
30
66
78
51
50
41
102
28
57
27
81
52
302
34
66
50
44
49
53
60
623
27
38
172
35
376
31
53
40
82
93
26
26
68
33
96
27
28
65
28
71
27
31
47
44
127
35
32
25
177
28
25
30
272
30
38
28
32
35
30
230
29
997
29
26
32
34
97
34
29
73
91
123
120
25
18431
157
194
29
30
34
2895
87
30
32
59
28
45
25
35
36
29
120
25
47
28
25
52
37
29
49
31
93
340
1711
25
38
108
76
29
32
72
34
49
30
159
49
87
34
50
94
40
28
28
30
67
32
31
57
42
25
136
53
34
29
41
59
39
71
32
26
71
44
59
25
57
46
113
98
36
32
73
29
52
73
39
25
26
28
26
28
225
26
26
33
76
90
37
25
44
109
39
357
36
33
65
29
25
28
pact:pl
:nom:m3
:imperf
:aff
36
44
123
410
87
34
26
46
72
26
66
35
64
41
247
63
28
47
33
165
44
30
59
44
79
68
30
206
54
805
30
59
103
45
29
26
65
179
66
28
51
31
36
74
51
26
33
43
37
154
73
401
33
25
228
53
29
29
29
125
169
99
73
141
78
90
63
30
41
34
167
250
65
32
293
42
97
131
42
39
42
973
88
34
68
42
238
211
56
63
108
59
652
86
34
105
546
66
1005
40
133
147
86
74
184
143
218
6730
80
50
69
32
229
125
45
42
26
30
30
106
28
36
137
34
114
32
176
69
203
417
34
48
80
63
26
98
149
28
37
81
30
177
38
57
39
59
38
25
25
40
37
107
27
56
3664
27
26
59
27
27
39
96
31
28
29
38
39
47
45
58
70
33
57
63
88
50
30
36
28
47
139
25
69
33
44
66
49
74
33
70
27
52
367
46
157
30
79
41
56
186
29
85
31
47
32
60
66
75
51
121
44
43
447
42
35
30
44
119
68
79
37
48
32
46
49
71
30
141
56
25
33
32
36
65
29
101
192
91
48
94
64
81
40
47
88
44
46
2778
787
94
140
67
166
29
25
39
40
34
113
109
5774
151
474
37
28
168
92
499
62
90
86
26
55
99
50
49
49
27
230
52
29
30
87
45
25
29
(b) Polish
Figure 2.4: An ordered graph of the most frequent tag sequences in English and Polish (each
tag sequence starts on the left). The complex relationships between Polish tags are quite
evident, although we include this illustration only as an eye candy.
2.1. Text Preprocessing 22
Id: 732, tags [subst:sg:nom:f, adj:sg:nom:f:pos]
i wyzej wyspa koralowa pod nazwa
dzien dzisiejszy siła zbrojna narodu
traktowany jak wina osobista , przedsionek
załoba spowita klasa robotnicza chyliła czoła
poniewaz maksymalna moc przerobowa przedsiebiorstw jest
sie wytrawna grupa przestepcza , w
dwóch synów pani owa ma jeszcze
obu jest odkrywczosc tematyczna
Mikrogrupa społeczna , jaka
Sprawa ta była niedawno
Id: 623, tags [subst:sg:nom:m3, adj:sg:nom:m3:pos]
ze polski film krótkometrazowy miewa sie
nie tylko styl poetycki , to
Kosciół katolicki znalazł sie
zagadnienie jak Teatr polski podczas wojny
I program ten ma wówczas
( to grzech ciezki ) ,
( to grzech smiertelny )
zycia społecznego zwiazek ten ma szereg
sali był zespół eskimoski , jutro
tego powodu artykuł ten bedzie miał
tanca – fakt ten nie musi
Id: 606, tags [subst:sg:nom:m1, subst:sg:nom:m1]
Ale ktos kto zdany jest
Inaczej Rudolf Augstein , znany
wojnie swiatowej Bolesław Strzelewicz zerwał z
czwartego roku Wiesław Gomułka mówił miedzy
wygłosił przemówienie poseł Zygmunt Marek ,
przemówienie poseł Zygmunt Marek , składajac
do NRF Alfred Döblin , emigrujac
i nie doktor Borowik sa odpowiedzialni
oficjalnie był Wacław Gebethner , a
redaktorem rzeczywistym doktor Stanisław Rogoz
rzeczywistym doktor Stanisław Rogoz
panstwowosci ( profesor Bogusław Lesnodorski )
( profesor Bogusław Lesnodorski ) ,
Figure 2.5: A few phrases extracted for three example tag patterns encoded in the heuristic
chunker discussed on page 19. Chunk phrases are in the center column, left and right con-
text of each phrase is also shown. We marked in red the phrases we considered invalid (not
of maximum length or ambiguous).
is needed to successfully apply this definition for phrase extraction but when applicable,
it turns out to be a really effective heuristic. Moreover, any algorithm suitable for finding
frequent sequences of items can perform phrase extraction and there is a wide selection of
efficient methods to choose from. In this thesis we will use suffix trees (in Descriptive k-
Means) and suffix arrays (in Lingo). We characterize the most important elements of both
methods below.
A suffix tree [95, 27, 51] is a tree where all suffixes of a given sequence of elements can suffix tree
be found on the way from the root node to a leaf node. What makes this data structure
efficient is that a single node node in the tree may contain more than one element of the se-
quence (this feature differs it from another data structure — suffix trie). Figure 2.6(a) shows
an example suffix tree for a word mississippi.
An interesting property of suffix trees is that any path starting at the root node and end-
ing in an internal node denotes a subsequence of elements that occurred at least twice in
the input (take a look at the node corresponding to character sequence issi in Figure 2.6(a)
on the following page). This observation leads to a simple and effective frequent phrase
extraction algorithm: build a suffix tree such that each element of the input sequence is a
single word and analyze the internal nodes — a path from the root node to each internal
2.1. Text Preprocessing 23
(a) Suffix tree. Highlighted is the internal node
and path to the root for the subsequence “issi”.
substring substring
index
10 i
9 ippi
4 issippi
1 ississippi
0 mississippi
9 pi
8 ppi
6 sippi
3 sissippi
5 ssippi
2 ssissippi
(b) Suffix array. Highlighted is the continuous block for
subsequence “issi”.
Figure 2.6: A suffix tree and a suffix array for the word mississippi.
node is a frequent phrase.
Suffix trees have become very popular mostly due to low computational cost of their
construction — linear with the size of input sequence’s elements. Their practical implemen-
tation is quite tricky due to non-locality of tree traversal (a number of algorithms have been
proposed to overcome this issue).
Another data structure permitting frequent phrase detection is a suffix array [57, 61]. A suffix array
suffix array is a sorted array of all suffixes of a given input sequence. Figure 2.6(b) illustrates
a suffix array for the same word mississippi we used earlier. Frequent phrase extraction is
implemented by scanning the suffix array top-to-bottom, looking for continuous blocks of
identical prefixes (of maximum length) [109, 19].
Algorithms for building suffix arrays have a much better properties (locality of computa-
tion, memory consumption), in fact the only information we really need to store in a suffix
array are indices to first elements of each substring. Of course a straightforward construc-
tion of a suffix array using generic sorting routines would slow down the algorithm to the
order of O(n log n) (assuming a bit unrealistically that string comparisons take O(1) time).
More efficient algorithms that preserve the desired properties of suffix arrays have been sug-
gested in literature [45, 51].
Summarizing this section, with the help of suffix trees and suffix arrays we can locate fre-
quently recurring subsequences of words in the input. Anticipating further sections, we will
use frequent phrases as document features (for detecting similar documents) and primarily
to construct cluster labels. A number of problems opens from these applications of frequent
phrases.
• A frequent sequence of words is not necessarily a good phrase — it can be a meaning-
less collocation (like: vitally important) or a frequent structural element of a language
(like: out of, it is).
2.2. Text Representation: Vector Space Model 24
• A frequent sequence can be virtually any junk that just happens to be in the input. For
example, when searching for frequent phrases in a corpora of mailing list messages, all
messages starting long discussion threads end up as frequent because they are usually
cited in replies.
• Not all phrases are sequences. We have already pointed out that in Polish the same
phrase can be rewritten in many ways with different word order and still be perfectly
comprehensible (as in: sto lat samotnosci, lat sto samotnosci, samotnosci sto lat, samot-
nosci lat sto).
• Essentially the same phrase may be non-continuous (interrupted by other words or
words forms). For example, compare: Franklin D. Roosevelt and Franklin Delano Roo-
sevelt, or Earvin Johnson and Earvin “Magic” Johnson.
2.2 Text Representation: Vector Space Model
To assess the similarity or dissimilarity of two or more documents, we need a model in which
these operations are defined. The model is usually selected to match a particular task’s re-
quirements and objectives. Several text representation models have been suggested in liter-
ature, their good overview can be found in [4]. To keep this chapter’s size reasonable we will
focus only on Vector Space Model (vsm) and the elements it consists of: document indexing,
feature weighting and similarity coefficients.
2.2.1 Document Indexing
Vector Space Model2 uses the concepts of linear algebra to address the problem of repre-
senting and comparing textual data.
A document d is represented in the vsm as a document vector [wt0 , wt1 , . . . wtΩ ], where
t0, t1, . . . tΩ is a set of words of a given language and wtiexpresses the weight (importance)
of term ti to document d . Weights in a document vector typically reflect the distribution of
words in that document. In other words, the value wtiin a document vector d represents
the importance of word ti to that document.
Components of the document’s vector are commonly called its features because their feature
collection provides a footprint of the document’s contents. Note that we can hardly speak
about the meaning of a document vector anymore since it is basically an collection of unre-
lated terms. For this reason, the vsm is sometimes called a bag-of-words model. The process
of translation of input documents into their term vectors is called document indexing. document indexing
Given a set of documents, their document vectors can be put together to form a matrix
called a term-document matrix. The value of a single component of this matrix depends on term-documentmatrix
the strength of relationship between a document and the given term. An example with the
following input documents (each consisting of a single sentence) demonstrates this.
2Vector Space Model is usually credited to Gerard Salton, although the concept had been already known in the
literature when Gerard Salton started to use it — [21] is an interesting essay on the subject.
2.2. Text Representation: Vector Space Model 25
document content
d0 Large Scale Singular Value Computations
d1 Software for the Sparse Singular Value Decomposition
d2 Introduction to Modern Information Retrieval
d3 Linear Algebra for Intelligent Information Retrieval
d4 Matrix Computations
d5 Singular Value Analysis of Cryptograms
In the first step, we identify all possible terms appearing in the input and build a matrix
where columns correspond to terms and rows correspond to input documents. We exclude
certain terms that we know are not useful for identifying the topic of a document (these
are called stop words) and restrict the presentation to just a few selected terms with at least
one non-zero weight. On the intersection of each column and each row we place the count
(number of occurrences) of the column’s term in the row’s document. For our example in-
put, the term-document matrix looks as shown below.
Info
rma
tio
n
Sca
le
An
aly
sis
Sin
gu
lar
Va
lue . . . . . . . . . . . . .
w
d0 0 1 0 1 1
d1 0 0 0 1 1
d2 1 0 0 0 0 . . . . . . . . . . . . .
d3 1 0 0 0 0
d4 0 0 0 0 0
d5 0 0 1 1 1
Two questions arise from the example. First, an average document will contain just
a small subset of all possible words of a given language, which is additionally an uncon-
strained set. We can solve this problem with the simplest method and store just the indices
of terms for which the document has non-zero weights (as we did in the example). More
advanced techniques encode gaps between indices or other forms of bit packing and cod-
ing [102, 4].
The second problem is related to our prospective application — measuring similarity
between documents. Certain words, such as pronouns, conjunctions or prepositions, occur
frequently in any document and are useless as features. We can ignore certain words like
that (by adding them to a set of stop words), but a better idea is to recalculate weights in
document vectors in a way that highlights words that are more important for a given docu-
ment with respect to others and downplays words that are very common. This task is called
feature weighting and we list a few known weighting methods in the next section.
2.2.2 Feature Weighting
Feature weighting methods can be divided into local (one document’s term count is avail-
able) and global (term counts of all documents are available). We list the weighting schemes
2.2. Text Representation: Vector Space Model 26
that have application somewhere in this thesis. A full overview of the subject can be found
in [83, 4].
The following notation is used: tfi j — number of occurrences of term i in document j ,
dfi — number of documents containing term i in the entire collection, w(i , j ) — weight of
term i in document j , N is the number of all documents in the collection.
Term Frequency, Inverse Document Frequency Certainly the most widely known feature
weighting formula, usually abbreviated to an acronym tf-idf. Credited to Gerard Salton [82],
tf-idf tries to balance the importance of a word in a document with how common it is in the
entire collection.
w(i , j )= tfi j × log2
N
dfi(2.1)
Modified tf-idf This modification of the original tf-idf downplays the count of terms in a
document and contains certain algebraic modifications for faster calculation of w(i , j ) on a
precached index. Implemented in the document retrieval library Lucene [H].
w(i , j )=√
tfi j ×(
loge
N
dfi +1+1
)(2.2)
Pointwise Mutual Information A widely used weighting scheme, although known to be
biased towards infrequent events (terms) — an interesting discussion can be found in [60].
We show practical implications of this property in our experiments in chapter 7 on page 79.
w(i , j ) = log2
tfi j
N∑k
tfi k
N ×∑k
tfk j
N
(2.3)
Discounted Mutual Information Similar to pointwise mutual information, but multiplied
with a discounting factor to compensate for the problems mentioned above (formula after
[53]).
w(i , j ) = mii j ×tfi j
tfi j +1×
min(∑
ktfk j ,
∑k
tfik
)
min(∑
ktfk j ,
∑k
tfik
)+1
(2.4)
2.2.3 Similarity Coefficients
Two documents in the Vector Space Model represent two points in a multidimensional term
space (each term is assumed to be an independent dimension). If we define a notion of
distance in this space, we can compare documents against each other and thus start looking
for similarities or dissimilarities.
Any distance metric applicable to a multidimensional vector space is applicable, but two
methods are widely used: Euclidean distance and cosine measure.
A simple Euclidean distance is quite often used, but requires document vector length
normalization prior to calculation or the number of words (proportion of weights) in each
document will distort the result.
2.3. Document Clustering 27
Cosine measure is a more robust technique stemming from the observation that if two
vectors have approximately the same features then they should “point” at a very similar di-
rection in the space determined by the term-document matrix, regardless of their Euclidean
distance. To calculate similarity between two documents we need to look at the angle be-
tween them, which we can calculate using the dot product between their document vectors.
To simplify things even more, we can use the cosine of this angle which is easier to compute
(does not require hyperbolic function). We hence define the cosine measure of similarity cosine measure
between vector representation of documents di and d j in the term vector space as:
sim(di ,d j ) = cos(α) =di ·d j
|di ||d j |, (2.5)
where x · y denotes the dot product between vectors x and y and |x| is the norm of vector x.
The cosine measure is widely used in text clustering any many other text processing ap-
plications because its definition is quite intuitive and its implementation efficient. How-
ever, it is also known that in highly dimensional spaces any two random vectors are very
likely to be orthogonal. An attempt to solve this problem is to reduce the dimensionality dimensionalityreduction
of the feature space using feature selection, feature construction or term-document matrix
decomposition techniques [4].
2.3 Document Clustering
2.3.1 Introduction
Let us start with a general definition of clustering after Brian Everitt et al. [22]:
Given a number of objects or individuals, each of which is described by a set of clustering problem
numerical measures, devise a classification scheme for grouping the objects into
a number of classes such that objects within classes are similar in some respect
and unlike those from other classes. The number of classes and the characteris-
tics of each class are to be determined.
By analogy to the above definition document clustering, or text clustering, can be defined text clustering
as a process of organizing pieces of textual information into groups whose members are
similar in some way, and groups as a whole are dissimilar to each other. But before we delve
into text clustering, let us take a look at clustering in general.
There are many kinds of clustering algorithms, suitable for different types of input data
and diverse applications. A great deal depends on how we define similarity between ob-
jects. We can measure similarity in terms of objects’ proximity (distance), or as a relation
between the features they exhibit. An intuitive demonstration of this difference is shown in
Figure 2.7 on the following page — the same set of objects is grouped depending on their
relative distance, or feature — shape and color.
Brian Everitt et al. suggest the following classification of clustering methods [22]:
• hierarchical techniques — in which clusters are recursively grouped to form a tree,
2.3. Document Clustering 28
Figure 2.7: The same group of objects (1) “clustered” based on their relative distance (2) and
features they exhibit — shape (3) and color (4).
• optimization techniques — where clusters are formed by the optimization of a cluster-
ing criterion,
• density or mode-seeking techniques — in which clusters are formed by searching for
regions containing a relatively dense concentration of entities,
• clumping techniques — in which classes or clumps can overlap,
• others — methods which do not fall clearly into any of the above.
Alternative classifications of clustering algorithms can be suggested, depending on the
aspect we look at. In our opinion it is worthwhile to take a look at several aspects. Looking
at the structure of discovered clusters we can distinguish flat and hierarchical clustering
algorithms. Depending on the type of assignment between documents and clusters we can
have:
• partitioning algorithms — which assign each document to exactly one cluster,
• clumping techniques — described above; note that this type of clustering is natural
and desirable for texts because a single document can be assigned to more than one
topic,
• partial clustering — algorithms which may leave some objects unassigned at all; in
this thesis we will use the term “others” to refer to a synthetic group of unclustered
objects.
Finally, the classification can be made depending on the strength of relationship between
an object and a cluster:
2.3. Document Clustering 29
objects groups
Partitioning Overlapping Partial assignment
Hierarchical Binary assignment Other assignment
Figure 2.8: A simplified visual representation of different clustering algorithms.
• crisp clustering — with a binary assignment when a document is either assigned to a
cluster or not assigned to it,
• fuzzy clustering — when the degree of assignment is expressed on the scale of “not
associated” to “fully associated”, typically with a number between 0 and 1.
Figure 2.8 depicts different ways of looking at clustering algorithms depending on their char-
acteristics.
As a final element of this section, we should mention another interesting aspect of clus-
tering, related to our transparency requirement. Mark Sanderson and Bruce Croft [84] divide
clustering methods depending on how many features contributed to inclusion of a given ob-
ject to a cluster. They distinct monothetic algorithms, which assign objects to clusters based monothetic andpolytheticalgorithmson a single feature and polythetic algorithms which use multiple features. Our work is a bit
of both. We try to make the results monothetic (transparent relationship of cluster labels
to documents), but at the same time we use polythetic clustering algorithms for detecting
groups of documents.
2.3.2 Overview of Selected Clustering Algorithms
Clustering analysis is a very broad field and the number of available methods and their vari-
ations can be overwhelming. A good introduction to numerical clustering can be found in
Brian Everitt’s Cluster Analysis [22] or in Allan Gordon’s Classification [30]. A more up-to-
date view of clustering in the context of data mining is available in Jiawei Han and Miche-
line Kamber’s Data Mining: Concepts and Techniques [35]. Shorter surveys on the topic
are also available in [6] in [41]. Resources in the Polish language include, for example, a
chapter on clustering methods in Jacek Koronacki and Jan Cwik’s book Statystyczne systemy
uczace sie [46] and a Polish translation of David Hand, Heikki Mannila and Padhraic Smyth’s
Principles of Data Mining (Eksploracja Danych) [36]. We should emphasize again that most
text clustering algorithms attempt to transform the input text into a mathematical repre-
2.3. Document Clustering 30
sentation directly suitable for use with numerical clustering algorithms directly, so any book
about cluster analysis will be relevant to the topic of this thesis.
In the following part of this section we describe a few selected clustering algorithms that
are important from the point of view of further chapters.
Partitioning Methods
Partitioning clustering methods divide the input data into disjoint subsets attempting to find
a configuration which maximizes some optimality criterion. Because enumeration of all
possible subsets of the input is usually computationally infeasible, partitioning clustering
employs an iterative improvement procedure which moves objects between clusters until
the optimality criterion can no longer be improved.
The most popular partitioning algorithm is the k-Means algorithm. In k-Means, we de-
fine a global objective function and iteratively move objects between partitions to optimize objective function
this function. The objective function is usually a sum of distances (or sum of squared dis-
tances) between objects and their cluster’s centers and the objective is to minimize it. As-
suming K is a set of clusters, t ∈ K is a cluster (set of objects), Ctiis the representation of
a cluster’s center and d is an element, we try to minimize the following expression:
∑
ti∈K
∑
d∈ti
distance(d ,Cti) (2.6)
The representation of a cluster can be an average of its elements (its centroid) or a mean
point (object closest to the centroid of a cluster). In the latter case we call the algorithm
k-Medoids. Given the number of clusters k a priori, a generic k-Means procedure is imple-
mented in four steps:
1. partition objects into k nonempty subsets (most often randomly),
2. compute representation of centers for current clusters,
3. assign each object to the closest cluster,
4. repeat from step 2 until no more reassignments occurs.
By moving objects to their closest partition and recalculating partition’s centers in each step
the method eventually converges to a stable state, which is usually a local optimum.
We discuss computational complexity of k-Means in later sections, for now let us just
comment that the entire procedure is efficient in practice and usually converges in just a
few iterations on non-degenerated data. Another thing worth mentioning is that clusters
created by k-Means are spherical with respect to the distance metric — the algorithm is
known to have problems with non-convex, and in general complex, shapes.
Hierarchical Methods
A family of hierarchical clustering methods can be divided into agglomerative and divisive
variants. Agglomerative Hierarchical Clustering (ahc) initially places each object in its own
2.3. Document Clustering 31
cluster and then iteratively combines the closest clusters merging their content. The cluster-
ing process is interrupted at some point, leaving a dendrogram with a hierarchy of clusters. dendrogram
Many variants of hierarchical methods exist, depending on the procedure of locating
pairs of clusters to be merged. In the single link method, the distance between clusters is the single link,complete link,average linkmethods
minimum distance between any pair of elements drawn from these clusters (one from each),
in the complete link it is the maximum distance and in the average link it is correspondingly
an average distance (a discussion of other merging methods can be found in [22]). Each of
these has a different computational complexity and runtime behavior. Single link method
is known to follow “bridges” of noise and link elements in distant clusters (a chaining ef-
fect). Complete link method is computationally more demanding, but is known to produce
more sensible hierarchies [30, 4]. Average link method is a trade-off between speed and
quality and efficient algorithms for its incremental calculation exist such as in the Buck-
shot/Fractionation algorithm [13].
Typical problems in hierarchical methods are in finding the stop criterion for the cluster
merging process [20], tuning parameters and finding a method of “flattening” dendrogram
levels to create clusters with more than two subgroups.
Clustering Based on Phrase Co-occurrence
The Internet brought a new challenge to the task of text clustering: incomplete data. Par-
titioning and hierarchical methods typically used vector space representation and required
verbose input (full documents). Web pages and mailing lists are typically shorter and snip-
pets found in search results — fragments retrieved from documents matching the query —
are an extreme example of incomplete data, often just a few words long. This new data had
to be reflected in novel approaches to clustering.
Algorithms utilizing phrase co-occurrence use frequently recurring sequences of terms
as features of similarity between documents. Assuming that documents discussing related
subjects should use similar vocabulary and phrasing, frequent phrases can be used to iden-
tify documents discussing the same (or related) topics. Note that what really makes the dif-
ference is the use of variable-length features that are later used for describing the discovered
clusters. This idea first appeared in Suffix Tree Clustering (stc) algorithm, published in a suffix treeclustering
seminal paper by Oren Zamir and Oren Etzioni [105].
Suffix Tree Clustering works in two phases: first it discovers base clusters (groups of doc-
uments that share a single frequent phrase) and then merges base clusters together to form
the output.
The discovery of base clusters starts from segmenting the input into words and sen-
tences. Each sentence is essentially a sequence of words and is as such inserted into a gen-
eralized suffix tree. Generalized suffix tree is similar to a suffix tree, but contains suffixes of
more than one input sequence. Internal nodes in the tree also keep pointers to sequences a
given suffix originated from. This way in each internal node of the tree we have a sequence
of elements that occurred at least twice in the input and sentences it occurred in.
2.3. Document Clustering 32
Figure 2.9: A generalized suffix tree for three sentences: (1) cat ate cheese, (2) mouse ate
cheese too and (3) cat ate mouse too. Paths to internal nodes (circles) contain phrases that
occurred more than once in documents indicated by children nodes (rectangles). Dollar
symbol is used as a unique end-of-sentence marker. Example after: [105].
After all the input sentences have been added to the suffix tree, the algorithm traverses
the tree’s internal nodes looking for phrases that occurred a certain number of times in more
than one document. Any node exceeding the minimal count of documents and phrase fre-
quency immediately becomes a base cluster. Figure 2.9 shows a generalized suffix tree with
three short example phrases.
The strongest element of stc is in utilizing a proper data structure — suffix tree construc-
tion is linear and it permits very fast and convenient base cluster detection. Interestingly,
after this clever step, stc falls back to a simple single-link merge procedure where base clus-
ters that overlap too much are iteratively combined into larger clusters and the process is
repeated. This step is not fully justified and may result in merging base clusters that should
not be merged. An improved version of the algorithm was suggested by Irmina Masłowska
in [63, 64]. Nonetheless, Suffix Tree Clustering was the first algorithm to emphasize the im-
portance of comprehensible cluster labels and it was very inspiring for other authors. We
list several spin-off algorithms utilizing phrase co-occurrence in Section 2.4 on page 34.
Other Clustering Methods
A number of other clustering methods are known in literature — density-based methods,
model-based and fuzzy clustering, self organizing maps and even biology-inspired algo-
rithms. An interested Reader can find many surveys and books providing comprehensive
information on the subject [22, 30, 6, 36, 46].
2.3.3 Applications of Document Clustering
Applications of text clustering algorithms changed over time either due to record-breaking
data availability, new ideas and algorithms or due to new types of input data. We summa-
rize a list of major text clustering applications because it nicely outlines the evolution of
clustering methods from a background utility for modelling similarities among objects to
the first-hand user experience.
2.3. Document Clustering 33
Improving Document Retrieval Efficiency
The initial application of text clustering was in document retrieval. Keith Van Rijsbergen ob-
served that “Closely associated documents tend to be relevant to the same requests” (cluster
hypothesis). Clustering was applied to a collection of documents prior to searching to detect
similar groups of documents. When the user typed a query, an information retrieval algo-
rithm retrieved documents matching the query and documents in their clusters to improve
recall. Note that clusters are never explicitly revealed to the user so there was no need to
describe them.
Organizing Large Document Collections
Document retrieval focuses on finding documents relevant to a particular query, but it fails
to solve the problem of making sense of a large number of uncategorized documents. The
challenge here is to organize these documents in a taxonomy identical to the one humans
would create given enough time and use it as a browsing interface to the original collection
of documents.
Several large conferences, such as the Text Retrieval Conference (trec), published ref-
erence data with samples of documents clustered manually by humans. The ongoing work
in this area focused primarily to replicate the man-made taxonomy as closely as possible,
maximizing the score calculated as a conformity to predefined document-to-cluster assign-
ments. Comprehensibility of cluster descriptions was usually neglected because it was not a
direct factor affecting the score.
Browsing Document Collections
The observation that clusters alone present a certain value to the user of an information
retrieval system is very important and was first noticed by Marti Hearst and Jan Pedersen
in their paper about a search system Scatter/Gather [37]. By scanning the description of a
cluster the user can assess the relevance of the remaining documents in that cluster and
find the interesting information faster (or at least identify the irrelevant clusters and avoid
them). The techniques for extracting cluster descriptions were very simple: selected titles of
documents within the cluster, excerpts of documents and keywords.
Duplicate Content Detection
In many applications there is a need to find duplicates or near-duplicates in a large number
of documents. Clustering is employed for plagiarism detection, grouping of related news
stories and to reorder search results rankings (to assure higher diversity among the topmost
documents). Note that in such applications the description of clusters is rarely needed.
Integration with Search Engines
Modern Internet and intranet search engines contain countless numbers of Web pages, doc-
uments, news articles and can search all this content very fast. Users simply rephrase the
2.4. Related Works 34
query if the information they need does not show up at the top of search results and they
rarely need full taxonomy of documents to find what they need. However, in some situa-
tions when the query is ill defined, or the information need is not clear (an overview-type
query for example), a different type of text clustering may be helpful — search results clus-
tering. Search results clustering is about clustering each query’s results and presenting the
user with an overview of what this result contains. The clusters can be used to filter through
irrelevant hits or to refine the query with other terms.
We mentioned this type of clustering a few times already, but let us underline the key
elements of difficulty again. The input information for the algorithm is very limited: only
the titles and snippets are available. The algorithm must be very fast to avoid slowing down
the user interface of a search engine. Finally, the clusters must be accurately and clearly
described because the user expects an overview of topics similar to the query and does not
have the time to take guesses about the meaning of clusters described using keywords, for
example.
[
The concepts presented in this thesis are applicable whenever the description of clusters
needs to be shown to the user. The most likely targets among the applications presented
above, are in document collection browsing and search results clustering.
2.4 Related Works
The purpose of this section is to present currently available algorithms and methods that
closely correspond to the ideas presented in this thesis.
Clustering
Antonio Gulli and Paolo Ferragina [24, 23] start from the limitations of Grouper — stc’s ini- see errata E-17
tial implementation — and build an algorithm called SnakeT. SnakeT uses non-contiguous
phrases as features, which the authors call approximate sentences. The criterion forming
a cluster is still (as in stc) the fact of sharing a sufficient number of approximate phrases.
The implementation and algorithm’s design is by far more complex than stc’s and uses cus-
tom data structures similar to frequent itemset detection in data mining, but the authors
pay attention to cluster label comprehensibility and enrich cluster descriptions with data
extracted from a predefined ontology.
In [48], authors present a search results clustering algorithm which attempts to associate
documents with a single concept where labels are chosen so that “they are good indicators
of the documents they contain”. The algorithm uses frequent terms, but also preprocessed
noun phrases. Unfortunately, as the authors put it: “[stems] are not usually very meaning-
ful for use as node labels, therefore we replace each stemmed term by the most frequently
occurring original term”. Note that such heuristic would obviously fail for documents in Pol-
ish. The provided screenshots show that the generated cluster labels are mostly based on
single-words.
2.4. Related Works 35
Hotho, Staab and Stumme build a Conceptual Clustering system that refines cluster de-
scriptions using concept lattices [39, 38]. Their cluster descriptions are still single words but
they use a large thesaurus and formal concept analysis to avoid repetitions and synonyms
in cluster keywords.
In [107], authors perform an interesting experiment with a supervised training of a clus-
ter label selection procedure. First, a set of fixed-length word sequences (the article reports
3-grams) is created from the input. Each label is then scored with an aggregative formula
combining several factors: phrase frequency, length, intra-cluster similarity, entropy and
phrase independence. The specific weights and scores for each of these factors are learnt
from examples of manually prioritized cluster labels.
Pantel and Lin [53, 72] present a very interesting clustering algorithm called Clustering
with Committees which builds clusters around groups of few strongly associated (most sim-
ilar) documents, called committees. Because committees are so strongly related with their
set of features they usually point to an unambiguous concept (which authors even evaluate
using semantic relationships from WordNet). The cluster description remains to be a list of
strong features, but hopefully unambiguous. Pantel recently attempted to label the output
classes with more “semantic” labels [73], but this work goes definitely deeper into natural
language processing than information retrieval.
A concepts of clustering combined with pattern selection (similar to the dcf approach)
appears in [108], where authors present a classification system which uses clusters to select
labeled objects and expand the set of labeled objects with elements from within the cluster
to improve classification. Cluster descriptions are not part of the consideration.
Mark Sanderson and Bruce Croft [84] present a completely different, yet related ap-
proach to exploring document collections. Instead of clustering input documents, they start
with salient terms and phrases taken from predefined queries to a document collection and
expand this set with a technique called Local Context Analysis. Once a big enough collec-
tion of terms and phrases is gathered, it is automatically organized into a hierarchy, starting
with most generic terms at the top and descending to most detailed ones at the bottom.
The technique used by authors is very interesting as it involves no clustering techniques, yet
provides a hierarchy of (quite comprehensible) descriptions of groups of documents in the
output. The disadvantage is that authors bootstrap their method with a predefined set of
queries, which would be unavailable for another collection of documents.
An interesting cluster labeling procedure is also shown in the Weighted Centroid Cover-
ing algorithm [90]. Authors start from a representation of clusters (centroids of document
groups) and then build their (word-based) descriptions by iterative assignment of highest
scoring terms to each category, making sure each unique term is assigned only once. Inter-
estingly, the authors point out that this kind of procedure could be extended to use existing
ontologies and labels, but they provide no experimental results of any kind.
2.4. Related Works 36
Summarization
Similarities to our work can be found in the field of document summarization and especially
multi-document text summarization. The goal of summarization is to present a concise tex-
tual summary of a single document or a group of documents. What differs summarization
from descriptive clustering is that no document groups are ever shown in the results, the
output consist of exactly one summary, usually longer compared to cluster labels.
Summarization dates back to 1958 when Hans Peter Luhn published a pioneer paper
called The Automatic Creation of Literature Abstracts [55]. The algorithm works by forming a
definite set of keywords and then looking for sentences containing these keywords, assem-
bling an abstract from the highest scoring sentences. Majority of later works presented in
the field are not too far from Luhn’s idea, focusing mostly on refining the procedure of se-
lecting sentences abbreviating the content of a document. One approach, for instance is to
take into account lexical clues, boosting the score of sentences in proximity of words such as
important or significant and decreasing their score in the neighborhood of phrases like might
be, or unlikely [80].
In [77] authors describe a summarization engine mead which extracts phrases and ranks
them using a clustering method. The highest ranking sentences are selected for the sum-
mary.
A broader overview of summarization and topic segmentation techniques and systems
can be found in [52] or in [58].
Topic Segmentation
A piece of text, such as an article, rarely talks about a single subject. The analysis of how
topics change in a document is a matter of topic segmentation techniques. A large group topic segmentation
of methods in this field is based on theories of discourse modelling, and represented by in- discoursemodelling
fluential papers by Eduard Skorochod’ko [88], Michael Halliday and Ruqaiya Hasan [34], or
Barbara Grosz and Candace Sidner [32]. An overview of the models of discourse along with
their applications to topic segmentation and summarization can be found in Jeffrey Reynar’s
thesis [79].
[
Summarization, topic segmentation and clustering have a high degree of overlap in the
motivations and even in methodology of solving their respective problem areas. Having said
that, each discipline has its own niche where it fits best. The ideas presented in this work
to certain extent combine the goals of multi-document text summarization, topic identifica-
tion and clustering, although we tend to stay in the field of information retrieval with regard
to the algorithmic solutions.
2.5. Evaluation of Clustering Quality 37
2.5 Evaluation of Clustering Quality
Experts seem to agree that objective measures of clustering quality are not feasible [59]. Text
clustering depends on such a variety of factors (implementation details, parametrization,
input data, preprocessing) that each experiment becomes quite unique and deriving con-
clusions about supremacy of one algorithm over another seems a bit far fetched. Moreover,
there is usually more than one “good” result and even human experts are rarely consistent
in their choice of the best one [56]. On the other hand, anecdotal evidence of improvement
is obviously very unfortunate.
There are two mainstream clustering evaluation methodologies: user surveys and mea-
sures of distortion from an “ideal” set of clusters. Neither method is perfect.
2.5.1 User Surveys
User surveys are a very common method of evaluating clustering algorithms [106, 63, 23, 85]
and often the only one possible. Unfortunately, a number of elements speak against this
method of evaluation.
• It is difficult to find a significantly large and representative group of evaluators. When
the users are familiar with the subject (like computer science students or fellow sci-
entists), their judgment is often biased. On the other hand, people not familiar with
clustering and used to regular search engines have difficulty adjusting to a different
type of search interface.
• People are rarely consistent in what they perceive as “good” clusters or clustering. This
has been reported both in literature [56] and in our past experiments on the Carrot2
framework and affects both preparation of the answers sheet and analysis of results.
• Experiment results are unique, unreproducible and incomparable. User surveys are
one-shot experiments that are not comparable between each other and difficult to
perform repeatedly or periodically.
• Human evaluators learn by examples and their judgment and performance is not con-
stant throughout the experiment. This makes performing subsequent experiments
with the same evaluators impossible (because they have gained experience).
• User surveys usually take a long time and effort in both preparation of the experiment
and its practical realization.
We used user surveys in a few of our experiments in the past and were usually discour-
aged with the results. During the course of work on this thesis we tried to avoid controlled
user-studies and instead relied on empirical experiments, numerical investigation of qual-
ity and user feedback collected from an open demonstration of the search results clustering
system Carrot2. We summarize this experience in Section 7.4 on page 107.
2.5. Evaluation of Clustering Quality 38
2.5.2 Measures of Distortion from Predefined Classes
Another evaluation method is based on defining a mathematical notion of difference be-
tween the set of clusters and a reference desirable set of partitions (called a ground truth
set). A clustering algorithm should minimize this difference to mimic the behavior of the ground truth
person or algorithm that put together the ground truth set. A few popular data sets are avail-
able for full document clustering, majority created by mixing documents from thematically
different sources (such as different mailing lists) and more seldom by human selection and
tagging. Interestingly, in spite of a few attempts to create ground truth data sets for search
results clustering [87], no “standard” test collection for this problem exists at the moment of
writing.
Let us assume a set of clusters K = k1,k2, . . . kn and a set of ideal partitions with a
total of N objects C = c1,c2, . . . cm. The following metrics are typically used for measuring
difference between clusters and the ground truth set.
F-measure A measure popular in information retrieval: aggregation of precision and re-
call, here adopted to clustering evaluation purposes. Recall that precision is the ratio of the
number of relevant documents to the total number of documents retrieved for a query. Re-
call is the ratio of the number of relevant documents retrieved for a query to the total num-
ber of relevant documents in the entire collection. In terms of evaluating clustering, the
f-measure of each single class ci is:
F (ci ) = maxj=1...m
2P j R j
P j +R j, (2.7)
where:
P j =|ci ∩k j ||k j |
, R j =|ci ∩k j |
|ci |. (2.8)
The final f-measure for the entire set of clusters is:
m∑
i=1
F (i )|ci |N
. (2.9)
Higher values of the f-measure indicate better clustering.
Shannon’s Entropy Entropy is often used to express the disorder of objects within the clus-
ter with information-theoretic terms. We can define the entropy of each cluster k j as [11]:
E (k j ) =−m∑
i=1
|ci ∩c j ||c j |
log|ci ∩c j ||c j |
. (2.10)
Defined this way, entropy is not normalized, so we normalize it (after: [16]):
E (k j ) =−1
log m
m∑
i=1
|ci ∩c j ||c j |
log|ci ∩c j ||c j |
. (2.11)
2.5. Evaluation of Clustering Quality 39
Entropy of the entire clustering is a weighted entropy of its clusters:
n∑
j=1
E (k j )|k j |N
. (2.12)
Zero entropy means the cluster is comprised entirely of objects from a single class.
Byron E. Dom’s Clustering Entropy This is yet another information-theoretic measure. An
interesting thing about it is that it takes into account the difference between the number of
clusters and the number of classes (if such a difference exists) — a useful property we used
in our previous research on the influence of language properties on the quality of cluster-
ing [89]. We omit the exact formula here because we do not use it in this thesis — details
can be found in Byron Dom’s report [18].
Clustering Purity Purity gives the average ratio of a dominating class in each cluster to the
cluster size and is defined as:
P (k j ) =1
|k j |max
i
(h(ci ,k j )
), (2.13)
where h(c,k) is the number of documents from partition c ∈C assigned to cluster k ∈K .
[
Evaluation against a ground truth set is quite reliable and convenient because it yields a
numeric and repeatable result, but it comes with its own issues. First of all, a single number
does not explain what went wrong in the clustering process. It only provides an average
figure that is comparable, but hardly interpretable, and the range of source errors is broad
and varies in “severity” (a mix of two related classes is preferred over uniform mixture of
documents for example).
Each cluster validation measure also comes with inconvenient requirements concerning
clusters structure; most require explicit partitioning into the number of clusters identical to
the number of ground truth’s partitions. This requirement is very often hard to meet, espe-
cially when we want our clustering algorithm to adjust the number of clusters freely or allow
clusters that are pure subsets of original classes with no additional penalty. In this thesis we
introduce a cluster validation measure similar to entropy but hopefully easier to interpret —
cluster contamination measure (see Appendix A on page 116). We are also convinced that
visualization methods showing the allocation of documents within clusters help a great deal
when assessing the quality of a clustering algorithm and we use many such visualizations in
Section 7.
Chapter 3
Descriptive Clustering
In this chapter we outline the differences between traditional understanding of the docu-
ment clustering problem in information retrieval and descriptive clustering. We define it as
a distinct problem with a specific set of requirements and applicable to a certain class of
text browsing problems. Finally, we present a loose association with conceptual clustering
known in machine learning.
3.1 Problem Statement
Let us start by repeating the textbook definition of a clustering problem after [22]:
Given a number of objects or individuals, each of which is described by a set of
numerical measures, devise a classification scheme for grouping the objects into
a number of classes such that objects within classes are similar in some respect
and unlike those from other classes. The number of classes and the characteris-
tics of each class are to be determined.
Note that the above definition does not mention cluster labels at all, the objective is to find
groups of similar objects (documents in our case). Any application that brings clusters to
the user interface will need to find their textual description — an additional requirement
not stated in the definition of the problem. A good clustering algorithm (in terms of the
definition) may appear completely useless from the user’s point of view, because it fails to
explain the reasons why clusters were formed. We believe the core difficulty is in the transi-
tion between the algorithm discovering groups of documents and the method of attaching
descriptions to these groups. Taking a Vector Space Model as an example, it seems almost
impossible to find a way of reconstructing comprehensible cluster labels from a mathemat-
ical bag-of-words model of a group of documents. In our opinion the approaches known
in literature, such as keyword tagging or the use of frequent phrases as cluster labels, don’t
provide satisfactory answers to all the requirements of a cluster browsing application.
The idea presented in this thesis attempts to avoid this difficult phase of cluster labeling
instead of solving it. We do it by slightly relaxing the requirements concerning document
40
3.2. Requirements 41
groups and shifting the emphasis to cluster labels. Compare the definition of the descriptive
clustering problem with document clustering shown above:
Descriptive clustering is a problem of discovering diverse groups of semanti- descriptiveclustering
cally related documents described with meaningful, comprehensible and com-
pact text labels.
Ideally, an algorithm solving the descriptive clustering problem should present docu-
ment groups for which clear and comprehensible descriptions exist. Document clustering is
therefore a step towards the final result, not the ultimate goal.
According to the above definition, we agree to discard clusters without sensible descrip-
tions. It may be disturbing at first, but our decision is deeply rooted in practical experience
gained with the Carrot2 framework and is an outcome of the following observations:
• the user will spend no additional time to figure out the meaning of a cluster label if its
description is unclear,
• the user will not inspect documents of a cluster with an unintuitive label,
• unclear or obscure relationships between the cluster description and documents in-
side it are discouraging and frustrating for the user.
All our further considerations try to take these facts into account, even if it potentially
affects the “ideal quality” of clustering understood as the expected allocation of documents
to groups.
3.2 Requirements
Evaluation of cluster labels presents a great challenge. Initially, inspired by approaches from
natural language processing, we tried to define strict formal requirements concerning clus-
ter labels, based on their grammatical decomposition. This direction turned out to be unre-
alistic — the structure of natural language, especially in Polish, seems to be far too complex
for reliable automatic evaluation.
Unable to specify the requirements formally, we thought about defining certain expec-
tations that hardly replace a formal definition, but hopefully convey our intuition of what
cluster labels should be like. Therefore, to clarify the terminology, when we speak about
requirements concerning the problem of descriptive clustering, we mean two things:
• expectations concerning cluster labels (naturally imprecise and hard to verify, but pro-
viding certain intuition), and
• traditional requirements concerning clusters (groups of documents) which are taken
directly from the definition of clustering in information retrieval.
We describe these requirements in the following sections of this chapter.
3.2. Requirements 42
3.2.1 Cluster Labels
We define three requirements concerning cluster labels: comprehensibility, conciseness and
transparency.
Comprehensibility
Zygmunt Saloni and Marek Swidzinski make a very interesting observation of how people
perceive elliptical statements:
Jestesmy przekonani, ze kazdy uzytkownik jezyka ma stosunkowo jasna (choc
nie wyrazna!) intuicje elipsy, tzn. potrafi okreslic stopien kompletnosci danego
wypowiedzenia. ([81], page 56)
We are convinced that every speaker of a given language has a clear (but not ex-
plicit!) intuition of ellipsis, that is can determine the completeness of a given pro-
nouncement.
Extending this observation to comprehensibility we suspect that native speakers of a given
language can easily determine if a given sequence of words can function as a cluster label,
but without providing any explicit rules with respect to this judgment. Instead of defining
a good cluster label we may pinpoint the negative cases and reject the clearly bad ones. A
list of reasons for rejecting or at least penalizing a cluster label along with some examples is
shown below. Good cluster labels should not fall into any of these categories.
• Grammatical inconsistency (not a sentence, not a pronouncement or an incomplete
phrase).
– wooden A go if
– byli krzywy noga z do (were crooked leg from to)
– of Computer Science
– z Torunia (from Torun)
• Internal grammatical or inflectional constraint violated (the phrase is incorrect, words
inside it are not in agreement).
– Europe snowboarding resorts [→European snowboarding resorts]
– Samorzadom miasta Poznaniu [→Samorzad miasta Poznania]
• External grammatical or inflectional constraint violated (the phrase is grammatically
correct, but is used in inflected form or lacks the required context).
– Instytucie Informatyki Politechniki Poznanskiej [→Instytut Informatyki Politechniki Poz-
nanskiej]
– Alicji w Krainie Czarów [→Alicja w Krainie Czarów]
3.2. Requirements 43
• Ellipsis or ambiguity.
– piłem Okocimy (drank Okocim)
– to i tamto (this and that)
Obviously, fully automatic and reliable verification of these constraints is impossible.
Even human assessment is often difficult, e.g. is inspector gadget a meaningful phrase when
it lacks context? A reasonable solution is to minimize the number of potentially bad descrip-
tions by their careful selection. That means, for instance, allowing only entire sentences
or pronouncements — such entities should be self-contained and less ambiguous by def-
inition. Unfortunately, they are also too long to form concise cluster labels (expressed in
the next requirement), so we decided to use a more fine-grained level of chunks (see Sec-
tion 2.1.3 on page 16). Chunks should be grammatically consistent, potentially self-contained
and hopefully meaningful when extracted from the text and stripped of its surrounding con-
text, so they seem like good candidates, not breaking any of the unwanted cases mentioned
above.
Conciseness
Our goal is to show the user a brief, concise view of a structure of topics present in a set of
documents. Cluster labels should be as short as possible to minimize the amount of infor-
mation the user must process, but sufficient to convey the information about the cluster’s
documents. If a word in the description can be removed without sacrificing comprehensi-
bility of the phrase, then it should be.
Anticipating our further discussion, let us mention that this requirement is quite difficult
to realize without linguistic and contextual knowledge. Our algorithms satisfy this require-
ment by allowing the user to express the desired length of cluster descriptions. We agree,
however, that this is a partial solution to the problem.
Transparency
To the user of a clustering algorithm, all its internal elements: the model of text represen-
tation, similarity measures, the algorithm used for grouping documents, remain a black box
which he or she expects to work flawlessly. Any mistake made by the algorithm, especially
one that manifests itself in cluster descriptions, introduces confusion and decreases user’s
trust to the entire algorithm.
We believe that the relationship between any document inside a cluster and its descrip-
tion must be clear and evident as in monothetic clustering. Similar clarity must exist in the
other direction — when looking at a description of a cluster, the user must be able to tell
which elements of this description can be found in the cluster’s documents. We will call a
clustering method transparent if the user is able to easily answer the following questions:
• Why was label X selected for documents in cluster Y?
• Why was document X placed in cluster Y?
3.2. Requirements 44
Cluster keywords Excerpts from sample documents
apple New York reminds us of the warmhearted program of Big Apple Greeter.
apache, server [. . . ] Median hourly earnings of nonrestaurant food servers were $7.95 in May
2004. This figure was even lower among Native American tribes of Zuni, Navajo
or Apache. [. . . ]
jacek, placek [. . . ] Jacek był grubym i nieporadnym chłopcem. [100 pages later] Na stole stał
pachnacy placek. [. . . ]
Table 3.1: Examples of cluster labels consisting of keywords and fragments of documents
matching these keywords, but not at all their first common sense meaning.
This requirement is partially inspired by the history of search engines in information re-
trieval. In the beginning most search engines had a default Boolean alternative operation
(or) between query keywords. The result included documents containing any term in the
query. But people soon realized that default conjunction (and) is less confusing because
there is no guessing of which combination of terms made a given document pop up in the
result; with the default and the relationship between the query and the set of retrieved doc-
uments is very clear (or in out terms: transparent).
Returning to the field of clustering, algorithms used there use more complex mecha-
nisms compared to vsm-based document retrieval and the transparency requirement be-
comes even more important. For instance, the traditional keyword-based cluster presen-
tation may lead to mistakes because users will assign the most common sense to a set of
keywords — consider the examples of cluster keywords and documents not truly relevant to
such clusters shown in Table 3.1.
As for how the transparency requirement can be solved, in our opinion a cluster label
containment relationship is a good, clear rule of thumb: every document in a cluster must
contain a phrase from its description. Such a rule is very restrictive — the number of docu-
ments containing an exact copy of a given phrase is likely to be small. Recalling the discus-
sion in Section 2.1.3 about loose order of phrases in languages such as Polish, exact phrase
containment may not be even a correct heuristic.
We may relax the above common sense rule and require the cluster to contain docu-
ments where the label’s phrase appears with possibly reordered words or other terms in-
jected inside it. The user should be able to control how much distortion from the cluster
label he or she allows. If such a definition is still too narrow, because the input is so big or
larger clusters are needed, the cluster label can be ultimately replaced with a more generic
term. However, there should always be a possibility of expanding the abbreviated cluster
label into low-level, fully transparent elements to provide explanation to the user about how
the cluster was formed and what can be found inside it.
3.2.2 Document Groups
The problem of descriptive clustering is focused on cluster labels, nonetheless it essentially
still remains a document clustering problem. In this thesis we consider a subset of clus-
3.3. Relationship with Conceptual Clustering 45
tering algorithms producing flat, overlapping clusters with the possibility to leave behind
unassigned documents.
Internal Consistency
An algorithm solving the descriptive clustering problem should ensure documents inside a
cluster are similar to each other. We believe that internal consistency, regardless of its math-
ematical definition, corresponds strongly with the concept of cluster label transparency. If
all documents in a cluster have a clear relationship with its description then such a cluster
must appear consistent to the user and therefore fulfills the requirement.
External Consistency
Descriptive clustering should provide an overview of the topics present in the input, so we
search for diverse clusters (different from each other) and varying in size (not only the largest
ones, which might be obvious to the user).
Overlaps and Outliers
We can expect a single document to contain references to many different subjects, so an al-
gorithm solving descriptive clustering must allow placing it in more then one cluster. More-
over, we can also expect a situation when a document does not belong to any cluster at all.
Such outlier documents can be abandoned entirely or form a synthetic group of unrelated
documents. The point is not to force documents to their closest cluster if such relationship
is not justified.
Note that we assumed that the structure of clusters is flat. This is partially a consequence
of transparency — if we need a clear relationship between a cluster label and its content, it
would be difficult to come up with a transparent label on a compound cluster in a hierarchi-
cal clustering. On the other hand, hierarchical clusters have a number of desirable features,
most importantly a more compact presentation compared to flat clusters. We consider it an
open question whether hierarchical, transparent clustering is feasible.
3.3 Relationship with Conceptual Clustering
During the work on this thesis is was pointed out to us that the difference between tradi-
tional and descriptive clustering resembles to some extent the ideas introduced earlier in
machine learning. Conceptual clustering was introduced fairly independently by Douglas conceptualclustering
Fisher, Ryszard Michalski, Robert Stepp and Joel Martin [40, 25, 66] and implemented in
algorithms such as Cluster/2 or CobWeb.
A conceptual clustering system accepts a tabular list of objects, described using a fixed
set of attributes (events, observations, facts) and produces a classification scheme over the
domain of these attributes. Conceptual clustering algorithms are usually unsupervised and
use some notion of a quality evaluation function to discover classes with “good” descrip-
tions. Evaluation of class quality is performed by looking at summaries (descriptions) of
3.3. Relationship with Conceptual Clustering 46
classes and confronting it with the training set. In other words, conceptual clustering sys-
tems measure the adequacy of classification and employ iterative search strategies for its
optimization, keeping in mind that the description of classes is an integral and important
part of an investigation [30].
A class in conceptual clustering is described with a set of attributes (concrete values,
probability distributions or other properties). For example, in Cluster/2, the algorithm cre-
ates descriptions of groups of objects based on conjunctions of simple conditions defined
on attributes of these objects. A description can look as shown below:
[height > 1290 cm] & [eye color = blue or green]
Similar motivation of conceptual clustering and our problem of descriptive clustering is
fairly clear: class (or cluster) label is the key element driving the rest of the process. Hav-
ing said that, straightforward application of conceptual methods to text clustering seems to
be problematic — conceptual clustering is strongly related to a specific type of input data —
tabular lists of objects, each described with a set of attributes (typically nominal). Text repre-
sentation models have a different data characteristic — a great number of numeric features.
Adopting conceptual clustering algorithms to clustering text is of course possible (and has
been done in the past), but seems a bit artificial.
Summarizing, what is similar in conceptual clustering and descriptive clustering is the
motivation, the emphasis on describing the result using concepts understandable to a hu-
man. Their application domain and implementation remain quite different.
Chapter 4
Solving the Descriptive Clustering Task:
Description Comes First Approach
4.1 Introduction
Description Comes First (dcf) is our suggested solution to the problem of descriptive clus-
tering. We perceive dcf as a general method into which different algorithmic components
can be plugged; the two algorithms we present later in this document are its concrete in-
stances. In this chapter we would like to describe the common denominator — a high-level
procedure which helps in overcoming the most difficult problems of cluster labeling and in
our opinion fulfills the requirements of descriptive clustering. We can summarize the De-
scription Comes First approach by the following statement:
Description Comes First approach is a general method for constructing text dcf approach
clustering algorithms suited to solving the problem of descriptive clustering.
4.2 Anatomy of Description Comes First
The dcf approach consists of several phases, illustrated in Figure 4.1, but the core idea is in
separating selection of candidate cluster labels from cluster discovery
• Candidate label discovery (phase 1) is responsible for collecting all phrases potentially
useful as good cluster labels (comprehensible and concise phrases).
• Cluster discovery provides a data model about document groups present in the input
data.
By splitting the process into these two phases, the most difficult element so far — creat-
ing proper cluster descriptions from a mathematical model — is avoided and replaced by a
problem of selection of appropriate cluster labels for each group of related documents found
in the input. The only purpose of cluster discovery (in phase 2) is to build a model of dom-
inant topics — major subjects the documents are about. This model is subsequently used
47
4.2. Anatomy of Description Comes First 48
Figure 4.1: Key elements of the dcf approach.
to select appropriate labels from the set of candidates and is discarded afterwards. The fi-
nal document groups (clusters) are built around the selected cluster labels (called pattern
phrases) to further reduce the “semantic gap” between cluster descriptions and documents
they contain (to fulfill the transparency requirement). The process ends with pruning of
groups that did not collect enough documents and elimination of very similar cluster labels.
In the following sections we discuss the rationale behind each phase of the dcf ap-
proach, provide certain implementation clues and end with an illustrative example.
4.2.1 Phase 1: Cluster Label Candidates
As we mentioned in the introduction, previous research on text clustering shows that finding
cluster labels always encounters great difficulties. We can avoid this problem by preparing
candidate cluster labels prior to the clustering process and then only pick these for which
significant groups of documents exist.1 Because the process of cluster label selection is in-
dependent from clustering, it can fully utilize raw text input to assure comprehensibility and
conciseness described in Section 3.2.1.
An interesting side-effect of making candidate label selection a separate phase is its in- computationalcomplexitydiscussionfluence on the efficiency of the entire procedure. Note that cluster label extraction is basi-
1It should be mentioned that this reversed order: labels→clusters instead of the traditional clusters→labels was
first suggested on the Web site of a commercial clustering search engine Vivisimo [D]. Obviously, no algorithmic
details had been released, so we do not know if our ideas align in any way with Vivisimo’s. A proper credit is due to
Vivisimo’s authors for inspiring our further work on the subject.
4.2. Anatomy of Description Comes First 49
cally independent, it can precede clustering or run in parallel. It can be centralized or easily
distributed (each computational unit extracting candidate labels from a single document).
Moreover, a collection of cluster label candidates can be prepared a priori (an ontology) and
reused with no additional computational cost.
Implementation Ideas A set of candidate cluster labels can be prepared in several ways.
One possibility is to utilize existing dictionaries or ontologies. This scenario is interesting
because cluster labels are then given a priori, so we can assume they fulfill the requirements
and are comprehensible for end users. The entire dcf process then effectively becomes
a classification task to a set of predefined categories (implied by candidate cluster labels),
where only categories that collect enough documents are shown back to the user.
When candidate cluster labels are not given in advance, we must extract them directly
from the input documents. Several methods can be employed to do this:
• extraction of frequent phrases, much like in the stc algorithm,
• extraction of simple coherent linguistic chunks — noun phrases or other coherent
groups of words,
• full linguistic analysis to extract independent phrases or sentences.
Each one of the above methods has its advantages and disadvantages. Frequent phrase
extraction is a fast and scalable method, but may result in nonsensical candidates in the
output (non-grammatical, incomplete or common clichés). We still use them in both algo-
rithms presented later in this thesis, mostly because their extraction is so efficient, but we
are aware of the shortcomings of this solution. To defend frequent phrases and dcf a bit:
we believe (and our experiments support this belief) that an algorithm following dcf should
be able to deal with certain noise in the set of candidate cluster labels. A noisy cluster label
should not be supported by any dominant topic and should not become a pattern phrase.
Even if this happens, such a pattern phrase should not collect enough documents and be
discarded as a result. These elements are a clear improvement over plain stc for example,
which lacked such a verification step, often permitting junk frequent phrases to become
clusters. We return to this discussion in later sections.
To find better cluster labels candidates we need to look at the methods of shallow lin-
guistic processing introduced in Section 2.1.2. The most common way of finding coherent,
sensible groups of words (in English) is to divide the text into chunks. Chunks are the small-
est (conciseness) grammatically consistent (comprehensibility) element that the input text
can be divided into, so they offer much more in terms of our needs compared to frequent
phrases.
Statistical chunkers for English are reasonably efficient and accurate, and we use them
later in this thesis to extract noun phrases in Descriptive k-Means. Note that as with any
automatic method, chunks retrieved using tools based on statistical processing of text are
just an approximation and may still return incorrect results.
4.2. Anatomy of Description Comes First 50
For Polish, we assumed an equivalent of an English chunk to be a group as defined
in [81]. Unlike chunks, however, groups may be unordered and distributed throughout the
sentence, so their direct use for cluster label candidates is more complex. We already men-
tioned our experiments with a simple heuristic automaton for detecting certain tag seq-
uences, but this solution was too premature to be employed in this thesis. As a result, at
the moment of writing we are limited to frequent phrases (and possibly predefined ontolo-
gies) for experimenting with Polish texts.
4.2.2 Phase 2: Document Clustering (Dominant Topic Detection)
The intention of this phase is to construct a model of dominant topics2 present in the input. dominant topic
Each dominant topic consists of a group of documents that are about the same, or closely
related subject. A dominant topic must also have a suitable representation which can be
used later (in the pattern phrase selection phase) to calculate similarity between each dom-
inant topic and phrases from the set of candidate cluster labels.
Note that while we refer to this phase as document clustering, any method producing
a model of dominant topics is actually sufficient. We actually present two different ap-
proaches to topic approximation in this thesis. In Descriptive k-Means we use a regular
clustering algorithm (k-Means) and assume each cluster’s centroid represents a single dom-
inant topic. In the Lingo algorithm, on the other hand, clustering is replaced by Singular
Value Decomposition (dimensionality reduction) of term-document matrix. Dominant top-
ics are approximated with base vectors of one of the reduced matrices (we provide details
later).
Another element worth emphasizing is that dominant topics remain an internal artifact
in the process of dcf and never need to be shown to the user explicitly. This implies that
the model used for discovering dominant topics can be arbitrarily complex without hurt-
ing cluster label comprehensibility. This is a clear advantage with documents in Polish, for
instance. We can take into account loose syntax (use the vsm model instead of the phrase
co-occurrence model) and apply destructive text transformations to accommodate inflec-
tion (stemming, diacritic marks removal) without worrying about the problems of cluster
labeling which is essentially resolved in the next phase of dcf.
Implementation Ideas The most obvious and natural choice of a representation model for
this phase is the Vector Space Model and we use it in combination with cosine measure in
both our algorithms. We suppose that other models of text representation could be used,
but this direction has not been explored.
4.2.3 Phase 3: Pattern Phrase Selection and Document Assignment
The role of this step is to pick these cluster label candidates which are most similar to the
representation of previously discovered dominant topics. We will call such labels pattern
2We will use the phrases: dominant topic, dominant concept and abstract concept interchangeably for historic
reasons.
4.2. Anatomy of Description Comes First 51
phrases. A different way of looking at this phase is that we approximate the representation pattern phrase
of dominant topics, which we know is traditionally difficult to describe, with existing com-
prehensible labels.
As with any approximation, there are certain risks involved. For example, there is a risk
that no cluster label candidate will match a given topic. This is almost impossible if cluster
candidate labels have been extracted from the input documents, but is much more likely for
a predefined set of labels. While at first it may seem like a disadvantage, this stems from the
intuition of user’s anticipated behavior (Section 3.1 on page 41) — a cluster which cannot
be properly described is useless, even if the documents inside it make sense from the point
of view of the clustering method. Moreover, we rarely encountered this problem in real life
and believe the ability to hide clusters to which no sensible label can be found is actually a
strong point of the approach. The discussion of potential differences and distortions from
an ideal clustering is continued in Section 4.3 on page 53.
Once pattern phrases have been identified, the representation of dominant topics is dis-
carded and pattern phrases replace them as seeds of final document groups. This is a conse-
quence of the transparency requirement — we want a clear relationship between a cluster’s
label and its content. Pattern phrases will become cluster descriptions, so we must use them
directly to find the documents matching the topic they represent.
Note that document allocation phase fulfills the requirements concerning group over-
lap and partial clustering defined in Section 3.2.2. Documents are assigned to each pattern
phrase independently, so they may belong to more than one group. Documents not rel-
evant to any pattern phrase at all may also exist, obviously, forming a synthetic group of
non-clustered documents.
The last step (pruning) is meant to remove any pattern phrases which failed to collect
enough documents. The following scenarios are possible:
• No documents are assigned to the pattern phrase. A rare, but possible, case when the
pattern phrase was similar to the dominant topic’s model, but does not associate any
documents. Consider the following example: the representation of a dominant topic
contains keywords lemony and snicket. A candidate cluster label Lemony Snicket3 will
be selected, but, unfortunately, no document contains this exact phrase. The group is
(correctly) discarded.
• Very few documents are assigned to the pattern phrase. This may indicate that the
pattern phrase encompasses just a part of the original topic or the dominant topic was
a combination of more than one subjects. We can either discard the pattern phrase
or use it for merging with other small groups with overlapping documents (but this,
as we know from the stc algorithm, may lead to problems and is generally against
the transparency as we defined it). The threshold at which we consider the pattern
phrase and its documents irrelevant is a tuning parameter of a concrete algorithm
implementing dcf.
3Lemony Snicket is a pseudonym of Daniel Handler, an American novelist and the author of a series of darkly
comic children’s books known as A Series of Unfortunate Events.
4.2. Anatomy of Description Comes First 52
• A significant number of documents is assigned to the pattern phrase. In this case the
pattern phrase and its associated documents become part of the final result, that is
become a cluster described with a pattern phrase and containing the documents allo-
cated to it.
Implementation Ideas There are two elements of difficulty: pattern phrase selection and
document allocation.
To select pattern phrases we must seek for candidate cluster labels similar (or “close to”)
the discovered dominant topics. Assuming both cluster label candidates and dominant top-
ics are expressed in the same model (in the same vector space, for example), a simple cal-
culation of similarity between them should suffice.
Document allocation is more tricky since we want to have a clear relationship between
a pattern phrase and the documents allocated to it. We have already discussed several “rule
of thumb” heuristics that could be used for this task when we talked about the transparency
requirement (on page 43). Let us recall them now:
• allocate all documents containing an exact copy of the pattern phrase (strict rule),
• allocate all documents containing a possibly distorted copy of the pattern phrase (re-
ordered words, foreign words injected inside); the user should be able to control the
allowed level of distortion,
• allocate all documents containing the phrase and any synonymous phrases that could
be related to it, but offer the user a possibility of expanding the cluster label to explain
which phrases contributed to the allocated documents.
Implementation of pattern phrase selection and document allocation in practice may
become tricky, especially with large problem instances when efficiency of processing be-
comes critical. We show two different implementations of these elements in Lingo and De-
scriptive k-Means — the algorithms presented later in this thesis. Lingo uses a relatively
simple vsm-based retrieval model which results in several problems in document allocation
phase. Learning upon this experience, we improved the document allocation procedure in
dkm to scale to large problem instances and accommodate different document allocation
heuristics we mentioned above.
4.2.4 An Illustrative Example
This section demonstrates the dcf approach on a simple two-dimensional Vector Space
Model example. All referenced illustrations are collected in Figure 4.2 on page 54.
Let us narrow the “language” of documents in our example to only two terms: X and Y .
We can represent all input documents as points in a two-dimensional vector space, where
the horizontal axis represents the weight (importance) of term X and the vertical axis rep-
resents the weight (importance) of term Y . For estimating similarity between documents
represented in the term vector space we will use the cosine measure — the angle between
4.3. Discussion of Clustering Quality 53
vectors starting in point (0,0) and ending at each respective documents vector’s location. In
all subsequent figures we represent objects in the term vector space as small circles cast to
a unit sphere (the angle obviously does not change). A dcf approach to finding clusters in
the input documents would proceed as follows.
In the first step we collect cluster candidate labels. In our example these candidate labels
are already represented as red circles in the term vector space (see Figure 4.2(a)). Angle
vectors are also shown for clarity.
In a concurrent step we parse input documents and represent them in the same vector
space model. Each document is depicted as a faded blue circle with a direction vector going
through it (see Figure 4.2(b)).
We proceed to the second phase and detect dominant topics in the input documents. In
our example we look for groups of documents with a similar angle and note that two clear
groups exist (see Figure 4.2(c)). An average angle of the content of the group is the centroid
vector — the dominant topic’s representation in our model, depicted with larger blue arrows.
In the third phase we select pattern phrases for the dominant topics. Since our example
uses the same model for representing documents and labels, we can simply put everything
in the same space (see Figure 4.2(d)). Selecting pattern phrases is about choosing candidate
cluster labels “close to” topic vectors. Technically, we look for any vectors representing la-
bel candidates that lie within an infinite hypercone around the topic vector’s axis. In our
example’s two-dimensional space, the hypercone is simply an angle around a topic vector
(see Figure 4.2(e)) — we select three pattern phrases for the next phase. Note that the cone’s
opening angle is a tuning parameter of the algorithm.
Finally, we assign documents to pattern phrases in a way similar to the one previously
used to find pattern phrases. For each pattern phrase we look for documents within their
pattern hypercones and assign these documents to the pattern phrase. As shown in Fig-
ure 4.2(f), documents can be assigned to more than one pattern phrase. Each final cluster
contains documents similar to the selected pattern phrase (transparency requirement), the
original dominant topic’s vector is no longer directly taken into account.
Note that we use a simple cosine similarity model in the last phase of this example to
convey the idea of how dcf works. In a real algorithm the document assignment phase
would have to take into consideration word order and proximity of terms in the pattern
phrase, something the cosine measure omits entirely.
4.3 Discussion of Clustering Quality
Description Comes First approach avoids the most difficult problems of labeling clusters,
but of course with trade-offs someplace else. There are two potential places where clustering
quality may degrade (compared to an “ideal” clustering):
• when pattern phrases are selected, they are an approximation of dominant topics con-
structed in phase 2,
4.3. Discussion of Clustering Quality 54
(a) Candidate labels and their “position” in the
model.
(b) Documents and their “position” in the model.
(c) Concept vectors discovered in the documents. (d) Preparation for merging — cluster label
candidates and topics are represented in the same
model.
(e) Candidate labels close to topics (within cones)
become pattern phrases.
(f) Documents matching pattern phrases (within
cones) are selected to their groups
Figure 4.2: Example showing the dcf approach applied to documents and labels in a two-
dimensional term vector space.
4.4. Summary 55
• when documents are assigned to selected pattern phrases we use them as the refer-
ence point instead of the original representation of dominant topics.
The first issue seems to be more important as it means we are “cheating” the user a bit
by replacing the original dominant topics with groups of documents created around perfect
candidate labels. We believe this element is a virtue rather than a vice. Dominant topics
represent ideal groups expressed in a model used for clustering, but this model is obscure
and incomprehensible to the user. A “semantic gap” between the cluster’s representation
and its perception by a human is inevitable and introduces just as much confusion. In our
opinion the approximation of dominant topics using pattern phrases is not about “cheating”,
but rather choosing the closest comprehensible image of dominant topics that the user can
fully understand.
The second problem — documents assigned to pattern phrases instead of the original
dominant topic — is a straightforward consequence of the transparency requirement. We
again attempt to minimize the semantic gap, this time between the pattern phrases and
documents inside them. If we used the dominant topics to allocate documents to pattern
phrases, the cluster labels would remain comprehensible, but their content would be rele-
vant to something the user never sees explicitly.
The document assignment step must ensure that there really are documents that match
selected pattern phrases and that the link between cluster labels and documents inside them
is clear. This step in also a safety vent of the entire dcf procedure: even if incorrect (unre-
lated) cluster labels are selected to be pattern phrases they are not likely to collect enough
documents to form final clusters.
4.4 Summary
We believe the strengths of the dcf approach lie in the following properties:
• Candidate phrases can be extracted from the input text automatically based on fre-
quency or other statistics, just as in the stc algorithm. Alternatively, candidate phrases
can come from a completely different source or even a predefined ontology (to guar-
antee they are comprehensible).
• Cluster discovery and candidate phrase extraction are independent and can be easily
parallelized. In fact, the extraction can be done incrementally as the documents are
added to the system.
• Dominant topic detection can use arbitrarily complex model of text representation
and cluster analysis without making the cluster labeling procedure any more difficult.
• If the method used for detecting dominant topics returns an unclear representation of
a topic then it is less likely to find a matching label in the set of label candidates. Even
if a matching label is found, the document assignment phase provides a second-level
pruning. We thus ensure that groups of documents in the output are really relevant
and well described.
4.4. Summary 56
• The final assignment of documents to pattern phrases fulfills the transparency re-
quirements we established for the descriptive clustering problem. All documents in
a final cluster must contain a phrase from its label (possibly distorted), so the rela-
tionship between the cluster and its label should be clear to the user.
Chapter 5
The Lingo Algorithm
The motivation for creating Lingo was to come up with an algorithm for clustering search
results capable to discover diverse groups of documents and at the same time keep cluster
labels sensible. The work on Lingo must be credited to Stanisław Osinski who worked on
the algorithm under supervision of Jerzy Stefanowski [68] and later contributed a great deal
of effort to the Carrot2 framework. The author of this thesis worked with Stanisław on a
number of co-authored papers [70, 68] and this fruitful cooperation gradually resulted in a
conceptual basis for defining descriptive clustering and the dcf approach.
The aim of this section is to show how Lingo fits in the general scheme introduced by
the dcf. The algorithm is an example of dcf’s application to the domain of search results
clustering and several elements of its implementation are designed specifically to deal with
this type of input data.
5.1 Application Domain
Clustering search results differs significantly from other types of document clustering. Each
matching result (hit) in a list of results returned by a search engine contains a resource loca-
tor (url), an optional title, and a short fragment of text called a snippet, which is optional as snippet
well. Modern search engines assemble snippets individually for each query by scanning the
body of a document and looking for short spans of text that contain as much of the query
as possible. Two or three best matching spans are joined and returned as a short block of
text providing insight into the original document for the user. This technique of generating
snippets is called kwic — keyword in context. Figure 5.1 shows a typical snippet. kwic
In the remaining part of this chapter we will use the term document to refer to a single
hit, even though the entire document is obviously not returned with the search result.
The usual number of hits returned by a search engine is anything between a few dozen to
a few hundred entries, so the input is relatively small. Moreover, it is not likely to grow sub-
stantially larger because search engines limit the number of hits to a few thousand (Google,
Yahoo and others).
57
5.2. Overview of the Algorithm 58
Figure 5.1: A typical “hit” returned by a search engine: document title on top, snippet with
query terms in the middle and an information line (with the document’s address) on the
bottom.
Figure 5.2: Generic elements of dcf and their counterparts in Lingo. svd decomposition
takes place inside cluster label induction phase, it is extracted here for clarity.
Conclusions are twofold: on one hand, a search results clustering must work with in-
complete, fragmented data (an extreme example is a document with an empty snippet and
an empty title). On the other, the scalability of the algorithm is not that important as long
as it is unnoticeably fast (for a human user) on typical input data sizes.
5.2 Overview of the Algorithm
Lingo processes the input in four phases: snippets preprocessing, frequent phrase extrac-
tion, cluster label induction and content allocation. The parallels to the generic scheme
introduced in the dcf are illustrated in Figure 5.2. Algorithm 5.1 on the next page contains
full pseudocode of the algorithm, we discuss the details of each step in sections below.
5.2.1 Input Preprocessing
In the preprocessing phase the input documents (titles and snippets) are tokenized and split
into terms. Lingo is implemented as a component embedded in the Carrot2 framework
5.2. Overview of the Algorithm 59
1: D ← input documents (or snippets)
/* Preprocessing */
2: for all d ∈D do
3: perform text segmentation of d ; /* Segmentation, stemming. */
4: if language of d recognized then
5: apply stemming and mark stop-words in d ;
6: end if
7: end for
/* Frequent Phrase Extraction */
8: concatenate all documents;
9: Pc ← discover complete phrases;
10: P f ← p : p ∈Pc ∧ frequency(p) > term frequency threshold;
/* Cluster Label Induction */
11: A ← term-document matrix of terms not marked as stop-words and
with frequency higher than the Term Frequency Threshold;
12: S,U ,V ← SVD(A); /* Product of SVD decomposition of A */
13: k ← 0; /* Start with zero clusters */
14: n ← rank(A);
15: repeat
16: k ← k +1;
17: q ← (∥∥Sk
∥∥F /‖S‖F );
18: until q < Candidate Label Threshold;
19: P ← phrase matrix for P f ;
20: for all columns of U Tk
P do
21: find the largest component mi in the column;
22: add the corresponding phrase to the Cluster Label Candidates set;
23: labelScore← mi ;
24: end for
25: calculate cosine similarities between all pairs of candidate labels;
26: identify groups of labels that exceed the Label Similarity Threshold;
27: for all groups of similar labels do
28: select one label with the highest score; /* cluster description */
29: end for
/* Cluster Content Discovery */
30: for all L ∈ Cluster Label Candidates do
31: create cluster C described with L;
32: add to C all documents whose similarity
to C exceeds the Snippet Assignment Theshold;
33: end for
34: put all unassigned documents in the “Others” group;
/* Final Cluster Formation */
35: for all clusters do
36: clusterScore← labelScore×‖C‖;
37: end for
38: Sort final clusters.
Algorithm 5.1: Pseudo-code of the Lingo algorithm.
5.2. Overview of the Algorithm 60
and uses its infrastructure to perform certain text preprocessing tasks — stemming, mark-
ing stop words and simple text segmentation heuristics. After tokenization is complete, a
term-document matrix is constructed out of the terms that exceed a predefined term fre-
quency threshold. After that, document vectors are weighted using the tf-idf formula [82].
Terms present in document titles are additionally boosted compared to these appearing in
snippets by a predefined constant because titles are more likely to contain sensible (human-
edited) information.
5.2.2 Frequent Phrase Extraction
The aim of this step is to discover a set of cluster label candidates — phrases (but also sin-
gle terms) that can potentially become cluster labels later. Lingo extracts frequent phrases
using a modification of an algorithm presented in the shoc algorithm [19]. A word-based
suffix array is constructed and extended with an auxiliary data structure — the lcp (Longest
Common Prefix). This allows the algorithm to identify all frequent complete phrases in O(n)
time, n being the total length of all input snippets.
The frequent phrase extraction algorithm ensures that the discovered labels fulfill the
following conditions:
• appear in the input at least a given number of times (it is a tuning threshold);
• not cross sentence boundaries; sentence markers indicate a topical shift, therefore a
phrase extending beyond one sentence is unlikely to be meaningful;
• be a complete frequent phrase (the longest possible phrase that is still frequent); com-
pared to partial phrases, complete phrases should allow clearer description of clusters
(compare: “Hillary Rodham” and “Senator Hillary Rodham Clinton”);
• neither begin nor end with a stop word; stop words that appear in the middle of a
phrase should not be discarded.
5.2.3 Cluster Label Induction
During the cluster label induction phase, Lingo identifies the abstract concepts (or domi-
nant topics in the terminology used in dcf) that best describe the input collection of snip-
pets. There are two steps to this: abstract concept discovery, phrase matching and label
pruning.
In abstract concept discovery, singular value decomposition (svd) is applied to the term-
document matrix A, breaking it into three matrices: U , S and V in such a way that A =U SV T . An interesting property of svd is that the first r columns of matrix U , r being the
rank of A, form an orthogonal basis for the term space of the input matrix A [29]. It is
commonly believed that base vectors of the decomposed term-document matrix represent
an approximation of “topics” — collections of terms connected with an obscure net of latent
relationships. Although this fact is difficult to prove, singular decomposition is widely used
in text processing, for example in Latent Semantic Indexing (lsi). From Lingo’s point of view,
5.2. Overview of the Algorithm 61
basis vectors (column vectors of matrix U ) contain exactly what it has set out to find — a
vector representation of the abstract concepts.
The most significant k base vectors of matrix U are determined by selecting the Frobe-
nius norms (measuring the difference between two matrices) of the term-document matrix
A and its k-rank approximation Ak . Let threshold q be a percentage-expressed value that
determines to what extent the k-rank approximation should retain the original information
in matrix A. We hence define k as the minimum value that satisfies the following condition:
‖Ak‖F /‖A‖F ≥ q,
where the symbol ‖X ‖F denotes the Frobenius norm of matrix X . Clearly, the larger the
value of q the more cluster candidates will be induced. The choice of the optimal value for
this parameter ultimately depends on the preferences of users, so we make it one of Lingo’s
control thresholds — Candidate Label Threshold.
Phrase matching and label pruning step, where group descriptions are discovered, relies
on an important observation that both abstract concepts and frequent phrases are expressed
in the same vector space — the column space of the original term-document matrix A. This
enables us to use the cosine distance to calculate how “close” a phrase or a single term is to
an abstract concept. Let us denote by P a matrix of size t × (p + t), where t is the number
of frequent terms and p is the number of frequent phrases. P can be easily built by treating
phrases as pseudo-documents and using one of the term weighting schemes.
Having the P matrix and the i -th column vector of the svd’s U matrix, a vector mi of
cosines of the angles between the i -th abstract concept vector and the phrase vectors can
be calculated as:
mi =UiT P.
The phrase that corresponds to the maximum component of the mi vector should be se-
lected as the human-readable description of i -th abstract concept. Additionally, the value of
the cosine (similarity) becomes the score of the cluster label candidate.
A similar process for a single abstract concept can be extended to the entire Uk matrix
— a single matrix multiplication M =UkT P yields the result for all pairs of abstract concepts
and frequent phrases.
The final step of label induction is to prune overlapping labels. Let V be a vector of clus-
ter label candidates and their scores. We create another term-document matrix Z , where
cluster label candidates serve as documents. After column length normalization we calcu-
late Z T Z , which yields a matrix of similarities between cluster labels. For each row we then
pick columns that exceed the Label Similarity Threshold and discard all but one cluster label
candidate with the maximum score which becomes the description of a future cluster.
5.2.4 Cluster Content Allocation
The process of cluster content allocation very much resembles document retrieval based on
plain vsm model. The only difference is that instead of one query, the input snippets are
5.3. An Illustrative Example 62
matched against a series of queries, each of which is a single cluster label. Thus, if for a
certain query-label, the similarity between a document and the label exceeds a predefined
threshold, it will be allocated to the corresponding cluster. Note that from the point of view
of dcf, traditional Vector Space Model used for comparisons is not ideal — the label’s word
order and proximity is not taken into account.
Let us define matrix Q , in which each cluster label is represented as a column vector.
Let C =QT A, where A is the original term-document matrix for input documents. This way,
element ci j of the C matrix indicates the strength of membership of the j -th document to
the i -th cluster. A document is added to a cluster if ci j exceeds the Snippet Assignment
Threshold, yet another control parameter of the algorithm. Documents not assigned to any
cluster end up in an artificial cluster called “Other documents”.
5.2.5 Final Cluster Formation
Finally, clusters are sorted for display based on their score, calculated using the following
formula:
Cscore = label score×‖C‖,
where ‖C‖ is the number of documents assigned to cluster C . The scoring function, al-
though simple, prefers well-described and relatively large groups over smaller ones.
5.3 An Illustrative Example
Let the input collection of documents contain d = 7 documents. We omit the preprocessing
stage and assume t = 5 terms and p = 2 phrases are given (these appear more than once and
thus will be treated as frequent). The input is shown in Figure 5.3.
The t = 5 terms
T1: Information
T2: Singular
T3: Value
T4: Computations
T5: Retrieval
The p = 2 phrases
P1: Singular Value
P2: Information Retrieval
The d = 7 documents
D1: Large Scale Singular Value Computations
D2: Software for the Sparse Singular Value Decomposition
D3: Introduction to Modern Information Retrieval
D4: Linear Algebra for Intelligent Information Retrieval
D5: Matrix Computations
D6: Singular Value Analysis of Cryptograms
D7: Automatic Information Organization
Figure 5.3: Input documents, frequent terms and phrases.
We now preprocess the input term document matrix — tf-idf weighting and normaliza-
tion results in matrix Atf-idf, svd decomposition of that matrix yields matrix U containing
abstract concepts.
Atfidf =
∣∣∣∣∣∣∣∣∣∣
0 0 0.56 0.56 0 0 1
0.49 0.71 0 0 0 0.71 0
0.49 0.71 0 0 0 0.71 0
0.72 0 0 0 1 0 0
0 0 0.83 0.83 0 0 0
∣∣∣∣∣∣∣∣∣∣
U =
∣∣∣∣∣∣∣∣∣∣
0 0.75 0 −0.66 0
0.65 0 −0.28 0 −0.71
0.65 0 −0.28 0 0.71
0.39 0 0.92 0 0
0 0.66 0 0.75 0
∣∣∣∣∣∣∣∣∣∣
5.4. Computational Complexity 63
Now we look for the value of k — the estimated number of clusters. Let us define quality
threshold q = 0.9. Then the process of estimating k is as follows:
k = 0 7→ q = 0.62, k = 1 7→ q = 0.856, k = 2 7→ q = 0.959
and the number of expected clusters is k = 2.
To find relevant descriptions of our clusters (k = 2 columns of matrix U ), we calculate
similarity between candidate phrases and concept vectors as matrix M =UkT P , where P is
a synthetic term-document matrix created out of our frequent phrases and terms (values in
matrix P are again weighted using tf-idf and normalized):
P =
∣∣∣∣∣∣∣∣∣∣
0 0.56 1 0 0 0 0
0.71 0 0 1 0 0 0
0.71 0 0 0 1 0 0
0 0 0 0 0 1 0
0 0.83 0 0 0 0 1
∣∣∣∣∣∣∣∣∣∣
M =∣∣∣∣0.92 0 0 0.65 0.65 0.39 0
0 0.97 0.75 0 0 0 0.66
∣∣∣∣
Rows of matrix M represent clusters, columns — their descriptions. For each row we select
the column with maximum value. The two selected labels are: Singular Value (score: 0.92)
and Information Retrieval (score: 0.97). We skip label pruning as it is not necessary in this
example. Finally, documents are allocated to clusters by applying matrix Q , created out of
cluster labels, back to the original matrix Atf-idf. The final result is shown below. Note the
fifth column in matrix C , representing unassigned document D5.
Q =
∣∣∣∣∣∣∣∣∣∣
0 0.56
0.71 0
0.71 0
0 0
0 0.83
∣∣∣∣∣∣∣∣∣∣
C =∣∣∣∣0.69 1 0 0 0 1 0
0 0 1 1 0 0 0.56
∣∣∣∣
Information Retrieval [score: 1.0]
D3: Introduction to Modern Information Retrieval
D4: Linear Algebra for Intelligent Information Retrieval
D7: Automatic Information Organization
Singular Value [score: 0.95]
D2: Software for the Sparse Singular Value Decomposition
D6: Singular Value Analysis of Cryptograms
D1: Large Scale Singular Value Computations
Other: [unassigned]
D5: Matrix Computations
5.4 Computational Complexity
Time computational complexity of Lingo is quite high and mostly bound by the cost of term-
vector matrix decomposition (remember that a suffix array can be built in time linear with
respect to the input size). To our best knowledge, svd decomposition can be performed
in the order of O(m2n +n3) for a m ×n matrix [29]. Moreover, memory requirements are
demanding because of all the matrix transformations.
Note, however, that Lingo has been designed for a very specific application — search
results clustering — and in this setting scalability to large data sets is of no practical impor-
tance (the information is more often limited than abundant). In the next chapter we will
discuss another algorithm that scales well to large number of documents and also imple-
ments the dcf approach.
5.5. Summary 64
5.5 Summary
Strong Points
• Lingo was the first algorithm implementing cluster description search prior to actual
document allocation.
• Lingo handles fragmented, incomplete input. It discovers a diverse structure of topics
using dimensionality reduction applied to the term document matrix and subsequent
label search with base vectors of the reduced space.
• Lingo has few tuning parameters that fit well in its application domain. The number
of clusters is determined by taking advantage of a side-product of the singular matrix
decomposition — the accuracy of approximation of the original term vector space.
Weak Points
• The document assignment step breaks the transparency requirement of descriptive
clustering: documents containing subphrases or even isolated words from the cluster
label can become part of that label’s cluster.
• Troublesome scalability to larger problem instances.
• Candidate label discovery is bound to frequent ordered sequences of words in the in-
put. Even though we could try to use an external set of labels for cluster label induc-
tion, this possibility has not been exercised so far.
Fulfillment of Requirements
• Comprehensibility and Conciseness — Lingo extracts candidate cluster labels from a
set of frequent phrases. The danger of selecting frequent, but meaningless labels (as
in stc) exists, but is not an annoyance in practice because of two reasons. First, we se-
lect only a subset of all frequent phrases that correspond to dominant topics present
in the input and detected using matrix decomposition techniques. Second, the input
to Lingo is very specific — it is short and contextual with regard to the query (snippets)
and this context is quite likely to contain recurring phrases that denote meanings syn-
onymous to the query, which helps the algorithm in selection of candidate phrases.
• Transparency — Transparency in Lingo suffers from the generic vsm document alloca-
tion procedure, which allows documents that contain only subphrases of the original
cluster label to be added to the group. This often yields unintuitive results.
• Clusters Structure — Cluster diversity is ensured by the use of singular matrix decom-
position — the base vectors representing dominant topics are orthogonal, which is
commonly believed to represent different topics. Internal consistency is sometimes
broken as a result of the document allocation procedure. The algorithm is able to pro-
duce overlapping clusters.
Chapter 6
Descriptive k-Means Algorithm
Our initial experiments with dcf concerned clustering search results and, as we already
mentioned, this is a very specific application domain. A few challenging questions arose:
• Is it possible to create an algorithm implementing dcf that scales well to large num-
bers (tens of thousands) of documents?
• Is it possible to adopt a well-known text clustering algorithm to the dcf approach, at
least retain the original quality of clustering and improve comprehensibility of cluster
labels? What will be the difference in clustering quality between the derived algorithm
and the original?
Descriptive k-Means (dkm) is an attempt to provide answers to the above questions. It
is a combination of cluster label discovery — we experiment with two techniques: frequent
phrase extraction and noun phrase extraction — with a very well known numerical cluster-
ing algorithm k-Means, resulting in a novel algorithm that follows the dcf approach.
We are aware that k-Means is very often criticized: it produces spherical clusters (with
respect to the distance metric used), it requires a number of cluster seeds to be given in
advance and it always assigns each object to its closest cluster centroid, regardless of their
actual resemblance. However, we still chose to extend k-Means for a few important reasons.
First of all, we wanted to have a scalable, very fast baseline algorithm. Running times of
k-Means are practically linear with the size of input (the algorithm is interrupted when it is
close enough to convergence) and the procedure scales to very large data sets [16, 50].
Second, remembering about very good experiences with diversity of topics detected us-
ing svd decomposition, we looked for a similar method that would handle large input data.
Interestingly, cluster centroid vectors created by k-Means are reported to be a close approxi-
mation of singular value decomposition’s base vectors [43, 17]. This made us believe that we
could use k-Means as an efficient and scalable algorithm consistent with the behavior once
observed in the Lingo algorithm.
Finally, k-Means is a widely recognized and very often used numerical clustering algo-
rithm — many researchers use it as a benchmark result for their own achievements, so it
is relatively easy to cross-compare results with others. By choosing k-Means, an algorithm
65
6.1. Application Domain 66
with notorious reputation when applied to text clustering, we hoped to demonstrate that
an adoption to the dcf approach can help improve, or at least retain, the clustering quality
and yield more comprehensible cluster labels, consistent with the requirements defined in
descriptive clustering.
6.1 Application Domain
The envisioned application domain for Descriptive k-Means consists of short and medium
documents such as: news stories, Web pages, e-mails and other documents not exceeding
a few pages of text. We expect the input to be real text (not completely random, noisy doc-
uments and not fragments like snippets) written in one language (left to right word order,
available word segmentation heuristic). In this thesis we consider texts written in English
and Polish. The designed algorithm must be able to handle thousands of input documents
for off-line clustering and return results in a reasonable time.
6.2 Overview of the Algorithm
Descriptive k-Means closely follows the dcf approach. The cluster label discovery phase is
implemented in two alternative variants: using frequent phrase extraction and with shallow
linguistic processing for English texts (extraction of noun phrase chunks). Dominant topic
discovery is performed by running a variant of k-Means algorithm on a sample of input doc-
uments. We experimented with various types of features and weighting schemes for docu-
ment representation and found out that, except for pointwise mutual information which is
known to cause problems, all of them gave similar results.
In pattern phrase selection phase the algorithm uses a Vector Space Model to calculate
similarities between cluster label candidates and dominant topics (represented by cluster
centroids). Document assignment phase uses a mix of vsm and a Boolean model imple-
mented on top of a search engine (and utilizing its data structures) to ensure processing
efficiency. The document assignment phase, unlike in Lingo, searches for documents that
contain a pattern phrase, but allowing certain distortions such as minor word reordering
and different words injected inside. The level of pattern phrase distortion is adjustable and
is a parameter of the algorithm.
The matching between dcf approach and Descriptive k-Means is depicted graphically
in Figure 6.1. Algorithm 6.1 on page 68 contains full pseudocode of the algorithm, and we
discuss each major step in sections below.
6.2.1 Preprocessing
In the preprocessing step we initialize two important data structures: an index of documents
and an index of cluster candidate labels.
An index is a fundamental structure in information retrieval. Each entry added to an in- index
dex (document or candidate cluster label in our case) is accompanied by a vector of terms
6.2. Overview of the Algorithm 67
Figure 6.1: Generic elements of dcf and their counterparts in Descriptive k-Means.
and their counts appearing in that entry. The index also maintains an associated list con-
taining all unique terms and pointers to entries a given term occurred in (inverted index).
The index allows performing queries, that is search for entries that contain a given set of
terms and sort them according to weights associated with these terms. In our experiments
we utilize a document retrieval library that creates indices called Lucene [H].
Indices are essential in dkm to keep the processing efficient. Note that the index of docu-
ments is usually created anyway to allow searching in the collection and the index of cluster
labels may be reused in the future, so the overhead of introducing these two auxiliary data
structures should not be too big.
Each incoming document is segmented into tokens using the heuristic implemented in
the Carrot2 framework. A unique identifier is assigned to the document and then it is added
to an index ID .
If cluster candidate labels are to be extracted directly from the input documents, this
process takes place concurrently to document indexing. Depending on the variant of dkm,
we extract frequent phrases or noun phrases (from English documents). The resulting set of
candidate labels is added to a separate index IP . Each candidate cluster label is indexed as
if it were a single document. To minimize the number of identical index entries, we keep a
buffer of unique labels in memory and flushing them to the index in batches.
6.2. Overview of the Algorithm 68
1: D ← a set of input documents
2: k ← number of “topics” expected in the input (used for k-Means).
/* Preprocessing */
3: ID ← empty inverted index of documents;
4: IP ← empty inverted index of candidate cluster labels;
5: for all d ∈D do
6: IDindex←−−−−− d ; /* add d to index ID */
7: if extract candidate labels then
8: T = a set of noun phrases and/or frequent phrases extracted from d ;
9: for all t ∈ T do
10: IPindex←−−−−− t ; /* add t to index IP */
11: end for
12: end if
13: end for
14: if predefined labels available then
15: for all t ∈ (a set of predefined labels) do
16: IPindex←−−−−− t ; /* add t to index IP */
17: end for
18: end if
/* Topic Detection (k-Means) */
19: for all d ∈ D do
20: fd = extract_features(d); /* Feature extraction; mutual information, tfidf. . . */
21: end for
22: C = initialize(D); /* Initialize cluster centroids by finding most dissimilar documents. */
23: repeat
24: τ= 0;
25: for all d ∈D do
26: cd = reassign_to_closest_centroid( fd ,C );
27: τ= τ+ sim( fd ,cd );
28: end for
29: for all c ∈C do
30: update_centroid(c);
31: end for
32: until reassignments> rmin and τ> τmin
/* Select pattern phrases */
33: P ← empty set of candidate labels;
34: F ← empty set of final pairs (description, documents);
35: for all c ∈C do
36: qvsm = a Boolean query for terms in c , terms boosted with the centroid’s weights fc ;
37: Pc = search(IP ,qvsm); /* Execute the query against the index of labels. */
38: for all p ∈Pc do
39: s = length_penalty(p)×hit_score(p); /* Penalize the score with a length function. */
40: if s > smin then
41: P =P ∪p; /* Add to pattern phrases. */
42: end if
43: end for
44: end for
/* Assign documents to pattern phrases */
45: for all p ∈ P do
46: qvsm = construct a phrase query for pattern phrase p;
47: Rp = search(ID ,qvsm); /* Execute the query against documents. */
48: if Rp contains more than minimum documents then
49: F = F ∪ (p,Rp ); /* Add a new cluster with label p and documents Rp . */
50: end if
51: end for
52: Return the final set of clusters F .
Algorithm 6.1: Pseudo-code of the Descriptive k-Means algorithm.
6.2. Overview of the Algorithm 69
6.2.2 Dominant Topic Detection
Dominant topic detection reuses data structures already present in the index of documents
ID and runs the k-Means clustering algorithm on a sample of documents to detect dominant
topics.
Preparation of Document Vectors
Let us recall that the index contains a vector of terms and their occurrences for each docu-
ment. Depending on the input size, we either take all documents or select a uniform ran-
dom subset and fetch their feature vectors from the index. To speed up computations (co-
sine similarities), we weight all the features and then limit the number of features for each
selected document to a given number of most significant terms to make document vectors
even more sparse.
We experiment with several feature weighting formulas: tf-idf, mutual information, mod-
ified mutual information and modified tf-idf (see Section 2.2.2 on page 25). Anticipating
experimental results shown in the next section, the choice of a weighting formula was mar-
ginally important; with the exception of pointwise mutual information, all other strategies
behaved similarly.
The limit of features of a single document is variable, we experimented with lengths
equal to 30, 50, 70 and 100 features. According to results shown in [86] the optimal value
of the term vector length should be somewhere within this range.
After feature weighting and truncation, a sample of documents (and their document vec-
tors) are ready for clustering using k-Means.
Clustering with k-Means
The variant of k-Means internally used for topic detection is characterized by the choice of
similarity measure, initialization of cluster centroids and the convergence criterion.
Similarity Measure Cosine distance is used to calculate similarity between document vec-
tors. This choice is motivated by computation efficiency needed for handling large numbers
of documents. For normalized vectors di and d j the cosine similarity simplifies to:
sim(di ,d j ) = cos(α) =di ·d j
|di ||d j |= di ·d j . (6.1)
By representing cluster centroids as dense vectors and documents as sparse vectors, multi-
plication of two document vectors in Equation 6.1 can be implemented in one loop iterating
over the components of the sparse vector only (this is the reason for making feature vectors
as short as possible). The total cost of calculating similarity between two documents is the
order of Θ(m) floating point multiplications, where m is the number of components of the
sparse vector.
6.2. Overview of the Algorithm 70
1: D ← a set of input documents;
2: k ← number of “topics” expected in the input;
3: C ← a set of k centroid vectors c1 . . . ck , initially empty;
4: s = 0.1×‖D‖; /* Subsample ratio. */
5: Ds = random_sample(D,s); /* Select random sample of size s. */
6: c1 = average(Ds ); /* The first centroid is the average of the sample. */
7: for all i = 2,3, . . . ,k do
8: /* Next centroid is initialized to a document most dissimilar to previous centroids. */
9: ci = argmaxd∈Ds
(∑j=1,2,...,i−1 sim(d ,c j )
);
10: end for
11: C contains the initial centroids.
Algorithm 6.2: An algorithm used to bootstrap k-Means.
Initial State Proper selection of the initial state is crucial in k-Means to ensure the clusters
are diverse and truly representative. We chose to initialize the algorithm by selecting most
diverse documents from a subsample of the input (see Algorithm 6.2).
Convergence Criterion We used a composite convergence criterion for interrupting the
computation loop in k-Means. The computation ends when any of the conditions below is
true:
1. global objective function (sum of distances from documents to their centroids, see
Equation 2.6 on page 30) is no longer improved by more than threshold τ, or
2. fewer number of documents than rmin is reassigned between clusters.
The objective function in k-Means is by itself monotonic (a general proof can be found
in [5], cosine measure-specific proof is in [17]), so the algorithm will always terminate at
some point. We added the second condition to trim the trailing iterations which might keep
improving the global function, without affecting the centroid vectors.
Selection of the Desired Number of Clusters We assumed that the number of dominant
topics to be discovered, and at the same time the number of desired clusters for the k-Means
algorithm, must be given a priori. Our decision to leave k as a parameter was partially
justified by the planned experiments where k had to be given in advance to allow cross-
algorithm result comparisons. The second reason influencing this decision was that cluster-
ing is an internal process to Descriptive k-Means — the number of topics to be discovered
can be actually set to a fixed value, if there are no sensible cluster candidates the final num-
ber of clusters should be reduced automatically anyway. Nonetheless, we are aware this is
a weak point of the entire algorithm and that k could be at least estimated from the data
using one of the known methods.
[
Once k-Means converges to stable cluster centroids, they become the final result of this
phase — the representation of dominant topics expressed by a set of centroid vectors in the
term vector space.
6.2. Overview of the Algorithm 71
6.2.3 Selecting Pattern Phrases and Allocating Cluster Content
Pattern Phrase Selection Pattern phrase selection is again tightly related to the data struc-
ture used throughout the algorithm — the index of phrases and documents. Recall that we
seek the closest approximation of cluster centroid vectors among candidate labels acquired
in the previous phases of the algorithm.
For each cluster centroid we assemble a list of its top ranking terms and their weights.
We then build a weighted Boolean query and execute it against the index of cluster candi-
date labels, retrieving phrases that best match the “profile” of weights of cluster centroid
terms. The intuition behind this operation is that we search for cluster label candidates that
best match the terms and weights of our dominant topic’s representation (cluster centroid
vector). Note that at this stage we are not concerned with the order of words or their prox-
imity — cluster label candidates that match the query, but have no coverage in the set of
input documents will be pruned later at the allocation phase anyway.
To clear away any doubts, a query that collects matching candidate cluster labels is of
course not explicit – it is constructed programmatically in Descriptive k-Means. However, it
is much like a document retrieval query and can be represented visually as a list of alterna-
tives with numeric boosts associated with each term, as shown below.
java(0.52) OR coffee(0.24) OR island(0.12) OR language(0.09) OR . . .
A query like the one shown above fetches an ordered list of candidate cluster labels for a
given dominant topic. Each cluster label has a score, which is calculated as a relevance to the
query by the underlying document retrieval system. Let us denote the score of a given can-
didate label p as query_score(p). At this time we can influence the process of cluster label adjusting clusterdescription length
selection by allowing the user to express his preference of the expected cluster description
length (recall the conciseness requirement, page 43). We do it by adjusting the score of each
label p and penalizing it for being longer or shorter then the desired length of m terms. The
penalty function is a simple bell-like curve, shown in Equation 6.2.
length_penalty(p)= exp−(length(p)−m)2
2d2, (6.2)
where length(p) is the number of terms in p and d controls the penalty strictness. A penal-
ized score of a candidate label then becomes:
score(p)= query_score(p)× length_penalty(p). (6.3)
In our experiments we used an arbitrary fixed values of m = 4 and d = 8, Figure 6.2 illustrates
the shape of this particular function.
In the last step, the set of re-scored candidate labels is sorted again and the highest scor-
ing elements become pattern phrases for a given dominant topic.
Allocation of Documents To allocate documents to pattern phrases, we use a very similar
procedure to the one used to fetch pattern phrases. Recall the possible ways of implement-
6.2. Overview of the Algorithm 72
Figure 6.2: Shape of the phrase length penalization function 6.2. The desired phrase length
m = 4 in this example, strictness d replotted for 4 (innermost curve), 6, 8 (solid line), 10 and
12 (outermost curve).
ing allocation of documents in dcf (pages 43 and 52). In dkm, we implement a search for
documents with possible distortions of the pattern phrase.
For each pattern phrase we build a query to the index of documents ID . This time, how-
ever, the query is not a simple Boolean alternative, we use a phrase query with a given slop
factor. Both concepts are borrowed from the Lucene project and are explained below.
A phrase query matches documents containing all terms in the query, but allows reorder- phrase query
ing and injection of other words in between query’s terms. The slop factor controls how slop factor
mangled the phrase can be to still consider the document relevant to the query. For exam-
ple, a phrase query [Bolek, Lolek] with a large enough slop factor should cover documents
containing phrases such as Bolek i Lolek but also Lolek oraz Bolek, but not a document where
Bolek and Lolek are too far apart.
The slop factor is technically defined as a difference in positions between the two terms
maximally moved out from their original positions in the phrase. This is best explained using
an example, see Example 6.1 on the following page. Formally, slop factor for a query p,
containing ordered terms t = t1, t2, . . . , tn and a document d with terms from the query at
indices d1,d2, . . . ,dn is defined as:
slop(p,d) = max(∀i
(di − i ))−min
(∀i
(di − i )). (6.4)
The phrase query for a given pattern phrase returns a list containing relevance-ordered
documents. Note that exact matches are scored higher than “sloppier” matches, thus doc-
uments containing more compact copy of the pattern phrase (and preserving the original
order of terms) should be scored higher and end up at the top of cluster’s documents. By
performing phrase queries instead of simple Boolean retrieval, we try to fulfill the transpar-
ency requirement — we prefer documents that contain an exact copy of the pattern phrase.
The decision how much distortion (and hence confusion) is allowed between cluster labels
and documents is controlled using the slop factor and its exact setting is left up to the user.
Finishing the slop factor discussion, our experience shows that this tuning parameter could
be set automatically as a a function of the pattern phrase’s length. In fact, we use a dynamic
6.3. Dealing With Large Instances: Implementation Details 73
Consider a phrase query with four terms: a b c d. The term vector of this phrase can be written down
as:
term a b c d
i 0 1 2 3
Assume two short documents “match” this phrase:
• a c b d (reordered terms),
• a b x c d (non-phrase term inside).
For the first document we have the following term positions and positional differences with respect to
the query phrase.
term a c b d
di 0 1 2 3
di − i 0 -1 1 0 → slop= 1− (−1) = 2.
Similar calculation follows for the second document:
term a b x c d
di 0 1 2 3 4
di − i 0 0 - 1 1 → slop= 1−0 = 1.
Example 6.1: Example calculation of a slop factor between a phrase and two short docu-
ments.
slop factor in our experiments, calculated with a formula in Equation 6.5.
slop_factor = 4+2× length(p). (6.5)
In the final step of the document allocation phase we remove these pattern phrases
which did not collect enough documents. The remaining pattern phrases and their asso-
ciated documents become the final algorithm’s result.
6.3 Dealing With Large Instances: Implementation Details
A number of elements in Descriptive k-Means has been designed anticipating its efficient
implementation. The key bottlenecks are:
• extraction and storage of documents and unique candidate cluster labels,
• k-Means clustering,
• searching for candidate labels matching abstract topics,
• searching for documents matching selected pattern phrases.
In this section we discuss implementation techniques and design choices we used to over-
come the above problems.
6.3. Dealing With Large Instances: Implementation Details 74
Table 6.1: The structure (fields) of the document index and candidate labels index.
Index Field Store Inv. index Positions Description
Cluster candidate labels label Yes No No Label real representation.
keywords No Yes Yes Label’s model (keywords).
Documents id Yes No No Document identifier.
terms No Yes Yes Document terms.
6.3.1 Data Storage and Retrieval
The algorithm relies heavily on data structures and document retrieval model present in a
typical search engine. We implemented dkm around Lucene — an open source indexing and
document retrieval library [H]. Lucene provides an efficient implementation of algorithms
for building inverted indices and storing term vectors. We replaced the default input parser
with our own one from the Carrot2 project.
Technically, a new incoming document triggers two separate threads. One extracts can-
didate labels and adds them to an internal in-memory buffer of unique candidate labels,
occasionally flushing them to cluster label candidates index. The other thread adds the doc-
ument to an index of documents.
The internal structure of fields stored in the two core indices is shown in Table 6.1. In
short, we tokenize and index terms in documents and in cluster candidate labels. The index
of documents is enriched with positional information to allow running phrase queries, but it
does not store the full content of documents (merely their identifiers and an inverted index
of terms). On the contrary, in the index of candidate labels IP we don’t need the positional
information, but we store full cluster labels because they are needed later when a given label
becomes a pattern phrase.
Extraction of candidate labels is the most time-wise expensive operation. We use our
own implementation of suffix trees for detecting frequent phrases and an external noun
chunker for English — the MontyLingua library [C]. In our experiments the times for pars-
ing a single document ranged from a few milliseconds for extracting frequent phrases (using
suffix trees) to a good few seconds for noun phrase chunking in English (Pentium III, 1.1
MHz). The latter result seems almost prohibitive but as we pointed out, parsing is very easy
to parallelize — each document can be processed by a concurrent processing node. The en-
tire index of candidate cluster labels can also be reused in subsequent algorithm runs or can
be prepared a priori from an existing ontology.
6.3.2 Clustering
The usual computational complexity reported for k-Means is the order of O(k× t ×n), where
k is the number of clusters, t the number of iterations to reach convergence and n the
number of clustered objects. This complexity is obviously a bit simplified not taking into
account the costs of atomic operations, which in case of feature vectors can skew the result
significantly. David Arthur and Sergei Vassilvitskii recently reported worst case lower bound
estimations for k-Means to be superpolynomial (2Ω(p
n)) [3]. Nonetheless, the algorithm’s
6.4. Computational Complexity 75
behavior in most applications is satisfying and subsampling or parallelization techniques
permit its application to enormous data sets [31, 16].
We implemented k-Means from scratch reusing document term vectors already present
in Lucene’s index. We first choose a uniform random sample of required length out of all
documents stored in the index and fetch document vectors of selected documents for fea-
ture weighting. This is the most memory-demanding element of the entire procedure, but
we can adjust the sample size to limit memory consumption.
After feature weighting, we truncate document vectors to a certain length, remembering
that sparsity of feature vectors provides substantial gains for calculating cosine similarity
(discussion on page 69). This step also lets us ignore certain amount of noisy terms present
in documents.
The main clustering routine uses the preprocessed document vectors of the sample and
runs entirely in main memory for efficiency. All vectors are normalized to a unit sphere to
optimize vector operations such as additions and dot products (we avoid certain divisions
and calculation of vector norms).
In the result, the clustering routine easily handles thousands of documents and clusters
them in a few seconds on commodity hardware.
6.3.3 Searching for Pattern Phrases
Searching for candidate labels is performed directly using the document retrieval model im-
plemented by Lucene. We build a Boolean query for terms in a dominant topic’s vector rep-
resentation and boosting each term with its corresponding weight. Listing 6.1 on the next
page shows the code fragment responsible for this process.
6.3.4 Searching for Documents Matching Pattern Phrases
Searching for documents is again performed using Lucene’s built-in query type, a phrase
query. The actual implementation of document assignment to a pattern phrase varied a bit
between the baseline dkm version described in Section 6.2.3 and the one we used for exper-
iments in Chapter 7. In the experiments, we had to keep a predefined number of output
partitions, so instead of allocating documents to each pattern phrase, we searched for all
documents matching a union of phrase queries of all pattern phrases selected for a single
dominant topic. The code fragment implementing this behavior is shown in Listing 6.2 on
the following page.
6.4 Computational Complexity
Estimating computational complexity of the entire Descriptive k-Means is difficult. A great
deal depends on the method used to extract candidate cluster labels. Linguistic shallow pre-
processing algorithms rarely specify computational complexity and are often heuristic. Our
experiments show that even with parallelization this phase may be the most time consum-
ing element of the entire algorithm.
6.4. Computational Complexity 76
// Search for pattern phrases
BooleanQuery query = new BooleanQuery();
for (int j = 0; j < Math.min(100, featureVector.size()); j++)
final TermQuery tk = new TermQuery(
new Term("keywords", featureVector.get(j).feature));
tk.setBoost((float) fv.get(j).weight);
query.add(tk, BooleanClause.Occur.SHOULD);
final Hits hits = searcher.search(query);
// Hits contains raw pattern phrases. Rescore taking into account phrase length.
final ArrayList<ScoredPhrase> phrases = new ArrayList<ScoredPhrase>();
for (int j = 0; j < hits.length(); j++)
final Document doc = hits.doc(j);
final String label = doc.get("label");
final double score = hits.score(j);
final int length = label.split("[\\t\\ ]").length;
final double penalty = Math.exp(
- (length - optimalPhraseLength) * (length - optimalPhraseLength)
/ (2 * optimalPhraseLengthDev * optimalPhraseLengthDev));
phrases.add(new ScoredPhrase(label, score * penalty));
// Sort pattern phrases for this cluster according to the final score.
Collections.sort(phrases, new PhraseScoreComparator());
Listing 6.1: A code fragment building a query for retrieving pattern phrases.
for (final ScoredPhrase p : phrases)
final BooleanQuery query = new BooleanQuery();
final PhraseQuery pq = new PhraseQuery();
final String [] keywords = p.keywords;
pq.setSlop(4 + keywords.length * 2);
for (final String term : keywords)
pq.add(new Term("terms", term));
query.add(pq, Occur.MUST);
// Search the index for matching documents
final Hits hits = docSearcher.search(query);
if (hits.length() < MIN_DOCUMENTS_PER_PHRASE_CLUSTER)
continue;
for (int j = 0; j < hits.length(); j++)
documentsSet.put(hits.doc(j).get("id"));
Listing 6.2: A code fragment implementing selection of a union of documents for pattern
phrases of a single dominant topic.
6.5. Summary 77
Cluster label discovery put aside, the overall algorithm’s complexity is bound by k-Means.
Assuming the complexity estimation given by Arthur and Vassilvitskii [3] is correct, Descrip-
tive k-Means is at least the order of O(n2+2/d
(Dδ
)222n/d
), where δ is the smoothness factor,
D is the diameter of the pointset, n is the number of documents and d is the number of
dimensions of the feature space. We believe that encountering this pessimistic scenario in
practice is doubtful. What’s even more important, a stop criterion waiting for full conver-
gence of k-Means is rarely used and replaced with a practical trick to limit the number of
iterations of the algorithm. This reduces the complexity to O(i ×n×|K |). The use of docu-
ment sampling techniques can help reduce the problem size even more.
6.5 Summary
Strong Points
• We have shown a derivation of k-Means that attempts to solve the descriptive clus-
tering problem by following the dcf approach. The dkm algorithm internally uses a
term-based document similarity model and clustering algorithm, enabling it to over-
come certain problems present in inflectional languages, but at the same time ensures
that cluster labels are comprehensible and their relationship to the documents trans-
parent.
• Thanks to the use of data structures known in document retrieval (inverted indices,
queries), the algorithm is scalable to large numbers of documents and efficient in
practice.
• Preprocessing, although costly, can be parallelized easily. The index of cluster label
candidates can be reused on subsequent algorithm runs.
Weak Points
• The algorithm requires an explicit initial number of predicted topics for the k-Means
algorithm used internally. Although the final number of clusters may be different, this
explicit parameter could be somehow estimated from the data.
• The algorithm creates a flat structure of clusters. This is actually a property of the
dcf approach in general. Hierarchy can be induced later of course, but it would be
desirable to have a hierarchical clustering algorithm from the start.
Fulfillment of Requirements
• Comprehensibility and Conciseness — We demonstrate the use of two different meth-
ods of selecting cluster label candidates: frequent phrases and chunks (noun phrases).
While frequent phrases have the same characteristic as it was previously the case with
Lingo, the use of noun phrases allows us to limit cluster label candidates only to com-
prehensible and concise entries (or rather: to increase the likelihood of selecting good
6.5. Summary 78
cluster label candidates since we rely on approximate methods of identifying chunks).
The user can control the preferred length of cluster labels by adjusting the phrase
penalty function.
• Transparency — Transparency of relationship between cluster labels and the docu-
ments assigned to it is ensured by the use of phrase queries. We allow certain dis-
tortions of the pattern phrase (such as word injections or reordering) and let the user
control it using the slop factor.
• Clusters Structure — Cluster diversity and consistency is presumed to be a result of
using k-Means for dominant topic detection. While the predicted number of topics
needs to be given a priori (which we recognize as a weak point), the number of output
clusters may be different depending on how many pattern phrases are selected and
collect enough documents. Along with our expectations, the algorithm can produce
overlapping clusters and leave unassigned documents.
Chapter 7
Evaluation
This section presents the results of experiments evaluating Lingo and Descriptive k-Means.
7.1 Evaluation Scope and Goals
The goals of descriptive clustering differ slightly from those of typical clustering — we try to
find coherent, properly described groups of documents, not just groups of documents. This
particular aspect seemed to be completely locked for any kind of objective assessment and
we knew that finding a way of performing evaluation would be a difficult issue.
A user survey is practically the only way of evaluating the quality of cluster labels. We
were, however, very hesitant about using them; recall the reasons already discussed in Sec-
tion 2.5. In the end we decided to look at the problem of evaluation from another angle.
The cutting edge of dcf is in providing more comprehensible cluster labels as an out-
come of how the algorithms are built — hopefully with meaningful clusters label candidates
from the start. By showing that the document clustering quality does not degrade, and at the
same time knowing a dcf-based algorithm should be able to explain its results more clearly,
we could indicate its advantage.
To summarize, the evaluation presented in this chapter has two different aspects.
• The aspect of clustering quality, measured as conformity to a predefined structure of
classes (comparison against a set of given classes, the ground truth).
• The utilitarian value of the concepts presented in this thesis and published in the
Carrot2 framework. The system has been available as an open source project for a few
years, so we have a good perspective of who and how has been using it. We present
the feedback we received from its users.
7.2 Experiment 1: Clustering in Lingo
In the first experiment, we compared the clustering quality of Lingo against the benchmark
algorithm — Suffix Tree Clustering (stc) [104, 105, 106]. We were interested in the structure
79
7.2. Experiment 1: Clustering in Lingo 80
of returned snippet groups (we will refer to snippets as documents in the remaining part of
this chapter). Specifically, we asked the following questions:
• Is Lingo able to cluster similar documents? If so, what is the algorithm’s performance
for search results containing unrelated and closely related documents?
• Is Lingo able to highlight outliers, defined as minor subsets of documents sharing a
common topic, but unrelated to majority of the input?
• Is Lingo able to capture cross-topic relationships (generalizations) among closely re-
lated subjects?
• Are cluster labels meaningful with respect to the topic they supposedly represent?
• What are key differences between clusters created by Lingo and stc?
We used two different ways of inspecting the results. In the first analysis the results for
a few synthetic data sets were inspected and evaluated manually. The second analysis at-
tempted to measure the distribution of original ground truth partitions in the final result.
7.2.1 Test Data and Experiment’s Setting
The set of ground truth partitions was acquired from document groups present in the Open
Directory Project (odp) [I]. Open Directory Project is a tree-like, human-collected thematic
directory of resources in the Internet. Each branch of this tree, called a category, represents
a single topic and contains links to related resources in the Internet. Every link added to the
odp must be accompanied by a short description of a resource it points to (25–30 words).
We assumed these short descriptions could serve as a substitute of snippets because of their
similar length.
We selected 10 categories out of approximately 575000 present in the odp database. The
exact choice of categories was random, given that each one contained at least 10 documents
and had a meaningful English description of each document inside. We decided to perform
the experiment on categories with documents in English because stc is known to have prob-
lems when clustering Polish search results [89].
The selected categories were related to four subjects: movies (2 categories), health care
(1), photography (1) and computer science (6) (see Table 7.1). We assumed that documents
within each category should have enough in common to be linked into a cluster. To verify
how the algorithms would handle the separation of similar but not identical topics, some
categories were drawn from one parent branch of the odp and the document count between
categories varied significantly. For example, there were four categories related to various
database systems.
To address the experiment’s questions, we mixed the original categories into 7 test sets,
each one aimed to verify a certain aspect of the clustering algorithms under consideration.
Table 7.2 on the next page lists the test sets, their content (categories) and rationale. We
used default values of parameters for both algorithms (as implemented in the Carrot2 frame-
work).
7.2. Experiment 1: Clustering in Lingo 81
Table 7.1: odp categories selected for the experiment.
Category Number of Description of the Contents
Name Documents
BRunner 77 Information about the Blade Runner movie.
LRings 92 Information about the Lord of the Rings movie.
Ortho 77 Orthopedic equipment and manufacturers
Infra 15 Infrared photography references
DWare 27 Articles about data warehouses (integrator databases)
MySQL 42 MySQL database
XMLDB 15 Native XML databases
Postgr 38 PostgreSQL database
JavaTut 39 Java programming language tutorials and guides
Vi 37 Vi text editor
Table 7.2: Merged test data sets, their categories and rationale.
Id Categories Rationale (hypothesis to verify)
G1 LRings, MySQL Can Lingo separate two unrelated categories?
G2 LRings, MySQL, Ortho Can Lingo separate three unrelated categories?
G3 LRings, MySQL, Ortho, Infra Can Lingo separate four unrelated categories, one
significantly smaller than the rest (Infra)?
G4 MySQL, XMLDB, DWare, Postgr Can Lingo separate four similar, but not identical
categories (all related to databases)?
G5 MySQL, XMLDB, DWare, Postgr,
JavaTut, Vi
Can Lingo separate four very similar categories
(databases) and two distinct, but loosely related ones
(computer science)?
G6 MySQL, XMLDB, DWare, Postgr, Ortho Outlier highlight test — four dominating conceptually
close categories (databases) and one outlier (Ortho).
G7 All categories All categories mixed together. Can Lingo generalize
categories (into movies, databases)?
7.2.2 Output Clusters Structure and Quality
In the first part of the evaluation, we analyzed the distribution of original categories in clus-
ters returned by Lingo for each input test set.
Before we comment on the results, let us describe how distributions are visualized graph-
ically (see Figure 7.3 on page 84 for example). Each bar on the chart represents a single clus-
ter (with a fragment of the cluster’s label on the horizontal axis) and each color represents a
single original class (partition of data). Ideally, coherent clusters representing single topics
should be of solid color (originate from a single category). Clusters are sorted according to
their final score, starting with the strongest clusters on the left side of each chart. Solid bars
of alternate colors should therefore appear to the left — this would mean that clear, coher-
ent clusters representing different classes were discovered properly. Vertical axis represents
the number of documents in clusters.
For each test set, Lingo created between 24 and 36 clusters. Unrelated topics (tests G1–
G3) have been properly separated and represented. The LRings category in test G1 has a
much more complex internal structure and spreads into more clusters, but MySQL clusters
are present and not mixed with any other category (Figure 7.3 on page 84). Similar situa-
tion can be observed in test G3, where four input categories are represented in the top five
clusters. Even though category Infra was much smaller, it is still highlighted as a separate
subject.
Topics were successfully separated even for conceptually similar categories in test G4
7.2. Experiment 1: Clustering in Lingo 82
(Figure 7.4 on page 85). Note, however, the cluster labeled MySQL Server — only half of its
documents come from MySQL category. This indicates a misassignment problem in Lingo’s
cluster content discovery phase, which is also present in the cluster labeled Information on
infrared in test G7. Documents containing any terms from the cluster’s label end up inside
it, so Information on infrared cluster matches documents containing terms Information and in-
frared, but not necessarily both of them. For example, a document containing a phrase In-
formation and Images could be found in that cluster even though it originated from LRings
category.
An outlier test G6 has been handled correctly (Figure 7.5 on page 86), Ortho category
was not obscured by database-related documents and was highlighted at top positions of
the group ranking. The same held for smaller categories, for example Infra in test G7, or
XMLDB in test G5. Interestingly, category XMLDB vanished from results of test G6 and G7,
with some of its documents assigned to other database-related clusters. We suppose Lingo
did not separate XMLDB because it was too close to other database-related categories and
at the same time too small to create a separate conceptual group during label discovery.
Lingo captured some of the cross-topic relationships. For example, in G7 (Figure 7.6 on
page 87), cluster labeled Movie review nicely combined documents from LRings and BRunner
categories. Similarly in G4, clusters SQL, or Tables combined documents spanning different
categories.
In spite of minor noise groups, Lingo’s clusters seemed more meaningful compared to
stc’s. We think that selection of group labels in stc, based solely on common phrase fre-
quency, is inferior to Lingo’s dcf-based one. For example, in test G6, the outlier category Or-
tho was dominated by database-related documents. stc also failed to clearly separate topics
in test G7, choosing common terms for group labels and mixing all categories based on fre-
quent, but meaningless words like used or site, see Figure 7.5 on page 86 and Figure 7.6 on
page 87.
A look at cluster labels showed that, in case of Lingo, they were in most cases meaningful
and comprehensible. Figure 7.1 on the next page shows labels of topmost clusters. The few
questionable descriptions consisted in majority of single terms, which were either ambigu-
ous (as in free), or very broad (as in News). Several elliptical cluster labels were also extracted
(Information on Infrared /photography/ ). But even with these errors, a cross comparison of clus-
ter labels for the same data set clearly shows the advantage of Lingo over stc. Figure 7.2 on
the following page demonstrates such a comparison for test G7. Lingo tends to pick longer
and thus more meaningful phrases, while stc prefers single terms that often turn out to be
very generic or simply junk (written, site, database).
7.2. Experiment 1: Clustering in Lingo 83
G1 G4Fan Fiction Fan Art Federated Data Warehouse
Images Galleries Xml Database
MySQL Postgresql Database
Wallpapers Mysql Server
LOTR Humor Intelligent Enterprise Magazine
Links Web Based
Middle Earth Postgres
Special Report Tables
G2 G5News Java Tutorial
MySQL Database Vim Page
Images Galleries Federated Data Warehouse
Foot Orthotics Native Xml Database
Lord of the Rings Wallpapers Web
Information on the Films PostgreSQL Database
Lotr humor MySQL Server
Orthopedic support Free
G3 G6 G7MySQL Mysql Database Blade Runner
News Federated Data Warehouse Mysql Database
Information on Infrared Foot Orthotics Java Tutorial
Images Galleries Orthopedic Products Lord of the Rings
Foot Orthotics Access Postgresql News
Lord of the Rings Movie Web Movie Review
Orthopedic Products MySQL Server Information on Infrared
Humor Medical Data Warehouse
Figure 7.1: Lingo’s cluster labels of the topmost clusters.
Cluster Description
STC Lingo
1 xml, native, native xml database Blade Runner
2 includes Mysql Database
3 blade runner, blade, runner Java Tutorial
4 information Lord of the Rings
5 dm, dm review article, dm review News
6 used Movie Review
7 database Information on Infrared
8 ralph, article by ralph kimball Data Warehouse
9 mysql Image Galleries
10 articles BBC Film
11 written Vim Macro
12 site Web Site
13 cast Fan Fiction Fan Art
14 dm review article by douglas hackne Custom Orthotics
15 review Layout Management
16 characters DBMS Online
Figure 7.2: Cluster labels in stc and Lingo, test G7.
7.2. Experiment 1: Clustering in Lingo 84
LINGO, test G1
lord of the rings mysql
Fan F
iction F
an A
rt
Images G
alle
ries
MyS
QL
Wallp
apers
Lotr
Hum
or
Lin
ks
Mid
dle
Eart
h
Specia
l R
eport
Data
base
Manager
Foru
m Art
Rin
gs O
bsessio
n
Onlin
e
Hobbit
Offers
Softw
are
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
LINGO, test G3
lord of the rings mysql orthopedic infrared photography
MyS
QL
New
s
Info
rmation o
n Infr
are
d
Images G
alle
ries
Foot O
rthotics
Lord
of th
e R
ings M
ovie
Ort
hopedic
Pro
ducts
Hum
or
Lotr
Site
Shoes
Lin
ks
Medic
al
Data
base
Support
Sto
ckin
gs
Mid
dle
Eart
h
Manager0
5
10
15
20
25
30
Figure 7.3: Partition distribution in clusters, Lingo algorithm, simple unrelated categories
separation tests G1 and G3. Number of documents on the vertical axis, clusters on the hori-
zontal axis (the first cluster on the left side).
7.2. Experiment 1: Clustering in Lingo 85
LINGO, test G4
mysql data warehouses articles native xml databases postgres
Federa
ted D
ata
Ware
house
Xm
l D
ata
base
Postg
resql D
ata
base
Mysql S
erv
er
Inte
lligent E
nte
rprise M
agazin
e
Web B
ased
Postg
res
Table
s
GM
D IP
SI X
QL E
ngin
e
Obje
ct O
riente
d
Driver
Softw
are
Dim
ensio
nal
Ware
housin
g
SQ
L
Trial V
ers
ion0
1
2
3
4
5
6
7
8
9
10
11
LINGO, test G5
java tutorials vi mysql data warehouses articles native xml databases postgres
Java T
uto
rial
Vim
Page
Federa
ted D
ata
Ware
house
Native X
ml D
ata
base
Web
Postg
resql D
ata
base
Mysql S
erv
er
Fre
e
Lin
ks
Develo
pm
ent T
ool
Quic
k R
efe
rence
Data
Ware
housin
g
Mysql C
lient
Obje
ct O
riente
d
Postg
res
Driver0
5
10
15
20
25
Figure 7.4: Partition distribution in clusters, Lingo algorithm. Similar category separation
test G4 and related category separation test G5. Number of documents on the vertical axis,
clusters on the horizontal axis (the first cluster on the left side).
7.2. Experiment 1: Clustering in Lingo 86
LINGO, test G6
mysql data warehouses articles native xml databases postgres orthopedic
Mysql D
ata
base
Federa
ted D
ata
Ware
house
Foot O
rthotics
Ort
hopedic
Pro
ducts
Access P
ostg
resql
Web
Mysql S
erv
er
Medic
al
Shoes D
esig
ned
Ort
hopaedic
Postg
res
Mysql C
lient
Data
Ware
housin
g
Offers
Softw
are
Develo
pm
ent T
ool
Innovative0
2
4
6
8
10
12
14
16
18
STC, test G6
mysql data warehouses articles native xml databases postgres orthopedic
xm
l,natively
,native x
ml data
base
dm
,dm
revie
w a
rtic
le,r
evie
w
ralp
h,a
rtic
le b
y r
alp
h k
imball,
ralp
[...]
data
bases
mysql
dm
revie
w a
rtic
le b
y d
ougla
s h
ackne [...]
art
icle
data
ware
house,w
are
house
data
dbm
s,d
bm
s o
nlin
e a
rtic
le b
y r
alp
h k
[...]
pro
vid
es
support
pro
duct
manufa
ctu
rer
inte
lligent ente
rprise m
agazin
e a
rt [...]
access0
5
10
15
20
25
30
Figure 7.5: Partition distribution in clusters. Outlier highlight test G6. Lingo algorithm
above, stc below. Number of documents on the vertical axis, clusters on the horizontal axis
(the first cluster on the left side).
7.2. Experiment 1: Clustering in Lingo 87
LINGO, test G7
blade runner infrared photography java tutorials lord of the rings mysql orthopedic
vi data warehouses articles native xml databases postgres
Bla
de R
unner
Mysql D
ata
base
Java T
uto
rial
Lord
of th
e R
ings
New
s
Movie
Revie
w
Info
rmation o
n Infr
are
d
Data
Ware
house
Images G
alle
ries
BB
C F
ilm
Vim
Macro
Web S
ite
Fan F
iction F
an A
rt
Custo
m O
rthotics
Layout M
anagem
ent
DB
MS
Onlin
e0
5
10
15
20
25
30
35
40
45
50
STC, test G7
blade runner infrared photography java tutorials lord of the rings mysql orthopedic
vi data warehouses articles native xml databases postgres
xm
l,native,n
ative x
ml data
base
inclu
des
bla
de r
unner,
bla
de,r
unner
info
rmation
dm
,dm
revie
w a
rtic
le,d
m r
evie
w
used
data
base
ralp
h,a
rtic
le b
y r
alp
h k
imball,
ralp
[...]
mysql
art
icle
s
written
site
cast
dm
revie
w a
rtic
le b
y d
ougla
s h
ackne [...]
revie
w
chara
cte
rs
0
5
10
15
20
25
30
35
Figure 7.6: Partition distribution in clusters. All categories mixed together; test G7. Lingo
algorithm above, stc below. Number of documents on the vertical axis, clusters on the hor-
izontal axis (the first cluster on the left side).
7.2. Experiment 1: Clustering in Lingo 88
7.2.3 Analysis of Cluster Contamination
A comparison of similarity between two cluster structures can be done in a number of ways
— we mentioned some of them already in Section 2.5. Most of these methods aggregate
similarity to a single figure over all clusters produced by the algorithm. Our main interest
was quite the opposite — measuring the quality of individual clusters, starting with these
that the user sees first.
None of the existing quality evaluation formulas seemed to fully satisfy our needs. The f-
measure penalizes clusters that are incomplete subsets of documents from a single original
partition. Average cluster purity takes into account the dominating document subset in each
cluster, while it makes a difference if the remaining documents are a mixture of different
partitions or come from a single one.
We eventually came up with our own measure of similarity that captures our intuition
of what a good cluster should be like: cluster contamination. A cluster is considered pure if clustercontamination
it consists of all or a subset of documents from a single original category. Cluster contami-
nation for cluster ki is defined as the number of pairs of objects found in the same cluster
ki , but not in any of the original partitions, divided by the worst case scenario — maximum
number of such bad pairs in ki , considering its size. For pure clusters, cluster contamina-
tion measure equals 0. When a cluster consists of documents from more than one partition,
the contamination is between 0 and 1. The worst case is an even mix of documents from
original partitions when the cluster is said to be fully contaminated and the measure equals
1. Full definition of cluster contamination measure is given in Appendix A on page 116.
Figure 7.7 presents contamination measure of clusters acquired from Lingo and stc and
created for the input test set G7. One can observe that Lingo creates purer clusters than
stc, especially at the top of the clusters’ ranking. stc is also unable to distinguish between
frequent junk terms and important phrases. For example, see clusters labeled includes or in-
formation — essentially common words with no meaning specific to any partition and hence
a high contamination. Lingo also produces contaminated clusters, such as in a cluster la-
beled Web site or Movie review, but these can be explained with the knowledge of the input
data set (Web site is a common phrase in the Open Directory Project), or understood (movie
review cluster is contaminated because it merges documents from two categories — LRings
and BRunner). This latter example shows that sometimes blind analysis of purely numerical
quality indicators can lead to incorrect conclusions; movie review cluster is a generalization
of two original categories and as such is sensible, even though it was marked as contami-
nated.
We omit the remaining test sets because the results and conclusions were very similar.
Simple average aggregation of contamination measure over all clusters also points at Lingo
as the one having lower overall contamination.
7.2.4 Summary and Conclusions
While it seems clear that manual inspection of a few input data sets does not provide reli-
able evidence of Lingo’s superiority, we believe the results are convincing enough to risk the
7.2. Experiment 1: Clustering in Lingo 89
LINGO, test G7
cluster contamination cluster size
Bla
de R
unner
Mysql D
ata
base
Java T
uto
rial
Lord
of th
e R
in...
New
s
Movie
Revie
w
Info
rmation o
n ...
Data
Ware
house
Images G
alle
rie...
BB
C F
ilm
Vim
Macro
Web S
ite
Fan F
iction F
an...
Custo
m O
rthotic...
Layout M
anagem
e...
DB
MS
Onlin
e
Editor
Colle
ction
Inte
lligent E
nt...
Written b
y T
ony
Essay
Bio
gra
phie
s W
al...
Postg
res
Foru
m
Soft S
hoes
(Oth
er)0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Clu
ste
r conta
min
ation
0
10
20
30
40
50
60
70
80
90
100
110
120
130
Clu
ste
r siz
e
0
0.471
0.183
0
0.264
0.515
0.591
0.3590.333
0.55
0
0.92
0.21
0
0.825
0.926
0.2
0.7780.766
0.378
0.185
0 0
0.667
0
0.868
52
43
34
41
3133
21 2123
25
15
23
19
12
1619
10 912
10 119 9 10
7
126
STC, test G7
cluster contamination cluster size
xm
l,native,n
ati...
inclu
des
bla
de r
unner,
bl...
info
rmation
dm
,dm
revie
w a
r...
used
data
base
ralp
h,a
rtic
le b
...
mysql
art
icle
s
written
cast
site
dm
revie
w a
rtic
...
revie
w
links
data
chara
cte
rs
data
ware
house,...
pro
vid
es0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Clu
ste
r conta
min
ation
0
10
20
30
40
50
60
70
80
90
100
110
120
130
Clu
ste
r siz
e
0.324
0.898
0.16
0.88
0
0.965
0.82
0
0.138
0.498
0.667
0.523
0.86
0
0.616
0.922
0.548
0.442
0.2
0.758
18
38
13
37
9
3532
8
30
2624 24 24
4
23 22 22 22
10
21
Figure 7.7: Contamination measure and cluster size for test set G7. Cluster labels have been
truncated to fit the chart. Clusters are ordered on the horizontal axis from the most im-
portant clusters (left side) to the least significant ones (right). stc algorithm has no explicit
“Others” group.
7.3. Experiment 2: Clustering in Descriptive k-Means 90
following conclusions:
• The dcf approach implemented in Lingo clusters similar documents together and
picks up diverse topics present in the input data. Output clusters are not always com-
plete copies of input partitions, but are rarely a complete noise as it is the case with
stc.
• Cluster labels discovered by Lingo are sensible and rarely denote common phrases,
unlike with stc where even with threshold tuning we were not able to filter out non-
sensical descriptions.
7.3 Experiment 2: Clustering in Descriptive k-Means
In Section 4.3 we mentioned the two potential elements degrading clustering quality in dcf:
approximation of the “ideal” model of dominant topics with pattern phrases and document
allocation to pattern phrases instead of the original topics. In this experiment we tried to
verify if this degradation really takes place. Clustering quality in this section is again un-
derstood as the difference between a ground truth set of classes and the clustered result
returned by the algorithm. The comprehensibility of cluster labels is not taken into account
directly, but we will take a tentative look at them and show a few example descriptions of
clusters returned by the algorithm.
The experiment is about assessing quantitative quality rather than individual quality of
results — we will measure an average performance of clustering with several quality indica-
tors on a range of configuration parameters. The point is to establish if an algorithm follow-
ing dcf is indeed of poorer quality compared to the baseline clustering. Descriptive k-Means
provides a good testbed for this, because we can reuse its internal k-Means implementation
as a reference clustering algorithm.
The goals of the experiment can be summarized by the following questions.
• Description Comes First has two elements that may degrade clustering quality: ap-
proximation of dominant topics with pattern phrases and document assignment to
pattern phrases. Does dcf indeed degrade clustering quality and how?
• Given two methods of candidate cluster label extraction — one extracting frequent
phrases, the second extracting noun phrases — is the quality of clustering different
and how does it change?
• Does subsampling of documents used internally in k-Means during dominant topic
discovery change the quality of results and how?
7.3.1 Test Data and Experiment’s Setting
Ground Truth Data Set
To have a cross-comparison with previous research we decided to use a widely known 20-
newsgroups [A] data set, which consists of approximately 20000 mailing list messages parti- 20-newsgroups
7.3. Experiment 2: Clustering in Descriptive k-Means 91
comp.graphics rec.autos sci.crypt
comp.os.ms-windows.misc rec.motorcycles sci.electronics
comp.sys.ibm.pc.hardware rec.sport.baseball sci.med
comp.sys.mac.hardware rec.sport.hockey sci.space
comp.windows.x
misc.forsale talk.politics.misc talk.religion.misc
talk.politics.guns alt.atheism
talk.politics.mideast soc.religion.christian
Figure 7.8: Groups of messages in the 20-newsgroups data set. Related groups are in the
same box (source: [A]).
tioned nearly evenly into 20 groups, each corresponding to a different topic (although some
topics are closely related). Figure 7.8 shows the headers of newsgroups in the 20-newsgroups
data set and their relation to each other. We chose one of the typical subsets of the origi-
nal data set with removed redundancies and empty documents, called a “bydate split” and
consisting of 18941 documents.
We assumed that each original newsgroup in the data set should be reconstructed as a
cluster in the output clustering.
Algorithms
We selected three algorithms for the evaluation:
• Descriptive k-Means with English noun phrases as label candidates,
• Descriptive k-Means with frequent phrases as label candidates,
• pure k-Means.
Descriptive k-Means appears in two variants because we wanted to see if different can-
didate label selection affects clustering quality. Both noun phrases and frequent phrases
were extracted from the input data set (noun phrases with MontyLingua, frequent phrases
using a suffix tree). The “baseline” k-Means implementation was identical with the one used
internally in Descriptive k-Means (to select dominant topics). Note that the goal of the ex-
periment was to see if dcf introduces any changes to the quality of clustering compared to
a baseline version of the clustering algorithm, so it made little sense to add other clustering
algorithms to the list.
The evaluation methodology, described a bit later in this section, required complete par-
titioning of the input data set. It was quite an inconvenient assumption because dcf-based
algorithms are not designed for producing full partitioning (a consequence of requirements
stated in Section 3.2.2 on page 45). To fulfill the needs of evaluation methods, we had to
introduce a few minor changes to the original dkm algorithm.
In the modified algorithm, pattern phrases selected for each dominant topic are not
treated independently. Instead, we build a compound query with an or operator between
phrase queries corresponding to each pattern phrase and execute this query against the col-
lection of documents. This results in one output cluster for each dominant topic and lets us
predefine the expected number of clusters in advance. This modification still cannot guar-
7.3. Experiment 2: Clustering in Descriptive k-Means 92
Table 7.3: Threshold values used in the experiment.
Threshold k-Means Descriptive k-Means
maximum reassignments rmin 20 20
minimal global objective function increase τ 0.001 0.001
minimum documents allocated to a pattern phrase n/a 10
minimum documents in a final cluster n/a 5
antee full partitioning of the input documents and we exclude unassigned documents from
the result.
We need to point out that the changes introduced to Descriptive k-Means for the needs
of the experiment actually penalize the algorithm. We artificially merge documents from dif-
ferent pattern phrases, leading to potentially higher cluster contamination. This adjustment
was unfortunately necessary to keep the structure of results similar to the ground truth set
and evaluation metrics comparable.
Thresholds and Variables
Thresholds used in k-Means and Descriptive k-Means were a result of prior manual tuning of
both algorithms. Precise values of thresholds are listed in Table 7.3. The number of expected
clusters k was set to 20 for both algorithms.
We clustered the ground truth data set 5 times for each combination of the following
variable elements.
• Sample size — we used a sample of documents from the original data set (picked with
uniform distribution) to see how the size of the sample affects clustering quality.
• Feature vector length — we experimented with different maximum lengths of feature
vectors for documents: 30, 50, 70 and 100 elements.
• Term weighting — We tried several term weighting formulas: mutual information, dis-
counted mutual information, tf-idf and Lucene’s tf-idf (described in Section 2.2.2). The
weights of terms for a document were then sorted and truncated at the maximum fea-
ture vector length limit. The truncated vector of weights became a set of features of a
document.
Evaluation Methodology
For every run of the experiment we compared the clusters to the reference ground truth
data set using the following quality indicators (described in Section 2.5 on page 37 and Ap-
pendix A on page 116):
• cluster purity,
• entropy,
• f-measure,
• contamination measure.
7.3. Experiment 2: Clustering in Descriptive k-Means 93
For each indicator we calculated an arithmetic mean (average) calculated from five runs of
the experiment in fixed conditions of variables mentioned above.
7.3.2 Output Clusters Structure and Quality
Before we start with results and conclusions, we admit that our expectation was for Descrip-
tive k-Means to be slightly worse at clustering documents compared to pure k-Means. After
all, the dcf approach has a slightly different goal compared to traditional clustering and will
not try to group documents for which there are no sensible descriptions. We were quite
pleasantly surprised when it turned out that the experiment’s results show an increase in
quality in most aspects of the analysis.
Clustering Quality
In this section we present conclusions derived from several descriptive statistics and obser-
vations of associated quality distribution charts. Statistical analysis of significance of these
observations is continued later in Section 7.3.2 on the next page.
Figure 7.9 on page 95 presents an average contamination for each feature selection type.
Surprisingly, descriptive clustering in both variants is less contaminated than the baseline
k-Means. This is confirmed by two other quality metrics — average purity (Figure 7.10 on
page 95) and average entropy (Figure 7.11 on page 96). In these metrics, however, k-Means
is slightly better when mutual information (or rather pointwise mutual information) is used problems withmutual information
for feature weighting. The explanation of this phenomenon is hidden in the behavior of this
weighting scheme, which is known for giving preference to low frequency elements [60].
If such elements are selected to represent a dominant topic then they find no supporting
candidate labels or find some low-scoring junk phrases that somehow got into the candidate
labels set. We can find the evidence of these suspicions in Figure 7.13 on page 97. An average
cluster size for candidates selected among frequent phrases is much lower than with noun
phrases, whose selection is not related to frequency of occurrence. This disproportion does
not take place with other weighting schemes.
The f-measure is the only one showing that k-Means has a higher quality of results. We
know this result is biased because the f-measure gives preference to larger clusters and we
removed unassigned documents from the output of Descriptive k-Means. Perhaps the se-
lection of f-measure was not a good choice for the experiment at all, but we include it in the
results for completeness.
Figure 7.14 illustrates an average size of a cluster (number of documents inside a cluster),
depending on the size of the feature vector. Descriptive k-Means produces smaller groups of
documents (remember that k-Means basically assigns all documents to their closest cluster,
so the average cluster size remains constant). Note that increasing the size of the sample af-
fects the average cluster size gradually. We can speculate that dcf tends to produce smaller,
but more accurate clusters.
An average number of clusters (Figure 7.15 on page 98) shows that Descriptive k-Means
reduces the number of clusters k given a priori. An average number of output groups de-
7.3. Experiment 2: Clustering in Descriptive k-Means 94
pends mostly on the length of document vectors (which influence cluster centroid calcula-
tion and finally selection of pattern phrases). Interestingly, the size of the sample has no
visible influence on the number of output groups — this is a promising result because it
encourages the use of sampling as a technique of scaling the method to larger data sets.
Noun Phrases and Frequent Phrases
The difference in clustering quality between the two versions of dkm was in most cases ne-
glectable. A small preference towards noun phrases seemed to exist, especially with larger
document samples (see Figure 7.16 on page 98). This was a bit surprising (we hoped noun
phrases would be much more accurate pattern phrases) and manual inspection of the can-
didate cluster labels showed that many noun phrases were in fact corrupted. Perhaps if we
used a better NP-chunker or prepared noun phrases offline the quality would increase to a
more evident difference.
Statistical Analysis of Quality Differences
The discussion so far was based on an experiment where for each configuration we re-ran
the clustering five times. This number of samples was insufficient to derive statistically sig-
nificant conclusions, so we repeated the whole procedure for a subset of the original exper-
iment’s range of settings and running the clustering a hundred times in each configuration.
We were interested in testing whether the difference in average values of our quality mea-
sures between pairs of different algorithms (k-Means, dkm with noun phrases and dkm with
frequent phrases) is statistically significant. Even though the distribution of original data is
unknown, we assumed a sufficiently high number of samples to use a test of difference of
means between two populations [67, 47].
Tables 7.4, 7.5, 7.6 and 7.7 on pages 99–102 show the mean values of all quality measures,
averaged over all experiment runs in each configuration. Rightmost columns contain pair-
wise comparisons between each combination of algorithms. We used a two-tailed test with
confidence level α = 0.05. When the null hypothesis (equality of means) could be rejected,
we also show which algorithm had a “better” mean value (lower or higher, depending on the
interpretation of the quality measure).
We have already mentioned that the f-measure is not a good quality indicator for com-
paring k-Means and dkm because the two algorithms produce output clusters of different
sizes. We still present the results for completeness in Table 7.6 on page 101, but one can
see that results for the f-measure are clearly inconsistent with other quality indicators. The
remaining measures provide more reliable results and support our earlier observations.
Entropy, contamination and purity indicate that dkm is better at clustering the input
data set in majority of configurations (with the exception of raw mutual information used
to extract document features — a phenomenon we already discussed and explained). Our
suspicions that noun phrases are not much of an improvement to the numerical quality of
results has been confirmed. For example, in Table 7.4 on page 99 the noun phrase version
is better 12 times and the frequent phrase version only 6 times, with 14 differences that
7.3. Experiment 2: Clustering in Descriptive k-Means 95
Figure 7.9: Average cluster contamination depending on the feature type (higher values in-
dicate more contaminated clusters).
Figure 7.10: Average cluster purity depending on the feature type (higher values indicate
purer clusters).
7.3. Experiment 2: Clustering in Descriptive k-Means 96
Figure 7.11: Average entropy depending on the feature type (lower values indicate better
clusters).
Figure 7.12: Average f-measure depending on the feature type (higher values indicate better
clusters).
7.3. Experiment 2: Clustering in Descriptive k-Means 97
Figure 7.13: Average size of a cluster depending on the feature type.
Figure 7.14: Average size of a cluster depending on the number of features and sample.
7.3. Experiment 2: Clustering in Descriptive k-Means 98
Figure 7.15: Average number of clusters depending on the number of features and sample.
Figure 7.16: Average cluster contamination for different cluster label selection methods
(noun phrases and frequent phrases), depending of feature vector length and sample size.
Results for samples with discounted mutual information only.
7.3. Experiment 2: Clustering in Descriptive k-Means 99
Table 7.4: Mean value of contamination and pairwise comparison between algorithms. Re-
lation symbols indicate a better algorithm within the pair: AB — A wins, AB — B wins.
The ins. symbol means the equality hypothesis cannot be rejected.
Configuration Average mean value Pairwise comparison
feature sample vector k-Means DKM-fp DKM-np k-Means/ k-Means/ DKM-fp/
DKM-fp DKM-np DKM-np
tfidf 5000 30 0.863 0.552 0.556 ins.
50 0.793 0.536 0.539 ins.
70 0.787 0.590 0.565
100 0.774 0.628 0.576
7000 30 0.821 0.539 0.538 ins.
50 0.739 0.524 0.520 ins.
70 0.717 0.586 0.544
100 0.709 0.587 0.570
mi 5000 30 0.911 0.670 0.808
50 0.890 0.649 0.678
70 0.817 0.628 0.646 ins.
100 0.752 0.629 0.634 ins.
7000 30 0.912 0.647 0.753
50 0.836 0.639 0.691
70 0.776 0.620 0.600 ins.
100 0.707 0.594 0.586 ins.
mid 5000 30 0.838 0.555 0.568 ins.
50 0.767 0.560 0.556 ins.
70 0.751 0.609 0.574
100 0.719 0.665 0.604
7000 30 0.800 0.523 0.561
50 0.693 0.529 0.527 ins.
70 0.689 0.587 0.564
100 0.675 0.634 0.582
tfidf2 5000 30 0.875 0.575 0.585 ins.
50 0.794 0.563 0.552 ins.
70 0.759 0.600 0.567
100 0.719 0.644 0.585
7000 30 0.847 0.538 0.564
50 0.711 0.532 0.521 ins.
70 0.678 0.582 0.551
100 0.700 0.629 0.595
7.3. Experiment 2: Clustering in Descriptive k-Means 100
Table 7.5: Mean value of average entropy and pairwise comparison between algorithms.
Relation symbols indicate a better algorithm within the pair: AB — A wins, AB — B
wins. The ins. symbol means the equality hypothesis cannot be rejected.
Configuration Average mean value Pairwise comparison
feature sample vector k-Means DKM-fp DKM-np k-Means/ k-Means/ DKM-fp/
DKM-fp DKM-np DKM-np
tfidf 5000 30 0.741 0.345 0.365
50 0.661 0.360 0.372
70 0.643 0.407 0.395 ins.
100 0.634 0.462 0.415
7000 30 0.685 0.343 0.362
50 0.599 0.352 0.358 ins.
70 0.569 0.410 0.384
100 0.561 0.423 0.415 ins.
mi 5000 30 0.506 0.460 0.646
50 0.452 0.459 0.520 ins.
70 0.444 0.461 0.503 ins.
100 0.478 0.478 0.495 ins. ins.
7000 30 0.474 0.452 0.593 ins.
50 0.312 0.453 0.541
70 0.369 0.458 0.468 ins.
100 0.414 0.445 0.453 ins.
mid 5000 30 0.716 0.361 0.397
50 0.631 0.382 0.400
70 0.615 0.449 0.425
100 0.579 0.511 0.452
7000 30 0.667 0.337 0.388
50 0.558 0.364 0.383
70 0.542 0.432 0.420 ins.
100 0.534 0.491 0.448
tfidf2 5000 30 0.763 0.364 0.388
50 0.651 0.383 0.387 ins.
70 0.622 0.443 0.421
100 0.571 0.502 0.451
7000 30 0.722 0.351 0.384
50 0.569 0.361 0.368 ins.
70 0.532 0.421 0.404
100 0.559 0.493 0.467
7.3. Experiment 2: Clustering in Descriptive k-Means 101
Table 7.6: Mean value of f-measure and pairwise comparison between algorithms. Relation
symbols indicate a better algorithm within the pair: AB — A wins, AB — B wins. The ins.
symbol means the equality hypothesis cannot be rejected.
Configuration Average mean value Pairwise comparison
feature sample vector k-Means DKM-fp DKM-np k-Means/ k-Means/ DKM-fp/
DKM-fp DKM-np DKM-np
tfidf 5000 30 0.319 0.126 0.163
50 0.398 0.144 0.171
70 0.413 0.138 0.182
100 0.426 0.145 0.183
7000 30 0.371 0.130 0.171
50 0.448 0.144 0.174
70 0.471 0.143 0.178
100 0.484 0.138 0.173
mi 5000 30 0.161 0.106 0.151
50 0.233 0.121 0.160
70 0.304 0.129 0.154
100 0.354 0.146 0.181
7000 30 0.181 0.110 0.152
50 0.265 0.117 0.152
70 0.329 0.127 0.163
100 0.408 0.142 0.176
mid 5000 30 0.341 0.139 0.178
50 0.419 0.152 0.198
70 0.441 0.160 0.197
100 0.467 0.151 0.197
7000 30 0.389 0.146 0.199
50 0.482 0.169 0.217
70 0.496 0.159 0.196
100 0.500 0.159 0.201
tfidf2 5000 30 0.298 0.134 0.166
50 0.389 0.151 0.181
70 0.424 0.155 0.191
100 0.462 0.167 0.199
7000 30 0.343 0.138 0.174
50 0.463 0.148 0.188
70 0.492 0.156 0.191
100 0.480 0.156 0.187
7.3. Experiment 2: Clustering in Descriptive k-Means 102
Table 7.7: Mean value of purity and pairwise comparison between algorithms. Relation sym-
bols indicate a better algorithm within the pair: AB — A wins, AB — B wins. The ins.
symbol means the equality hypothesis cannot be rejected.
Configuration Average mean value Pairwise comparison
feature sample vector k-Means DKM-fp DKM-np k-Means/ k-Means/ DKM-fp/
DKM-fp DKM-np DKM-np
tfidf 5000 30 0.321 0.593 0.596 ins.
50 0.399 0.603 0.608 ins.
70 0.401 0.554 0.590
100 0.412 0.517 0.576
7000 30 0.375 0.602 0.615
50 0.450 0.612 0.629
70 0.470 0.560 0.604
100 0.472 0.556 0.580
mi 5000 30 0.537 0.477 0.359
50 0.581 0.494 0.477 ins.
70 0.581 0.511 0.495 ins.
100 0.550 0.512 0.514 ins.
7000 30 0.566 0.492 0.406
50 0.707 0.498 0.451
70 0.650 0.519 0.529 ins.
100 0.599 0.541 0.551 ins.
mid 5000 30 0.353 0.587 0.579 ins.
50 0.426 0.579 0.589 ins.
70 0.427 0.534 0.569
100 0.449 0.483 0.538
7000 30 0.398 0.617 0.589
50 0.488 0.615 0.616 ins.
70 0.487 0.556 0.574
100 0.491 0.512 0.563
tfidf2 5000 30 0.305 0.565 0.565 ins.
50 0.398 0.576 0.590
70 0.427 0.542 0.575
100 0.467 0.494 0.556
7000 30 0.345 0.601 0.584
50 0.475 0.610 0.620 ins.
70 0.497 0.558 0.590
100 0.471 0.510 0.547
7.3. Experiment 2: Clustering in Descriptive k-Means 103
cannot be judged significant from the sample. Having said that, there are some very subtle
regularities with respect to the “winning” version of dkm and the test configuration. First,
there seems to be a preference toward noun phrases with a growing number of features in a
document vector. If we filter our observations from Table 7.4 on page 99 and exclude mutual
information and short document vectors (30 and 50 elements), then the noun phrase version
of the algorithm always wins over frequent phrases.
Subsampling Input Documents
Assuming cluster centroids remain very similar in dkm, they should select the same pattern
phrases, which should in turn allocate identical final content of clusters. From this we can
suspect that subsampling used in k-Means should not significantly affect the document al-
location in Descriptive k-Means. Indeed, only the smallest sample (2000 documents) had a
different quality characteristic. Samples of 5000, 7000 and 9000 documents were very much
alike.
These results suggest that subsampling can be successfully used to decrease the com-
putational effort needed to cluster large numbers of documents and aligns with previous
research on the subject [16].
An interesting aspect we did not measure would be to see how stable the set of pattern
phrases is depending on the size of the sample. An experimental validation of this hypoth-
esis would require ways of detecting how pattern phrases change (or rather: how their rank
changes in a list of results returned for a Boolean query issued to the index of candidate
labels). This is an interesting direction for future research.
7.3.3 Manual Inspection of Cluster Content
Performing an inspection of cluster content similar to that we did in the first experiment
with Lingo was difficult. Large number of input documents, different variable combinations
and even the count of original partitions was so large that analysis of any particular instance
would be very limited and selective. In the end we did have a tentative look at a few dozen
results and used graphical visualizations to interpret them. The results mostly confirmed
the conclusions from the quantitative experiment: Descriptive k-Means produces smaller,
but less contaminated (more accurate) clusters.
Let us have a look at a single instance. Figure 7.17 on page 105 shows confusion ma-
trices for an experiment run numbered 20n-en-7000-70-tfidf-EV-2 (English segmenta-
tion, sample of 7000 documents, 70 elements in the feature vector, sample run number 2).
All algorithms seem to be able to reconstruct the original partitions (the diagonal contains
a “trace” of mostly non-zero elements), although the number of noisy documents spread-
ing to other clusters is higher in case of k-Means. We can clearly depict this effect using a
color-map visualization. Figure 7.18 shows the same confusion matrices, with color-coded
cells. We normalized the distribution of documents among partitions for each cluster and
assigned white color to zero, red color to 60% of documents in a cluster assigned to a given
partition and blue color was transitional between white and red to improve the clarity of the
7.3. Experiment 2: Clustering in Descriptive k-Means 104
image. The “blue noise”, represents documents misassigned from partitions different than
the dominating one, usually depicted as a red cell. The amount of noise is clearly seen in
the color map for k-Means and much less evident in case of both variants of Descriptive
k-Means.
A closer look at this instance reveals a subtle problem with cluster candidate selection
in a few rows that contain mostly blue cells. We tracked down the pattern phrases respon-
sible for forming these clusters. These rows either contained very few documents scattered
among original partitions because of a very specific pattern phrase (proper names, e-mail
addresses) or contained very common phrases like Web Server or e-mail address. The first
problem could be solved by increasing the minimal document count threshold. The sec-
ond problem actually roots in the cluster model built by k-Means — the centroid vectors
of dominant topics contained highly ranked popular terms such as Web and Server, so the
dcf selected pattern phrases rather correctly. To avoid the problem of forming obvious clus-
ters we could either manually tune the stopwords set used in k-Means or remove common
phrases from the set of candidate cluster labels.
7.3.4 A Look at Cluster Labels
Even though the emphasis of the experiment was placed on quantitative analysis, we also
inspected a set of clusters manually to see what kind of cluster labels were created for the
two variations of dkm.
Pattern phrases selected among noun phrases were more precise compared to frequent
phrases which often contained certain amount of common junk:
David Sternlight writes work just fine
Thanks in advance for any help
For comparison, a few pattern phrases selected from noun phrases candidates in sample
20n-en-7000-50-tfidf2-EV-3:
Palestinians, Israel Israel Gaza
Serbs, Croats and Muslims Israeli Jews
Bosnian Serbs and Bosnian Muslims
Another example, topmost labels from a cluster most likely corresponding to a group called
soc.religion.christian:
Lord Jesus Christ grace of God
God does not exist existence of God
salvation through our Lord Jesus Christ
Remember that pattern phrases for the experiment were a bit artificially merged into
single clusters to keep the number of clusters similar to the one given a priori. In a real
application, each pattern phrase would select its own set of documents.
7.3. Experiment 2: Clustering in Descriptive k-Means 105
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
0 158 85 131 111 98 37 12 3 11 3 11 114 23 13 9 4 1 0 3
6 20 130 21 28 92 3 0 2 3 1 3 7 3 2 0 2 1 2 1
0 101 72 11 7 109 0 1 0 3 2 4 7 1 4 4 0 0 1 1
5 8 12 8 16 15 42 6 8 5 12 13 12 22 9 6 11 5 11 2
0 3 10 155 156 3 164 4 3 4 2 1 28 3 2 0 3 0 0 0
2 2 1 1 0 4 15 279 33 2 1 0 29 4 3 0 6 0 8 1
2 15 2 0 2 4 5 16 283 1 0 1 9 3 1 0 1 1 0 0
0 0 0 1 1 1 6 0 0 288 268 0 1 0 1 0 0 0 1 0
2 0 3 1 0 0 0 0 0 0 13 0 1 0 0 5 0 0 0 3
16 16 38 37 9 11 10 20 12 20 62 9 16 5 5 13 17 1 9 3
3 3 2 2 0 1 0 0 2 0 0 288 20 5 1 1 11 1 1 3
0 1 1 2 27 6 18 6 7 5 3 6 40 5 3 3 3 1 9 1
9 3 4 6 4 9 2 2 0 7 0 1 5 30 13 1 2 1 1 5
10 1 0 0 1 0 1 0 1 0 2 1 2 234 7 6 43 5 23 4
8 3 5 1 5 2 2 1 1 2 1 7 48 17 294 3 15 7 20 3
80 2 0 1 0 0 0 1 0 0 0 0 2 2 0 288 0 4 2 101
116 7 0 1 3 0 2 4 5 9 6 12 6 9 7 12 204 34 92 61
2 6 2 0 0 0 3 0 2 1 1 2 1 0 1 3 2 76 7 3
6 7 1 1 2 11 3 0 2 4 0 9 7 11 6 19 21 181 7 14
16 3 0 0 3 1 15 6 3 10 5 7 3 6 3 29 4 3 77 13
0 32 19 4 3 7 0 1 0 0 0 9 1 1 1 0 0 1 0 0
0 14 5 8 6 4 10 3 3 2 5 5 8 3 2 3 1 0 1 3
0 5 12 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 1 31 3 1 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 22 23 23 14 20 8 5 1 4 0 4 10 5 1 5 0 0 0 0
0 1 1 17 6 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 24 0 1 1 1 0 0 0 14 0 1 0 0 0 0 0
0 0 0 0 0 0 3 70 5 0 0 1 0 0 0 0 4 1 1 0
0 0 0 0 0 0 0 1 48 0 0 0 0 0 0 0 0 0 0 0
1 0 7 0 0 1 1 0 0 0 29 0 2 2 1 2 0 0 0 3
0 0 0 0 0 0 0 0 0 23 44 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 68 0 0 0 0 3 0 0 0
14 9 11 9 5 16 3 5 0 19 5 11 3 42 36 6 3 9 2 6
0 0 0 0 0 0 0 0 0 0 3 4 0 85 1 0 39 0 2 1
0 0 0 0 0 0 0 0 0 0 0 0 6 0 63 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 1 0 0 1 1 86 8 0 2 44
1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 46 0 18 7
1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 39 0 2
1 0 0 0 0 0 0 0 0 0 0 0 4 7 0 11 0 49 0 1
0 1 2 2 0 0 1 0 0 0 2 0 2 4 0 8 2 6 46 0
0 7 13 4 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 2 19 12 1 9 0 0 2 0 0 0 1 1 0 0 0 0 0 0
0 1 1 18 6 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0
0 36 25 41 25 38 3 3 2 8 6 3 27 13 2 6 0 1 1 1
0 0 0 0 0 0 0 10 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 54 0 0 0 0 0 0 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0 4 0 0 0 0 3 0 0 0 0
5 6 15 12 3 4 1 1 4 7 19 4 7 8 4 3 15 5 4 3
0 0 0 0 0 0 0 0 0 18 27 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 22 0 0 0 0 1 0 0 0
0 0 0 0 0 9 0 0 0 0 1 0 13 0 0 0 0 1 0 0
3 6 4 8 5 10 0 4 2 2 10 3 2 12 4 3 4 5 5 5
11 0 0 0 0 0 0 0 0 3 0 0 0 21 1 0 0 0 0 3
0 0 0 0 0 0 0 0 0 0 0 1 0 52 0 0 4 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 6 0 48 0 0 0 0 0
27 0 0 0 1 0 0 0 0 0 1 0 1 0 0 75 1 3 2 22
0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 35 0 9 8
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 47 0 1
0 1 2 2 0 0 1 0 0 0 2 0 2 4 0 1 1 6 45 0
clusters
classes
clusters
classes
classes
clusters
Descriptive k-Means (noun phrases)
Descriptive k-Means (frequent phrases)
k-Means
class labels
Figure 7.17: Confusion matrices of all three evaluated algorithms for an instance named
20n-en-7000-70-tfidf-EV-2. A confusion matrix represents a cluster as a row and the
original partition as a column. The intersection shows the number of documents in the
row’s cluster assigned to the column’s partition. Ideal clustering would have non-zero ele-
ments only on the diagonal.
7.3. Experiment 2: Clustering in Descriptive k-Means 106
k-Means
Descriptive k-Means (noun phrases)
Descriptive k-Means (frequent phrases)
Figure 7.18: Color maps of confusion matrices for all three algorithms, instance named
20n-en-7000-70-tfidf-EV-2. Values in each row have been normalized. The color in
each cell represents the number of row’s documents from the column’s partition (pale blue
— few documents, intense blue — a number of documents, red — almost all documents).
7.4. User Feedback 107
7.3.5 Summary and Conclusions
The second experiment provided us with observations that applying dcf to a well known
clustering algorithm does not decrease its quality. Quite the contrary: all the results seem to
indicate that we gained on purity of the resulting clusters and knowing how dcf constructs
cluster labels, these should improve as well. Obviously, the baseline algorithm in our ex-
periment (k-Means) is quite weak and a fine text clustering algorithm would have a higher
quality rating, but we believe that an initial conclusion is justified: dcf does not have a de-
structive effect on the original clustering and offers rewards in the area of cluster labeling.
Summarizing, the outcomes from the experiment are:
• Modifying an existing clustering algorithm by applying dcf did not seem to have any
negative effect on clustering quality.
• Descriptive k-Means tends to produce small, compact but also relatively pure clusters.
• Pointwise mutual information is not a good feature weighting scheme for dcf because
it tends to pick low-frequency terms with no matches in the candidate cluster labels
set.
• We noticed no extra quality gain from using noun phrases instead of frequent phrases,
although manual inspection of the output cluster labels shows that in many cases
noun phrases were more clear and accurate (this judgment is very subjective though).
• It seems that Descriptive k-Means can be successfully applied to efficiently cluster
thousands of documents (as in the experiment) and even if memory or disk space be-
comes a problem then subsampling of the original document set delivers an sufficient
approximation of the exact result.
7.4 User Feedback
The ultimate goal of a descriptive clustering algorithm is to provide some gain for the user
seeking information. This gain is incredibly tricky to define and we agree with prof. Christo-
pher Manning [59] that measuring “user happiness” depends a lot on the target context of
measurement (Who is measuring? Who is the target? What is the surrounding environ-
ment?). Perhaps the best criteria are based on increase in user productivity translated to
economic aspects such as time savings or access to more relevant, accurate (and thus valu-
able) information. As Manning says, however, this kind of evaluation “makes experimental
work hard, especially on large scale”. For this and other reasons mentioned in the intro-
duction, we decided not to conduct a controlled user study. Instead, we present some real
feedback resulting from publishing the software implementing the concepts contained in
this thesis on the open source arena.
7.4. User Feedback 108
Table 7.8: Carrot2 committers (alphabetical order).
Name Component
Karol Gołembniak haog clustering
Paweł Kowalik Search engine wrapper
Lang Ngo Chi Fuzzy Rough Set clustering
Stanisław Osinski Lingo Classic clustering, core
Steven Schockaert Fuzzy Ants clustering
Dawid Weiss stc clustering, core
Michał Wróblewski AHC clustering
7.4.1 Open Sourcing Carrot2
The Carrot2 framework was registered as an open source, bsd-licensed project on Source-
Forge in July 2003. All the initial code and design [99] was contributed by the author of this
thesis. At the moment of writing two people actively commit new code to the project —
Dawid Weiss and Stanisław Osinski. Other developers occasionally contributed components
and patches for integrating with the rest of the framework (see Table 7.8 for a list contribu-
tors).
The most interesting part of an open source project is how people really react to it. With
free software it is impossible to lure people with marketing slogans or advertising tricks: the
software must stand for itself and be usable in order to attract user community. Carrot2
has been moderately successful so far, looking at SourceForge’s statistics, the project has a
constant rate of about 120 downloads per month — for such a specific piece of server-side
software this seems to be a decent result. We are most proud of the integration with other
popular open source projects such as Nutch, an open source search engine, or Lucene — an
information retrieval and indexing library. The user community around these projects has
been very supportive and kind. In fact, we know of several free and commercial investments
using Carrot2 clustering components (see Figure 7.19). It means people consider clustering
functionality useful and provides us with motivation to improve it.
7.4.2 Online Demo Usage
Carrot2’s development follows Martin Fowler’s continuous integration principles [26] — the
project is regularly rebuilt and automatically tested. A side-effect of this process is a head-
of-development demonstration of the system made available on-line at [K]. Even though
this was not meant to be any sort of evaluation, we kept collecting log files from the Web
server and provide some insight into who and how has been using the system (summary
made at the end of February, 2006).
There were on the average 56 queries to the demo service made daily and this number is
increasing (see monthly counts in Figure 7.22). Queries were sent from all kinds of different
locations (acquired by a reverse lookup of IP addresses) and in many different languages,
including far-east families and arabic (see Figure 7.21).
The demo service exposed several different search results clustering components and
data sources. The most frequently used clustering algorithm was Lingo (most likely because
it is the default one), followed by FuzzyAnts and trc (see Figure 7.23).
7.4. User Feedback 109
Figure 7.19: Two visual search systems using Carrot2 internally: meX-Search (above) and
Grokker (below).
7.4. User Feedback 110
Figure 7.20: Search results for the query prawo issued to a search engine Yahoo and clustered
with Lingo.
Figure 7.21: A screenshot of user queries to the demo service. The meaning and final quality
of clustering for queries in Mandarin (?) is unknown.
7.4. User Feedback 111
0
1000
2000
3000
4000
5000
6000
Num
ber
ofquer
ies
Num
ber
ofquer
ies
2003-07
2003-08
2003-09
2003-10
2003-11
2003-12
2004-01
2004-02
2004-06
2004-07
2004-08
2004-09
2004-10
2004-11
2004-12
2005-01
2005-02
2005-03
2005-04
2005-05
2005-06
2005-07
2005-08
2005-09
2005-10
2005-11
2005-12
2006-01
2006-02
551
178
228 421
705
591
1242
393
981
979
558
915
1043
628
1593
5065
1536
905 1180
1136
2360
1099
820
1384
4219
3911
3207
4062
3237
Figure 7.22: Number of queries to the demo service each month.
0
5000
10000
15000
20000
Num
ber
ofquer
ies
Num
ber
ofquer
ies
lingo-g
oogle
-en
lingo-y
ahooapi
lingo-g
oogle
api-en
lingo-a
llth
eweb
-en
fuzz
yA
nts
api-en
trc.
km
eans-
api
scri
pte
d.a
hc-
api-en
scri
pte
d.a
hc-
api-en
-idf
trc.
phra
se-k
mea
ns-
api
trc.
rough-k
mea
ns-
api
stc-
full.g
oogle
api
lingo.e
goth
or-
cs-o
nly
lingo-g
oogle
-pl
trc.
phra
se-r
ough-k
mea
ns-
api
lingo-g
oogle
-en.r
aw
16003
3464
2588
2079
1978
1657
1647
1635
1588
1537
1379
1143
1044
753
629
Figure 7.23: Total number of queries to clustering processes available in the demo service.
Chapter 8
Summary and Conclusions
Document clustering becomes a front-end utility for searching, but also comprehending in-
formation. We started this thesis from the observation that clustering methods in such ap-
plications are inevitably connected with finding concise, comprehensible and transparent
cluster labels — a goal missing in the traditional definition of clustering in information re-
trieval. We then showed that current approaches to solving the problem try to allocate a
sensible description to a model of clusters expressed in mathematical concepts and not di-
rectly corresponding to the input text (as in the vector space model). This is typically very
difficult and often results in poor cluster labels (as in keyword-tagging).
Realizing the constraints and nature of the problem and having the knowledge of user
expectations acquired from the Carrot2 framework, we collected the requirements and ex-
pectations to formulate a problem of descriptive clustering — a document grouping task
emphasizing comprehensibility of cluster labels and the relationship between cluster labels
and their documents.
Next, we devised a general method called Description Comes First, which showed how
the difficult step of describing a model of clusters can be replaced with extraction of candi-
date labels and selection of pattern phrases — labels that can function as an approximation
of a dominant topics present in the collection of documents. We show how this general
method fulfills our requirements concerning document assignment and provide ideas for
cluster label selection strategies in English (frequent phrases, noun phrases) and Polish (fre-
quent phrases, heuristic chunk extraction).
We then demonstrate dcf on two concrete algorithms applicable to important problems
found in practice: clustering results returned by search engines and clustering larger collec-
tions of longer documents such as news stories or mailing lists.
The thesis ends with a presentation of results collected from empirical experiments with
the two presented algorithms.
Fulfillment of Goals The motivation for this thesis arose as a consequence of observing
new applications of clustering methods in information retrieval and the needs of real users
using these applications. Our initial goals were to create a method able to accurately de-
112
8.1. Contributions 113
scribe existing clusters, but they soon changed when we realized that the problem itself
needs to be rewritten to permit sensible solutions. The definition of descriptive cluster-
ing is, in our opinion, a better way of reflecting the needs of a user who needs to browse a
collection of texts, whether they are snippets or other documents.
The dcf approach, which we describe in this thesis, is meant to show how descriptive
clustering can be solved in a way which is substantially different compared to traditional
clustering approaches and, as we show through the algorithms and experiments, not at all
worse with respect to the known clustering quality measures. Moreover, we show that dcf
combined with smart candidate label selection (noun phrases, for example), allows easier
resolution to the problem of cluster labeling that are more likely to fulfill the requirements
of descriptive clustering defined at the beginning of this thesis, especially comprehensibility
and transparency.
Obviously we hope that dcf is not the only method for solving the problem — quite the
contrary, we are aware of the limitations of dcf and tried to point them out in relevant sec-
tions of this thesis and in future research directions. Having said that, we believe that the
concepts and results presented in this thesis, along with the popularity of their implementa-
tions in the Carrot2 framework allow us to say that the goals we initially assumed have been
fulfilled.
8.1 Contributions
The scientific contributions of this thesis include:
• A description and overview of the requirements of the descriptive clustering prob-
lem.
Descriptive clustering is different compared to regular text clustering. It is character-
ized by a different set of requirements and focus shifted to cluster label quality. This
thesis defines descriptive clustering in terms of its expected properties and discusses
differences and similarities with existing research fields.
• Description Comes First approach.
The dcf approach is an attempt to design a method for constructing algorithms to
solve the descriptive clustering problem. We describe the concepts behind dcf and
demonstrate its usefulness by evaluating two algorithmic instantiations — Lingo and
Descriptive k-Means.
• Descriptive k-Means algorithm.
An algorithm based on the dcf approach and applicable to longer documents and in-
cremental processing (contrary to Lingo). This algorithm also demonstrates the use
of different label extraction techniques (noun phrases, frequent phrases, predefined
labels). While the experiments show no clear advantage in quality of document al-
location between frequent phrases and noun phrases, the latter usually yielded more
8.2. Future Directions 114
comprehensible cluster labels, satisfying our initial expectations defined in the prob-
lem of descriptive clustering.
• A method of clustering quality evaluation. Aggregative measures of clustering quality
do not explain how documents from the original partitioning are spread in the target
clusters. We devise a method for calculating a normalized score of disorder in each
individual cluster, trivial to compute and easier to interpret than normalized cluster
entropy. Cluster contamination score can be used to create a comprehensible visual-
izations of partitions distribution in clusters.
Scientific contributions are accompanied by technical deliverables.
• Carrot2 — an open source framework for processing text data.
All our experiments and implementations have been tested in the open source frame-
work of the Carrot2 system. Carrot2 has become a widely known and cited open source
project, especially in the domain of clustering search results. It features an industry-
strong component-based software architecture which allowed us to test and deploy
the discussed algorithms in a real production environment.
• Fast hybrid stemming algorithm for Polish.
We fill the existing gap in the area of linguistic preprocessing tools for the Polish lan-
guage with a fast, heuristic stemmer capable of processing several thousand words per
second on commodity hardware. We also provide initial experimental results from the
implementation of a pattern-based chunker for Polish texts.
8.2 Future Directions
This thesis certainly does not exhaust the subject and a great deal of work is still ahead.
Several most promising directions for future work are briefly outlined below.
Fast and Accurate Cluster Candidate Selection The dcf approach relies heavily on the
quality of cluster candidates. Frequent phrase extraction is very convenient because of its
efficiency, but has several disadvantages like word-order dependence and inability to sepa-
rate junk from meaningful phrases. More advanced techniques like statistical noun phrase
extraction are computationally more expensive, but even allowing the costs, they do not
yield “ideal” descriptions expected by descriptive clustering. A combined approach may be
a solution to the problem: if we could extract frequently occurring terms, filter out the junk
and put them back in the right order then we would have a fast and accurate candidate
cluster extraction method.
Predefined cluster label ontologies seem to be a closer (and more realistic) alternative.
There are many semantic nets of relations on the Web. This knowledge could be used to
create a set of comprehensible cluster labels candidates.
8.2. Future Directions 115
Cluster Label Candidates in Polish We tried to approximate English chunking with extrac-
tion of specific tag sequences in Polish and the results were promising, but far from perfect.
We know about the research being done in the area of proper name extraction in Polish (pri-
vate communication with Polish computational linguists), so hopefully the results achieved
in that field could be applied to cluster label extraction. Having a cluster label candidate
extraction method would allow us to perform an experiment with assessing the quality of
clustering documents in Polish. The problem is still not trivial because no “standard” docu-
ment collection for this task exists, so there is no point of reference.
Improving Topic Detection Quality It would be interesting to see if we can employ other
types of clustering algorithms to the task of building a model of topics (Phase 2 in dcf).
Building a model of clusters applicable for dcf’s pattern phrase selection task is not always
going to be easy, especially with methods that do not rely on the term vector space, but we
are convinced it is possible in most scenarios.
Pattern Phrases with Broader Meaning There are contradictory directions in descriptive
clustering — on one hand, we would like to have a compact view of the underlying collec-
tion of documents, on the other, a transparent relationship between a cluster label and its
contents. Pattern phrases presented in this thesis allocate documents using phrase queries,
which will inevitably retrieve just a small number of input documents. An interesting prob-
lem would be to have “general” pattern phrases, such as digital photography which could ex-
tract all documents related to the subject, even if documents did not contain the keywords
of a pattern phrase. Interestingly, we could use the mechanisms already present in Descrip-
tive k-Means to do this — it would be sufficient to build an index of candidate cluster labels
where the presentation is fixed to a given phrase (as in digital photography), but the index
contains synonymous phrases that denote its meaning (digital photos, digital imaging, digital
pictures and others).
Appendix A
Cluster Contamination
Let there be a set of N objects, original partitioning of these objects C = c1,c2, . . . ,cm and a
set of clusters K = k1,k2, . . . ,kn .
A contingency matrix is a two-dimensional matrix H where columns correspond to clus- contingency matrix
ters, rows to classes and h(ci ,k j ) contains the number of objects from class ci present in
cluster k j .
H =
∣∣∣∣∣∣∣∣∣∣∣
h(c1,k1) h(c1,k2) . . . h(c1,kn )
h(c2,k1) h(c2,k2) . . . h(c2,kn )
......
. . ....
h(cm ,k1) h(cm ,k2) . . . h(cm ,kn)
∣∣∣∣∣∣∣∣∣∣∣
In a perfect clustering, where C = K , matrix H is a square matrix with only one non-zero
element in each row and column and can be transformed into a diagonal matrix by rear-
ranging the order of columns or rows.
Given H , we can express the number of pairs of objects found in the same cluster k j but
in two different partitions:
a10(k j ) =m∑
i=2
i−1∑
t=1
h(ci ,k j )×h(ct ,k j ). (A.1)
The notation of a10(k j ) is derived from contingency matrix aggregation factor — see [18] for
details. Let us denote by amax(k j ) the worst-case potential scenario of objects distribution in
cluster k j :
amax(k j ) =m∑
i=2
i−1∑
t=1
h(ci ,k j )× h(ct ,k j ), (A.2)
116
Cluster Contamination 117
where:
p =m∑
t=1
h(ct ,k j ), (number of objects in k j )
h(ci ,k j ) =
⌊ pm
⌋+1 if i < (p mod m),
⌊ p
m
⌋otherwise.
As we show later on, amax(k) models a situation when we take an even number of objects
from each partition and combine these objects into a cluster.
Theorem 1. For a given cluster k j and a contingency matrix H, amax(k j ) is the maximum
possible value of a10(k j ).
Before we show the proof of the above theorem, we shall simplify the notation a bit. We
are concerned with a single cluster k j and a column vector of matrix H containing distribu-
tion of cluster’s objects in original partitions. Only this vector (its elements) are important
for future discussion, so we start by simplifying the notation of Equation A.1 by replacing
h(ci ,k j ) with hi :
a10(k j ) =m∑
i=2
i−1∑
t=1
h(ci ,k j )×h(ct ,k j )
=m∑
i=2
i−1∑
t=1
hi ×ht
= h1 (h2 +h3 +·· ·hm−1 +hm )
+h2 (h3 +h4 +·· ·hm−1 +hm )
...
+hm−2 (hm−1 +hm )
+hm−1 (hm ) .
Lemma 1. Moving a single object from a class with fewer elements to a class with more ele-
ments always decreases a10(k j ).
Proof. Let us reorder classes so that hm−1 contains at least the number of elements of hm
(|hm−1| ≥ |hm |). By moving a single object from hm to hm−1 we change these classes to,
correspondingly:
h′m−1 = hm−1 +1,
h′m = hm −1.
Cluster Contamination 118
Before and after the move we have:
a10(k j ) = h1 (h2 +h3 +·· ·hm−2)+h1 (hm−1 +hm )
+h2 (h3 +h4 +·· ·hm−2)+h2 (hm−1 +hm )
...
+hm−2 (hm−1 +hm )
+hm−1 (hm ) ,
a′10(k j ) = h1 (h2 +h3 +·· ·hm−2)+h1
(h′
m−1 +h′m
)
+h2 (h3 +h4 +·· ·hm−2)+h2
(h′
m−1 +h′m
)
...
+hm−2
(h′
m−1 +h′m
)
+h′m−1
(h′
m
).
Continuing, the difference a10(k j )−a′10(k j ) is:
a10(k j )−a′10(k j ) = h1 (hm−1 +hm )−h1
(h′
m−1 +h′m
)
+h2 (hm−1 +hm )−h2
(h′
m−1 +h′m
)
...
+hm−2 (hm−1 +hm )−hm−2
(h′
m−1 +h′m
)
+hm−1 (hm )−h′m−1
(h′
m
)
= h1 (hm−1 +hm )−h1 (hm−1 +1+hm −1)
+h2 (hm−1 +hm )−h2 (hm−1 +1+hm −1)
...
+hm−2 (hm−1 +hm )−hm−2 (hm−1 +1+hm −1)
+hm−1 (hm )− (hm−1 +1) (hm −1)
= hm−1 −hm +1,
which is never negative.
Corollary 1. From Lemma 1 it follows that in a clustering where the number of objects from
every class differs at most by one, any object moved between classes decreases or retains the
value of a10(k j ). In other words, a cluster k j reaches amax(k j ) when:
∀i ,t=1...m
h(ci ,k j )−h(ct ,k j ) ≤ 1. (A.3)
Lemma 2. Starting from any cluster k j we can reassign elements in its contingency matrix to
reach a distribution with maximum value (Equation A.3).
Cluster Contamination 119
Proof. The proof is done by showing an iterative object reassignment procedure that starts
from any clustering k j and terminates in a finite number of steps when state described in
Equation A.3 is reached.
Assuming that the column vector h(c1,k j ),h(c2,k j ), . . . ,h(cm ,k j ) from the contingency
matrix H is given, the following procedure has the required properties:
1. Find the class with most objects (if more than one, take any):
hmax = maxi
(h(ci ,k j )
).
2. Find the class with fewest objects (if more than one, take any):
hmin = mini
(h(ci ,k j )
).
3. If (hmax −hmin <= 1) → stop.
4. Move a single object from hmax to hmin and repeat from step 1.
Proof of Theorem 1. The proof is a consequence of Lemma 1 and Lemma 2. Starting from
any clustering k j we can iteratively swap single objects between the largest and the smallest
class, never decreasing a10(k j ) (a proof is exactly the same as in Lemma 1, so we omit it
here). Eventually we always reach a state when a10(k j ) = amax(k j ).
Cluster contamination measure for a cluster k j is defined as the ratio between a10(k j ) contaminationmeasure
and the worst case amax(k j ).
contamination(k j ) =a10(k j )
amax(k j ). (A.4)
Formula A.4 has the expected properties, namely it equals 0 for clusters that consist of
objects from only a single partition (“pure”) and it equals 1 for the worst-case scenario of
even objects distribution between original partitions (“contaminated”).
Bibliography
[1] Steven Abney. The English Noun Phrase in Its Sentential Aspect. PhD thesis, MIT, Cambridge,
Massachusetts, 1987.
[2] Steven Abney. Parsing by Chunks. In Robert C. Berwick, Steven P. Abney, and Carol Tenny, ed-
itors, Principle-Based Parsing: Computation and Psycholinguistics, pages 257–278. Kluwer Aca-
demic Publishers, 1991.
[3] David Arthur and Sergei Vassilvitskii. On the Worst Case Complexity of the k-Means Method. In
Proceedings of the 22nd Annual ACM Symposium on Computational Geometry, Sedona, Arizona,
2006. (awaiting publication).
[4] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval. ACM Press
/ Addison-Wesley, 1999.
[5] Arindam Banerjee, Srujana Merugu, Inderjit Dhillon, and Joydeep Ghosh. Clustering with Breg-
man Divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.
[6] Pavel Berkhin. Survey Of Clustering Data Mining Techniques. Technical report, Accrue Software,
San Jose, CA, USA, 2002.
[7] Janusz S. Bien. Komputerowa weryfikacja gramatyki Swidzinskiego. Biuletyn Polskiego To-
warzystwa Jezykoznawczego, LII:147–164, 1996.
[8] Hadumod Bussmann. Routledge Dictionary of Language and Linguistics. Routledge Reference.
Hadumod Bussmann, 1996.
[9] Peter Cheeseman, James Kelly, Matthew Self, John Stutz, Will Taylor, and Don Freeman. Auto-
Class: A Bayesian Classification System. In Proceedings of the 5th International Conference on
Machine Learning, Ann Arbor, MI, USA, pages 54–64. Morgan Kaufmann, 1988.
[10] Peter Cheeseman and John Stutz. Bayesian Classification (AutoClass): Theory and Results. In
Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, ed-
itors, Advances in Knowledge Discovery and Data Mining, pages 153–180. AAAI Press/ MIT Press,
1996.
[11] David Cheng, Santosh Vempala, Ravi Kannan, and Grant Wang. A Divide-And-Merge Method-
ology for Clustering. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on
Principles of Database Systems, Baltimore, Maryland, USA, pages 196–205. ACM Press, 2005.
[12] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted
Text. In Proceedings of the 2nd Conference on Applied Natural Language Processing, Austin,
Texas, USA, pages 136–143. Association for Computational Linguistics, 1988.
120
121
[13] Douglass R. Cutting, Jan O. Pedersen, David Karger, and John W. Tukey. Scatter/Gather: A
Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th
International ACM Conference on Research and Development in Information Retrieval, Copen-
hagen, Denmark, pages 318–329. ACM Press, 1992.
[14] Jan Daciuk. Incremental Construction of Finite-State Automata and Transducers, and their Use
in the Natural Language Processing. PhD thesis, Technical University of Gdansk, Poland, 1998.
[15] Arthur Dempster, Nan Laird, and Donald Rubin. Maximum Likelihood from Incomplete Data
via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1–38, 1977.
[16] Inderjit S. Dhillon, James Fan, and Yuqiang Guan. Efficient Clustering of Very Large Document
Collections. In Robert Grossman, Chandrika Kamath, Vipin Kumar, and Raju R. Namburu, edi-
tors, Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, 2001.
Invited book chapter.
[17] Inderjit S. Dhillon and Dharmendra S. Modha. Concept Decompositions for Large Sparse Text
Data Using Clustering. Machine Learning, 42(1–2):143–175, 2001.
[18] Byron E. Dom. An Information-Theoretic External Cluster-Validity Measure. Research Report RJ
10219, IBM, 2001.
[19] Zhang Dong. Towards Web Information Clustering. PhD thesis, Southeast University, Nanjing,
China, 2002.
[20] Richard C. Dubes. How many clusters are best? — An experiment. Pattern Recognition,
20(6):645–663, 1987.
[21] David Dubin. The most influential paper Gerard Salton never wrote. Library Trends, 52(4):748–
764, 2004.
[22] Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Oxford University Press,
fourth edition, 2001.
[23] Paolo Ferragina and Antonio Gulli. The Anatomy of a Hierarchical Clustering Engine for Web-
page, News and Book Snippets. In Proceedings of the 4th IEEE International Conference on Data
Mining, ICDM 2004, Brighton, UK, pages 395–398. IEEE Computer Society, 2004.
[24] Paolo Ferragina and Antonio Gulli. The Anatomy of SnakeT: A Hierarchical Clustering Engine for
Web-Page Snippets. In Proceedings of the 8th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Pisa, Italy, volume 3202 of Lecture Notes in Computer Science,
pages 506–508. Springer, 2004.
[25] Douglas Fisher. Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learn-
ing, 2(2):129–172, 1987.
[26] Martin Fowler and Matthew Foemmel. Continuous integration. Available on-line (May 2006):
http://martinfowler.com/articles/continuousIntegration.html .
[27] Robert Giegerich and Stefan Kurtz. From Ukkonen to McCreight and Weiner: A Unifying View
of Linear-Time Suffix Tree Construction. Algorithmica, 19(3):331–353, 1997.
[28] Alessandra Giorgi and Giuseppe Longobardi. The Syntax of Noun Phrases, Configuration, Pa-
rameters and Empty Categories. Number 57 in Cambridge Studies in Linguistics. Cambridge
University Press, 1991.
122
[29] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University
Press, London, third edition, 1996.
[30] Allan D. Gordon. Classification. Chapman & Hall, London, second edition, 1999.
[31] Anjan Goswami, Ruoming Jin, and Gagan Agrawal. Fast and Exact Out-of-Core K-Means Clus-
tering. In Proceedings of the 4th IEEE International Conference on Data Mining, ICDM 2004,
Brighton, UK, pages 83–90. IEEE Computer Society, 2004.
[32] Barbara L. Grosz and Candace L. Sidner. Attentions, Intentions, and the Structure of Discourse.
Computational Linguistics, 12(3):175–204, 1986.
[33] Elzbieta Hajnicz and Anna Kupsc. Przeglad analizatorow morfologicznych dla jezyka polskiego.
Research Report 937, Institute of Computer Science, Polish Academy of Sciences, 2001.
[34] Michael Halliday and Ruqaiya Hasan. Cohesion in English. English Language Series. Longman,
1976.
[35] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann,
second edition, 2006.
[36] David Hand, Heikki Mannila, and Padhraic Smyth. Eksploracja Danych. Wydawnictwa Nauko-
wo-Techniczne WNT, 2005.
[37] Marti A. Hearst and Jan O. Pedersen. Reexamining the Cluster Hypothesis: Scatter/Gather on
Retrieval Results. In Proceedings of the 19th ACM International Conference on Research and De-
velopment in Information Retrieval, Zürich, Switzerland, pages 76–84, 1996.
[38] Andreas Hotho, Steffen Staab, and Gerd Stumme. Explaining Text Clustering Results Using Se-
mantic Structures. In Proceedings of the 7th European Conference on Principles and Practice of
Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, volume 2838 of Lecture Notes in
Computer Science, pages 217–228. Springer, 2003.
[39] Anreas Hotho and Gerd Stumme. Conceptual Clustering of Text Clusters. In Proceedings of FGML
Workshop, Special Interest Group of German Informatics Society, Hannover, Germany, pages 37–
45, 2002.
[40] Wayne Iba and Pat Langley. Unsupervised Learning of Probabilistic Concept Hierarchies. In
Georgios Paliouras, Vangelis Karkaletsis, and Constantine D. Spyropoulos, editors, Machine
Learning and Its Applications, volume 2049 of Lecture Notes in Computer Science, pages 39–70.
Springer, 2001.
[41] Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. Data Clustering: A Review. ACM Com-
puting Surveys, 31(3):264–323, 1999.
[42] Mark Kantrowitz, Behrang Mohit, and Vibhu Mittal. Stemming and Its Effects on TFIDF Rank-
ing. In Proceedings of the 23rd ACM International Conference on Research and Development in
Information Retrieval, Athens, Greece, pages 357–359. ACM Press, 2000.
[43] George Karypis and Eui-Hong Han. Fast Supervised Dimensionality Reduction Algorithm with
Applications to Document Categorization and Retrieval. In Proceedings of the 9th International
Conference on Information and Knowledge Management, McLean, VA, USA, pages 12–19. ACM
Press, 2000.
123
[44] Donald E. Knuth. The TEXbook. Computers and Typesetting. Addison-Wesley, 1986.
[45] Pang Ko and Srinivas Aluru. Space Efficient Linear Time Construction of Suffix Arrays. Journal
of Discrete Algorithms, 3(2–4):143–156, 2005.
[46] Jacek Koronacki and Jan Cwik. Statystyczne systemy uczace sie. Wydawnictwa Naukowo-Techni-
czne WNT, 2005.
[47] Jacek Koronacki and Jan Mielniczuk. Statystyka dla studentów kierunków technicznych i przy-
rodniczych. Wydawnictwa Naukowo-Techniczne WNT, 2001.
[48] Krishna Kummamuru, Rohit Lotlikar, Shourya Roy, Karan Singal, and Raghu Krishnapuram.
A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing
Search Results. In Proceedings of the 13th International Conference on World Wide Web, New
York, NY, USA, pages 658–665. ACM Press, 2004.
[49] Leslie Lamport. LATEX — A Document Preparation System — User’s Guide and Reference Manual.
Addison-Wesley, 1985.
[50] Bjornar Larsen and Chinatsu Aone. Fast and Effective Text Mining Using Linear-Time Docu-
ment Clustering. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, San Diego, CA, USA, pages 16–22. ACM Press, 1999.
[51] Jessper N. Larsson. Structures of String Matching and Data Compression. PhD thesis, Depart-
ment of Computer Science, Lund University, 1999.
[52] Elizabeth Liddy. Advances in Automatic Text Summarization. Information Retrieval, 4(1):82–83,
2001.
[53] Dekang Lin and Patrick Pantel. Concept Discovery from Text. In Proceedings of the 19th In-
ternational Conference on Computational Linguistics, Taipei, Taiwan, pages 1–7. Association for
Computational Linguistics, 2002.
[54] Beth Julie Lovins. Development of a Stemming Algorithm. Mechanical Translation and Compu-
tational Linguistics, 11:22–31, 1968.
[55] Hans Peter Luhn. The Automatic Creation of Literature Abstracts. The IBM Journal of Research
and Development, 2(2):159–165, 1958.
[56] Sofus A. Macskassy, Arunava Banerjee, Brian D. Davison, and Haym Hirsh. Human Performance
on Clustering Web Pages: A Preliminary Study. In Proceedings of the 4th International Conference
on Knowledge Discovery and Data Mining, New York City, NY, USA, pages 264–268. AAAI Press,
1998.
[57] Udi Manber and Gene Myers. Suffix Arrays: A New Method for On-line String Searches. SIAM
Journal on Computing, 22(5):935–948, 1993.
[58] Inderjeet Mani and Mark T. Maybury, editors. Advances in Automatic Text Summarization. MIT
Press, 1999.
[59] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Informa-
tion Retrieval. Cambridge University Press, 2007. (Pre-press publication notes).
[60] Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Pro-
cessing. MIT Press, 1999.
124
[61] Giovanni Manzini and Paolo Ferragina. Engineering a Lightweight Suffix Array Construction
Algorithm. Algorithmica, 40(1):33–50, 2004.
[62] Marcin Wolinski. Komputerowa weryfikacja gramatyki Swidzinskiego. PhD thesis, Instytut Pod-
staw Informatyki PAN, Warszawa, 2004.
[63] Irmina Masłowska. Phrase-Based Hierarchical Clustering of Web Search Results. In Proceedings
of the 25th European Conference on IR Research, ECIR 2003, Pisa, Italy, volume 2633 of Lecture
Notes in Computer Science, pages 555–562. Springer, 2003.
[64] Irmina Masłowska and Roman Słowinski. Hierarchical Clustering of Text Corpora Using Suffix
Trees. In Proceedings of the International Intelligent Information Processing and Web Mining
Conference, Zakopane, Poland, Advances in Soft Computing, pages 179–188. Springer, 2003.
[65] Geoffrey J. McLachlan and Thriyambakam Krishnan. The EM Algorithm and Extensions. Wiley,
1996.
[66] Ryszard S. Michalski and Robert E. Stepp. Learning From Observation: Conceptual Clustering.
In Ryszard S. Michalski, Jaime G. Carbonell, and Tom M. Mitchell, editors, Machine Learning:
An Artificial Intelligence Approach. Morgan Kaufmann, 1983.
[67] Douglas C. Montgomery and George C. Runger. Applied Statistics and Probability for Engineers.
Wiley, third edition, 2002.
[68] Stanisław Osinski, Jerzy Stefanowski, and Dawid Weiss. Lingo: Search Results Clustering Algo-
rithm Based on Singular Value Decomposition. In Proceedings of the International Intelligent
Information Processing and Web Mining Conference, Zakopane, Poland, Advances in Soft Com-
puting, pages 359–368. Springer, 2004.
[69] Stanisław Osinski and Dawid Weiss. Conceptual Clustering Using Lingo Algorithm: Evaluation
on Open Directory Project Data. In Proceedings of the International Intelligent Information Pro-
cessing and Web Mining Conference, Zakopane, Poland, Advances in Soft Computing, pages 369–
378. Springer, 2004.
[70] Stanisław Osinski and Dawid Weiss. A Concept-Driven Algorithm for Clustering Search Results.
IEEE Intelligent Systems, 20(3):48–54, 2005.
[71] Chris D. Paice. Another Stemmer. SIGIR Forum, 24(3):56–61, 1990.
[72] Patrick Pantel and Dekang Lin. Document Clustering With Committees. In Proceedings of the
25th ACM International Conference on Research and Development in Information Retrieval, Tam-
pere, Finland, pages 199–206. ACM Press, 2002.
[73] Patrick Pantel and Deepak Ravichandran. Automatically Labeling Semantic Classes. In Proceed-
ings of Human Language Technology Conference of the North American Chapter of the Association
for Computational Linguistics, Boston, MA, USA, pages 321–328, 2004.
[74] Ferran Pla, Antonio Molina, and Natividad Prieto. Tagging and Chunking with Bigrams. In Pro-
ceedings of the 18th International Conference on Computational Linguistics, Saarbrücken, Ger-
many, pages 614–620. Morgan Kaufmann, 2000.
[75] Zbigniew Płotnicki. Słownik morfologiczny jezyka polskiego na licencji LGPL. Master’s thesis,
Poznan University of Technology, 2003.
125
[76] Martin F. Porter. An Algorithm for Suffix Stripping. In Karen Sparck Jones and Peter Willett,
editors, Readings in Information Retrieval, pages 313–316. Morgan Kaufmann, 1997.
[77] Dragomir R. Radev, Hongyan Jing, Małgorzata Stys, and Daniel Tam. Centroid-based Summa-
rization of Multiple Documents. Information Processing and Management, 40(6):919–938, 2004.
[78] Lance Ramshaw and Mitch Marcus. Text Chunking Using Transformation-Based Learning. In
Proceedings of the 3rd Workshop on Very Large Corpora, Boston, MA, USA, pages 82–94. Associa-
tion for Computational Linguistics, 1995.
[79] Jeffrey C. Reynar. Topic Segmentation: Algorithms and Applications. PhD thesis, University of
Pennsylvania, Philadelphia, USA, 1998.
[80] James E. Rush, Antonio Zamora, and Ricardo Salvador. Automatic Abstracting and Indexing:
Production of Indicative Abstracts by Application of Contextual Inference and Syntactic Coher-
ence Criteria. Journal of the American Society for Information Science, 22(4):260–274, 1971.
[81] Zygmunt Saloni and Marek Swidzinski. Składnia współczesnego jezyka polskiego. Wydawnictwo
Naukowe PWN, 1998.
[82] Gerard Salton. Automatic Text Processing — The Transformation, Analysis, and Retrieval of In-
formation by Computer. Addison-Wesley, 1989.
[83] Gerard Salton and Chris Buckley. Term Weighting Approaches in Automatic Text Retrieval. Re-
search report, Cornell University, Ithaca, NY, USA, 1987.
[84] Mark Sanderson and Bruce Croft. Deriving Concept Hierarchies from Text. In Proceedings of
the 22nd ACM International Conference on Research and Development in Information Retrieval,
Berkeley, USA, pages 206–213, 1999.
[85] Adam Schenker, Mark Last, and Abraham Kandel. A Term-Based Algorithm for Hierarchical
Clustering of Web Documents. In Proceedings of the Joint 9th IFSA World Congress and 20th
NAFIPS International Conference, Vancouver, Canada, pages 3076–3081, 2001.
[86] Heinrich Schütze and Craig Silverstein. Projections for Efficient Document Clustering. In Pro-
ceedings of the 20th ACM International Conference on Research and Development in Information
Retrieval, Philadelphia, PA, USA, pages 74–81. ACM Press, 1997.
[87] Mark Sinka and David Corne. A Large Benchmark Dataset for Web Document Clustering. Soft
Computing Systems: Design, Management and Applications, 87:881–890, 2002.
[88] Eduard F. Skorochod’ko. Adaptive Method of Automatic Abstracting and Indexing. In Informa-
tion Processing 71: Proceedings of the IFIP Congress, pages 1179–1182. North-Holland Publishing
Company, 1972.
[89] Jerzy Stefanowski and Dawid Weiss. Carrot2 and Language Properties in Web Search Results
Clustering. In Proceedings of the 1st International Atlantic Web Intelligence Conference, Madrid,
Spain, volume 2663 of Lecture Notes in Computer Science, pages 240–249. Springer, 2003.
[90] Benno Stein and Sven Meyer zu Eissen. Topic Identification: Framework and Application. In
Proceedings of the 4th International Conference on Knowledge Management, Graz, Austria, pages
353–360, 2004.
126
[91] Michael Steinbach, George Karypis, and Vipin Kumar. A Comparison of Document Clustering
Techniques. In KDD Workshop on Text Mining, Proceedings of the 6th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 2000.
[92] Krzysztof Szafran. Analizator morfologiczny SAM-95. Technical Report TR 96 226, Faculty of
Mathematics, Informatics and Mechanics, Warsaw University, Poland, 1996.
[93] Stanisław Szpakowicz. Automatyczna analiza składniowa polskich zdan pisanych. PhD thesis,
Warsaw University, Poland, 1978.
[94] Kevin Thompson and Pat Langley. Concept Formation in Structured Domains. In Douglas
Fisher, Michael Pazzani, and Pat Langley, editors, Concept Formation: Knowledge and Experi-
ence in Unsupervised Learning. Morgan Kaufmann, 1991.
[95] Esko Ukkonen. On-Line Construction of Suffix Trees. Algorithmica, 14(3):249–260, 1995.
[96] Keith van Rijsbergen. Information Retrieval. Butterworth-Heinemann, 1979.
[97] Justyna Wachnicka. Odkrywanie reguł lematyzacji dla jezyka polskiego w oparciu o słownik
ispell-pl. Master’s thesis, Poznan University of Technology, 2004.
[98] Dawid Weiss. A Survey of Freely Available Polish Stemmers and Evaluation of Their Applica-
bility in Information Retrieval. In Proceedings of the 2nd Language and Technology Conference,
Poznan, Poland, pages 216–221, 2005.
[99] Dawid Weiss. Carrot2: Design of a Flexible and Efficient Web Information Retrieval Framework.
In Proceedings of the 3rd International Atlantic Web Intelligence Conference, Łódz, Poland, vol-
ume 3528 of Lecture Notes in Computer Science, pages 439–444. Springer, 2005.
[100] Dawid Weiss. Stempelator: A Hybrid Stemmer for the Polish Language. Research Report RA-
002/05, Institute of Computing Science, Poznan University of Technology, Poland, 2005.
[101] Dawid Weiss and Jerzy Stefanowski. Web Search Results Clustering in Polish: Experimental Eval-
uation of Carrot. In Proceedings of the International Intelligent Information Processing and Web
Mining Conference, Zakopane, Poland, Advances in Soft Computing, pages 209–218. Springer,
2003.
[102] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and In-
dexing Documents and Images. Morgan Kaufmann, second edition, 1999.
[103] Marcin Wolinski. Morfeusz — a Practical Tool for the Morphological Analysis of Polish. To be
published at International Intelligent Information Processing and Web Mining Conference, Us-
tron, Poland, 2006.
[104] Oren Zamir. Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine
Results. PhD thesis, University of Washington, 1999.
[105] Oren Zamir and Oren Etzioni. Web Document Clustering: A Feasibility Demonstration. In Re-
search and Development in Information Retrieval, pages 46–54, 1998.
[106] Oren Zamir and Oren Etzioni. Grouper: A Dynamic Clustering Interface to Web Search Results.
Computer Networks, 31(11–16):1361–1374, 1999.
[107] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, and Jinwen Ma. Learning to Cluster Web
Search Results. In Proceedings of the 27th ACM International Conference on Research and Devel-
opment in Information Retrieval, Sheffield, United Kingdom, pages 210–217. ACM Press, 2004.
127
[108] Hua-Jun Zeng, Xuan-Hui Wang, Zheng Chen, Hongjun Lu, and Wei-Ying Ma. CBC: Clustering
Based Text Classification Requiring Minimal Labeled Data. In Proceedings of the 3rd IEEE In-
ternational Conference on Data Mining (ICDM 2003), Melbourne, Florida, USA, pages 443–450,
2003.
[109] Dell Zhang and Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results.
In Proceedings of 6th Asia-Pacific Web Conference (APWeb), Hangzhou, China, volume 3007 of
Lecture Notes in Computer Science, pages 69–78. Springer, 2004.
[110] Tong Zhang, Fred Damerau, and David Johnson. Text Chunking Based on a Generalization of
Winnow. Journal of Machine Learning Research, 2:615–637, 2002.
List of Web Resources
[A] Twenty Newsgroups (20-newsgroups) Data Set.
http://people.csail.mit.edu/jrennie/20Newsgroups/
[B] Andrzej Białecki, Stempel — algorithmic stemmer for the Polish language.
http://getopt.org/stempel/
[C] Hugo Liu, MontyLingua — An end-to-end natural language processor with common sense.
http://web.media.mit.edu/~hugo/montylingua
[D] Vivísimo Search Engine.
http://www.vivisimo.com/
[E] Text Retrieval Conference.
http://trec.nist.gov/
[F] The Unicode standard.
http://www.unicode.org/
[G] Generative Art, Wikipedia.
http://en.wikipedia.org/wiki/Generative_art
[H] Lucene text search and index library.
http://lucene.apache.org/
[I] Open Directory Project.
http://dmoz.org/
[J] Carrot2 framework.
http://www.carrot2.org
[K] Carrot2 framework demo.
http://carrot.cs.put.poznan.pl
[L] IPI PAN Corpus.
http://www.korpus.pl
128
Index
abstract topic, see dominant topic
algorithm
monothetic, 29
polythetic, 29
character encoding, 14
chunk, 16, 49
chunking, 16
cluster contamination, 39, 88, 119
cluster hypothesis, 3
Cluster/2, 45
clustering problem, 27
CobWeb, 45
codepage, see character encoding
conceptual clustering, 45
contingency matrix, 116
cosine measure, 27
DCF, see Description Comes First
Description Comes First, 47
Descriptive k-Means, 65
dimensionality reduction, 27
discourse modelling, 36
DKM, see Descriptive k-Means
document indexing, 24, 66
document vector, 10
dominant topic, 50
f-measure, 38
feature, 24
ground truth, 38
group
distributionally equivalent, 18
hit, see hit list
hit list, 1
homographs, 15
index, see document indexing
indexing, see document indexing
information needs, 1
k-Means, 30
keyword in context, 57
keywords, 3
KWIC, see keyword in context
LCP, see Longest Common Prefix
lemma, 15
lexeme, 15
Lingo, 11
Longest Common Prefix, 60
MI, see mutual information
mutual information, 26
noun phrase, 17, 49
NP-chunks, see noun phrase
ODP, see Open Directory Project
Open Directory Project, 80
pattern phrase, 51
phrase, 5, 16
phrase query, 72, 75
pointwise mutual information, 26
precision, 7, 38
query
phrase query, 72
search engines, 1
ranking, 7
recall, 7, 38
recurring phrases, 11
recurring terms, see phrase
search engine, 1
search ranking, see ranking
search results clustering, 5
shallow linguistic preprocessing, 15
slop factor, 72
snippet, 5, 57
STC, see Suffix Tree Clustering
stemming, 15
suffix tree, 22
Suffix Tree Clustering, 31
summarization, 36
term, 14
term vector, see document vector
term-document matrix, 24
text chunking, see chunking
text segmentation, 14
tf-idf, 26
topic segmentation, 36
Vector Space Model, 10, 24
VSM, see Vector Space Model
word form, 15
129
© 2006 Dawid Weiss
Institute of Computing Science
Poznan University of Technology, Poznan, Poland
Typeset using LATEX in Adobe Utopia, Helvetica and Luxi Mono.
Compiled on November 22, 2006 from SVN Revision: 2467
BibTEX entry:
@phdthesis dweiss2006,
author = "Dawid Weiss",
title = "Descriptive Clustering as a Method for Exploring Text~Collections",
school = "Pozna\’n University of Technology",
address = "Pozna\’n, Poland",
year = "2006",
Errata
This chapter contains a list of corrections made to Dawid Weiss doctoral dissertation De-
scriptive Clustering as a Method for Exploring Text Collections.
The motivation is to highlight changes and corrections made between any publicly dis-
tributed versions (paper or electronic). Each version contains a revision number in the
colophon (found on the last page). Many changes below have been applied in consecu-
tive releases of the dissertation, but all page numbers and rows in each change set apply to
the defended version 2358.
— Changes between revisions 2343–2358 (internal review) —
Page: Front page E-1
Front pages order changed: English title page, Polish title page, summary in Polish.
Page: I–III E-2
Chapter page numbers in plain font (not bold).
Page: multiple pages E-3
Changed labelling to American English spelling for consistency: labeling.
Page: 15 4th line from the top E-4
White space instead of whitespace.
• white space and punctuation characters delimit terms,
• a full stop followed by a white space delimits a sentence.
Page: 23 6 from the top E-5
Whitespace added after suffix array and before the citation.
Page: 29 Figure 2.8 on top of the page E-6
The legend on top of the figure should be the opposite: squares denote objects, circles groups.
objects groups
2
Page: 37 First line, third element of the list E-7
Grammatical correction: uncomparable changed to incomparable.
Experiment results are unique, unreproducible and incomparable. User surveys are [. . . ]
— Changes between revisions 2359–2427 (post-review) —
Page: 1 E-8
Contraction we’ve gathered expanded to we have gathered.
Page: 8 pt. 3 in Section 1.3 E-9
The acronym dcf is used prior to its definition. Reworded pt. 2 to introduce the term.
• Devise a method — Description Comes First (dcf) — that yields comprehensible, [. . . ]
Page: multiple pages E-10
Contraction let’s reworded or expanded to let us.
Page: 43 Second paragraph in „Conciseness” block. E-11
Typo: Anticipating out changed to Anticipating our.
Page: 47 chapter title E-12
Chapter title changed to: Solving the Descriptive Clustering Task: Description Comes First Approach.
Page: 116 4th line from the top, wrong word used E-13
Change partition to class.
[. . . ] rows to classes and h(ci ,k j ) contains the number of objects from class ci present [. . . ]
Page: 118 2nd line of Corollary 1 E-14
Change partition to class.
[. . . ] every class differs at most by one, any object moved between classes decreases [. . . ]
Page: 111 Figures 7.22 and 7.23 E-15
Integers on the charts shown with an unnecessary floating point precision.
Page: 23 third paragraph from the top E-16
Inaccurate estimate of suffix array construction complexity (explanation added).
Of course a straightforward construction of a suffix array using generic sorting routines
would slow down the algorithm to the order of O(n log n) (assuming a bit unrealistically that
string comparisons take O(1) time).
3
Page: 34 Related Works section E-17
Missing references. A few other algorithms for clustering search results have been mistakenly omitted
in Section 2.4 (shamefully, some of them are even implemented in Carrot2).
• Chi Lang Ngo and Hung Son Nguyen. A method of web search result clustering based
on rough sets. In Proceedings of 2005 IEEE / WIC / ACM International Conference on
Web Intelligence, 19-22 September 2005, Compiegne, France, pages 673–679. IEEE Com-
puter Society, 2005
• C. Carpineto, A. Della Pietra, S. Mizzaro, and G. Romano. Mobile clustering engine.
In Proceedings of the 28th European Conference on Information Retrieval, LNCS 3936,
pages 155–166, London, 2006. Springer
• Steven Schockaert, Martine De Cock, Chris Cornelis, and Etienne E. Kerre. Fuzzy ant
based clustering. In Ant Colony Optimization and Swarm Intelligence, Proceedings
of the 4th International Workshop, ANTS 2004, volume 3172 of LNCS, pages 342–349.
Springer-Verlag, 2004
— Changes between revisions 2427–2466 (margin notes from prof. Jacek Koronacki) —
Page: 7 9th line from the top E-18
Replaced ambiguous sentence.
[. . . ] users expect labels to convey all the information about clusters’ content.
Page: 10 4th line from the bottom E-19
Replaced awkward sentence.
Cluster discovery provides a computational data model of document groups present in the
input data.
Page: 12 2nd line from the bottom E-20
Replaced awkward characters.
[. . . ] devoted to Descriptive k-Means additionally [. . . ]
Page: 19 10th line from the bottom E-21
Corrected word: frequently → frequent.
Page: 22 9th line from the bottom E-22
Corrected word repetition: node node → node.
Page: 23 2nd line from the top E-23
Corrected awkward sentence.
Suffix trees have become very popular mostly due to low computational cost of their con-
struction — linear with the size of input sequence’s elements.
4
Page: 25 4 line from the bottom E-24
Added missing word.
[. . . ] and downplays words that are very common.
Page: 27 first paragraph from the top E-25
Replaced articles in phrases: a dot product, a norm of vector.
Page: 27 Section 2.3.1 E-26
Added proper attribution — Brian Everitt et al.
Page: 31 midpage E-27
Added missing article.
The Internet brought a new challenge [. . . ]
Page: 38 midpage E-28
Incorrect word usage corrected: remember → recall.
Page: 47 10th line from the bottom E-29
Reworded awkward sentence.
[. . . ] but the core idea is in separating selection of candidate cluster labels from cluster dis-
covery.
Page: 48 label in Figure 4.1 E-30
Incorrect word usage corrected: potential → potentially.
Page: 57 9th line from the bottom E-31
Removed extra ‘s’ in contains.
Page: 57 7th line from the bottom E-32
Added kwic to the index and the margin.
Page: 66 7th line from the top E-33
Reworded awkward sentence.
We expect the input to be real text (not completely random, noisy documents and not frag-
ments like snippets) written in one language [. . . ]
Page: 66 second and third paragraph in Section 6.2 E-34
Corrected omitted words.
[. . . ] certain distortions such as minor [. . . ]
[. . . ] pseudocode of the algorithm, and we discuss each major step in sections below.
5
Page: 69 15th from the top E-35
Reworded phrase: very similar → similarly.
Page: 79 multiple lines in Section 7.1 E-36
Minor rewordings.
[. . . ] differ slightly from those of typical clustering [. . . ]
[. . . ] look at the problem of evaluation from another angle.
The system has been available as an open source project[. . . ]
Page: 80 9th line from the top E-37
Reworded awkward sentence.
• Are cluster labels meaningful with respect to the topic [. . . ]
Page: 80 3rd line from the bottom E-38
Removed extra parenthesis.
Page: 92 3rd line from the top E-39
Added missing the.
We need to point out that the changes introduced to [. . . ]