processing of large document collections 1 helena ahonen-myka university of helsinki
TRANSCRIPT
![Page 1: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/1.jpg)
Processing of Large Document Collections 1
Helena Ahonen-MykaUniversity of Helsinki
![Page 2: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/2.jpg)
Organization of the courseClasses: 17.9., 22.10., 23.10., 26.11.
lectures (Helena Ahonen-Myka): 10-12,13-15 exercise sessions (Lili Aunimo): 15-17 required presence: 75%
Exercises are given (and returned) each week required: 75%
Exam: 4.12. at 16-20, AuditorioPoints: Exam 30 pts, exercises 30 pts
![Page 3: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/3.jpg)
Schedule
17.9. Character sets, preprocessing of text, text categorization
22.10. Text summarization23.10. Text compression26.11. … to be announced…
self-study: basic transformations for text data, using linguistic tools, etc.
![Page 4: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/4.jpg)
In this part...
Character setspreprocessing of texttext categorization
![Page 5: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/5.jpg)
1. Character sets
Abstract character vs. its graphical representation
abstract characters are grouped into alphabets each alphabet forms the basis of the
written form of a certain language or a set of languages
![Page 6: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/6.jpg)
Character sets
For instance for English:
uppercase letters A-Zlowercase letters a-zpunctuation marksdigits 0-9common symbols: +, =
ideographic symbols of Chinese and Japanese
phonetic letters of Western languages
![Page 7: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/7.jpg)
Character sets
To represent text digitally, we need a mapping between (abstract) characters and values stored digitally (integers)
this mapping is a character setthe domain of the character set is
called a character repertoire (= the alphabet for which the mapping is defined)
![Page 8: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/8.jpg)
Character sets
For each character in the character repertoire, the character set defines a code value in the set of code points
in English: 26 letters in both lower- and uppercase ten digits + some punctuation marks
in Russian: cyrillic lettersboth could use the same set of code points (if
not a bilingual document) in Japanese: could be over 6000 characters
![Page 9: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/9.jpg)
Character sets
The mere existence of a character set supports operations like editing and searching of text
usually character sets have some structure e.g. integers within a small range all lower-case (resp. upper-case) letters
have code values that are consecutive integers (simplifies sorting etc.)
![Page 10: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/10.jpg)
Character sets: standars
Character sets can be arbitrary, but in practice standardization is needed for interoperability (between computers, programs,...)
early standards were designed for English only, or for a small group of languages at a time
![Page 11: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/11.jpg)
Character sets: standards
ASCIIISO-8859 (e.g. ISO Latin1)UnicodeUTF-8, UTF-16
![Page 12: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/12.jpg)
ASCII
American Standard Code for Information Interchange
A seven bit code -> 128 code pointsactually 95 printable characters only
code points 0-31 and 128 are assigned to control characters (mostly outdated)
ISO 646 (1972) version of ASCII incorporated several national variants (accented letters and currency symbols)
![Page 13: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/13.jpg)
ASCII
With 7 bits, the set of code points is too small for anything else than American English
solution: 8 bits brings more code points (256) ASCII character repertoire is mapped to the
values 0-127 additional symbols are mapped to other
values
![Page 14: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/14.jpg)
Extended ASCII
Problem: different manufacturers each developed
their own 8-bit extensions to ASCIIdifferent character repertoires ->
translation between them is not always possible
also 256 code values is not enough to represent all the alphabets -> different variants for different languages
![Page 15: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/15.jpg)
ISO 8859
Standardization of 8-bit character sets In the 80´s: multipart standard ISO 8859 was
produceddefines a collection of 8-bit character sets,
each designed for a group of languages the first part: ISO 8859-1 (ISO Latin1)
covers most Western European languages 0-127: identical to ASCII, 128-159 (mostly)
unused, 96 code values for accented letters and symbols
![Page 16: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/16.jpg)
Unicode
256 is not enough code points for ideographically represented languages
(Chinese, Japanese…) for simultaneous use of several languages
solution: more than one byte for each code value
a 16-bit character set has 65,536 code points
![Page 17: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/17.jpg)
Unicode
16-bit character set, e.g. 65,536 code points
not sufficient for all the characters required for Chinese, Japanese, and Korean scripts in distinct positions CJK-consolidation: characters of these
scripts are given the same value if they look the same
![Page 18: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/18.jpg)
Unicode
Code values for all the characters used to write contemporary ’major’ languages also the classical forms of some languages Latin, Greek, Cyrillic, Armenian, Hebrew,
Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan
Chinese, Japanese, and Korean ideograms, and the Japanese and Korean phonetic and syllabic scripts
![Page 19: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/19.jpg)
Unicode
punctuation marks technical and mathematical symbols arrows dingbats (pointing hands, stars, …) both accented letters and separate diacritical
marks (accents, tildes…) are included, with a mechanism for building composite characterscan also create problems: two characters that look
the same may have different code values->normalization may be necessary
![Page 20: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/20.jpg)
Unicode
Code values for nearly 39,000 symbols are provided
some part is reserved for an expansion method (see later)
6,400 code points are reserved for private use they will never be assigned to any
character by the standard, so they will not conflict with the standard
![Page 21: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/21.jpg)
Unicode: encodings
Encoding is a mapping that transforms a code value into a sequence of bytes for storage and transmission
identity mapping for a 8-bit code? it may be necessary to encode 8-bit characters as
sequences of 7-bit (ASCII) characters e.g. Quoted-Printable (QP)
code values 128-255 as a sequence of 3 bytes1: ASCII code for ’=’, 2 & 3: hexadecimal digits of the
value233 -> E9 -> =E9
![Page 22: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/22.jpg)
Unicode: encodings
UTF-8 ASCII code values are likely to be more
common in most text than any other valuesin UTF-9 encoding ASCII characters are sent
themselves (high-order bit 0)other characters (two bytes) are encoded
using up to six bytes (high-order bit is set to 1)
![Page 23: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/23.jpg)
Unicode: encodings
UTF-16: expansion method two 16-bit values are combined to a 32-
bit value -> a million characters available
![Page 24: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/24.jpg)
2. Preprocessing of text
Text cannot be directly interpreted by the many document processing applications
an indexing procedure is needed mapping of a text into a compact
representation of its contentwhich are the meaningful units of text?how these units should be combined?
usually not ”important”
![Page 25: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/25.jpg)
Vector model
A document is usually represented as a vector of term weights
the vector has as many dimensions as there are terms (or features) in the whole collection of documents
the weight represents how much the term contributes to the semantics of the document
![Page 26: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/26.jpg)
Vector model
Different approaches: different ways to understand what a
term is different ways to compute term weights
![Page 27: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/27.jpg)
Terms
Words typical choice set of words, bag of words
phrases syntactical phrases statistical phrases usefulness not yet known?
![Page 28: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/28.jpg)
Terms
Part of the text is not considered as terms very common words (function words):
articles, prepositions, conjunctions
numeralsthese words are pruned
stopword listother preprocessing possible
stemming, base words
![Page 29: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/29.jpg)
Weights of terms
Weights usually range between 0 and 1binary weights may be used
1 denotes presence, 0 absence of the term in the document
often the tfidf function is used higher weight, if the term occurs often in the
document lower weight, if the term occurs in many
documents
![Page 30: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/30.jpg)
Structure
Either the full text of the document or selected parts of it are indexed
e.g. in a patent categorization application title, abstract, the first 20 lines of the
summary, and the section containing the claims of novelty of the described invention
some parts may be considered more important e.g. higher weight for the terms in the title
![Page 31: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/31.jpg)
Dimensionality reductionMany algorithms cannot handle high
dimensionality of the term space (= large number of terms)
usually dimensionality reduction is applieddimensionality reduction also reduces
overfitting classifier that overfits the training data is good at
re-classifying the training data but worse at classifying previously unseen data
![Page 32: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/32.jpg)
Dimensionality reduction
Local dimensionality reduction for each category, a reduced set of terms
is chosen for classification that category hence, different subsets are used when
working with different categoriesglobal dimensionality reduction
a reduced set of terms is chosen for the classification under all categories
![Page 33: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/33.jpg)
Dimensionality reduction
Dimensionality reduction by term selection the terms of the reduced term set are a
subset of the original term setDimensionality reduction by term
extraction the terms are not the same type of the terms
in the original term set, but are obtained by combinations and transformations of the original ones
![Page 34: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/34.jpg)
Dimensionality reduction by term selection
Goal: select terms that, when used for document indexing, yields the highest effectiveness in the given application
wrapper approach the reduced set of terms is found iteratively and
tested with the application
filtering approach keep the terms that receive the highest score
according to a function that measures the ”importance” of the term for the task
![Page 35: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/35.jpg)
Dimensionality reduction by term selection
Many functions available document frequency: keep the high
frequency termsstopwords have been already removed50% of the words occur only once in the
document collection e.g. remove all terms occurring in at most 3
documents
![Page 36: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/36.jpg)
Dimensionality reduction by term selection
Information-theoretic term selection functions, e.g. chi-square information gain mutual information odds ratio relevancy score
![Page 37: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/37.jpg)
Dimensionality reduction by term extraction
Term extraction attempts to generate, from the original term set, a set of ”synthetic” terms that maximize effectiveness
due to polysemy, homonymy, and synonymy, the original terms may not be optimal dimensions for document content representation
![Page 38: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/38.jpg)
Dimensionality reduction by term extraction
Term clustering tries to group words with a high degree of pairwise
semantic relatedness groups (or their centroids) may be used as
dimensions
latent semantic indexing compresses document vector into vectors of a
lower-dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence
![Page 39: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/39.jpg)
3. Text categorization
Text classification, topic classification/spotting/detection
problem setting: assume: a predefined set of categories,
a set of documents label each document with one (or more)
categories
![Page 40: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/40.jpg)
Text categorization
Two major approaches: knowledge engineering -> end of 80’s
manually defined set of rules encoding expert knowledge on how to classify documents under the given gategories
machine learning, 90’s ->an automatic text classifier is built by
learning, from a set of preclassified documents, the characteristics of the categories
![Page 41: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/41.jpg)
Text categorization
Let D: a domain of documents C = {c1, …, c|C|} : a set of predefined categories T = true, F = false
The task is to approximate the unknown target function ’: D x C -> {T,F} by means of a function : D x C -> {T,F}, such that the functions ”coincide as much as possible”
function ’ : how documents should be classified function : classifier (hypothesis, model…)
![Page 42: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/42.jpg)
We assume...
Categories are just symbolic labels no additional knowledge of their meaning is
availableNo knowledge outside of the documents is
available all decisions have to be made on the basis of
the knowledge extracted from the documents metadata, e.g., publication date, document
type, source etc. is not used
![Page 43: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/43.jpg)
-> general methods
Methods do not depend on any application-dependent knowledge in operational applications all kind of
knowledge can be usedcontent-based decisions are necessarily
subjective it is often difficult to measure the
effectiveness of the classifiers even human classifiers do not always agree
![Page 44: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/44.jpg)
Single-label vs. multi-label
Single-label text categorization exactly 1 category must be assigned to each
dj D
Multi-label text categorization any number of categories may be assigned
to the same dj D
Special case of single-label: binary each dj must be assigned either to category
ci or to its complement ¬ ci
![Page 45: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/45.jpg)
Single-label, multi-label
The binary case (and, hence, the single-label case) is more general than the multi-label an algorithm for binary classification
can also be used for multi-label classification
the converse is not true
![Page 46: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/46.jpg)
Category-pivoted vs. document-pivoted
Two different ways for using a text classifier
given a document, we want to find all the categories, under which it should be filed -> document-pivoted categorization (DPC)
given a category, we want to find all the documents that should be filed under it -> category-pivoted categorization (CPC)
![Page 47: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/47.jpg)
Category-pivoted vs. document-pivoted
The distinction is important, since the sets C and D might not be available in their entirety right from the start
DPC: suitable when documents become available at different moments in time, e.g. filtering e-mail
CPC: suitable when new categories are added after some documents have already been classified (and have to be reclassified)
![Page 48: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/48.jpg)
Category-pivoted vs. document-pivoted
Some algorithms may apply to one style and not the other, but most techniques are capable of working in either mode
![Page 49: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/49.jpg)
Hard-categorization vs. ranking categorization
Hard categorization the classifier answers T or F
Ranking categorization given a document, the classifier might
rank the categories according to their estimated appropriateness to the document
respectively, given a category, the classifier might rank the documents
![Page 50: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/50.jpg)
Applications of text categorization
Automatic indexing for Boolean information retrieval systems
document organizationtext filteringword sense disambiguationhierarchical categorization of Web
pages
![Page 51: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/51.jpg)
Automatic indexing for Boolean IR systems
In an information retrieval system, each document is assigned one or more keywords or keyphrases describing its content keywords belong to a finite set called controlled
dictionary
TC problem: the entries in a controlled dictionary are viewed as categories k1 x k2 keywords are assigned to each
document document-pivoted TC
![Page 52: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/52.jpg)
Document organization
Indexing with a controlled vocabulary is an intance of the general problem of document base organization
e.g. a newspaper office has to classify the incoming ”classified” ads under categories such as Personals, Cars for Sale, Real Estate etc.
organization of patents, filing of newspaper articles...
![Page 53: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/53.jpg)
Text filtering
Classifying a stream of incoming documents dispatched in an asynchronous way by an information producer to an information consumer
e.g. newsfeed producer: news agency; consumer: newspaper the filtering system should block the delivery
of documents the consumer is likely not interested in
![Page 54: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/54.jpg)
Word sense disambiguation
Given the occurrence in a text of an ambiguous word, find the sense of this particular word occurrence
E.g. Bank of England the bank of river Thames ”Last week I borrowed some money
from the bank.”
![Page 55: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/55.jpg)
Word sense disambiguation
Indexing by word senses rather than by words text categorization
documents: word occurrence contexts categories: word senses
also resolving other natural language ambiguities context-sensitive spelling correction, part of
speech tagging, prepositional phrase attachment, word choice selection in machine translation
![Page 56: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/56.jpg)
Hierarchical categorization of Web pages
E.g. Yahoo like web hierarchical catalogues
typically, each category should be populated by ”a few” documents
new categories are added, obsolete ones removed
usage of link structure in classificationusage of the hierarchical structure
![Page 57: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/57.jpg)
Knowledge engineering approach
In the 80´s: knowledge engineering techniques building manually expert systems capable
of taking text categorization decisions expert system: consists of a set of rules
wheat & farm -> wheatwheat & commodity -> wheatbushels & export -> wheatwheat & winter & ~soft -> wheat
![Page 58: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/58.jpg)
Knowledge engineering approach
Drawback: rules must be manually defined by a knowledge engineer with the aid of a domain expert any update necessitates again human
intervention totally domain dependent -> expensive and slow process
![Page 59: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/59.jpg)
Machine learning approach
A general inductive process (learner) automatically builds a classifier for a category ci by observing the characteristics of a set of documents manually classified under ci or ci by a domain expert
from these characteristics the learner gleans the characteristics that a new unseen document should have in order to be classified under ci
supervised learning (= supervised by the knowledge of the training documents)
![Page 60: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/60.jpg)
Machine learning approach
The learner is domain independent usually available ’off-the-shelf’
the inductive process is easily repeated, if the set of categories changes
manually classified documents often already available manual process may exist
if not, it still easier to manually classify a set of documents than to build and tune a set of rules
![Page 61: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/61.jpg)
Training set, test set, validation set
Initial corpus of manually classified documents let dj belong to the initial corpus
for each pair <dj, ci> it is known if dj should be filed under ci
positive examples, negative examples of a category
![Page 62: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/62.jpg)
Training set, test set, validation set
The initial corpus is divided into two sets a training (and validation) set a test set
the training set is used to build the classifier
the test set is used for testing the effectiveness of the classifiers each document is fed to the classifier and the
decision is compared to the manual category
![Page 63: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/63.jpg)
Training set, test set, validation set
The documents in the test are not used in the construction of the classifier
alternative: k-fold cross-validation k different classifiers are built by partitioning
the initial corpus into k disjoint sets and then iteratively applying the train-and-test approach on pairs, where k-1 sets construct a training set and 1 set is used as a test set
individual results are then averaged
![Page 64: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/64.jpg)
Training set, test set, validation set
Training set can be split to two partsone part is used for optimising
parameters test which values of parameters yield
the best effectivenesstest set and validation set must be
kept separate
![Page 65: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/65.jpg)
Inductive construction of classifiers
A ranking classifier for a category ci
definition of a function that, given a document, returns a categorization status value for it, i.e. a number between 0 and 1
documents are ranked according to their categorization status value
![Page 66: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/66.jpg)
Inductive construction of classifiers
A hard classifier for a category definition of a function that returns true
or false, or definition of a function that returns a
value between 0 and 1, followed by a definition of a thresholdif the value is higher than the threshold ->
trueotherwise -> false
![Page 67: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/67.jpg)
Learners
Probabilistic classifiers (Naïve Bayes)decision tree classifiersdecision rule classifiersregression methodson-line methodsneural networksexample-based classifiers (k-NN)support vector machines
![Page 68: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/68.jpg)
Rocchio method
Linear classifier methodfor each category, an explicit profile
(or prototypical document) is constructed benefit: profile is understandable even
for humans
![Page 69: Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki](https://reader030.vdocument.in/reader030/viewer/2022013101/56649e305503460f94b20ef7/html5/thumbnails/69.jpg)
Rocchio method
A classifier is a vector of the same dimension as the documents
weights:
classifying: cosine similarity of the category vector and the document vector
}{}{ |||| ijij NEGd i
kj
POSd i
kjki
NEG
w
POS
ww