1 l5eng
TRANSCRIPT
Topic No 1. NATURAL LANGUAGE SIGN SYSTEMS
MAIN SECTIONS1.1. Models and methods of representation and organization of
knowledge – lections 1-2. 1.2. Quantitative specification of natural language systems —
lections 3-4, 8. 1.3. Logical-statistical methods of knowledge retrieval —
lections 5-7.
OPTIONAL SECTIONS FOR SELF-STUDY1.4. Technology of automated formation of thesaurus. 1.5. Example of natural language resource studying.
Lection 5.
LOGICAL-STATISTICAL METHODS OF
KNOWLEDGE ACQUISITION
Distribution-statistical method Componential analysis Frequency-semantic method
References
Lecture materials can be found in:
Yu.N.Filippovich, А.V.Prohorov. Semantics of information technologies: practices of dictionary-thesaurus description. /
Computer linguistics series. Introduction article by A.I. Novikov.M.: MGUP, 2002.—CD ROM in package— pp. 46–54.
DISTRIBUTION-STATISTICAL METHOD
Basic hypothesis: Meaningful language elements (words) that occur
together in a text interval are semantically connected between each other
Quantitative (frequency) characteristics of
sole or joint occurrence ofmeaningful language elements
‘connection strength’ coefficient formula
Semantic classification of
meaningful language elements
FREQUENCY CHARACTERISTICS OF CONTEXTS
Context Сi(T) — a piece of text, a sequence (chain) of syntagmas.
T = C1(T)+...+Cq(T), where Сi(T) Cj(T)=, i,j (ij) [1,q]
If syntagma is a meaningful language element (word), then:
NA, fA=NA/N — quantity and frequency of contexts, where onlyword A occurred;
NB , fB=NB/N — quantity and frequency of contexts, where onlyword B occurred;
NAB , fAB=NAB/N — quantity and frequency of contexts, where jointoccurrence of words A and B took place;
N — total number of contexts.
FORMULAE OF ‘CONNECTION STRENGTH’ COEFFICIENT (1)
K fN
NAB ABAB .
,
.— T.T.Tаnimоtо, L.B.Dоуlе.
N
ffNK BAABAB
— M.E.Mаrоn,
J.Kuhns.
FORMULAE OF ‘CONNECTION STRENGTH’ COEFFICIENT (2)
.
,
.
Kf N
f fABAB
A B
— А.Ya.Shaikevich, G.Sаltоn, R.M.Curtiсе.
— S.Dеnnis.
— H.E.Stilеs
ANALYSIS OF FORMULAEOF ‘CONNECTION STRENGTH’ COEFFICIENT (1)
All formulae of ‘connection strength’ coefficient are united by seeing events related to occurrence of words A and B as a system
of accidental phenomena.
Method procedure enables to establish the fact:if A and B – independent events, than P(AB)=P(A)P(B).
Estimated value of ‘connection strength’ coefficient needs interpretation (explanation)
Size of context (number of surrounding words) enables most likely to define that:
а) 1–2 words — contact syntagmatic connections ofword combinations;
b) 5–10 words — distant syntagmatic connectionsand paradigmatic relations;
c) 50–100 words — thematic connections between the words.
ANALYSIS OF FORMULAEOF ‘CONNECTION STRENGTH’ COEFFICIENT (2)
Matrix of language units (words) cohesion andassociative matrix
word ... аi ...
word frequency fа...
bj fb ... fаb ...
...
• formation of the core of thematically connected texts; • automated construction of thesaurus; • information search and indexing; • automated abstracting.
Directions of method implementation:
METHODOLOGY FOR THESAURUS CONSTRUCTION BASED ON DISTRIBUTION-STATISTICAL METHOD
Compilation of frequency glossaries and concordances. Analysis of joint occurrence of words (language units) and on
its basis compilation of associative matrix. Subjective interpretation of associative matrix and formation
of classes of typical connections (relations). Grouping (segregation) of specific relation types (genus-
species, causal, etc.). Interpretations of separate word connections. Grouping of semantic fields.
COMPONENT ANALYSIS
Method of component analysis enables to track connection between two notions basing on the
analysis of their definitions
Definition of notion A
Notion A fAB Notion B Definition of notion B
Main method modifications:
• Quantitative specification of connection.• Hypertext link.
QUANTITATIVE SPECIFICATION OF CONNECTION
Two words A and B are considered connected bythe connection strength fаb = k,
if in their definitions there are k number of common words
— multitude of the same words,used in definitions for words A and B;
}{xAB
i
— number of the same words. xAB
ik , where = k >1
Clusters of words connected by connection strength f = k , k = 1, 2, 3, ..., K.
HYPERTEXT LINK
Two words A and B are considered connected, if in definition of each word there is a common word,
fаb = k =1.
Hypertext links usage:• lexicographical systems
(e-dictionaries and encyclopedias), • e-texts, • information and reference systems etc.
Possible usage for knowledge analysis:
• analysis of definition system or definition dictionary; • examination of quality of dictionary articles (by number of
connections with other dictionary articles, by length of chain); • examination of extracts in definition dictionaries; • analysis of text dictionaries;• examination of hеlр-systems.
FREQUENCY-SEMANTIC METHOD
Frequency-semantic method uses two characteristics of words definitions as a criterion for
connection strength estimation: similarity of elements and frequency.
Method idea:«...imagine forces of semantic adhesion as being an everywhere existing , leaked in language field which has bodies in it – lexical language units. Different units interact the same way as atoms, molecules, macro bodies, planets and space objects interact – on one level, i.e. between homogeneous units, as well as on interlevels.»
Basic data:• ideographic dictionaries.• concise definition dictionary of Russian for foreigners.• definition dictionaries of S.I. Ozhegov and D.N. Ushakov.
References
Karaulov Yu.N. Frequency dictionary of semantic multipliers of the Russian language. – М.: Nauka, 1980.
Karaulov Yu.N., V.I.Molchanov, V.A.Afanasiev, N.V.Mihalev.Analysis of dictionary metalanguage using ECM. – М.: Nauka, 1982. – 96 p.
FORMATION OF SEMANTIC FIELDS (1)
Aak
DWwd ij Dw ji
a ijwd
Ak
DW
,
if , than , where:
— value of semantic connection strength between
word wi and descriptor dj ; — multitude of acceptable values of semanticconnection strength between descriptors and words;
Dj = {wij} — multitude of words of a descriptor;
wi — word, i = 1...|W|, W = {wi} — multitude of words;
dj — descriptor, j = 1...|D|, D = {dj} — multitude of descriptors.
Practical task: divide 9000 words between 1600 descriptors
FORMATION OF SEMANTIC FIELDS (2)
ISSUES OF PRACTICAL TASK SOLUTION
1. Determine the way of words comparison• Choose the way to obtain (to indicate) semantic multiplier
(lemmatization, folding, root indication, word stem and quasi stem of the word indication)
• Develop methodology for obtaining word semantic code.
2. Determine frequency characteristics of semantic multipliers.
3. Identification of the criterion for semantic connection of words and descriptors.
• Phenomenological model of unit connectivity • Phenomenological model of K connectivity • Connectivity model with account of frequency of multipliers
DETERMINE THE WAY TO COMPARE WORDS
Word definition/descriptor — ~10 word forms,
Total number in experiment — ~110000 word forms.
semantic multiplier — elementary unit of concept plan.
Basic presumptions: a) semantic expansion of language is discrete; b) range of elements of expansion is final and observable ; c) number of combinations is almost eternal; d) semantic expansion is elementary, i.e. consists of indecomposable
elements; e) semantic elements are monotonous, i.e. refer to contents (they are
elements of perception and thinking); f) semantic elements form a universal set, i.e. they are of general character
and their number and range are similar for different languages .
WAYS TO OBTAIN (INDICATE) SEMANTIC MUKTIPLIER
Lemmatization — acquisition of canonic word form.
Folding — folding of the word, i.e. deletion of vowels except for vowel of the first syllable.
Root indication — representation of word with root morpheme.
Word stem indication — word representation with several morphemes, for example, prefix and root.
Indication of quasi stem of the word — with random initial word part, basing on the fact of shift of word meaning (its contents) to its beginning.
METHOD OF OBTAINING SEMANTIC CODE OF THE WORD
METHOD PROCEDURES
1. Entering of the coded word into its code.2. Exclusion of semantic multiplier repetitions.3. Filtration (deletion):
«zero» semantic multipliersgrammatical words prepositions, conjunction etc.
4. Lexicalisation of collocations.5. Formation of quasi word stems
RESULTS OF METHOD IMPLEMENTATION
}{s jd
x а) descriptor— dj = б) words — wi = }{s iw
x
DETERMINATION OF FREQUENCY CHARACTERITICS OF SEMANTIC MULTIPLIERS
Two frequency characteristics are associated with semantic multiplier X:
— frequency of multiplier occurrencein descriptor definitions
— frequency of multiplier occurrencein word definitions
Frequency analysis of semantic multiplier methodology:а) frequencies computing;
b) ranging and grading of multipliers in definitions
according to increase of their rank.
CRITERION OF SEMANTIC CONNECTIVITY BETWEEN WORDS AND DESCRIPTORS
Stages of development of the criterion:
1. Phenomenological model of unit connectivity if there is at least one common multiplier in definitions of words and descriptors:
| dj wi | = 1;
2. Phenomenological model of K connectivity there is K number of common semantic multipliers in definitions of words and descriptors:
| dj wi | = K; K}{}{ ss ij w
x
d
x
3. Connectivity model with account of frequency of multipliers (selective criterion of Karaulov).
;2K fD
x .6
SELECTIVE CRITERION OF KARAULOV
61}{}{
2}{}{
fK
K
Dx
iw
xjd
x
w
x
d
xwd
ss
ssa ij
ij
Word and descriptor are semantically connected if their definitions have more than two similar semantic multipliers or if their definitions
have at least one common semantic multiplier and its frequency in multitude of descriptors is more than six
Semantic fields construction procedure
1. Construction of the field according to unit connectivity model.2. Narrowing of the field by number of coinciding multipliers. 3. Narrowing of the field with account to semantic multipliers frequency.
Dw ji
If
, than
QUESTIONS FOR SELF-CHECK
Name logical-statistical methods of knowledge retrieval from texts.
Tell about distribution-statistical methodology of text analysis. Tell about frequency-semantic methodology of text analysis. Tell about component text analysis.