1 l5eng

Topic No 1. NATURAL LANGUAGE SIGN SYSTEMS

MAIN SECTIONS1.1. Models and methods of representation and organization of

knowledge – lections 1-2. 1.2. Quantitative specification of natural language systems —

lections 3-4, 8. 1.3. Logical-statistical methods of knowledge retrieval —

lections 5-7.

OPTIONAL SECTIONS FOR SELF-STUDY1.4. Technology of automated formation of thesaurus. 1.5. Example of natural language resource studying.

Lection 5.

LOGICAL-STATISTICAL METHODS OF

KNOWLEDGE ACQUISITION

Distribution-statistical method Componential analysis Frequency-semantic method

References

Lecture materials can be found in:

Yu.N.Filippovich, А.V.Prohorov. Semantics of information technologies: practices of dictionary-thesaurus description. /

Computer linguistics series. Introduction article by A.I. Novikov.M.: MGUP, 2002.—CD ROM in package— pp. 46–54.

DISTRIBUTION-STATISTICAL METHOD

Basic hypothesis: Meaningful language elements (words) that occur

together in a text interval are semantically connected between each other

Quantitative (frequency) characteristics of

sole or joint occurrence ofmeaningful language elements

‘connection strength’ coefficient formula

Semantic classification of

meaningful language elements

FREQUENCY CHARACTERISTICS OF CONTEXTS

Context Сi(T) — a piece of text, a sequence (chain) of syntagmas.

T = C1(T)+...+Cq(T), where Сi(T) Cj(T)=, i,j (ij) [1,q]

If syntagma is a meaningful language element (word), then:

NA, fA=NA/N — quantity and frequency of contexts, where onlyword A occurred;

NB , fB=NB/N — quantity and frequency of contexts, where onlyword B occurred;

NAB , fAB=NAB/N — quantity and frequency of contexts, where jointoccurrence of words A and B took place;

N — total number of contexts.

FORMULAE OF ‘CONNECTION STRENGTH’ COEFFICIENT (1)

K fN

NAB ABAB .

,

.— T.T.Tаnimоtо, L.B.Dоуlе.

N

ffNK BAABAB

— M.E.Mаrоn,

J.Kuhns.

FORMULAE OF ‘CONNECTION STRENGTH’ COEFFICIENT (2)

.

,

.

Kf N

f fABAB

A B

— А.Ya.Shaikevich, G.Sаltоn, R.M.Curtiсе.

— S.Dеnnis.

— H.E.Stilеs

ANALYSIS OF FORMULAEOF ‘CONNECTION STRENGTH’ COEFFICIENT (1)

All formulae of ‘connection strength’ coefficient are united by seeing events related to occurrence of words A and B as a system

of accidental phenomena.

Method procedure enables to establish the fact:if A and B – independent events, than P(AB)=P(A)P(B).

Estimated value of ‘connection strength’ coefficient needs interpretation (explanation)

Size of context (number of surrounding words) enables most likely to define that:

а) 1–2 words — contact syntagmatic connections ofword combinations;

b) 5–10 words — distant syntagmatic connectionsand paradigmatic relations;

c) 50–100 words — thematic connections between the words.

ANALYSIS OF FORMULAEOF ‘CONNECTION STRENGTH’ COEFFICIENT (2)

Matrix of language units (words) cohesion andassociative matrix

word ... аi ...

word frequency fа...

bj fb ... fаb ...

...

• formation of the core of thematically connected texts; • automated construction of thesaurus; • information search and indexing; • automated abstracting.

Directions of method implementation:

METHODOLOGY FOR THESAURUS CONSTRUCTION BASED ON DISTRIBUTION-STATISTICAL METHOD

Compilation of frequency glossaries and concordances. Analysis of joint occurrence of words (language units) and on

its basis compilation of associative matrix. Subjective interpretation of associative matrix and formation

of classes of typical connections (relations). Grouping (segregation) of specific relation types (genus-

species, causal, etc.). Interpretations of separate word connections. Grouping of semantic fields.

COMPONENT ANALYSIS

Method of component analysis enables to track connection between two notions basing on the

analysis of their definitions

Definition of notion A

Notion A fAB Notion B Definition of notion B

Main method modifications:

• Quantitative specification of connection.• Hypertext link.

QUANTITATIVE SPECIFICATION OF CONNECTION

Two words A and B are considered connected bythe connection strength fаb = k,

if in their definitions there are k number of common words

— multitude of the same words,used in definitions for words A and B;

}{xAB

i

— number of the same words. xAB

ik , where = k >1

Clusters of words connected by connection strength f = k , k = 1, 2, 3, ..., K.

HYPERTEXT LINK

Two words A and B are considered connected, if in definition of each word there is a common word,

fаb = k =1.

Hypertext links usage:• lexicographical systems

(e-dictionaries and encyclopedias), • e-texts, • information and reference systems etc.

Possible usage for knowledge analysis:

• analysis of definition system or definition dictionary; • examination of quality of dictionary articles (by number of

connections with other dictionary articles, by length of chain); • examination of extracts in definition dictionaries; • analysis of text dictionaries;• examination of hеlр-systems.

FREQUENCY-SEMANTIC METHOD

Frequency-semantic method uses two characteristics of words definitions as a criterion for

connection strength estimation: similarity of elements and frequency.

Method idea:«...imagine forces of semantic adhesion as being an everywhere existing , leaked in language field which has bodies in it – lexical language units. Different units interact the same way as atoms, molecules, macro bodies, planets and space objects interact – on one level, i.e. between homogeneous units, as well as on interlevels.»

Basic data:• ideographic dictionaries.• concise definition dictionary of Russian for foreigners.• definition dictionaries of S.I. Ozhegov and D.N. Ushakov.

References

Karaulov Yu.N. Frequency dictionary of semantic multipliers of the Russian language. – М.: Nauka, 1980.

Karaulov Yu.N., V.I.Molchanov, V.A.Afanasiev, N.V.Mihalev.Analysis of dictionary metalanguage using ECM. – М.: Nauka, 1982. – 96 p.

FORMATION OF SEMANTIC FIELDS (1)

Aak

DWwd ij Dw ji

a ijwd

Ak

DW

,

if , than , where:

— value of semantic connection strength between

word wi and descriptor dj ; — multitude of acceptable values of semanticconnection strength between descriptors and words;

Dj = {wij} — multitude of words of a descriptor;

wi — word, i = 1...|W|, W = {wi} — multitude of words;

dj — descriptor, j = 1...|D|, D = {dj} — multitude of descriptors.

Practical task: divide 9000 words between 1600 descriptors

FORMATION OF SEMANTIC FIELDS (2)

ISSUES OF PRACTICAL TASK SOLUTION

1. Determine the way of words comparison• Choose the way to obtain (to indicate) semantic multiplier

(lemmatization, folding, root indication, word stem and quasi stem of the word indication)

• Develop methodology for obtaining word semantic code.

2. Determine frequency characteristics of semantic multipliers.

3. Identification of the criterion for semantic connection of words and descriptors.

• Phenomenological model of unit connectivity • Phenomenological model of K connectivity • Connectivity model with account of frequency of multipliers

DETERMINE THE WAY TO COMPARE WORDS

Word definition/descriptor — ~10 word forms,

Total number in experiment — ~110000 word forms.

semantic multiplier — elementary unit of concept plan.

Basic presumptions: a) semantic expansion of language is discrete; b) range of elements of expansion is final and observable ; c) number of combinations is almost eternal; d) semantic expansion is elementary, i.e. consists of indecomposable

elements; e) semantic elements are monotonous, i.e. refer to contents (they are

elements of perception and thinking); f) semantic elements form a universal set, i.e. they are of general character

and their number and range are similar for different languages .

WAYS TO OBTAIN (INDICATE) SEMANTIC MUKTIPLIER

Lemmatization — acquisition of canonic word form.

Folding — folding of the word, i.e. deletion of vowels except for vowel of the first syllable.

Root indication — representation of word with root morpheme.

Word stem indication — word representation with several morphemes, for example, prefix and root.

Indication of quasi stem of the word — with random initial word part, basing on the fact of shift of word meaning (its contents) to its beginning.

METHOD OF OBTAINING SEMANTIC CODE OF THE WORD

METHOD PROCEDURES

1. Entering of the coded word into its code.2. Exclusion of semantic multiplier repetitions.3. Filtration (deletion):

«zero» semantic multipliersgrammatical words prepositions, conjunction etc.

4. Lexicalisation of collocations.5. Formation of quasi word stems

RESULTS OF METHOD IMPLEMENTATION

}{s jd

x а) descriptor— dj = б) words — wi = }{s iw

x

DETERMINATION OF FREQUENCY CHARACTERITICS OF SEMANTIC MULTIPLIERS

Two frequency characteristics are associated with semantic multiplier X:

— frequency of multiplier occurrencein descriptor definitions

— frequency of multiplier occurrencein word definitions

Frequency analysis of semantic multiplier methodology:а) frequencies computing;

b) ranging and grading of multipliers in definitions

according to increase of their rank.

CRITERION OF SEMANTIC CONNECTIVITY BETWEEN WORDS AND DESCRIPTORS

Stages of development of the criterion:

1. Phenomenological model of unit connectivity if there is at least one common multiplier in definitions of words and descriptors:

| dj wi | = 1;

2. Phenomenological model of K connectivity there is K number of common semantic multipliers in definitions of words and descriptors:

| dj wi | = K; K}{}{ ss ij w

x

d

x

3. Connectivity model with account of frequency of multipliers (selective criterion of Karaulov).

;2K fD

x .6

SELECTIVE CRITERION OF KARAULOV

61}{}{

2}{}{

fK

K

Dx

iw

xjd

x

w

x

d

xwd

ss

ssa ij

ij

Word and descriptor are semantically connected if their definitions have more than two similar semantic multipliers or if their definitions

have at least one common semantic multiplier and its frequency in multitude of descriptors is more than six

Semantic fields construction procedure

1. Construction of the field according to unit connectivity model.2. Narrowing of the field by number of coinciding multipliers. 3. Narrowing of the field with account to semantic multipliers frequency.

Dw ji

If

, than

QUESTIONS FOR SELF-CHECK

Name logical-statistical methods of knowledge retrieval from texts.

Tell about distribution-statistical methodology of text analysis. Tell about frequency-semantic methodology of text analysis. Tell about component text analysis.

1 l5eng

Documents

word b

frequency of contexts

word frequency f

words thematic connections

analysis of formulae

f n f fab ab

ab ab t

method procedure