labeling turkish news stories with crf

LABELING TURKISH NEWS STORIES WITH

CRFProf. Dr. Eşref Adalı

ISTANBUL TECHNICAL UNIVERSITYCOMPUTER ENGINEERING

1

PURPOSE of STUDY• As internet grows dramatically, the number of electronic text

documents increases considerably. • By means of increasing number of documents, the information

extraction grows in importance. • This study introduces an approach to information extraction,

which provides extraction of the main subject, main predicate, main location and main date of a text document and label it to use for semantic web applications.

2

PURPOSE of STUDY

3

LABEL MEANING

SUBJECT The most important person, place, thing, or idea in the document

PREDICATE Actual doing or being of the main subject

LOCATION Location of the main predicate

DATE Date of the main predicate.

PURPOSE of STUDY• The most pronounced difference between key phrase

extraction studies and labeling study is that labeling study extract the most significant phrases with their functions in the document.

• Extracted labels give an idea to the reader about the main topic of the document at a glance.

4

A SAMPLE LABELED DOCUMENT

5

SCOPE of STUDY

• Documents inspected are written in Turkish language.• Documets are gathered from news distributors.• Documents include 50-300 words.

6

LABELING by ANNOTATORS• Data set is composed of 1000 raw news stories gathered from

RSS feeds of Turkish news distributors and then labeled by annotators .

• Manually labeled documents are used for training and test phase of CRF model.

7

Manual Labeling Process

8

Capturing RSS feeds from news distributors

Arrange captured news with XML

format

Reading news by human annotators

and labeling manually

FIRST STEP of STUDY• Due to the Turkish is an agglutinative language, input file is

converted to the file includes the information of stems, inflectional suffixes and parser results of the raw new stories.

Morphological analyzerrMorphological disambiguator Dependency parser

9

Morphological Analyzer• Each word in a raw is morphologically analyzed. As a

morphologic analyzer, Oflazer’s morphologic analyzer is used. The output of morphological analyzer presents one or more possible results.

10

MORPHOLOGICAL DISAMBIGUATOR

11

• • The most possible result must be distinguished in the output of morphological analyzer. Morphological disambiguator which is developed by Sak et al. has been used for disambiguating. At this point roots or stems are provided.

DEPENDENCY PARSER

12

• Dependency parser defines the attribute of each word in a sentence. In order to do this we use a multilingual dependency parser.

CONSTRUCTING THE MODEL• At first we are developed a rule based model with the help of

the features provided by morphological analyzer, disambiguator and dependency parser.

• Because of the success rates are not enough to use we developed a new model with machine learning techniques.

• In our case labels consist of one word generally more than one word. So, we can estimate our problem is a sequence classification problem.

13

CONSTRUCTING THE MODEL• Each word in the document belongs to a class which is subject,

predicate, location, date or none of them.

14

Rule based features• Due to the experimental set of this study is news stories, main

subject of the text should be proper noun phrases.

• This assumption is obtained after inspected all manually annotated subject labels.

• In order to obtain proper name phrases in Turkish language, at first all words start with capital letter are gathered. However, this assumption is not correct in all cases, because some other words may start with capital letter, such as first word of sentence, titles, month or day names in dates etc.

15

Rule based features• Rule 1 : If the word is first word in a sentence and it is a proper name, it is

a possible candidate of proper name phrase.

• Rule 2 : If a word starts with capital letter and not the first word of sentence, select it as a possible candidate of proper name phrase.

• Rule 3 : If a conjunction is between two possible candidates of proper name phrases, select this word.

But all these rules are not enough to divide all these words into proper noun phrases. For instance, “Mustafa Kemal Atatürk Ankara’ya gitti.” is a sample Turkish sentence. In this sentence “Mustafa Kemal Atatürk ”and “Ankara” are two different proper noun phrases. However, the rules explained above selects the proper name phrase as “Mustafa Kemal Atatürk Ankara’ya”. So new boundary rules are defined.

16

Rule based features• Boundary Rule 1: If a possible candidate of proper noun

phrase ends with a punctuation such as quotation mark, comma etc, this word is the last name of proper noun phrase.

• • Boundary Rule 2: If a possible candidate has the suffixes

”P3sg”, this word is a last word of proper noun phrases.

17

Other Features• Morphological Features: Outputs of morphological

disambiguator.• Syntactic Features: Output of dependency parser.• Structural Features:

Document sequence number in data set is defined in order to describe word is belong to which document in data set.

In order to distinguish sentence, sentence sequence number in document is used.

Term Frequency in document is used as a feature.First observed sentence sequence number of a word in the

document is used as a feature.The feature which defines first letter of the word is capital or not is

used.18

TRAINING CRF SYSTEM

19

Manually annotated documents are used with features of each word. 950 news stories are used as training set and CRF model is generated as Figure 1.

TESTING CRF SYSTEM

20

In order to the measure the success of the system, rest 50 manually annotated documents are used with generated CRF model.

EVALUATION• In this study, the main concern is the precision and the recall

that is how many of the suggested keywords are correct (precision), and how many of the manually assigned labels that are found (recall).

21

EVALUATION

22

ConclusionFactors affects success rate:

Human annotators are not %100 reliable. Human makes mistake.•Spell chek is needed, because it also affects results of morphologic analyzer.•Errors of morphologic analyzer•Errors of morphologic disambiguator•Errors of dependancy parser

•Size and scope of traning set

23

THANK YOU

24

labeling turkish news stories with crf

Documents

main date

main location

main topic

main subjectlocationlocation

main predicate datedate

labeling study

turkish news stories

main subject o