lis6 18 lecture 2 preparing and preprocessing

LIS618 lecture 2

preparing and preprocessing

Thomas Krichel2011-04-21

today

• text files and non-text files• characters• lexical analysis

preliminaries

• Here we discuss some preparatory stages that has the way the IR system is built.

• This may sound unduly technical.• But the reality is that often, when querying IR

systems– we don’t know how they have been built– we can’t find something that is not there– we have to make guesses and try several things

key aid: index

• An index is a list of features, with a list of locations where the feature is to be found.– The feature may be a word– The locations depend on where we need to find

the feature at.

example: book

• In a book the index links terms and locations are page numbers.

• Index terms are selected manually by the author or a specially trained indexer.

• Once indexing terms are chosen the index can be built by computer.

• Note that any index terms may be on several page.

example: computer files

• Here the feature can be a word.• An a location can be the location of the file

with the location of the start term within the file.

• The location of the term can be found by looking at the start and add the length of the term.

working with files

• Usually, when indexing takes place, we study the contents of documents that are stored in files.

• Usually you know something about the files– by the name of the directory– by commonly used file name part, e.g. extension

• JHOVE is a tool that tries to guess file types when you don’t know much about them.

character extraction

• The first thing to do is to extract the characters out of these documents.

• That means translating the file content, a series of 0s and 1s into a sequence of characters.

• For some types the textual information is limited.

image file, music files

• Even if you have image files and music files, there usually is some textual data that is store.

• jpeg files have exif metadata.• mp3 file have id3 metadata.• These data has simple attribute: value data.

example for id3 version, start

• header: "TAG" • title: 30 bytes of the title • artist: 30 bytes of the artist name • album: 30 bytes of the album name • year: 4 A four-digit year • comment: 30 bytes of comment• … more fields

mixed contents files

• If you look at MS word, spreadsheet or PDF file, they contain text surrounded by not textual code that the software reading it can understand.

• Such code contains descriptions of document structure, as well as formatting instructions.

markup?

• Markup is the non-textual contents of a text documents.

• Markup is used to provide structure to a document and to prepare for its formatting.

• An easy example for markup is any page written in HTML

example from a very fine page

<li><a tabindex="400" class="int" href="travel_schedule.html" accesskey="t">travel

schedule</a></li> <li><a tabindex="500" class="int" accesskey="l"

href="links.html">favorite links</a>.</li></ul><div class="hidden"> This text is hidden.</div>

text extracted

• The extracted text is “travel schedule favorite links. This text is hidden.”

• The phrase “This text is hidden” is not normally shown on the page but it is still extracted and indexed by Google.

• Well, I think it is.

tokenization

• Tokenization refers to the splitting of the source to be indexed into tokens that can be used in a query language.

• In the most common case, we think about splitting the text into words.

• Hey that is easy: “travel” “schedule” “favorite” “links” “this” “text” “is” “hidden”

• Spot a problem?

text files

• Some files can either contain text, meant for the end user.

• Some others contain text, plus some textual instruction that are means for processing software. Such instructions are called markup.

• Typical examples for markup are web pages.

key aid: index terms• The index term is a part of the document that

has a meaning on its own.• It is usually a word. Sometimes it is a phrase.• Let us stick with the idea of the indexing term

being a word.

the representation of text

• In principle, every file is just a sequence of on/off electrical signals.

• How can these signals translated into text?• Well text is a sequence of characters and

every character needs to be recognized.• Character recognition leads to text

recognition.

bits and bytes

• A bit is a unit of digital information. It can take the values 0 and 1.

• A byte (respelling of “bite” to avoid confusion) is a unit of storage. It should be hardware dependent but it is de facto 8 bits.

characters and computer• Computers can not deal with characters

directly. They can only deal with numbers.• There we need to associate a number with

every character that we want to use in an information encoding system. • A character set combines characters with

number.

characters

• When we are working with characters, two concepts appear.– The character set.– The character encoding.

• To properly recognize digital data, we need to know both.

The character set

• The character set is an association of characters with numbers. Example– ‘ ‘ 32– bell 7– ‘’ 8594

• Without such a concordance the processing of text by computer would be impossible.

the character encoding

• This regards the way the character number is written out as a sequence of bytes.

• There are several ways to do this. Going into details here would be a bit too technical.

• Let us look at some examples.

ascii

• The most famous, but old character set is the American Standard for Information Interchange ASCII.

• It contains 128 characters.• Its representation in bytes has every byte start

with a 0 and then the seven bits that the character number makes up.

iso-8859-1 (iso-latin-1)

• This is an extension of ASCII that has additional characters that are suitable for the Western European languages.

• There are 256 characters.• Every character occupies a byte.

Windows 1252

• This is an extension of iso-latin-1.• It is a legacy character set used on the

Microsoft Windows operating system. • Some code positions of iso-latin-1 have been

filled with characters that Microsoft liked.

UCS / Unicode

• UCS is a universal character set. • It is maintained by the International Standards

Organization.• Unicode is an industry standard for characters.

It is better documented than UCS.• For what we discuss here, UCS and Unicode

are the same.

Basic multilingual plane

• This is a name for the first 65536 characters in Unicode.• Each of these characters fits into two bytes

and is conveniently represented by four hex numbers.• Even for these characters, there are numerous

complications associated with them.

ascii and unicode

• The first 128 characters of UCS/Unicode are the same as the ones used by ASCII.

• So you can think of UCS/Unicode as an extension of ASCII.

in foreign languages

• Everything becomes difficult. • As an example consider the characters

– o– ő– ö

• The latter two can be considered o with diarcitics or as separate characters.

most problematic: encoding

• One issue is how to map characters to numbers.

• This is complicated for languages other than English.

• But assume UCS/Unicode has solved this.• But this is not the main problem that we

have when working with non-ASCII character data.

encoding

• The encoding determines how the numbers of each character should be put into bytes.

• If you have a character set that is has one byte for each character, you have no encoding issue.

• But then you are limited to 256 characters in your character set.

fixed-length encoding

• If you have a fixed length encoding, all characters take the same number of bytes.

• Say for the basic-multilingual plane of unicode, you need two bytes for each character, and then you are limited to that.

• If you are writing only ASCII, it appears a waste.

variable length encoding

• The most widely used scheme to encode Unicode is a variable length scheme, called UTF-8.

• It is important to understand that the encoding needs to known and correct.

bascis of UTF-8

• Every ASCII character, represented as a byte, starts with a zero.

• Characters that are beyond ASCII require two or three bytes to be complete.

• The first byte will tell you how many bytes are coming to make the character complete.

byte shapes in UTF-8 encoding

• 0?????? ASCII• 110???? first octet of two-byte character • 1110???? first byte of three-byte character • 11110??? first octet of four-byte character• 10??????? byte that is not the first byte• as you can see, there are sequences of bytes

that are not valid

hex range to UTF-8

• 0000 to 007F 0???????• 0080 to 07FF 110????? 10??????• 0800 to FFFF 1110???? 10?????? 10??????

in the simplest case

• In the simplest case, let us assume that we index a bunch of text files as full text.

• These documents contain text written in some sort of a human language.

• They may also contain some markup, but we assume we can strip the markup.

lexical analysis

• Lexical analysis is the process of preparing a sequence or characters into a sequence of words.

• These candidate words can then be used as index term (or what I have called features).

text direction

• Text is not necessarily linear. • When right to left writing is used, you

sometimes have to jump in the opposite direction when a left to right expression appears. Example:

• عام استقاللها الجزائر 132بعد 1962حققتالفرنسي االحتالل من .سنة

spaces

• One important delimiter between words are spaces.

• The blank, the newline, the carriage return and the tab characters are collectively known as whitespace

• We can collapse whitespace and get the words.

• But there are special cases.

text without spaces: Japanese

• Japanese uses 3 scripts, kanji, hiragana and katakana.

• All foreign words are written with katakana. Any conitunous katakana sequence is a word.

• A Japanese word has at most 4 kanji/hiragana. A dictionary has to find backward from 4 to 1 if a word exists.

punctuation

• Normally lexical analysis discards punctuation. But some punctuation can be part of a word, say “333B.C.”

• What we have to understand here is that probably at query time, the same removal of punctuation occurs.

• The user searches for “333B.C”, the query is translated to 333BC and the occurrence is found.

hyphen

• It’s a particularly problematic punctuation mark.

• Breaking them, like, for example, in “state-of-the-art” is good.

• End of line hyphenation breaks at a line end may not be ends of words.

• Some hyphenation is part of a word, e.g. UTF-8. I should use the non-breaking hyphen.

apostrophe

• Example:– Mr. O’Neill thinks that the boys’ stories about

Chile’s capital aren’t amusing.• O Neil … aren t• The first seem ok, but second looks silly.

stop words

• Some words are considered to carry no meaning because they are too frequent.

• Sometimes, they are completely removed, and not removed in the index.

• Then you can’t search for the phrase “to be or not to be”.

• Stop word are very language dependent.

stemming

• Frequently, text contains grammatical forms that a user may not look at. In English, e.g.– plurals– gerund form– past tense suffixes

• The most famous algorithm for English stemming is Martin Porter’s stemming algorithm.

• Evidence on usefulness is mixed.

grammatical variants

• These are much more important in other languages. E.g., in Russian – “My name is Irina.” “Меня зовут Ирина.” – “I love Irina.” “Я люблю Ирину.”

• Text in a query in Russian must be matched against a variety of grammatical forms, so of which are very tricky to derive.

casing

• Normalization usually means that we treat uppercase and lowercase the same, meaning that the uppercase and lowercase variants of a token become the same term.

• However in English– “US” vs “us”– “IT” vs “it”…

treatment of diacritics

• There are rules in Unicode how diacritics can be stripped.

• These involve decomposing diacritic• This is helpful for people who find the input of

diacritics tough.• But there are lots of other tough choices.

historic equivalents

• Many languages that use diacritics or ligatures have developed spelling variants without them. – Jaspersstraße Jasperstrasse– Völklingen Voelklingen

• A mechanical de-diacritic operation would yield “Volklingen” which few people would chose.

• Such operations are very language specific.

http://openlib.org/home/krichel

Please shutdown the computers whenyou are done.

Thank you for your attention!

lis6 18 lecture 2 preparing and preprocessing

Documents

text documents

text extractedthe

text filessome

used file

file types

index terms

pdf file

file content