lis6 18 lecture 2 preparing and preprocessing

52
LIS618 lecture 2 preparing and preprocessing Thomas Krichel 2011-04-21

Upload: talbot

Post on 24-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

LIS6 18 lecture 2 preparing and preprocessing. Thomas Krichel 2011-04-21. today. text files and non-text files characters lexical analysis. preliminaries. Here we discuss some preparatory stages that has the way the IR system is built. This may sound unduly technical. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LIS6 18  lecture 2 preparing and preprocessing

LIS618 lecture 2

preparing and preprocessing

Thomas Krichel2011-04-21

Page 2: LIS6 18  lecture 2 preparing and preprocessing

today

• text files and non-text files• characters• lexical analysis

Page 3: LIS6 18  lecture 2 preparing and preprocessing

preliminaries

• Here we discuss some preparatory stages that has the way the IR system is built.

• This may sound unduly technical.• But the reality is that often, when querying IR

systems– we don’t know how they have been built– we can’t find something that is not there– we have to make guesses and try several things

Page 4: LIS6 18  lecture 2 preparing and preprocessing

key aid: index

• An index is a list of features, with a list of locations where the feature is to be found.– The feature may be a word– The locations depend on where we need to find

the feature at.

Page 5: LIS6 18  lecture 2 preparing and preprocessing

example: book

• In a book the index links terms and locations are page numbers.

• Index terms are selected manually by the author or a specially trained indexer.

• Once indexing terms are chosen the index can be built by computer.

• Note that any index terms may be on several page.

Page 6: LIS6 18  lecture 2 preparing and preprocessing

example: computer files

• Here the feature can be a word.• An a location can be the location of the file

with the location of the start term within the file.

• The location of the term can be found by looking at the start and add the length of the term.

Page 7: LIS6 18  lecture 2 preparing and preprocessing

working with files

• Usually, when indexing takes place, we study the contents of documents that are stored in files.

• Usually you know something about the files– by the name of the directory– by commonly used file name part, e.g. extension

• JHOVE is a tool that tries to guess file types when you don’t know much about them.

Page 8: LIS6 18  lecture 2 preparing and preprocessing

character extraction

• The first thing to do is to extract the characters out of these documents.

• That means translating the file content, a series of 0s and 1s into a sequence of characters.

• For some types the textual information is limited.

Page 9: LIS6 18  lecture 2 preparing and preprocessing

image file, music files

• Even if you have image files and music files, there usually is some textual data that is store.

• jpeg files have exif metadata.• mp3 file have id3 metadata.• These data has simple attribute: value data.

Page 10: LIS6 18  lecture 2 preparing and preprocessing

example for id3 version, start

• header: "TAG" • title: 30 bytes of the title • artist: 30 bytes of the artist name • album: 30 bytes of the album name • year: 4 A four-digit year • comment: 30 bytes of comment• … more fields

Page 11: LIS6 18  lecture 2 preparing and preprocessing

mixed contents files

• If you look at MS word, spreadsheet or PDF file, they contain text surrounded by not textual code that the software reading it can understand.

• Such code contains descriptions of document structure, as well as formatting instructions.

Page 12: LIS6 18  lecture 2 preparing and preprocessing

markup?

• Markup is the non-textual contents of a text documents.

• Markup is used to provide structure to a document and to prepare for its formatting.

• An easy example for markup is any page written in HTML

Page 13: LIS6 18  lecture 2 preparing and preprocessing

example from a very fine page

<li><a tabindex="400" class="int" href="travel_schedule.html" accesskey="t">travel

schedule</a></li> <li><a tabindex="500" class="int" accesskey="l"

href="links.html">favorite links</a>.</li></ul><div class="hidden"> This text is hidden.</div>

Page 14: LIS6 18  lecture 2 preparing and preprocessing

text extracted

• The extracted text is “travel schedule favorite links. This text is hidden.”

• The phrase “This text is hidden” is not normally shown on the page but it is still extracted and indexed by Google.

• Well, I think it is.

Page 15: LIS6 18  lecture 2 preparing and preprocessing

tokenization

• Tokenization refers to the splitting of the source to be indexed into tokens that can be used in a query language.

• In the most common case, we think about splitting the text into words.

• Hey that is easy: “travel” “schedule” “favorite” “links” “this” “text” “is” “hidden”

• Spot a problem?

Page 16: LIS6 18  lecture 2 preparing and preprocessing

text files

• Some files can either contain text, meant for the end user.

• Some others contain text, plus some textual instruction that are means for processing software. Such instructions are called markup.

• Typical examples for markup are web pages.

Page 17: LIS6 18  lecture 2 preparing and preprocessing

key aid: index terms• The index term is a part of the document that

has a meaning on its own.• It is usually a word. Sometimes it is a phrase.• Let us stick with the idea of the indexing term

being a word.

Page 18: LIS6 18  lecture 2 preparing and preprocessing

the representation of text

• In principle, every file is just a sequence of on/off electrical signals.

• How can these signals translated into text?• Well text is a sequence of characters and

every character needs to be recognized.• Character recognition leads to text

recognition.

Page 19: LIS6 18  lecture 2 preparing and preprocessing

bits and bytes

• A bit is a unit of digital information. It can take the values 0 and 1.

• A byte (respelling of “bite” to avoid confusion) is a unit of storage. It should be hardware dependent but it is de facto 8 bits.

Page 20: LIS6 18  lecture 2 preparing and preprocessing

characters and computer• Computers can not deal with characters

directly. They can only deal with numbers.• There we need to associate a number with

every character that we want to use in an information encoding system. • A character set combines characters with

number.

Page 21: LIS6 18  lecture 2 preparing and preprocessing

characters

• When we are working with characters, two concepts appear.– The character set.– The character encoding.

• To properly recognize digital data, we need to know both.

Page 22: LIS6 18  lecture 2 preparing and preprocessing

The character set

• The character set is an association of characters with numbers. Example– ‘ ‘ 32– bell 7– ‘’ 8594

• Without such a concordance the processing of text by computer would be impossible.

Page 23: LIS6 18  lecture 2 preparing and preprocessing

the character encoding

• This regards the way the character number is written out as a sequence of bytes.

• There are several ways to do this. Going into details here would be a bit too technical.

• Let us look at some examples.

Page 24: LIS6 18  lecture 2 preparing and preprocessing

ascii

• The most famous, but old character set is the American Standard for Information Interchange ASCII.

• It contains 128 characters.• Its representation in bytes has every byte start

with a 0 and then the seven bits that the character number makes up.

Page 25: LIS6 18  lecture 2 preparing and preprocessing

iso-8859-1 (iso-latin-1)

• This is an extension of ASCII that has additional characters that are suitable for the Western European languages.

• There are 256 characters.• Every character occupies a byte.

Page 26: LIS6 18  lecture 2 preparing and preprocessing

Windows 1252

• This is an extension of iso-latin-1.• It is a legacy character set used on the

Microsoft Windows operating system. • Some code positions of iso-latin-1 have been

filled with characters that Microsoft liked.

Page 27: LIS6 18  lecture 2 preparing and preprocessing

UCS / Unicode

• UCS is a universal character set. • It is maintained by the International Standards

Organization.• Unicode is an industry standard for characters.

It is better documented than UCS.• For what we discuss here, UCS and Unicode

are the same.

Page 28: LIS6 18  lecture 2 preparing and preprocessing

Basic multilingual plane

• This is a name for the first 65536 characters in Unicode.• Each of these characters fits into two bytes

and is conveniently represented by four hex numbers.• Even for these characters, there are numerous

complications associated with them.

Page 29: LIS6 18  lecture 2 preparing and preprocessing

ascii and unicode

• The first 128 characters of UCS/Unicode are the same as the ones used by ASCII.

• So you can think of UCS/Unicode as an extension of ASCII.

Page 30: LIS6 18  lecture 2 preparing and preprocessing

in foreign languages

• Everything becomes difficult. • As an example consider the characters

– o– ő– ö

• The latter two can be considered o with diarcitics or as separate characters.

Page 31: LIS6 18  lecture 2 preparing and preprocessing

most problematic: encoding

• One issue is how to map characters to numbers.

• This is complicated for languages other than English.

• But assume UCS/Unicode has solved this.• But this is not the main problem that we

have when working with non-ASCII character data.

Page 32: LIS6 18  lecture 2 preparing and preprocessing

encoding

• The encoding determines how the numbers of each character should be put into bytes.

• If you have a character set that is has one byte for each character, you have no encoding issue.

• But then you are limited to 256 characters in your character set.

Page 33: LIS6 18  lecture 2 preparing and preprocessing

fixed-length encoding

• If you have a fixed length encoding, all characters take the same number of bytes.

• Say for the basic-multilingual plane of unicode, you need two bytes for each character, and then you are limited to that.

• If you are writing only ASCII, it appears a waste.

Page 34: LIS6 18  lecture 2 preparing and preprocessing

variable length encoding

• The most widely used scheme to encode Unicode is a variable length scheme, called UTF-8.

• It is important to understand that the encoding needs to known and correct.

Page 35: LIS6 18  lecture 2 preparing and preprocessing

bascis of UTF-8

• Every ASCII character, represented as a byte, starts with a zero.

• Characters that are beyond ASCII require two or three bytes to be complete.

• The first byte will tell you how many bytes are coming to make the character complete.

Page 36: LIS6 18  lecture 2 preparing and preprocessing

byte shapes in UTF-8 encoding

• 0?????? ASCII• 110???? first octet of two-byte character • 1110???? first byte of three-byte character • 11110??? first octet of four-byte character• 10??????? byte that is not the first byte• as you can see, there are sequences of bytes

that are not valid

Page 37: LIS6 18  lecture 2 preparing and preprocessing

hex range to UTF-8

• 0000 to 007F 0???????• 0080 to 07FF 110????? 10??????• 0800 to FFFF 1110???? 10?????? 10??????

Page 38: LIS6 18  lecture 2 preparing and preprocessing

in the simplest case

• In the simplest case, let us assume that we index a bunch of text files as full text.

• These documents contain text written in some sort of a human language.

• They may also contain some markup, but we assume we can strip the markup.

Page 39: LIS6 18  lecture 2 preparing and preprocessing

lexical analysis

• Lexical analysis is the process of preparing a sequence or characters into a sequence of words.

• These candidate words can then be used as index term (or what I have called features).

Page 40: LIS6 18  lecture 2 preparing and preprocessing

text direction

• Text is not necessarily linear. • When right to left writing is used, you

sometimes have to jump in the opposite direction when a left to right expression appears. Example:

• عام استقاللها الجزائر 132بعد 1962حققتالفرنسي االحتالل من .سنة

Page 41: LIS6 18  lecture 2 preparing and preprocessing

spaces

• One important delimiter between words are spaces.

• The blank, the newline, the carriage return and the tab characters are collectively known as whitespace

• We can collapse whitespace and get the words.

• But there are special cases.

Page 42: LIS6 18  lecture 2 preparing and preprocessing

text without spaces: Japanese

• Japanese uses 3 scripts, kanji, hiragana and katakana.

• All foreign words are written with katakana. Any conitunous katakana sequence is a word.

• A Japanese word has at most 4 kanji/hiragana. A dictionary has to find backward from 4 to 1 if a word exists.

Page 43: LIS6 18  lecture 2 preparing and preprocessing

punctuation

• Normally lexical analysis discards punctuation. But some punctuation can be part of a word, say “333B.C.”

• What we have to understand here is that probably at query time, the same removal of punctuation occurs.

• The user searches for “333B.C”, the query is translated to 333BC and the occurrence is found.

Page 44: LIS6 18  lecture 2 preparing and preprocessing

hyphen

• It’s a particularly problematic punctuation mark.

• Breaking them, like, for example, in “state-of-the-art” is good.

• End of line hyphenation breaks at a line end may not be ends of words.

• Some hyphenation is part of a word, e.g. UTF-8. I should use the non-breaking hyphen.

Page 45: LIS6 18  lecture 2 preparing and preprocessing

apostrophe

• Example:– Mr. O’Neill thinks that the boys’ stories about

Chile’s capital aren’t amusing.• O Neil … aren t• The first seem ok, but second looks silly.

Page 46: LIS6 18  lecture 2 preparing and preprocessing

stop words

• Some words are considered to carry no meaning because they are too frequent.

• Sometimes, they are completely removed, and not removed in the index.

• Then you can’t search for the phrase “to be or not to be”.

• Stop word are very language dependent.

Page 47: LIS6 18  lecture 2 preparing and preprocessing

stemming

• Frequently, text contains grammatical forms that a user may not look at. In English, e.g.– plurals– gerund form– past tense suffixes

• The most famous algorithm for English stemming is Martin Porter’s stemming algorithm.

• Evidence on usefulness is mixed.

Page 48: LIS6 18  lecture 2 preparing and preprocessing

grammatical variants

• These are much more important in other languages. E.g., in Russian – “My name is Irina.” “Меня зовут Ирина.” – “I love Irina.” “Я люблю Ирину.”

• Text in a query in Russian must be matched against a variety of grammatical forms, so of which are very tricky to derive.

Page 49: LIS6 18  lecture 2 preparing and preprocessing

casing

• Normalization usually means that we treat uppercase and lowercase the same, meaning that the uppercase and lowercase variants of a token become the same term.

• However in English– “US” vs “us”– “IT” vs “it”…

Page 50: LIS6 18  lecture 2 preparing and preprocessing

treatment of diacritics

• There are rules in Unicode how diacritics can be stripped.

• These involve decomposing diacritic• This is helpful for people who find the input of

diacritics tough.• But there are lots of other tough choices.

Page 51: LIS6 18  lecture 2 preparing and preprocessing

historic equivalents

• Many languages that use diacritics or ligatures have developed spelling variants without them. – Jaspersstraße Jasperstrasse– Völklingen Voelklingen

• A mechanical de-diacritic operation would yield “Volklingen” which few people would chose.

• Such operations are very language specific.

Page 52: LIS6 18  lecture 2 preparing and preprocessing

http://openlib.org/home/krichel

Please shutdown the computers whenyou are done.

Thank you for your attention!