collocations and information management applications gregor erbach saarland university saarbrücken

31
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Upload: branden-norman

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Collocations and Information Management Applications

Gregor ErbachSaarland University

Saarbrücken

Page 2: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Outline

• Information Management Applications

• Information Retrieval Techniques

• Categorization, Clustering

• Summarization

• Information Extraction

• Question Answering

• Points for discussion

Page 3: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Well-known Applications of Collocations

• Lexicography

• Machine Translation

• NL Generation

• NL Parsing

• Terminology Extraction

• Foreign Language Teaching

• Speech Recognition

Page 4: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Information Management Applications

• Information Retrieval• Text Categorisation

(by language, topic, author, genre ...)

• Clustering

• Summarisation / Keyword Extraction

• Information Extraction

• Question Answering

Page 5: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Information Retrieval

• Most IR systems don't retrieve information, but documents

• Boolean retrieval: an unordered set of documents are returned as result for a query

• Ranked retrieval: an ordered list of documents is returned; relevance of documents is determined by matching with a query

Page 6: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

IR System Model

Documents

RQ

Matching

[0, 1]

Query

Representation RD

Page 7: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Query Languages

• Co-occurence within document information AN D retrieval

• Negation information AND (NOT retrieval)

• Multi-word expression "information retrieval"

• Proximity operators information NEAR retrieval

Page 8: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Evaluation

• Precision

• Recall

• Precision/recall graphs

• 11 point average precision

• TREC (Text Retrieval Conference)

• TREC Tasks: ad-hoc, web, spoken documents, multimedia, cross-language ...

Page 9: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Relevance

• Relevance is matching of a document with an information need expressed through a query

• Relevance is considered as binary and determined by human assessors for document-query pairs

• Relevance is modelled by a similarity measure that compares query and document representations

Page 10: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Document and Query Representations in IR

• Documents and queries are generally represented as a vector of terms weights

• Documents are treated as bags of words

• Preprocessing: stemming or morphological analysis

• POS, chunking, syntax did not improve information retrieval performance

Page 11: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Similarity Measures

• Term weighting: TF , TF/ICF, TF/IDF

• Similarity measures determine how close two documents are, or how alike a document and a query are

• A common similarity measure is the cosine of the angles between the vector representations

Page 12: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Cosine Similarity

term2

term1

4 8

2

6

Page 13: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Result Ranking

• Adjacency or proximity of search terms can be taken into account in ranking of retrieval results

• This accounts for phrases and collocations

• Search terms occurring near each other (e.g. within a paragraph) are more likely to be related than search term occurring in different parts of a document

Page 14: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Latent Semantic Indexing

• LSI: Singular value decomposition, dimensionality reduction

• LSI associates terms that share the same context, i.e. can be substituted

• Applications: information retrieval, cross-language IR, language learning, text categorisation, vocabulary tests

Page 15: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Query Expansion

• Query expansion with related terms (e.g. from WordNet, thesarus)

• Relevance Feedback: Query expansion with terms from relevant document

• Blind Relevance Feedback: Query expansion with terms from top-ranking document. Expansion with co-occurring terms improves precision/recall.

Page 16: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Language Models for IR

• Language models generate queries from documents

• Estimate probability that a given query was generated by a particular document

• Uni-gram language models

• (special case of probabilistic IR)

Page 17: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Cross-language IR

• Methods: document translation, query translation, parallel/comparable corpora

Page 18: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Document Categorization

• Similar techniques to IR (document representation, similarity measures)

• Document base contains categorized documents• New document as query which retrieves the

best matching documents from database• Support Vector Machines achieve

very good performance on various text categorization tasks

Page 19: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Document Clustering

• Similar techniques to IR (document representation, similarity measures)

• Each cluster is represented by a centroid

• Iterative hierarchical grouping of similar documents

Page 20: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Summarization

• Two approaches: Extraction (of sentencs or keywords) and abstraction (summary generation)

• Indicative vs. informative summaries

• Query-independent vs. query-biased summaries

• Evaluation criteria: informativeness, coherence

Page 21: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Information Extraction

• Tasks: named entity extraction, coreference, template extraction

• named entities: person, organisation, location, time, date, money, percentage

• methods: finite-state grammars, finite-state transducers

• evaluation: precision, recall, f-measure

Page 22: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Question Answering

• Answer extraction (passage retrieval) vs. Information extraction + answer generation

• Combination of IR-based and NLP-based approaches (semantic concepts, dependency relations).

• TREC open domain QA evaluation: extract 50-word passage containing the answer to a factual question

Page 23: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Co-location spaces

• Linear (speech, text)

• document (as bag of words)

• hierarchical structure (tree, dependency relations)

• semantic/conceptual space (e.g. WordNet)

• cyberspace (hyperlinks)

Page 24: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Collocations and IM

Collocations are

• multi-word units

• with statistical associations

• with restricted semantic compositionality

Are they useful for information management applications?

Page 25: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Collocations and Document Representations

• Common representations treat terms in the document and query as independent

• Collocations research shows that they are not independent

• Implications?

Page 26: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Collocations / Query Formulation

• Use of collocations for query expansion?(e.g. collocation, corpus, association ...

vs. collocation, facility, service, server, hosting ...)

• Automatic or interactive?

Page 27: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Collocations / Categorisation & Clustering

• How much can category-specific collocations improve performance?

• Collocations for identification of genre, author, dialect?

Page 28: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Collocations / IE

• IE techniques (finite-state shallow parsing) for collocation identification

• Use of collocation in IE grammars (Gewinn machen, Umsatz erzielen ...)

Page 29: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Collocations / QA

• Use of collocations for finding answers (e.g. function-proper_name)

Page 30: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Collocations and Summarisation

• Keyword / key phrase extraction

• Evaluation of coherence: which association measures can be used?

Page 31: Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken

Questions for Discussion

• Are collocations a useful level of representation for indexing and retrieval?

• Or are they only useful in establishing semantic representations?