elasticsearch: implementing document full-text...

13
Elasticsearch: implementing document full-text search Bastian Mathes Elasticsearch Meetup Köln 2015-08-27

Upload: others

Post on 30-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

Elasticsearch: implementingdocument full-text search

Bastian Mathes

Elasticsearch Meetup Köln2015-08-27

Page 2: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

2 Introduction

• Elasticsearch is very successful as a loganalysis tool

• but it is also a very good search engine• . . . with some unique features for handling

structured data

• in this talk let’s focus on unstructured data

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 3: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

3 What do we mean by unstructured data ?

• can be websites . . .• but often this is a more diverse office mix

• various formats• many languages• large in file size or pages• various source systems

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 4: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

4 Where is that used ?

• Website search• eCommerce search• Enterprise search

• one place to find all information inside thecompany

• honor access rights

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 5: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

5 What are the challenges ?

• Document conversion / text extraction• Linguistics• Secure search• Source systems access

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 6: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

6 Document conversion

• Extraction of text (and metadata) fromyour documents

• Need converters• Commercial: Oracle Outside In, Microsoft IFilter,

HP/Autonomy KeyView• Open source: Apache Tika

• Move processing out of the search cluster• Near the source system, somewhere in between

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 7: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

7 Linguistics

• at least tokenization (Standard Tokenizer)• a lot of room for improvement

• Language specific tokenization• Stemming or better lemmatization

• Handle tokens with the same meaning as equal• Raise recall, keep precision• Overstemming: universal, universe, university

• Synonyms (Synonym Token Filter)• Decompounding• Named Entity Recognition

• Detecting entities (locations, organizations, people) inthe text

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 8: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

8 Linguistics cont.

• Language detection in Apache Tika• a lot of Analyzers in Elasticsearch/Lucene• Try Hunspell for lemmatization• Play with Stanford NER (English, German,

Chinese) and Apache OpenNLP• there is more in open-source academia, but

very specific

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 9: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

9 Linguistics cont.

©Basis Technology

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 10: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

10 Secure Search

• Document level security• Have a look at Shield for protecting cluster and

indexes• Basic idea is simple:

• Transfer access right from source system tosearch index

• at search time create a filter to only show resultthe searcher is authorized to see

• Common pitfalls• User-to-groups resolution has to be cached• Multiple source systems with different

authentication/authorization schemas• Domain migrations etc.

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 11: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

11 Secure Search cont.

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 12: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

12 Connectors

• Get data from source system to searchindex

• Events vs. synchronization, track changes• Access control lists• Open source solutions: Apache Nutch,

Apache ManifoldCF• Make or buy

Elasticsearch: implementing document full-textsearchBastian Mathes

Elasticsearch Meetup Köln

2015-08-27

Page 13: Elasticsearch: implementing document full-text searchfiles.meetup.com/18804222/raytion_elasticsearch_full_text.pdf · Language detection in Apache Tika a lot of Analyzers in Elasticsearch/Lucene

Thank you

Bastian [email protected]

Raytion GmbH

Benrather Strasse 18-2040213 DüsseldorfGermany

T +49. 211. 55 02 66. 0

www.raytion.com

© Copyright 2015 Raytion GmbH, Düsseldorf