cs 4422/7263 information retrieval jiho noh kennesaw state

50
Elasticsearch CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State University

Upload: others

Post on 22-Apr-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

ElasticsearchCS 4422/7263 Information Retrieval

Jiho NohKennesaw State University

Page 2: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

contents from • sundog education / elasticsearch• official elasticsearch tutorials

Page 3: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Installing and Basics

Page 4: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Elasticsearch — Overview

• Open source search and analytics engine for all types of data• Started off as scalable Apache Lucene• The ELK Stack (Elasticsearch, Logstash, and Kibana)• Elasticsearch maintains “shards”, where each shard is an inverted

index of documents• But not just for full text search!

• Can handle structured data, and can aggregate data quickly• Often a faster solution than Hadoop/Spark/Flink/etc.

Page 5: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Shards

• Documents are hashed (and split) into a particular shard.

shard1 shard1 shard1 ...

An Index

• Each shard may be on a different node in a cluster• Every shard is a self-contained Lucene index of its own

Page 6: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Fault Tolerance

• This index has two primary shards and two replicas

Primary1

• Write requests are routed to the primary shard, then replicated• Read requests are routed to the primary or any replica• Number of primary shards cannot be changed once set up

Replica0

Replica0

Replica1

Primary0

Replica1

node 1 node 2 node 3

Page 7: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Logstash / Beats

• A tool for publishing data into Elasticsearch• FileBeat monitor log files, parse them, and import into Elasticsearch

in near-real-time

• Logstash read data from machines and feed into Elasticsearch

• General purpose system; not just for log files

Page 8: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Kibana

• Web UI for searching and visualizing

• Complex aggregations, graphs, charts

• Often used for log analysis

Page 9: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Elasticsearch Installation

We are going to use a Linux (e.g., Ubuntu) system for installing and running Elasticsearch 7.

What if I use Mac or Windows?

• Cloud computing resources, such as Amazon EC2• Virtualbox + Ubuntu on your machine

Page 10: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Elasticsearch Installation

• Elasticsearch official installation guide

Installing on UbuntuBefore the installation, importing the Elasticsearch Signing Key (PGP) should be done.

• Download and install the keywget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add

• Install the apt-transport-https packagesudo apt-get install apt-transport-https

• Save the repository definitionecho "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-7.x.list

• Following command will install the Elasticsearch package on your machinesudo apt-get update && sudo apt-get install elasticsearch

Page 11: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Configuring Elasticsearch

• Edit the configuration filessudo vi /etc/elasticsearch/elasticsearch.yml

• Uncomment (and edit lines) as below:• node.name: node-1

• network.host: 0.0.0.0

• discovery.seed_hosts: ["127.0.0.1"]

• cluster.initial_master_nodes: ["node-1"]

Page 12: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Running Elasticsearch with systemd

• The following commands will enable the elasticsearch services as a daemon

sudo /bin/systemctl daemon-reload

sudo /bin/systemctl enable elasticsearch.service

• Elasticsearch can be started and stopped using these commands:sudo systemctl start elasticsearch.service

sudo systemctl stop elasticsearch.service

Page 13: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Checking that Elasticsearch is running

curl -XGET 127.0.0.1:9200

• This should give you a response something like this:{"name" : "node-1","cluster_name" : "elasticsearch","cluster_uuid" : "_UFe2UXmQka_Zmrkijn0IA","version" : {"number" : "7.14.0","build_flavor" : "default","build_type" : "deb","build_hash" : "dd5a0a2acaa2045ff9624f3729fc8a6f40835aa1","build_date" : "2021-07-29T20:49:32.864135063Z","build_snapshot" : false,"lucene_version" : "8.9.0","minimum_wire_compatibility_version" : "6.8.0","minimum_index_compatibility_version" : "6.0.0-beta1"

},"tagline" : "You Know, for Search"

}

Page 14: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

HTTP and RESTful API’s

Page 15: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

HTTP Request

Component Description

Method The “verb” of the request. {GET, POST, PUT, DELETE, etc.}

Protocol What language you use for HTTP communication

HOST Web server you want to talk to

URL What resource is being requested

BODY Auxiliary data needed in some cases

HEADERS User-agent, content-type, etc.

Page 16: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

HTTP Response example

>> curl -ivs --raw https://www.kennesaw.edu | less

HTTP/1.1 200 OK Date: Tue, 17 Aug 2021 18:34:02 GMT Server: Apache X-Powered-By: PHP/5.4.16 Access-Control-Allow-Origin: * Transfer-Encoding: chunked Content-Type: text/html; charset=UTF-8

20c8 <!DOCTYPE html><html lang="en" id="_63608cfd-163b-4208-b021-dd4e916bb815"> <head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Kennesaw State University in Georgia</title>

...

Page 17: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

RESTful API’s

• REpresentational State Transfer• Web service using HTTP requests

Examples:

• GET: retrieve information (like search results)• PUT: insert or replace new information• DELETE: delete existing information

Page 18: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

RESTful API’s

Six guiding constraints:

• Client-server architecture• Statelessness; Every request (response) must be self-contained• Cacheability; Responses can be cached• Layered system• Code on demand (i.e., sending javascript)• Uniform interface; Your data should have some structure and predictable

Page 19: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Why REST?

Whatever language or system you use, it must support HTTP Requests and responses. Thus, communicating with the Elasticsearch is language and system independent.

• Learning how to write an HTTP request for a particular command• and how to parse the response is all we need.

Page 20: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

The Curl Command

A way to issue HTTP requests from the command line

curl –H “Content-Type: application/json” <URL> -d ‘<BODY>’

Python example

import requests

r =requests.get('https://xkcd.com/1906/')

Page 21: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Examples

curl -X PUT "localhost:9200/my-index-000001/_doc/1" -H 'Content-Type: application/json' -d'

{

"@timestamp": "2099-11-15T13:12:00",

"message": "GET /search HTTP/1.1 200 1070000",

"user": {

"id": "kimchy"

}

}

'

Insert a document API: PUT /<target>/_doc/<_id>

Page 22: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Examples

curl -X GET "localhost:9200/my-index-000001/_search?from=40&size=20"

-H 'Content-Type: application/json' -d'

{

"query": {

"term": {

"user.id": "kimchy"

}

}

}

'

Search API: GET /<target>/_search

Page 23: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Indexing

Page 24: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Example data

• The Complete Works of William Shakespeare• Suitably parsed into fields• download json file• The Shakespeare dataset has the following structure:

{"line_id": INT, "play_name": "String", "speech_number": INT, "line_number": "String", "speaker": "String", "text_entry": "String",

}

Page 25: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Create an index for the corpus

curl -X PUT "localhost:9200/shakespeare?pretty" -H 'Content-Type: application/json' -d'{

"settings": {"number_of_shards": 2,"number_of_replicas": 1

},"mappings": {

"properties": {"line_id": {"type": "integer"},"play_name": {"type": "keyword"},"speech_number": {"type": "integer"},"line_number": {"type": "text"},"speaker": {"type": "keyword"},"text_entry": {"type": "keyword"}

}}

}'

Create Index with mapping API: PUT /<index>

Page 26: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Mapping

• Mapping is a schema definition.• Elasticsearch sets an index with a reasonable defaults• But mostly you will need to provide a particular mapping properties• Field Types

• string, byte, short integer, long, float, double, boolean, date• Field Index; Do you want the field to be indexed?

• analyzed / not_analyzed / no• Field Analyzer; Define your tokenizer and token filter.

• standard / whitespace / simple / english etc.

Page 27: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Loading data

curl "localhost:9200/shakespeare/_mapping"

Get the mapping

Insert data by loading the dataset in bulk

curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json

curl "localhost:9200/_cat/indices"

List all the indexes

Page 28: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Search a phrase

curl "localhost:9200/shakespeare/_search?pretty" -H 'Content-Type: application/json' -d'{

"query": {"match_phrase": {

"text_entry": "all that glitters is not gold"}

}}'

Page 29: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Update a single entry

curl -X POST "localhost:9200/shakespeare/_doc/62427/_update" -H 'Content-Type: application/json' -d '{

"doc": {"text_entry": "ALL THAT GLITTERS IS NOT GOLD;"

}}

List all the indexes

Page 30: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Search

Page 31: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

"Query Lite" search

curl "localhost:9200/shakespeare/_search?pretty" -H 'Content-Type: application/json' -d'{

"query": {"match": {

"text_entry": "glitters"}

}}'

curl "localhost:9200/shakespeare/_search?q=text_entry:glitters&pretty"

via URI search

Page 32: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

"Query Lite" search

curl "localhost:9200/shakespeare/_search?q=text_entry:%22make%20choice%22~3&pretty"

• URL needs to be encoded• Cryptic• Security issue if exposed to end users• Fragile and difficult to debug

It's better to send a JSON request

Page 33: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Queries and Filters

• Filters ask a yes/no question of your data• Queries return data in terms of relevance

Use filters when you can — they are faster and cacheable

Page 34: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Some types of filters

Filter Description

term filter by exact values

terms match if any exact values in a list match

range find numbers or dates in a given range (gt, gte, lt, lte)

exists find documents where a field exists

missing find documents where a field is missing

bool combine filters with Boolean logic (must, must_not, should)

Page 35: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Some types of queries

Query Description

match_all returns all documents (default)

match searches analyzed results, such as full text search

multi_match run the same query on multiple fields

match_phrase matching exact phrases or word proximity

combined_fields matches over multiple fields as if they had been indexed into one combined field

query_string supports the Lucene query string syntax

bool words like a bool filter, but results are scored by relevance

Page 36: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Example of query and filter contexts

GET /_search{

"query": { "bool": {

"must": [{ "match": { "title": "Search" }},{ "match": { "content": "Elasticsearch" }}

],"filter": [

{ "term": { "status": "published" }},{ "range": { "publish_date": { "gte": "2015-01-01" }}}

]}

}}

Page 37: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Pagination

curl "localhost:9200/shakespeare/_search?pretty" -H 'Content-Type: application/json' -d'{

"from": 10,"size": 3,"query": {

"match_phrase": { "text_entry": "life is" }}

}'

• The from parameter defines the number of hits to skip, defaulting to 0.• The size parameter is the maximum number of hits to return

Page 38: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Importing Data

Page 39: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Different ways of importing data

• Stand-alone scripts can submit bulk documents via REST API• Logstash and beats can stream data from logs, S3, databases, and more• AWS systems can stream in data via lambda or kinesis firehose• Kafka, spark, and more have Elasticsearch integration add-ons

Page 40: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Write a script

• Write a script that generates a list of JSON entries• Read in data from a data source• Transform it into JSON bulk inserts

• Submit via HTTP / REST to your elasticsearch cluster• Bulk API:

curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'{ "index" : { "_index" : "test", "_id" : "1" } }{ "field1" : "value1" }{ "delete" : { "_index" : "test", "_id" : "2" } }{ "create" : { "_index" : "test", "_id" : "3" } }{ "field1" : "value3" }{ "update" : {"_id" : "1", "_index" : "test"} }{ "doc" : {"field2" : "value2"} }'

Page 41: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Write a script

Page 42: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Importing with a client libraries

• Elasticsearch client libraries are available for most of languages• Python has an elasticsearch package

Page 43: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Logstash

Page 44: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Logstash is a data loader

Files S3 Beats Kafka

Elastic-search AWS Hadoop Mongo

db

Logstash

Page 45: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Logstash is a data loader

• Server-side data processing pipeline• It can parse, transform, and filter data• It can derive structure from unstructured data• It can anonymize personal data or exclude it entirely• It can do geo-location lookups• It can scale across many nodes• Huge list of filter plugins is available

Page 46: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Installing and configure Logstashsudo apt-get install openjdk-8-jre-headless

sudo apt-get update && sudo apt-get install logstash

sudo vi/etc/logstash/conf.d/apache-access.conf

Import data from apacheaccess_log to elasticsearch

The grok program can parse log data and program output.

Page 47: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Running Logstash

cd /usr/share/logstash/

sudo bin/logstash –-path.config /etc/logstash/conf.d/apache-access.conf

• Run Logstash as specified in the config file

• Check the indices of elasticsearch for the Logstash created data

curl –XGET "127.0.0.1:9200/_cat/indices?v"

Page 48: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Importing CSV file with Logstash

filter {

csv {

separator => ","

skip_header => "true"columns => ["movieId", "title", "genres"]

}

}

• CSV filter plugin

Page 49: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Kibana

Page 50: CS 4422/7263 Information Retrieval Jiho Noh Kennesaw State

Installing Kibana

sudo apt install kibanasudo vi /etc/kibana/kibana.yml

(change server.host to 0.0.0.0)

sudo /bin/systemctl daemon-reloadsudo /bin/systemctl enable kibana.servicesudo /bin/systemctl start kibana.service

• Kibana port number is 5601• http://10.80.34.86:5601