cs 4422/7263 information retrieval jiho noh kennesaw state

ElasticsearchCS 4422/7263 Information Retrieval

Jiho NohKennesaw State University

contents from • sundog education / elasticsearch• official elasticsearch tutorials

Installing and Basics

Elasticsearch — Overview

• Open source search and analytics engine for all types of data• Started off as scalable Apache Lucene• The ELK Stack (Elasticsearch, Logstash, and Kibana)• Elasticsearch maintains “shards”, where each shard is an inverted

index of documents• But not just for full text search!

• Can handle structured data, and can aggregate data quickly• Often a faster solution than Hadoop/Spark/Flink/etc.

Shards

• Documents are hashed (and split) into a particular shard.

shard1 shard1 shard1 ...

An Index

• Each shard may be on a different node in a cluster• Every shard is a self-contained Lucene index of its own

Fault Tolerance

• This index has two primary shards and two replicas

Primary1

• Write requests are routed to the primary shard, then replicated• Read requests are routed to the primary or any replica• Number of primary shards cannot be changed once set up

Replica0

Replica0

Replica1

Primary0

Replica1

node 1 node 2 node 3

Logstash / Beats

• A tool for publishing data into Elasticsearch• FileBeat monitor log files, parse them, and import into Elasticsearch

in near-real-time

• Logstash read data from machines and feed into Elasticsearch

• General purpose system; not just for log files

Kibana

• Web UI for searching and visualizing

• Complex aggregations, graphs, charts

• Often used for log analysis

Elasticsearch Installation

We are going to use a Linux (e.g., Ubuntu) system for installing and running Elasticsearch 7.

What if I use Mac or Windows?

• Cloud computing resources, such as Amazon EC2• Virtualbox + Ubuntu on your machine

https://aws.amazon.com/pm/ec2/

https://www.virtualbox.org/

Elasticsearch Installation

• Elasticsearch official installation guide

Installing on UbuntuBefore the installation, importing the Elasticsearch Signing Key (PGP) should be done.

• Download and install the keywget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add

• Install the apt-transport-https packagesudo apt-get install apt-transport-https

• Save the repository definitionecho "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-7.x.list

• Following command will install the Elasticsearch package on your machinesudo apt-get update && sudo apt-get install elasticsearch

https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html

Configuring Elasticsearch

• Edit the configuration filessudo vi /etc/elasticsearch/elasticsearch.yml

• Uncomment (and edit lines) as below:• node.name: node-1

• network.host: 0.0.0.0

• discovery.seed_hosts: ["127.0.0.1"]

• cluster.initial_master_nodes: ["node-1"]

Running Elasticsearch with systemd

• The following commands will enable the elasticsearch services as a daemon

sudo /bin/systemctl daemon-reload

sudo /bin/systemctl enable elasticsearch.service

• Elasticsearch can be started and stopped using these commands:sudo systemctl start elasticsearch.service

sudo systemctl stop elasticsearch.service

Checking that Elasticsearch is running

curl -XGET 127.0.0.1:9200

• This should give you a response something like this:{"name" : "node-1","cluster_name" : "elasticsearch","cluster_uuid" : "_UFe2UXmQka_Zmrkijn0IA","version" : {"number" : "7.14.0","build_flavor" : "default","build_type" : "deb","build_hash" : "dd5a0a2acaa2045ff9624f3729fc8a6f40835aa1","build_date" : "2021-07-29T20:49:32.864135063Z","build_snapshot" : false,"lucene_version" : "8.9.0","minimum_wire_compatibility_version" : "6.8.0","minimum_index_compatibility_version" : "6.0.0-beta1"

},"tagline" : "You Know, for Search"

}

HTTP and RESTful API’s

HTTP Request

Component Description

Method The “verb” of the request. {GET, POST, PUT, DELETE, etc.}

Protocol What language you use for HTTP communication

HOST Web server you want to talk to

URL What resource is being requested

BODY Auxiliary data needed in some cases

HEADERS User-agent, content-type, etc.

HTTP Response example

>> curl -ivs --raw https://www.kennesaw.edu | less

HTTP/1.1 200 OK Date: Tue, 17 Aug 2021 18:34:02 GMT Server: Apache X-Powered-By: PHP/5.4.16 Access-Control-Allow-Origin: * Transfer-Encoding: chunked Content-Type: text/html; charset=UTF-8

20c8 <!DOCTYPE html><html lang="en" id="_63608cfd-163b-4208-b021-dd4e916bb815"> <head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>Kennesaw State University in Georgia</title>

...

RESTful API’s

• REpresentational State Transfer• Web service using HTTP requests

Examples:

• GET: retrieve information (like search results)• PUT: insert or replace new information• DELETE: delete existing information

RESTful API’s

Six guiding constraints:

• Client-server architecture• Statelessness; Every request (response) must be self-contained• Cacheability; Responses can be cached• Layered system• Code on demand (i.e., sending javascript)• Uniform interface; Your data should have some structure and predictable

Why REST?

Whatever language or system you use, it must support HTTP Requests and responses. Thus, communicating with the Elasticsearch is language and system independent.

• Learning how to write an HTTP request for a particular command• and how to parse the response is all we need.

The Curl Command

A way to issue HTTP requests from the command line

curl –H “Content-Type: application/json” <URL> -d ‘<BODY>’

Python example

import requests

r =requests.get('https://xkcd.com/1906/')

Examples

curl -X PUT "localhost:9200/my-index-000001/_doc/1" -H 'Content-Type: application/json' -d'

{

"@timestamp": "2099-11-15T13:12:00",

"message": "GET /search HTTP/1.1 200 1070000",

"user": {

"id": "kimchy"

}

}

'

Insert a document API: PUT /<target>/_doc/<_id>

Examples

curl -X GET "localhost:9200/my-index-000001/_search?from=40&size=20"

-H 'Content-Type: application/json' -d'

{

"query": {

"term": {

"user.id": "kimchy"

}

}

}

'

Search API: GET /<target>/_search

Indexing

Example data

• The Complete Works of William Shakespeare• Suitably parsed into fields• download json file• The Shakespeare dataset has the following structure:

{"line_id": INT, "play_name": "String", "speech_number": INT, "line_number": "String", "speaker": "String", "text_entry": "String",

}

https://download.elastic.co/demos/kibana/gettingstarted/shakespeare_6.0.json

Create an index for the corpus

curl -X PUT "localhost:9200/shakespeare?pretty" -H 'Content-Type: application/json' -d'{

"settings": {"number_of_shards": 2,"number_of_replicas": 1

},"mappings": {

"properties": {"line_id": {"type": "integer"},"play_name": {"type": "keyword"},"speech_number": {"type": "integer"},"line_number": {"type": "text"},"speaker": {"type": "keyword"},"text_entry": {"type": "keyword"}

}}

}'

Create Index with mapping API: PUT /<index>

Mapping

• Mapping is a schema definition.• Elasticsearch sets an index with a reasonable defaults• But mostly you will need to provide a particular mapping properties• Field Types

• string, byte, short integer, long, float, double, boolean, date• Field Index; Do you want the field to be indexed?

• analyzed / not_analyzed / no• Field Analyzer; Define your tokenizer and token filter.

• standard / whitespace / simple / english etc.

Loading data

curl "localhost:9200/shakespeare/_mapping"

Get the mapping

Insert data by loading the dataset in bulk

curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json

curl "localhost:9200/_cat/indices"

List all the indexes

Search a phrase

curl "localhost:9200/shakespeare/_search?pretty" -H 'Content-Type: application/json' -d'{

"query": {"match_phrase": {

"text_entry": "all that glitters is not gold"}

}}'

Update a single entry

curl -X POST "localhost:9200/shakespeare/_doc/62427/_update" -H 'Content-Type: application/json' -d '{

"doc": {"text_entry": "ALL THAT GLITTERS IS NOT GOLD;"

}}

List all the indexes

Search

"Query Lite" search


"query": {"match": {

"text_entry": "glitters"}

}}'

curl "localhost:9200/shakespeare/_search?q=text_entry:glitters&pretty"

via URI search

"Query Lite" search

curl "localhost:9200/shakespeare/_search?q=text_entry:%22make%20choice%22~3&pretty"

• URL needs to be encoded• Cryptic• Security issue if exposed to end users• Fragile and difficult to debug

It's better to send a JSON request

Queries and Filters

• Filters ask a yes/no question of your data• Queries return data in terms of relevance

Use filters when you can — they are faster and cacheable

Some types of filters

Filter Description

term filter by exact values

terms match if any exact values in a list match

range find numbers or dates in a given range (gt, gte, lt, lte)

exists find documents where a field exists

missing find documents where a field is missing

bool combine filters with Boolean logic (must, must_not, should)

Some types of queries

Query Description

match_all returns all documents (default)

match searches analyzed results, such as full text search

multi_match run the same query on multiple fields

match_phrase matching exact phrases or word proximity

combined_fields matches over multiple fields as if they had been indexed into one combined field

query_string supports the Lucene query string syntax

bool words like a bool filter, but results are scored by relevance

Example of query and filter contexts

GET /_search{

"query": { "bool": {

"must": [{ "match": { "title": "Search" }},{ "match": { "content": "Elasticsearch" }}

],"filter": [

{ "term": { "status": "published" }},{ "range": { "publish_date": { "gte": "2015-01-01" }}}

]}

}}

Pagination


"from": 10,"size": 3,"query": {

"match_phrase": { "text_entry": "life is" }}

}'

• The from parameter defines the number of hits to skip, defaulting to 0.• The size parameter is the maximum number of hits to return

Importing Data

Different ways of importing data

• Stand-alone scripts can submit bulk documents via REST API• Logstash and beats can stream data from logs, S3, databases, and more• AWS systems can stream in data via lambda or kinesis firehose• Kafka, spark, and more have Elasticsearch integration add-ons

https://aws.amazon.com/lambda/

https://aws.amazon.com/kinesis/

Write a script

• Write a script that generates a list of JSON entries• Read in data from a data source• Transform it into JSON bulk inserts

• Submit via HTTP / REST to your elasticsearch cluster• Bulk API:

curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'{ "index" : { "_index" : "test", "_id" : "1" } }{ "field1" : "value1" }{ "delete" : { "_index" : "test", "_id" : "2" } }{ "create" : { "_index" : "test", "_id" : "3" } }{ "field1" : "value3" }{ "update" : {"_id" : "1", "_index" : "test"} }{ "doc" : {"field2" : "value2"} }'

Write a script

Importing with a client libraries

• Elasticsearch client libraries are available for most of languages• Python has an elasticsearch package

Logstash

Logstash is a data loader

Files S3 Beats Kafka

Elastic-search AWS Hadoop Mongo

db

Logstash

Logstash is a data loader

• Server-side data processing pipeline• It can parse, transform, and filter data• It can derive structure from unstructured data• It can anonymize personal data or exclude it entirely• It can do geo-location lookups• It can scale across many nodes• Huge list of filter plugins is available

https://www.elastic.co/guide/en/logstash/current/filter-plugins.html

Installing and configure Logstashsudo apt-get install openjdk-8-jre-headless

sudo apt-get update && sudo apt-get install logstash

sudo vi/etc/logstash/conf.d/apache-access.conf

Import data from apacheaccess_log to elasticsearch

The grok program can parse log data and program output.

Running Logstash

cd /usr/share/logstash/

sudo bin/logstash –-path.config /etc/logstash/conf.d/apache-access.conf

• Run Logstash as specified in the config file

• Check the indices of elasticsearch for the Logstash created data

curl –XGET "127.0.0.1:9200/_cat/indices?v"

Importing CSV file with Logstash

filter {

csv {

separator => ","

skip_header => "true"columns => ["movieId", "title", "genres"]

}

}

• CSV filter plugin

https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html

Kibana

Installing Kibana

sudo apt install kibanasudo vi /etc/kibana/kibana.yml

(change server.host to 0.0.0.0)

sudo /bin/systemctl daemon-reloadsudo /bin/systemctl enable kibana.servicesudo /bin/systemctl start kibana.service

• Kibana port number is 5601• http://10.80.34.86:5601

http://10.80.34.86:5601/