cs6604 digital libraries...mohammed magdy virginia tech, blacksburg 5/1/2014 ideal webpages cs6604...

36
Presented by Ahmed Elbery, Mohammed Farghally Project client Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries

Upload: others

Post on 31-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Presented by

Ahmed Elbery, Mohammed Farghally

Project client

Mohammed Magdy

Virginia Tech, Blacksburg

5/1/2014

IDEAL Webpages

CS6604 Digital Libraries

Page 2: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Agenda

5/1/2014

Project overview

Solr and SolrCloud

Solr for indexing the events

Hadoop

Indexing using Hadoop and SolrCloud

Web Interface

Overall Architecture

Screen Shots

Page 3: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Overview

5/1/2014

A tremendous amount ≈ 10TB of data is available about a variety of events crawled from the web.

It is required to make this big data accessible and searchable conveniently through the web.

≈ 10TB of .warc. Use only HTML files.

Page 4: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Big picture

5/1/2014

Crawled Data Hadoop

Index

Solr

Page 5: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Solr Solr is an open source enterprise search server

based on the Lucene Java search library.

Solr can be integrated with, among others… PHP Java Python JSON

Solr Server

Request

Reply

Page 6: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

SolrCloud Whet will happen if the server becomes full?

Solr Server

Request

Reply

Solr Server

Solr Server

Solr Server Solr

Server

Solr Server

Solr Server

Shard 1 Shard 2

Page 7: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

What is SolrCloud?

SolrCloud

Replica

Replica

Leader

Replica

Replica

Leader

Shard 1 Shard 2

Shard & Replicate

Scalability

Fault Tolerance and Throughput

Page 8: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Schema schema.xml is usually the first file we

configure when setting up a new Solr installation.

The schema declares: what kinds of fields there are which field should be used as the

unique/primary key which fields are required how to index and search each field

Page 9: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Schema (Cont.)

5/1/2014

Page 10: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

SolrCloud control

5/1/2014

solrctl instancedir --generate $HOME/solr_configs

solrctl instancedir --create collection1 $HOME/solr_configs

solrctl collection --create collection1 -s numOfShards

http://128.173.49.32:8983/solr/#/~cloud

Page 11: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Event Fields

5/1/2014

We use the following fields

category: the event category or type

name: the event name

title: the file name

content: the file content

URL: the file path on the HDFS system

id: document ID

text: copy of the previous fields

Page 12: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Hadoop What is Hadoop?

Features:-

Scalable

Economical

Efficient

Reliable

Uses 2 main Services

HDFS

Map-Reduce

Page 13: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

HDFS Architecture

http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Page 14: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Some Terminology Job – A “full program” - an execution of a

Mapper and Reducer across a data set

Task – An execution of a Mapper or a Reducer on a slice of data

Task Attempt – A particular instance of an attempt to execute a task on a machine

Page 15: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

MapReduce Overview User

Program

Worker

Worker

Master

Worker

Worker

Worker

fork fork fork

assign map

assign reduce

read local write

remote read, sort

Output File 0

Output File 1

write

Split 0

Split 1

Split 2

Input Data

15

Page 16: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

MapReduce in Hadoop (1)

Page 17: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

MapReduce in Hadoop (2)

Page 18: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

MapReduce in Hadoop (3)

Page 19: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Job Configuration Parameters On cloudera

/user/lib/hadoop-*-mapreduce/conf/mapred-site.xm

Page 20: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Map TO Reduce Combiners Often a map task will produce many pairs of the form

(k,v1), (k,v2), … for the same key k (e.g., popular words in Word Count)

Can save network time by pre-aggregating at mapper combine(k1, list(v1)) v2 Usually same as reduce function

Partition Function For reduce, we need to ensure that records with the same

intermediate key end up at the same worker System uses a default partition function e.g., hash(key)

mod R

Page 21: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

IndexerDriver.java

5/1/2014

Page 22: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Indexer,aper

5/1/2014

Page 23: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Solr REST API

5/1/2014

Solr is accessible through HTTP requests by using Solr’s REST API. Hard to create complex queries

Results are returned as strings which requires some form of parsing.

http://preston.dlib.vt.edu:8983/solr/collection1/select?q=Sisters&wt=json&indent=true&hl=true&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C%2Fem%3E

Page 24: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Solr REST API (Cont.)

5/1/2014

Page 25: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Solarium

5/1/2014

Solarium is a PHP client for Solr to allow easy communication between PHP programs and the Solr server containing the indexed data.

Solarium provides an Object Oriented interface to Solr which makes it easier for developers than the Solr’s REST API.

Current version 3.2.0.

Page 26: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Why Solarium

5/1/2014

Solarium makes it easier for creating queries. Object Oriented interface rather than URL REST

interface.

Page 27: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Why Solarium (Cont.)

5/1/2014

Solarium makes it easier for getting results Rather than parsing JSON or XML strings results are

returned as a PHP associative arrays.

Page 28: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Interface Architecture

5/1/2014

Web Interface (HTML)

Search requests(AJAX)

Server (PHP) Solarium

Solr Server

Query

Response (JSON or XML)

Response (Assoc. Array)

Results

Query

MYSQL DB

Events Information

Index

Events Information

Page 29: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Overall Architecture

5/1/2014

Web Interface

Search requests(AJAX)

PHP Module Solarium

Solr Server

Query Query

Index

Response (JSON or XML)

Response (Assoc. Array)

Results

Map/Reduce

WARC Files

Extraction/

Filtering Module

Hadoop Uploader Module

Indexer Module

.html Files

MYSQL DB

Events Information

Events Information

Page 30: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Screen Shots

5/1/2014

Page 31: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Screen Shots (Cont.)

5/1/2014

Page 32: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Screen Shots (Cont.)

5/1/2014

Page 33: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Screen Shots (Cont.)

5/1/2014

Page 34: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Screen Shots (Cont.)

5/1/2014

Page 35: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Screen Shots (Cont.)

5/1/2014

Page 36: CS6604 Digital Libraries...Mohammed Magdy Virginia Tech, Blacksburg 5/1/2014 IDEAL Webpages CS6604 Digital Libraries Agenda 5/1/2014 y Project overview y Solr and SolrCloud y Solr

Mohammed Farghally

&

Ahmed Elbery

5/1/2014