faceted browsing: analysis and implementation of a big data solution using apache solr

Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache

Paolo MalavoltaUniversity of Modena and Reggio Emilia

April 15, 2014

Advisor:Prof. Sonia BergamaschiCo-Advisor:Prof. H.V. JagadishDott. Ing. Francesco Guerra

Big Data visualization

90% of the data in the world today was created in the last 2 years alone. -IBM-

50% of our brains are

involved invisual processing

70% of all of our sensory

receptorsare in our eyes

Why should we be interested in visualization? Because the human visual system is a pattern seeker of enormous power and subtlety. The eye and the visual cortex of the brain form a massively parallel processor that provides the highest-bandwidth channel into human cognitive centers. At higher levels of processing, perception and cognition are closely interrelated, which is the reason why the words “understanding” and “seeing” are synonymous.

-Colin Ware, 2000-

Apache Solr

Solr (pronounced “solar”) is the most popular and open source enterprise search platform from the Apache Lucene project. Its major features include: Full-text search, Dynamic clustering, Near Real-time indexing, Traditional DB, rich document (e.g., Word, PDF, jpeg) handling, NoSQL DB support, HTTP Restfull requests.

Big Data solution

Jetty features: Tomcat features:• Full-featured and standards-based.• Embeddable and Asynchronous.• Open source and commercially usable.• Dual licensed under Apache and Eclipse.• Flexible and extensible, Enterprise

scalable. Low maintenance cost.• Small and Efficient.

• Famous open source under Apache.• Easier to embed Tomcat in your

applications.• Implements the Servlet 3.0, JSP 2.2 and

JSP-EL 2.2 support.• Strong and widely commercially usable

and use.• Easy integrated with other application

such as Spring.• Flexible and extensible, Enterprise

scalable.• Faster JSP parsing.• Stable.

Tomcat

Big Data solution

Solr Tomcat Apache ZooKeeper:

Sharding Replication

Big Data solution

Solr Tomcat Apache ZooKeeper:

Sharding Replication

Hadoop Distributed File System (HDFS)

Big Data solutionover

Apache Solr

Faceted Browsing

Work Solr over Big Data DONE

Big Data visualization over Solr NOT YET!

WE NEED IT

Faceted Browsing

Faceted Browsing over Solr AND between facets OR within facet

Faceted Browsing

Faceted Browsing concerns: Checkbox Scale Efficiency

CAPCOLOR (79)[x] Brown (23)[x] Gray (45)[ ] Red (11)CAPSURFACE (18)[ ] Scaly (7)[ ] Smooth (11)ODOR (183) [ ] None (34)[x] Pungent (80)[ ] Spicy (65)[ ] Anise (4)

Faceted Browsing

How to make Faceted Browsing over Solr: Analysis and learning of a multi-faceted query Modified Solr front-end Added javascript

TAG MEANING

/solr/collection1/browse? the url that show the Solr browsing web page

&q= tag that identify a main query

&fq= tag that identify a subquery

{!tag=dt}CapColor:brown tag used to select a specific faced field and its facet value

%22OR%22 tag needed when are chosen more facet value of the same facet field

facet=on tag to enable the faceted browsing

&facet.field= tag to choose a facet field when faceted browsing is enabled

{!ex=dt}Bruises tag to select a specific facet field to show when faceted browsing is enabled

http://localhost:8081/solr/collection1/browse? &q=&fq={!tag=dt}CapColor:brown%22OR%22{!tag=dt}CapColor:gray &q=&fq={!tag=dt}Odor:pungent &facet=on &facet.field={!ex=dt}CapColor &facet.field={!ex=dt}CapSurface &facet.field={!ex=dt}Odor

Bayesian Networks

Issues: Huge volume of facet User search experience

How: Bayesian networks Open Markov

Open source GUI available API available

Bayesian Networks

How to: Added javascript Added Servlet Open Markov API modified

Bayesian Networks

Aware User Search

Dynamic Summary

Keywords Search

Keywords Search over Relational Database: Very popular method Allow non-expert user to formulate queries Not needed to how the data is represented inside DB

Current techniques: A-priori instance analysis Construct an index Retrieve information from index

Experimented approach: Learning Bayesian Network from

DB Nodes represents columns Retrieve information from BN

Keywords Search

How it works:

Learning Bayesian NetworkAutomatic Hill Climbing algorithm

Compute probabilities:Variable Elimination algorithm

Print results:Quick Sort algorithm

Keywords Search

Test: Portion of a real DB 2-way path Full outer join:

First 1000 rows

50 runs

2 EVIDENCE 3 EVIDENCE

TOP 1 TOP 3 TOP 1 TOP 3

6/13=0.461 10/13 = 0.769 6/12 = 0.5 9/12 = 0.692

8/13 = 0.615 10/13 = 0.769 8/12 = 0.666 9/12 = 0.692

... ... ... ...

10/13 = 0.769 12/13=0.923 9/12 = 0.692 11/12=0.916

8/13 = 0.615 10/13 = 0.769 7/12=0.583 9/12 = 0.692

MEAN OF ALL THE RESULTS

0.615 0.7228 0.633 0.7832

Execution time: ~ 3.0 sec.

Conclusions and Future work

Conclusions: I have provided over Apache Solr:

Big Data solution Faceted browsing Integrated and ready-to-use solution Aware user search using Bayesian Networks Dynamic summary using Bayesian

Networks

I have provided over Keywords Search:

A different approach for keyword search over DB

A software that allow user to formulate queries

Future work: For Solr I can:

Use Machine Learning algorithm to automate the learning of the facets

For Keywords Search I can: Learn the BN with more Database rows for better

results

Thank for your time.

faceted browsing: analysis and implementation of a big data solution using apache solr

Documents

potential of freely faceted classification for knowledge...

navigating tomorrow's web: from searching and browsing to...

dh11: browsing highly interconnected humanities databases...

browsing the pds image archive with the imaging atlas and...

myocean central information system€¦ · (ogc/csw)...

apache solr + ajax solr

introduction · web view2020. 8. 18. · solr - its major...

ncsu libraries endeca and faceted browsing: giving the user...

tagme! { enhancing social tagging with spatial context · -...

faceted metadata in image search & browsing using words to...

benchmarking faceted browsing capabilities of triple stores

creating a new faceted browsing function for millennium...

faceted metadata for image search and...

fast (faceted application of subject terminology): a...

faceted browsing jan polowinski lehrstuhl für...

1 creating a new faceted browsing function for millennium...

apache solr cookbook - the-eye.euapache solr cookbook iii 4...

supporting multiple paths to objects in information ... ·...

widgets for faceted browsing jan polowinski institute for...

faceted browsing for acl anthology