arabic content with apache solr: presented by ramzi alqrainy, opensooq

23

Upload: lucidworks

Post on 15-Jul-2015

237 views

Category:

Software


3 download

TRANSCRIPT

Arabic Content with Apache Solr Ramzi Alqrainy

Ramzi Alqrainy •  MSc. In computer science, University of

Jordan, Amman - Jordan •  Senior Enterprise Search / Data Engineer @

OpenSooq.com •  Technical Reviewer for “Scaling Apache Solr”

and “Apache Solr Search Patterns” (Books) •  Co-founder of Solr.ar group •  Built 8 search engines for different models in

the last 2 years •  Active blogger and Presenter about

Information Retrieval

Agenda

•  Why is Arabic Language Important ?

•  Arabic Language is Complex

•  How we use Apache Solr @ OpenSooq ?

•  Localization Concept with SolrCloud

•  Ranking and Relevancy

•  Apache Solr Implementations @ OpenSooq

Why is Arabic Language Important ?

Why is Arabic Language Important ?

Sample Arabic document without dots

Why is Arabic Language Important ?

Sample Arabic document with dots

Why is Arabic Language Important ?

•  The Arabic Language is ranked as the fourth top language on the web

•  The number of Arab Internet users grew from 65 million in 2011 to 135 million in 2013

Arabic Language is Complex •  Arabic Orthography and Print

§  Arabic  has  a  right-­‐to-­‐le0  connected  script  that  uses  28  basic  le7ers,  which  change  shape  depending  on  their  posi:ons  in  words.  

•  Arabic Diacritics

§  Diacri:cs  help  disambiguate  the  meaning  of  words.  

§  For  example,  the  two  words Alam)عَلَم   -­‐  meaning  “flag”)  and Eilm)عِلم   -­‐  meaning  

“knowledge”)  share  the  same  le7ers علم   )Elm(  but  differ  in  diacri:cs.  

Arabic Language is Complex

•  Arabic Morphology

§  Arabic  words  are  divided  into  three  main  types:  nouns,  verbs,  and  par:cles.  

§  Arabic  nouns,  which  include  adjec:ves  and  adverbs,  and  verbs  are  derived  from  a  closed  set  of  around  10,000  roots  

Arabic Language is Complex

•  Arabic Dialects §  There  are  6  dominant  with  many  more  varia:ons  of  them  and  dozens  more  less  spoken  

dialects.  

§  EG.  The  concept  corresponding  to  “I  want”  is  expressed  as عاوز   )Eawz(  in  Egyp:an, أبغى   (Abgy)  in  Gulf, أبي   )Aby(  in  Iraqi,  and بدي   )bdy(  in  Levan:ne.  

•  Arabizi (Transliteration) §  Arabic  is  some:mes  wri7en  using  La:n  characters  in  transliterated  form.  §  Arabizi  uses  numerals  to  represent  Arabic  le7ers.  §  EG.  "2"  and  “3”  represent  the  le7ers أ     (that  sounds  like  “a”  as  in  apple)  and ع   )E(  (that  is  

a  gu7ural  “aa”)  respec:vely.    

How we use Apache Solr @ OpenSooq ? •  A leading classifieds ads website in the Middle East and North Africa.

•  Right now : Average > 7K Concurrent Users.

•  Activity-Per-Second : 240 APS. •  Adding/Edi:ng/Dele:ng  Post  •  Adding  Comments  •  Sending  Message  to  Buyer/Seller,  etc.  

•  More than 40k hits on Apache Solr Per Minute.

How we use Apache Solr @ OpenSooq ?

•  Arabic Search Engine

Arabic Normalization

•  There are common spelling mistakes that are widely accepted.  For  example,  the  verb ادرس  (Adrs)  in  impera:ve  mood  (meaning  “study”  –  in  a  command  form)  would  turn  to أدرس  .    

•  Arabic content would be normalized according to the following steps: §  Remove  punctua:on    §  Remove  diacri:cs  (primarily  weak  vowels).    §  Remove  non  le7ers    §  Replace   ا  , إ   ,  and أ   with ا     from  first  le7er  in  each  word  (A  -­‐  alef)  §  Replace  final ى   with ي     (Ya)  §  Replace  final ة   with ه   )Ha(    

Arabic Light Stemmer •  A light stemmer is not dictionary driven.

•  This algorithm follows a rule-based prefix-removal mechanism.

Arabic Light Stemmer •  The light stemmer, light10, outperformed the other approaches. It is becoming

widely used in Arabic information retrieval.

Arabic Light Stemmer •  Sometimes a stemmer might not do what you want out of the box.

•  Protects words from being modified by stemmers. Stop words and Synonyms •  Removing stop words is important to ensure high performance and improve recall

h7ps://github.com/Ramzi-­‐Alqrainy/Arabic-­‐IR/blob/master/stopwords-­‐ar.txt  

•  Matching strings of tokens and replacing them with other strings of tokens will improve precision and recall .

Apache Solr Schema.xml •  A text field that is appropriate for Arabic

Localization Concept with SolrCloud

Ranking and Relevancy: Boost documents by age

•  Just do a descending sort by age = done?

•  Boost more recent documents and penalize older documents just for being old •  Recency Boosting

Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-­‐11,0.08,0.05)  ^5  

Tune Solr Recip Function

Solr Implementations @ OpenSooq ?

§  Anti Spam

§  Checking Relevancy

§  Tags Generations

§  Recommendation System

Thank You

@RamziAlqrainy

https://github.com/Ramzi-Alqrainy

http://solr-enterprise-search-server.blogspot.com/