arabic content with apache solr

23

Upload: ramzi-alqrainy

Post on 24-Jun-2015

790 views

Category:

Engineering


10 download

DESCRIPTION

Arabic language poses several challenges faced by Natural Language Processing (NLP), largely due to the fact that Arabic language has a very rich and sophisticated morphological system. This talk will cover some of the challenges and how to solve them with Solr and will also present the challenges that were handled in Opensooq’s use case.

TRANSCRIPT

Page 1: Arabic Content with Apache Solr
Page 2: Arabic Content with Apache Solr

Arabic Content with Apache Solr Ramzi Alqrainy

Page 3: Arabic Content with Apache Solr

Ramzi Alqrainy •  MSc. In computer science, University of

Jordan, Amman - Jordan •  Senior Enterprise Search / Data Engineer @

OpenSooq.com •  Technical Reviewer for “Scaling Apache Solr”

and “Apache Solr Search Patterns” (Books) •  Co-founder of Solr.ar group •  Built 8 search engines for different models in

the last 2 years •  Active blogger and Presenter about

Information Retrieval

Page 4: Arabic Content with Apache Solr

Agenda

•  Why is Arabic Language Important ?

•  Arabic Language is Complex

•  How we use Apache Solr @ OpenSooq ?

•  Localization Concept with SolrCloud

•  Ranking and Relevancy

•  Apache Solr Implementations @ OpenSooq

Page 5: Arabic Content with Apache Solr

Why is Arabic Language Important ?

Page 6: Arabic Content with Apache Solr

Why is Arabic Language Important ?

Sample Arabic document without dots

Page 7: Arabic Content with Apache Solr

Why is Arabic Language Important ?

Sample Arabic document with dots

Page 8: Arabic Content with Apache Solr

Why is Arabic Language Important ?

•  The Arabic Language is ranked as the fourth top language on the web

•  The number of Arab Internet users grew from 65 million in 2011 to 135 million in 2013

Page 9: Arabic Content with Apache Solr

Arabic Language is Complex •  Arabic Orthography and Print

§  Arabic  has  a  right-­‐to-­‐le0  connected  script  that  uses  28  basic  le7ers,  which  change  shape  depending  on  their  posi:ons  in  words.  

•  Arabic Diacritics

§  Diacri:cs  help  disambiguate  the  meaning  of  words.  

§  For  example,  the  two  words Alam)عَلَم   -­‐  meaning  “flag”)  and Eilm)عِلم   -­‐  meaning  

“knowledge”)  share  the  same  le7ers علم   )Elm(  but  differ  in  diacri:cs.  

Page 10: Arabic Content with Apache Solr

Arabic Language is Complex

•  Arabic Morphology

§  Arabic  words  are  divided  into  three  main  types:  nouns,  verbs,  and  par:cles.  

§  Arabic  nouns,  which  include  adjec:ves  and  adverbs,  and  verbs  are  derived  from  a  closed  set  of  around  10,000  roots  

Page 11: Arabic Content with Apache Solr

Arabic Language is Complex

•  Arabic Dialects §  There  are  6  dominant  with  many  more  varia:ons  of  them  and  dozens  more  less  spoken  

dialects.  

§  EG.  The  concept  corresponding  to  “I  want”  is  expressed  as عاوز   )Eawz(  in  Egyp:an, أبغى   (Abgy)  in  Gulf, أبي   )Aby(  in  Iraqi,  and بدي   )bdy(  in  Levan:ne.  

•  Arabizi (Transliteration) §  Arabic  is  some:mes  wri7en  using  La:n  characters  in  transliterated  form.  §  Arabizi  uses  numerals  to  represent  Arabic  le7ers.  §  EG.  "2"  and  “3”  represent  the  le7ers أ     (that  sounds  like  “a”  as  in  apple)  and ع   )E(  (that  is  

a  gu7ural  “aa”)  respec:vely.    

Page 12: Arabic Content with Apache Solr

How we use Apache Solr @ OpenSooq ? •  A leading classifieds ads website in the Middle East and North Africa.

•  Right now : Average > 7K Concurrent Users.

•  Activity-Per-Second : 240 APS. •  Adding/Edi:ng/Dele:ng  Post  •  Adding  Comments  •  Sending  Message  to  Buyer/Seller,  etc.  

•  More than 40k hits on Apache Solr Per Minute.

Page 13: Arabic Content with Apache Solr

How we use Apache Solr @ OpenSooq ?

•  Arabic Search Engine

Page 14: Arabic Content with Apache Solr

Arabic Normalization

•  There are common spelling mistakes that are widely accepted.  For  example,  the  verb ادرس  (Adrs)  in  impera:ve  mood  (meaning  “study”  –  in  a  command  form)  would  turn  to أدرس  .    

•  Arabic content would be normalized according to the following steps: §  Remove  punctua:on    §  Remove  diacri:cs  (primarily  weak  vowels).    §  Remove  non  le7ers    §  Replace   ا  , إ   ,  and أ   with ا     from  first  le7er  in  each  word  (A  -­‐  alef)  §  Replace  final ى   with ي     (Ya)  §  Replace  final ة   with ه   )Ha(    

Page 15: Arabic Content with Apache Solr

Arabic Light Stemmer •  A light stemmer is not dictionary driven.

•  This algorithm follows a rule-based prefix-removal mechanism.

Page 16: Arabic Content with Apache Solr

Arabic Light Stemmer •  The light stemmer, light10, outperformed the other approaches. It is becoming

widely used in Arabic information retrieval.

Page 17: Arabic Content with Apache Solr

Arabic Light Stemmer •  Sometimes a stemmer might not do what you want out of the box.

•  Protects words from being modified by stemmers. Stop words and Synonyms •  Removing stop words is important to ensure high performance and improve recall

h7ps://github.com/Ramzi-­‐Alqrainy/Arabic-­‐IR/blob/master/stopwords-­‐ar.txt  

•  Matching strings of tokens and replacing them with other strings of tokens will improve precision and recall .

Page 18: Arabic Content with Apache Solr

Apache Solr Schema.xml •  A text field that is appropriate for Arabic

Page 19: Arabic Content with Apache Solr

Localization Concept with SolrCloud

Page 20: Arabic Content with Apache Solr

Ranking and Relevancy: Boost documents by age

•  Just do a descending sort by age = done?

•  Boost more recent documents and penalize older documents just for being old •  Recency Boosting

Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-­‐11,0.08,0.05)  ^5  

Page 21: Arabic Content with Apache Solr

Tune Solr Recip Function

Page 22: Arabic Content with Apache Solr

Solr Implementations @ OpenSooq ?

§  Anti Spam

§  Checking Relevancy

§  Tags Generations

§  Recommendation System

Page 23: Arabic Content with Apache Solr

Thank You

@RamziAlqrainy

https://github.com/Ramzi-Alqrainy

http://solr-enterprise-search-server.blogspot.com/