arabic content with apache solr
DESCRIPTION
Arabic language poses several challenges faced by Natural Language Processing (NLP), largely due to the fact that Arabic language has a very rich and sophisticated morphological system. This talk will cover some of the challenges and how to solve them with Solr and will also present the challenges that were handled in Opensooq’s use case.TRANSCRIPT
Arabic Content with Apache Solr Ramzi Alqrainy
Ramzi Alqrainy • MSc. In computer science, University of
Jordan, Amman - Jordan • Senior Enterprise Search / Data Engineer @
OpenSooq.com • Technical Reviewer for “Scaling Apache Solr”
and “Apache Solr Search Patterns” (Books) • Co-founder of Solr.ar group • Built 8 search engines for different models in
the last 2 years • Active blogger and Presenter about
Information Retrieval
Agenda
• Why is Arabic Language Important ?
• Arabic Language is Complex
• How we use Apache Solr @ OpenSooq ?
• Localization Concept with SolrCloud
• Ranking and Relevancy
• Apache Solr Implementations @ OpenSooq
Why is Arabic Language Important ?
Why is Arabic Language Important ?
Sample Arabic document without dots
Why is Arabic Language Important ?
Sample Arabic document with dots
Why is Arabic Language Important ?
• The Arabic Language is ranked as the fourth top language on the web
• The number of Arab Internet users grew from 65 million in 2011 to 135 million in 2013
Arabic Language is Complex • Arabic Orthography and Print
§ Arabic has a right-‐to-‐le0 connected script that uses 28 basic le7ers, which change shape depending on their posi:ons in words.
• Arabic Diacritics
§ Diacri:cs help disambiguate the meaning of words.
§ For example, the two words Alam)عَلَم -‐ meaning “flag”) and Eilm)عِلم -‐ meaning
“knowledge”) share the same le7ers علم )Elm( but differ in diacri:cs.
Arabic Language is Complex
• Arabic Morphology
§ Arabic words are divided into three main types: nouns, verbs, and par:cles.
§ Arabic nouns, which include adjec:ves and adverbs, and verbs are derived from a closed set of around 10,000 roots
Arabic Language is Complex
• Arabic Dialects § There are 6 dominant with many more varia:ons of them and dozens more less spoken
dialects.
§ EG. The concept corresponding to “I want” is expressed as عاوز )Eawz( in Egyp:an, أبغى (Abgy) in Gulf, أبي )Aby( in Iraqi, and بدي )bdy( in Levan:ne.
• Arabizi (Transliteration) § Arabic is some:mes wri7en using La:n characters in transliterated form. § Arabizi uses numerals to represent Arabic le7ers. § EG. "2" and “3” represent the le7ers أ (that sounds like “a” as in apple) and ع )E( (that is
a gu7ural “aa”) respec:vely.
How we use Apache Solr @ OpenSooq ? • A leading classifieds ads website in the Middle East and North Africa.
• Right now : Average > 7K Concurrent Users.
• Activity-Per-Second : 240 APS. • Adding/Edi:ng/Dele:ng Post • Adding Comments • Sending Message to Buyer/Seller, etc.
• More than 40k hits on Apache Solr Per Minute.
How we use Apache Solr @ OpenSooq ?
• Arabic Search Engine
Arabic Normalization
• There are common spelling mistakes that are widely accepted. For example, the verb ادرس (Adrs) in impera:ve mood (meaning “study” – in a command form) would turn to أدرس .
• Arabic content would be normalized according to the following steps: § Remove punctua:on § Remove diacri:cs (primarily weak vowels). § Remove non le7ers § Replace ا , إ , and أ with ا from first le7er in each word (A -‐ alef) § Replace final ى with ي (Ya) § Replace final ة with ه )Ha(
Arabic Light Stemmer • A light stemmer is not dictionary driven.
• This algorithm follows a rule-based prefix-removal mechanism.
Arabic Light Stemmer • The light stemmer, light10, outperformed the other approaches. It is becoming
widely used in Arabic information retrieval.
Arabic Light Stemmer • Sometimes a stemmer might not do what you want out of the box.
• Protects words from being modified by stemmers. Stop words and Synonyms • Removing stop words is important to ensure high performance and improve recall
h7ps://github.com/Ramzi-‐Alqrainy/Arabic-‐IR/blob/master/stopwords-‐ar.txt
• Matching strings of tokens and replacing them with other strings of tokens will improve precision and recall .
Apache Solr Schema.xml • A text field that is appropriate for Arabic
Localization Concept with SolrCloud
Ranking and Relevancy: Boost documents by age
• Just do a descending sort by age = done?
• Boost more recent documents and penalize older documents just for being old • Recency Boosting
Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-‐11,0.08,0.05) ^5
Tune Solr Recip Function
Solr Implementations @ OpenSooq ?
§ Anti Spam
§ Checking Relevancy
§ Tags Generations
§ Recommendation System
Thank You
@RamziAlqrainy
https://github.com/Ramzi-Alqrainy
http://solr-enterprise-search-server.blogspot.com/