work at tacola lab team members t.v.geetha ranjani parthasarathi madhan karky e.umamaheswari...
TRANSCRIPT
Work at TACOLA LabWork at TACOLA LabTeam Members
T.V.Geetha Ranjani Parthasarathi Madhan Karky E.UmaMaheswari J.Balaji Subalalitha Elanchezhiyan.K,
Karthika, Thenmalar, Radhakrishnan, Kandasamy, Padmavathi, Aruna, Vijayavani
Dr.T.V.Geetha, Anna University 2
Tamil Language ProcessingTamil Language Processing Tamil Language Processing
Morphological analyserNormal Words, Compound
Words, Colloquial Words Parser
Simple, Complex and Compound Sentences
Semantic analysis based on UNL Language Technology
Blog Mining Ontology Based Information
Extraction Personalized Search Parallelization for NLP
Processing Emotion detection form text
Carnatic Music Processing Raga Modelling Singer, Genre Identification Music Emotion Recognition
Tamil Language Oriented Tools Dictionary Text Compaction
UNL Based Work UNL for semantic
representation Nested UNL Concept based Search Bi-lingual Search Event Processing Discourse Analysis Summarization Question answering Thirukural Search
Lyric Oriented Processing Lyric Mining Lyrics for Tunes Pleasantness
Dr.T.V.Geetha, Anna University 3
Papers for TIC 2011Papers for TIC 2011Tamil Language Oriented Tools Agaraadhi: A Novel Online Dictionary Framework An Efficient Tamil Text Compaction System. (Surukkupai) Kuralagam, A Concept Relation Based Search Framework for
Thirukural. Popularity Based Scoring Model for Tamil Word Games Tamil Language Processing Template based Multilingual Summary Generation. On Emotion detection from Tamil Text. Tamil Summary Generation for Cricket Match.
Lyric Oriented Processing Lyric Mining : Word, Rhyme & Concept Co-occurrence Analysis. Special Indices for LaaLaLaa Lyric Analysis & Generation Framework.
Dr.T.V.Geetha, Anna University 4
AGARAADHIAGARAADHIA NOVEL ONLINE DICTIONARY A NOVEL ONLINE DICTIONARY
FRAMEWORKFRAMEWORK
Elanchezhiyan.K
Karthikeyan.S
T.V.Geetha
Ranjani Parthasarathi
Madhan Karky
Dr.T.V.Geetha, Anna University 5
OBJECTIVESOBJECTIVES
Agaraadhi, a dictionary framework for indexing and retrieving Tamil words, their meaning, analysis and related information.
Framework to incorporate various unique features - designed to provide additional information to the user regarding the word that they query about.
Dr.T.V.Geetha, Anna University 6
INTRODUCTIONINTRODUCTION
Dr.T.V.Geetha, Anna University 7
INTRODUCTION CONT…INTRODUCTION CONT…
Dr.T.V.Geetha, Anna University 8
AGARAADHI FRAMEWORK CONT…AGARAADHI FRAMEWORK CONT…
Dr.T.V.Geetha, Anna University 9
AGARAADHI FEATURESAGARAADHI FEATURES Morphological Analyser
gives the morphological features of the query word such as root word, parts of speech, gender, tense and count.
If the Query word is padithaan, Morphological Analyser gives as padi as root, word represents male gender and query word is past tense and so on.
Morphological GeneratorTamil morphological generator tackles different syntactic categories such as nouns, verbs, post positions, adjectives, adverbs. The generator is used to generate possible morphological variations
of the query word. Spell Checker
used to check the spelling of Tamil words and to provide alternative suggestions for the wrongly spelt words.
If root word not in dictionary - generates all the possible suggestions with minimum variations from the given word
Dr.T.V.Geetha, Anna University 10
AGARAADHI FEATURESAGARAADHI FEATURES Word Suggestions
gives the list of equivalent or related words for the given query word.
Word Pleasantness score generator provides how easy it is to pronounce the word.
Word Popularity Score shows the word usage in the web based on frequency distribution of
the word across the popular blogs, news articles, social nets etc. Word Usage Statistics
shows the usage of the word in the social network over the past one week.
Word Usage in Literature finds the usage of words in popular literature such as Thirukural,
Bharathiyar Padalgal, Avvai songs and also Lyrics of Tamil Movie songs.
Dr.T.V.Geetha, Anna University 11
AGARAADHI FEATURESAGARAADHI FEATURES Word of the Day A rare word is randomly chosen and is displayed in the opening
page to facilitate users to learn a new word every day.
Number to Text Converter converts a number to Tamil word equivalent as well as in English
text. For example in Tamil we represent oru Arpputham (அற்புதம்) for 100 million, Kumbam (கும்பம்) for 10 billion and finally up to Anniyan (அந்நியம்) for one zilli
Picture Dictionary Pictures, photos or line drawings to depict popular words have
been included in the dictionary to enable efficient learning for children using this tool.
Dr.T.V.Geetha, Anna University 12
RESULTSRESULTS Query word: pookkal (பூக்கள்)
http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE%AA%E0%AF%82%E0%AE%95%E0%AF%8D%E0%AE%95%E0%AE%B3%E0%AF%8D+&ln=ta&Submit.x=8&Submit.y=7
Query word: mazhai (மழை�) http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE%
AE%E0%AE%B4%E0%AF%88+&ln=ta&Submit.x=21&Submit.y=4
Query word: fruit http://www.agaraadhi.com/dict/OD.jsp?w=fruit&ln=en
Dr.T.V.Geetha, Anna University 13
FUTURE WORKFUTURE WORK
Providing APIs for programmers and developing mobile apps for Agaraadhi framework will open a good platform for many researchers and developers working in Tamil Computing area.
Dr.T.V.Geetha, Anna University 14
REFERENCEREFERENCE
1.Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002.
2.Anandan, R. Parthasarathi, and Geetha, Morphological Generator for Tamil. Tamil Inayam, Malaysia, 2001.
3.J. Jai Hari Raju, P. IndhuReka, Dr. Madhan Karky, Statistical Analysis and visualization of Tamil Usage in Live Text Streams, Tamil Internet Conference, Coimbatore, 2010.
Dr.T.V.Geetha, Anna University 15
N.M.RevathiG.P.Shanthi
Elanchezhiyan.K T V Geetha
Ranjani ParthasarathiMadhan Karky
Dr.T.V.Geetha, Anna University 16
OBJECTIVESOBJECTIVES Why Compacting?
limited message length in blog sites and tiny user interface of mobile phones.
saves online storage space and hence reduction in cost. The paper proposes
a text compaction system for Tamil, first of its kind in Tamil.
Idea of compaction Getting the shortest word has no specific rule it is
mainly aimed at understanding. can be obtained by omitting letters, replacing prefix
and suffix through suitable symbols and numbers.
Dr.T.V.Geetha, Anna University 17
FRAMEWORK ARCHITECTUREFRAMEWORK ARCHITECTURE
Dr.T.V.Geetha, Anna University 18
FRAMEWORK CONT..FRAMEWORK CONT.. Input Processing
The morphological analyzer removes the suffix (if present) added to the word and delivers the root word (RW).
Dr.T.V.Geetha, Anna University 19
FRAMEWORK CONT..FRAMEWORK CONT.. Identification of the category & Extraction of compact word
Three categories of words ; common Tamil words, abbreviations/acronyms, numbers. abbreviations /acronyms by comparing it with the keys of the hashmap.
With the help of the hash key and a mapping algorithm, the compact word is retrieved.
Otherwise belongs to either the common tamil word or numbersIf numbers - Numerical analyser for text to number
conversion.
Output Processing : Tamil tool Morphological Generator to add the suitable suffix to cater to
the rules of the language.
Dr.T.V.Geetha, Anna University 20
RESULT AND ANALYSISRESULT AND ANALYSIS
Tested with over 10,000 words.
The final result is reduced to 40% of the original text.
Dr.T.V.Geetha, Anna University 21
REFERENCESREFERENCES
Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002.
Fung, L. M. (2005). SMS short form identification and codec. Unpublished master’s thesis, National University of Singapore, Singapore .
Acrophile (LSLarkey, P Ogilvie, MA Price, B Tamilio, 2000) a system that automatically searches acronym expansion pairs.
Short Message Service (SMS) Texting Symbols: A Functional Analysis of 10,000 Cellular Phone Text Messages by Robert E. Beasley,Franklin College.
Dr.T.V.Geetha, Anna University 22
Kuralagam -Kuralagam - Concept Relation based Search Engine for Concept Relation based Search Engine for
ThirukkuralThirukkural
Elanchezhiyan.K T.V.GeethaRanjani ParthasarathiMadhan Karky
Dr.T.V.Geetha, Anna University 23
ObjectivesObjectives Kuralagam is a conceptual search framework for
Thirukkural – based on UNL Framework. Searching with keywords – in kurals and
intepretations Concept based search based on CoReX – conceptual
indexing based on UNL Bilingual search – English and Tamil Showing Relationships between the concepts.
Dr.T.V.Geetha, Anna University 24
Kuralagam FrameworkKuralagam Framework
Dr.T.V.Geetha, Anna University 25
Offline ProcessingOffline Processing Web Crawler
A Thirukkural statistics crawlercrawls the news and blog documents - to find the usage of
each individual Thirukkural.The usage recorded for measuring the popularity score for
each Thirukkural
Enconversion – Based on UNL Indexed – based on CoReX Framework
Dr.T.V.Geetha, Anna University 26
UNL & EnconversionUNL & Enconversion UNL is an intermediate language
processes knowledge across languagebarriers. captures semantics by converting natural language
terms present in the document to concepts. concepts are connected to the other concepts through
UNL relations - 46 UNL relations plf(Place From), plt(Place To), tmf(Time from), tmt(Time to)
etc
Process of converting a natural language text to UNL graph is known as Enconversion reverse process is known as Deconversion.
Dr.T.V.Geetha, Anna University 27
An Example speaks more...An Example speaks more...
Ex:John was playing in the garden
john(iof>person)
garden(icl>place)
play(icl>action)plc
agt
Dr.T.V.Geetha, Anna University 28
IndexerIndexer The Kuralagam Indexer is designed based on
CoReX Techniques. The Indexer stores and manages the UNL graphs in
two different indices. Concept only index (C index), and Concept-Relation-Concept index (CRC index)
Dr.T.V.Geetha, Anna University 29
Online ProcessingOnline Processing Query Translation and Expansion
converts the user query to UNL graph. uses CRC (Concept Relation Concept) CoReX indices to fetch similarity
thesaurus and co-occurrence list to populate the Multi list Data Structure. Search and Ranking
fetches the Thirukkural number and its details. Thirukkurals for a given query are fetched using the two types of concept
relation indices namely CRC and C. The query concept is expanded using related CRC indices pointing to the
query concept. helps in retrieving many Thirukkurals conceptually related to the query –
not possible with key word Thirukkural search engines. The ranking is based on
priority to the indices in the order CRC>C usage score frequency occurrence of the query concept
Dr.T.V.Geetha, Anna University 30
Tab LayoutTab Layout
Dr.T.V.Geetha, Anna University 31
Performance EvaluationPerformance Evaluation The accuracy of the Thirukkural search engine was
measured using the average precision and mean average precision.
The comparisons between concept based search and keyword based search were measured using Average Precision methodology
Dr.T.V.Geetha, Anna University 32
Average PrecisionAverage Precision
Dr.T.V.Geetha, Anna University 33
ReferenceReference 1. Subalalitha, T V Geetha, Ranjani Parthasarathy and Madhan Karky
Vairamuthu. CoReX: A Concept Based Semantic Indexing Technique. In SWM-08. 2008. India.
2. Foundation, U., the Universal Networking Language (UNL) Specifications Version 3 3ed. December 2004: UNL Computer Society, 2004. 8(5).Center UNDL Foundation
3. Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002.
4. T.Dhanabalan, K.Saravanan, and T.V.Geetha. 2002. Tamil to UNL Enconverter, ICUKL, Goa, India.
5. Andrew, T. and S. Falk. User performance versus precision measures for simple search tasks. In 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval 2006. Seattle, Washington, USA.
Dr.T.V.Geetha, Anna University 34
Template Based MultiLingual Template Based MultiLingual Summary GenerationSummary Generation
Subalalitha C.NE.Umamaheswari
T V GeethaRanjani Parthasarathi
Madhan Karky
Dr.T.V.Geetha, Anna University 35
AimAim
To generate a multi lingual summary using based on Universal Networking Language (UNL) Framework
Dr.T.V.Geetha, Anna University 36
The ArchitechtureThe Architechture
Dr.T.V.Geetha, Anna University 37
Multi Lingual Summary Generation using UNL
Template based Information Extraction• Seven tourism specific templates have been
designed and used• Templates filled using semantic information
inherent in UNL input graphs• Template information is language independent and
can be used with any desired language.
Dr.T.V.Geetha, Anna University 38
Example Templates for Tourism Domain
Template Semantics inherited from UNL
God iof>god, iof>goddess, icl>god
Food icl>food, icl>fruit
Flaura and Fauna icl>animal, icl>reptile, icl>mammal, icl> plant
Boarding facility icl>facility
Transport facility icl>transport
Place icl>place, iof>place, iof>city, iof>country
Distance icl >unit , icl >number
Dr.T.V.Geetha, Anna University 39
SummaryGeneration
• The template information is converted to target language using respective UNL-target language dictionaries.
• UNL-target language dictionaries contains root words.• Natural language term from the root word is obtained using
target language information like case suffixes and language technology tools like morphological generator
(சென்ழை�+இல்=சென்ழை�யி�ல்)• When these converted template information is fitted into
target language specific dynamic sentence patterns, a summary is generated.
Dr.T.V.Geetha, Anna University 40
Performance Evaluation Tested with 33,000 Tamil and English text
documents enconverted to UNL graphs. The performance of the methodology proposed has
been evaluated using human judgement. The accuracy of the summary generated has
achieved 90% .
Further Enhancements•Query specific summary •Comparing the performance with human generated summaries.
Dr.T.V.Geetha, Anna University 41
References[1] Elanchezhiyan K, T V Geetha, Ranjani Parthasarathi & Madhan Karky, CoRe – Concept Based Query Expansion, Tamil Internet Conference, Coimbatore, 2010.
[2] Alkesh Patel , Tanveer Siddiqui , U. S. Tiwary , “A language independent approach to multilingual text summarization”, Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1,2007
[3]David Kirk Evans, “Identifying Similarity in Text: Multi-Lingual Analysis for Summarization ”, Doctor of Philosophy thesis, Graduate School of Arts and Sciences , Columbia University, 2005[4] Radev, Allison, Blair-Goldensohn et al (2004), MEAD – a platform for multidocument multilingual text summarization[5] The Universal Networking Language (UNL) Specifications Version 3 Edition 3, UNL Center UNDL Foundation December 2004.Jagadeesh J, Prasad Pingali, Vasudeva Varma, “ Sentence Extraction Based Single Document Summarization” Workshop on Document Summarization, March, 2005, IIIT Allahabad.[7] Naresh Kumar Nagwani, Dr. Shrish Verma , “A Frequent Term and Semantic Similarity based Single Document Text Summarization Algorithm ” International Journal of Computer Applications (0975 – 8887) Volume 17– No.2, March 2011 .[8]Prof. R. Nedunchelian, “Centroid Based Summarization of Multiple Documents Implemented using Timestamps ” First International Conference on Emerging Trends in Engineering and Technology, IEEE 2008