work at tacola lab team members t.v.geetha ranjani parthasarathi madhan karky e.umamaheswari...

Work at TACOLA LabWork at TACOLA LabTeam Members

T.V.Geetha Ranjani Parthasarathi Madhan Karky E.UmaMaheswari J.Balaji Subalalitha Elanchezhiyan.K,

Karthika, Thenmalar, Radhakrishnan, Kandasamy, Padmavathi, Aruna, Vijayavani

Dr.T.V.Geetha, Anna University 2

Tamil Language ProcessingTamil Language Processing Tamil Language Processing

Morphological analyserNormal Words, Compound

Words, Colloquial Words Parser

Simple, Complex and Compound Sentences

Semantic analysis based on UNL Language Technology

Blog Mining Ontology Based Information

Extraction Personalized Search Parallelization for NLP

Processing Emotion detection form text

Carnatic Music Processing Raga Modelling Singer, Genre Identification Music Emotion Recognition

Tamil Language Oriented Tools Dictionary Text Compaction

UNL Based Work UNL for semantic

representation Nested UNL Concept based Search Bi-lingual Search Event Processing Discourse Analysis Summarization Question answering Thirukural Search

Lyric Oriented Processing Lyric Mining Lyrics for Tunes Pleasantness


Papers for TIC 2011Papers for TIC 2011Tamil Language Oriented Tools Agaraadhi: A Novel Online Dictionary Framework An Efficient Tamil Text Compaction System. (Surukkupai) Kuralagam, A Concept Relation Based Search Framework for

Thirukural. Popularity Based Scoring Model for Tamil Word Games Tamil Language Processing Template based Multilingual Summary Generation. On Emotion detection from Tamil Text. Tamil Summary Generation for Cricket Match.

Lyric Oriented Processing Lyric Mining : Word, Rhyme & Concept Co-occurrence Analysis. Special Indices for LaaLaLaa Lyric Analysis & Generation Framework.


AGARAADHIAGARAADHIA NOVEL ONLINE DICTIONARY A NOVEL ONLINE DICTIONARY

FRAMEWORKFRAMEWORK

Elanchezhiyan.K

Karthikeyan.S

T.V.Geetha

Ranjani Parthasarathi

Madhan Karky

tvg

in general - i think there can be stress on the new features not presented in last Tamil conferencea slide on applications can be addeda slide on future plans can be added


OBJECTIVESOBJECTIVES

Agaraadhi, a dictionary framework for indexing and retrieving Tamil words, their meaning, analysis and related information.

Framework to incorporate various unique features - designed to provide additional information to the user regarding the word that they query about.


INTRODUCTIONINTRODUCTION


INTRODUCTION CONT…INTRODUCTION CONT…


AGARAADHI FRAMEWORK CONT…AGARAADHI FRAMEWORK CONT…


AGARAADHI FEATURESAGARAADHI FEATURES Morphological Analyser

gives the morphological features of the query word such as root word, parts of speech, gender, tense and count.

If the Query word is padithaan, Morphological Analyser gives as padi as root, word represents male gender and query word is past tense and so on.

Morphological GeneratorTamil morphological generator tackles different syntactic categories such as nouns, verbs, post positions, adjectives, adverbs. The generator is used to generate possible morphological variations

of the query word. Spell Checker

used to check the spelling of Tamil words and to provide alternative suggestions for the wrongly spelt words.

If root word not in dictionary - generates all the possible suggestions with minimum variations from the given word


AGARAADHI FEATURESAGARAADHI FEATURES Word Suggestions

gives the list of equivalent or related words for the given query word.

Word Pleasantness score generator provides how easy it is to pronounce the word.

Word Popularity Score shows the word usage in the web based on frequency distribution of

the word across the popular blogs, news articles, social nets etc. Word Usage Statistics

shows the usage of the word in the social network over the past one week.

Word Usage in Literature finds the usage of words in popular literature such as Thirukural,

Bharathiyar Padalgal, Avvai songs and also Lyrics of Tamil Movie songs.


AGARAADHI FEATURESAGARAADHI FEATURES Word of the Day A rare word is randomly chosen and is displayed in the opening

page to facilitate users to learn a new word every day.

Number to Text Converter converts a number to Tamil word equivalent as well as in English

text. For example in Tamil we represent oru Arpputham (அற்புதம்) for 100 million, Kumbam (கும்பம்) for 10 billion and finally up to Anniyan (அந்நியம்) for one zilli

Picture Dictionary Pictures, photos or line drawings to depict popular words have

been included in the dictionary to enable efficient learning for children using this tool.


RESULTSRESULTS Query word: pookkal (பூக்கள்)

http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE%AA%E0%AF%82%E0%AE%95%E0%AF%8D%E0%AE%95%E0%AE%B3%E0%AF%8D+&ln=ta&Submit.x=8&Submit.y=7

Query word: mazhai (மழை�) http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE%

AE%E0%AE%B4%E0%AF%88+&ln=ta&Submit.x=21&Submit.y=4

Query word: fruit http://www.agaraadhi.com/dict/OD.jsp?w=fruit&ln=en





http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE%AE%E0%AE%B4%E0%AF%88+&ln=ta&Submit.x=21&Submit.y=4



http://www.agaraadhi.com/dict/OD.jsp?w=fruit&ln=en


FUTURE WORKFUTURE WORK

Providing APIs for programmers and developing mobile apps for Agaraadhi framework will open a good platform for many researchers and developers working in Tamil Computing area.


REFERENCEREFERENCE

1.Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002.

2.Anandan, R. Parthasarathi, and Geetha, Morphological Generator for Tamil. Tamil Inayam, Malaysia, 2001.

3.J. Jai Hari Raju, P. IndhuReka, Dr. Madhan Karky, Statistical Analysis and visualization of Tamil Usage in Live Text Streams, Tamil Internet Conference, Coimbatore, 2010.


N.M.RevathiG.P.Shanthi

Elanchezhiyan.K T V Geetha

Ranjani ParthasarathiMadhan Karky


OBJECTIVESOBJECTIVES Why Compacting?

limited message length in blog sites and tiny user interface of mobile phones.

saves online storage space and hence reduction in cost. The paper proposes

a text compaction system for Tamil, first of its kind in Tamil.

Idea of compaction Getting the shortest word has no specific rule it is

mainly aimed at understanding. can be obtained by omitting letters, replacing prefix

and suffix through suitable symbols and numbers.


FRAMEWORK ARCHITECTUREFRAMEWORK ARCHITECTURE


FRAMEWORK CONT..FRAMEWORK CONT.. Input Processing

The morphological analyzer removes the suffix (if present) added to the word and delivers the root word (RW).


FRAMEWORK CONT..FRAMEWORK CONT.. Identification of the category & Extraction of compact word

Three categories of words ; common Tamil words, abbreviations/acronyms, numbers. abbreviations /acronyms by comparing it with the keys of the hashmap.

With the help of the hash key and a mapping algorithm, the compact word is retrieved.

Otherwise belongs to either the common tamil word or numbersIf numbers - Numerical analyser for text to number

conversion.

Output Processing : Tamil tool Morphological Generator to add the suitable suffix to cater to

the rules of the language.

tvg

replace with a flowchart??


RESULT AND ANALYSISRESULT AND ANALYSIS

Tested with over 10,000 words.

The final result is reduced to 40% of the original text.


REFERENCESREFERENCES

Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002.

Fung, L. M. (2005). SMS short form identification and codec. Unpublished master’s thesis, National University of Singapore, Singapore .

Acrophile (LSLarkey, P Ogilvie, MA Price, B Tamilio, 2000) a system that automatically searches acronym expansion pairs.

Short Message Service (SMS) Texting Symbols: A Functional Analysis of 10,000 Cellular Phone Text Messages by Robert E. Beasley,Franklin College.


Kuralagam -Kuralagam - Concept Relation based Search Engine for Concept Relation based Search Engine for

ThirukkuralThirukkural

Elanchezhiyan.K T.V.GeethaRanjani ParthasarathiMadhan Karky

tvg

again - paper does not show the contributiondoes not indicate challengesfuture work or applications - PPt suggests everything has been done - Remember TVU will be there - so tell more in future work - or say version 1in general too many sentences in ppts - needs to be avoided & redone


ObjectivesObjectives Kuralagam is a conceptual search framework for

Thirukkural – based on UNL Framework. Searching with keywords – in kurals and

intepretations Concept based search based on CoReX – conceptual

indexing based on UNL Bilingual search – English and Tamil Showing Relationships between the concepts.


Kuralagam FrameworkKuralagam Framework


Offline ProcessingOffline Processing Web Crawler

A Thirukkural statistics crawlercrawls the news and blog documents - to find the usage of

each individual Thirukkural.The usage recorded for measuring the popularity score for

each Thirukkural

Enconversion – Based on UNL Indexed – based on CoReX Framework


UNL & EnconversionUNL & Enconversion UNL is an intermediate language

processes knowledge across languagebarriers. captures semantics by converting natural language

terms present in the document to concepts. concepts are connected to the other concepts through

UNL relations - 46 UNL relations plf(Place From), plt(Place To), tmf(Time from), tmt(Time to)

etc

Process of converting a natural language text to UNL graph is known as Enconversion reverse process is known as Deconversion.


An Example speaks more...An Example speaks more...

Ex:John was playing in the garden

john(iof>person)

garden(icl>place)

play(icl>action)plc

agt


IndexerIndexer The Kuralagam Indexer is designed based on

CoReX Techniques. The Indexer stores and manages the UNL graphs in

two different indices. Concept only index (C index), and Concept-Relation-Concept index (CRC index)


Online ProcessingOnline Processing Query Translation and Expansion

converts the user query to UNL graph. uses CRC (Concept Relation Concept) CoReX indices to fetch similarity

thesaurus and co-occurrence list to populate the Multi list Data Structure. Search and Ranking

fetches the Thirukkural number and its details. Thirukkurals for a given query are fetched using the two types of concept

relation indices namely CRC and C. The query concept is expanded using related CRC indices pointing to the

query concept. helps in retrieving many Thirukkurals conceptually related to the query –

not possible with key word Thirukkural search engines. The ranking is based on

priority to the indices in the order CRC>C usage score frequency occurrence of the query concept


Tab LayoutTab Layout


Performance EvaluationPerformance Evaluation The accuracy of the Thirukkural search engine was

measured using the average precision and mean average precision.

The comparisons between concept based search and keyword based search were measured using Average Precision methodology


Average PrecisionAverage Precision


ReferenceReference 1. Subalalitha, T V Geetha, Ranjani Parthasarathy and Madhan Karky

Vairamuthu. CoReX: A Concept Based Semantic Indexing Technique. In SWM-08. 2008. India.

2. Foundation, U., the Universal Networking Language (UNL) Specifications Version 3 3ed. December 2004: UNL Computer Society, 2004. 8(5).Center UNDL Foundation

3. Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002.

4. T.Dhanabalan, K.Saravanan, and T.V.Geetha. 2002. Tamil to UNL Enconverter, ICUKL, Goa, India.

5. Andrew, T. and S. Falk. User performance versus precision measures for simple search tasks. In 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval 2006. Seattle, Washington, USA.


Template Based MultiLingual Template Based MultiLingual Summary GenerationSummary Generation

Subalalitha C.NE.Umamaheswari

T V GeethaRanjani Parthasarathi

Madhan Karky

tvg

again - paper does not show the contributiondoes not indicate challengesfuture work or applications - PPt suggests everything has been done - Remember TVU will be there - so tell more in future work - or say version 1in general too many sentences in ppts - needs to be avoided & redone


AimAim

To generate a multi lingual summary using based on Universal Networking Language (UNL) Framework


The ArchitechtureThe Architechture


Multi Lingual Summary Generation using UNL

Template based Information Extraction• Seven tourism specific templates have been

designed and used• Templates filled using semantic information

inherent in UNL input graphs• Template information is language independent and

can be used with any desired language.


Example Templates for Tourism Domain

Template Semantics inherited from UNL

God iof>god, iof>goddess, icl>god

Food icl>food, icl>fruit

Flaura and Fauna icl>animal, icl>reptile, icl>mammal, icl> plant

Boarding facility icl>facility

Transport facility icl>transport

Place icl>place, iof>place, iof>city, iof>country

Distance icl >unit , icl >number


SummaryGeneration

• The template information is converted to target language using respective UNL-target language dictionaries.

• UNL-target language dictionaries contains root words.• Natural language term from the root word is obtained using

target language information like case suffixes and language technology tools like morphological generator

(சென்ழை�+இல்=சென்ழை�யி�ல்)• When these converted template information is fitted into

target language specific dynamic sentence patterns, a summary is generated.


Performance Evaluation Tested with 33,000 Tamil and English text

documents enconverted to UNL graphs. The performance of the methodology proposed has

been evaluated using human judgement. The accuracy of the summary generated has

achieved 90% .

Further Enhancements•Query specific summary •Comparing the performance with human generated summaries.


References[1] Elanchezhiyan K, T V Geetha, Ranjani Parthasarathi & Madhan Karky, CoRe – Concept Based Query Expansion, Tamil Internet Conference, Coimbatore, 2010.

[2] Alkesh Patel , Tanveer Siddiqui , U. S. Tiwary , “A language independent approach to multilingual text summarization”, Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1,2007

[3]David Kirk Evans, “Identifying Similarity in Text: Multi-Lingual Analysis for Summarization ”, Doctor of Philosophy thesis, Graduate School of Arts and Sciences , Columbia University, 2005[4] Radev, Allison, Blair-Goldensohn et al (2004), MEAD – a platform for multidocument multilingual text summarization[5] The Universal Networking Language (UNL) Specifications Version 3 Edition 3, UNL Center UNDL Foundation December 2004.Jagadeesh J, Prasad Pingali, Vasudeva Varma, “ Sentence Extraction Based Single Document Summarization” Workshop on Document Summarization, March, 2005, IIIT Allahabad.[7] Naresh Kumar Nagwani, Dr. Shrish Verma , “A Frequent Term and Semantic Similarity based Single Document Text Summarization Algorithm ” International Journal of Computer Applications (0975 – 8887) Volume 17– No.2, March 2011 .[8]Prof. R. Nedunchelian, “Centroid Based Summarization of Multiple Documents Implemented using Timestamps ” First International Conference on Emerging Trends in Engineering and Technology, IEEE 2008

work at tacola lab team members t.v.geetha ranjani parthasarathi madhan karky e.umamaheswari...

Documents

tamil words

query word

word pleasantness analysis

morphological analysis

root word

search framework

tamil summary generation

anna university4 agaraadhi