implementing a corpus for sinhala language
TRANSCRIPT
Implementing a Corpus for
Sinhala Language
Upeksha W. D.
Wijayarathna D. G. C. D.
Siriwardena M. P.
Lasandun K. H. L.
Dr. Chinthana Wimalasuriya
Prof. Gihan Dias
Mr. N. H. N. D. De Silva
OUTLINE
• Introduction
• Resource Gathering
• Data Storage
• User Interface Design and Implementation
• Application Programming Interface
• Limitations
INTRODUCTION
What is a Language Corpus?
▪ A collection of authentic texts of the
language
▪ Stored electronically
▪ Contains actual usage patterns of the
language
▪ Covers a wide context of the language
▪ Can be used to discover information about
language that may not have been noticed
through intuition alone
IDENTIFIED SINHALA
RESOURCES
News Academic Creative
Writing
Spoken Gazette
News Paper Text books Fiction Subtitle Gazette
News Items Religious Blogs
Wikipedia Magazine
Mahawansa
DATA REPRESENTATION
▪ Raw data is separately stored as XML
formatted files and converted to necessary
formats according to information need
▪ Primary Metadata
▪ Article Name
▪ Author
▪ URL
▪ Date (year, Month, Day)
▪ Genre
DATA STORAGE SYSTEM
Considerations
▪ Relational Databases
▪ Oracle DB
▪ H2 Database
▪ Indexed File Systems
▪ Apache Solr
▪ Column Store Databases
▪ Cassandra
▪ Graph Databases
▪ Neo4j
EVALUATION CRITERIA
We considered performance for inserting
data and for retrieving 12 different
information needs.
•Cassandra performed better than
others in most of the scenarios,
and its insertion time increased
linearly.
•So we chose it for implementing
corpus.
User Interface Design and
Implementation
● Web interface of Sinmin has been designed
for users who would prefer a visualised and
summarized view of statistical data of
Sinmin.
● Visual design of the interface has been made
in a way that any user without prior
experience of the interface is able to fulfill his
information requirements with little effort.
APPLICATION PROGRAMING
INTERFACE (API)
•REST API to expose Corpus services
•Much complex and customizable data retrieval
and filtering
•Interface for third party applications to
consume
LIMITATIONS OF THE CORPUS
● words are not annotated with their Part of
Speech (POS) tags and lemmas.
● If a new information need occurs, new
column families may need to be created for
them and data has to be inserted again.