itihasa: a large-scale corpus for sanskrit to english ... · the mahabh¯ ¯arata to english:...

7
Proceedings of the 8th Workshop on Asian Translation, pages 191–197 Bangkok, Thailand (online), August 5-6, 2021. ©2021 Association for Computational Linguistics 191 Itih¯ asa: A large-scale corpus for Sanskrit to English translation Rahul Aralikatte 1 Miryam de Lhoneux 1 Anoop Kunchukuttan 2 Anders Søgaard 1 1 University of Copenhgagen 2 Microsoft AI and Research 1 {rahul,ml,soegaard}@di.ku.dk 2 [email protected] Abstract This work introduces Itih¯ asa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English transla- tions. The shlokas are extracted from two Indian epics viz., The R¯ am¯ ayana and The Mah¯ abh¯ arata. We first describe the motivation behind the curation of such a dataset and fol- low up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset. 1 1 Introduction Sanskrit is one of the oldest languages in the world and most Indo-European languages are influenced by it (Beekes, 1995). There are about 30 million pieces of Sanskrit literature available to us today (Goyal et al., 2012), most of which have not been digitized. Among those that have been, few have been translated. The main reason for this is the lack of expertise and funding. An automatic translation system would not only aid and accelerate this pro- cess, but it would also help in democratizing the knowledge, history, and culture present in this liter- ature. In this work, we present Itih ¯ asa, a large-scale Sanskrit-English translation corpus consisting of more than 93,000 shlokas and their translations. Itih ¯ asa, literally meaning ‘it happened this way’ is a collection of historical records of important events in Indian history. These bodies of work are mostly composed in the form of verses or shlokas, a poetic form which usually consists of four parts containing eight syllables each (Fig. 1). The most important among these works are The R¯ am¯ ayana 1 The processed and split dataset can be found at https://github.com/rahular/itihasa and a human-readable version can be found at http://rahular. com/itihasa. ȡ Ǔȡ ĤǓçȡȲ æȡæȢè ȡ:@ ラĐȫムǕȡȯȢ: ȡȪǑ。 Q A O fowler, since you have slain one of a pair of Krauñcas, you shall never attain prosperity (respect)! Figure 1: An introductory shloka from The R¯ am¯ ayana. The four parts with eight syllables each are highlighted with different shades of gray. and The Mah ¯ abh ¯ arata. The R¯ am¯ ayana, which de- scribes the events in the life of Lord R¯ ama, consists of 24,000 verses. The Mah ¯ abh ¯ arata details the war between cousins of the Kuru dynasty, in 100,000 verses. The Mah ¯ abh ¯ arata is the longest poem ever written with about 1.8 million words in total and is roughly ten times the length of the Iliad and the Odyssey combined. Only two authors have attempted to translate the unabridged versions of both The R¯ am¯ ayana and The Mah ¯ abh ¯ arata to English: Manmatha N¯ ath Dutt in the 1890s and Bibek Debroy in the 2010s. M. N. Dutt was a prolific translator whose works are now in the public domain. These works are pub- lished in a shloka-wise format as shown in Fig. 1 which makes it easy to automatically align shlokas with their translations. Though many of M. N. Dutt’s works are freely available, we choose to extract data from The R¯ am¯ ayana (V¯ almiki and Dutt, 1891), and The Mah ¯ abh ¯ arata (Dwaip ¯ ayana and Dutt, 1895), mainly due to its size and popu- larity. As per our knowledge, this is the biggest Sanskrit-English translation dataset to be released in the public domain. We also train and evaluate standard translation systems on this dataset. In both translation di- rections, we use Moses as an SMT baseline, and Transformer-based seq2seq models as NMT base- lines (see §4). We find that models which are gen- erally on-par with human performance on other

Upload: others

Post on 26-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Itihasa: A large-scale corpus for Sanskrit to English ... · The Mahabh¯ ¯arata to English: Manmatha N ath Dutt¯ in the 1890s and Bibek Debroy in the 2010s. M. N. Dutt was a prolific

Proceedings of the 8th Workshop on Asian Translation, pages 191–197Bangkok, Thailand (online), August 5-6, 2021. ©2021 Association for Computational Linguistics

191

Itihasa: A large-scale corpus for Sanskrit to English translation

Rahul Aralikatte1 Miryam de Lhoneux1 Anoop Kunchukuttan2 Anders Søgaard1

1University of Copenhgagen 2Microsoft AI and Research1{rahul,ml,soegaard}@di.ku.dk

[email protected]

Abstract

This work introduces Itihasa, a large-scaletranslation dataset containing 93,000 pairs ofSanskrit shlokas and their English transla-tions. The shlokas are extracted from twoIndian epics viz., The Ramayana and TheMahabharata. We first describe the motivationbehind the curation of such a dataset and fol-low up with empirical analysis to bring out itsnuances. We then benchmark the performanceof standard translation models on this corpusand show that even state-of-the-art transformerarchitectures perform poorly, emphasizing thecomplexity of the dataset.1

1 Introduction

Sanskrit is one of the oldest languages in the worldand most Indo-European languages are influencedby it (Beekes, 1995). There are about 30 millionpieces of Sanskrit literature available to us today(Goyal et al., 2012), most of which have not beendigitized. Among those that have been, few havebeen translated. The main reason for this is the lackof expertise and funding. An automatic translationsystem would not only aid and accelerate this pro-cess, but it would also help in democratizing theknowledge, history, and culture present in this liter-ature. In this work, we present Itihasa, a large-scaleSanskrit-English translation corpus consisting ofmore than 93,000 shlokas and their translations.

Itihasa, literally meaning ‘it happened this way’is a collection of historical records of importantevents in Indian history. These bodies of work aremostly composed in the form of verses or shlokas,a poetic form which usually consists of four partscontaining eight syllables each (Fig. 1). The mostimportant among these works are The Ramayana

1The processed and split dataset can be foundat https://github.com/rahular/itihasa and ahuman-readable version can be found at http://rahular.com/itihasa.

मा नषाद परतषठा तवमगमशशाशवतीससमा:।यतकरौञचमथनादकमवधी: काममोहतम॥

O fowler, since you have slain one of a pair of Krauñcas, you shall never attain prosperity (respect)!

Figure 1: An introductory shloka from The Ramayana.The four parts with eight syllables each are highlightedwith different shades of gray.

and The Mahabharata. The Ramayana, which de-scribes the events in the life of Lord Rama, consistsof 24,000 verses. The Mahabharata details the warbetween cousins of the Kuru dynasty, in 100,000verses. The Mahabharata is the longest poem everwritten with about 1.8 million words in total andis roughly ten times the length of the Iliad and theOdyssey combined.

Only two authors have attempted to translate theunabridged versions of both The Ramayana andThe Mahabharata to English: Manmatha Nath Duttin the 1890s and Bibek Debroy in the 2010s. M.N. Dutt was a prolific translator whose works arenow in the public domain. These works are pub-lished in a shloka-wise format as shown in Fig. 1which makes it easy to automatically align shlokaswith their translations. Though many of M. N.Dutt’s works are freely available, we choose toextract data from The Ramayana (Valmiki andDutt, 1891), and The Mahabharata (Dwaipayanaand Dutt, 1895), mainly due to its size and popu-larity. As per our knowledge, this is the biggestSanskrit-English translation dataset to be releasedin the public domain.

We also train and evaluate standard translationsystems on this dataset. In both translation di-rections, we use Moses as an SMT baseline, andTransformer-based seq2seq models as NMT base-lines (see §4). We find that models which are gen-erally on-par with human performance on other

Page 2: Itihasa: A large-scale corpus for Sanskrit to English ... · The Mahabh¯ ¯arata to English: Manmatha N ath Dutt¯ in the 1890s and Bibek Debroy in the 2010s. M. N. Dutt was a prolific

192

Original Page Invert & Dilate Detect separators Split columns

&

Figure 2: Pre-processing Pipeline: The three steps shown here are: (i) invert the colour scheme of the PDF anddilate every detectable edge, (ii) find the indices of the longest vertical and horizontal lines in the page, and (iii)split the original PDF along the found separator lines.

translation tasks, perform poorly on Itihasa, withthe best models scoring between 7-8 BLEU points.This indicates the complex nature of the dataset(see §3 for a detailed analysis of the dataset and itsvocabulary).

Motivation The main motivation behind thiswork is to provide an impetus for the Indic NLPcommunity to build better translation systems forSanskrit. Additionally, since The Ramayana andThe Mahabharata are so pervasive in Indian culture,and have been translated to all major Indian lan-guages, there is a possibility of creating an n-wayparallel corpus with Sanskrit as the pivot language,similar to Europarl (Koehn, 2005) and PMIndia(Haddow and Kirefu, 2020) datasets.

The existence of Sanskrit-English parallel datahas other advantages as well. Due to Sanskrit beinga morphologically rich, agglutinative, and highlyinflexive, complex concepts can be expressed incompact forms by combining individual wordsthrough Sandhi and Samasa.2 This also enablesa speaker to potentially create an infinite number ofunique words in Sanskrit. Having a parallel corpuscan help us induce word translations through bilin-gual dictionary induction (Søgaard et al., 2018). Italso allows us to use English as a surrogate lan-guage for tasks like knowledge base population.Constituency or dependency parsing, NER, andword sense disambiguation can be improved us-ing indirect supervision (Tackstrom, 2013). Es-sentially, a parallel corpus allows us to apply aplethora of transfer learning techniques to improve

2Sandhi refers to the concatenation of words, where theedge characters combine to form a new one. Samasa can bethought of as being similar to elliptic constructions in Englishwhere certain phrases are elided since their meaning is obviousfrom the context.

NLP tools for Sanskrit.

2 Data Preparation

The translated works of The Ramayana and TheMahabharata were published in four and nine vol-umes respectively.3 All volumes have a standardtwo-column format as shown in Fig. 2. Each pagehas a header with the chapter name and page num-ber separated from the main text by a horizontalline. The two columns of text are separated by avertical line. The process of data preparation canbe divided into (i) automatic OCR extraction, and(ii) manual inspection for alignment errors.

Automatic Extraction The OCR systems we ex-perimented with performed poorly on digitized doc-uments due to their two-column format. They oftenfail to recognize line breaks which result in the con-catenation of text present in different columns. Tomitigate this issue, we use an edge detector4 tofind the largest horizontal and vertical lines, andusing the indices of the detected lines, split the orig-inal page horizontally and vertically to remove theheader and separate the columns (see Fig. 2). Wethen input the single-column documents to GoogleCloud’s OCR API5 to extract text from them. Toverify the accuracy of the extracted text, one chap-ter from each volume (13 chapters in total) is man-ually checked for mistakes. We find that the ex-tracted text is more than 99% and 97% accurate inSanskrit and English respectively. The surprisingaccuracy of Devanagari OCR can be attributed to

3The digitized (scanned) PDF versions of these books areavailable at https://hinduscriptures.in

4We invert the color scheme and apply a small dilation forbetter edge detection using OpenCV (Bradski, 2000).

5More information can be found at https://cloud.google.com/vision/docs/pdf

Page 3: Itihasa: A large-scale corpus for Sanskrit to English ... · The Mahabh¯ ¯arata to English: Manmatha N ath Dutt¯ in the 1890s and Bibek Debroy in the 2010s. M. N. Dutt was a prolific

193

(a) Print error. (b) Input error. (c) Subjective error.

Figure 3: Different types of errors found in the original text while performing manual inspection.

the distinctness of its alphabet. For English, thisnumber decreases as the OCR system often mis-classifies similar-looking characters (viz., e and c,i and l, etc.).

Manual Inspection An important limitation ofthe OCR system is its misclassification of align-ment spaces and line breaks. It sometimes wronglytreats large gaps between words as line breaks andthe rest of the text on the line is moved to the endof the paragraph which results in translations beingmisaligned with its shlokas. Therefore, the outputof all 13 volumes was manually inspected and suchmisalignments were corrected.6

Upon manual inspection, other kinds of errorswere discovered and corrected where possible.7

These errors can be categorized as follows: (i) printerrors: this type of error is caused by occluded orfaded text, smudged ink, etc. An example can beseen in Fig. 3a, (ii) input errors: these are human er-rors during typesetting the volumes which includetypos (Fig, 3b), exclusion of words, inclusion ofspurious words, etc., (iii) subjective errors: theseare contextual errors in the translation itself. Forexample, in Fig. 3c, the word dharma is incorrectlytranslated as ‘religion’ instead of ‘righteousness’,and (iv) OCR errors: these errors arise from theunderlying OCR system. An example of such er-rors is the improper handling of split words acrosslines in the Devanagari script. If the OCR systemencounters a hyphen as the last character of a line,the entire line is ignored. In general, print errorsare corrected as much as possible, subjective errorsare retained for originality, and other types of errorsare corrected when encountered.

6This was a time-consuming process and the first authorinspected the output manually over the course of one year.

7It was not feasible for the authors to correct every error,especially the lexical ones. The most common error that existsin the corpus is the swapping of e and c. For example, ‘thcir’instead of ‘their’. Though these errors can easily be correctedusing automated tools like the one proposed in (Boyd, 2018),it is out-of-scope of this paper and is left for future work.

Train Dev. Test Total

RamayanaChapters 514 42 86 642Shlokas 15,834 1,115 2,422 19,371

MahabharataChapters 1,688 139 283 2,110Shlokas 59,327 5,033 9,299 73,659

OverallChapters 2202 181 369 2,752Shlokas 75,161 6,148 11,721 93,030

Table 1: Size of training, development, and test sets.

3 Analysis

In total, we extract 19,371 translation pairs from642 chapters of The Ramayana and 73,659 transla-tion pairs from 2,110 chapters of The Mahabharata.It should be noted that these numbers do not cor-respond to the number of shlokas because, in theoriginal volumes, shlokas are sometimes split andoften combined to make the English translationsflow better. We reserve 80% of the data from eachtext for training MT systems and use the rest forevaluation. From the evaluation set, 33% is usedfor development and 67% for testing. The absolutesizes of the split data are shown in Tab. 1.

Due to Sanskrit’s agglutinative nature, thedataset is asymmetric in the sense that, the numberof words required to convey the same information,is less in Sanskrit when compared with English.The Ramayana’s English translations, on average,have 2.54 words for every word in its shloka. Thisvalue is even larger in The Mahabharata with 2.82translated words per shloka word.

This effect is clearly seen when we consider thevocabulary sizes and the percentage of commontokens between the texts. For this, we tokenize thedata with two different tokenization schemes: word-level and byte-pair encoding (Sennrich et al., 2016,BPE). For word-level tokenization, the translationsof The Ramayana (The Mahabharata) have 16,820(31,055) unique word tokens, and the shlokas have66,072 (184,407) tokens. The English vocabularies

Page 4: Itihasa: A large-scale corpus for Sanskrit to English ... · The Mahabh¯ ¯arata to English: Manmatha N ath Dutt¯ in the 1890s and Bibek Debroy in the 2010s. M. N. Dutt was a prolific

194

Figure 4: Comparison of vocabulary sizes. Sanskrit’smorphological and agglutinative nature accounts forthe large number of unique tokens in the vocabularies.

have 11,579 common tokens which is 68.8% of TheRamayana’s and 37.3% of The Mahabharata’s. Butthe overlap percentages drop significantly for theSanskrit vocabularies. In this case, we find 21,635common tokens which amount to an overlap of32.7% and 11.7% respectively. As shown in Fig. 4,this trend holds for BPE tokenization as well.

4 Experiments

We train one SMT and five NMT systems in both di-rections and report the (i) character n-gram F-score,(ii) token accuracy, (iii) BLEU (Papineni et al.,2002), and (iv) Translation Edit Ratio (Snoveret al., 2006, TER) scores in Tab. 2. For SMT,we use Moses (Koehn et al., 2007) and for NMT,we use sequence-to-sequence (seq2seq) Transform-ers (Vaswani et al., 2017). We train the seq2seqmodels from scratch by initializing the encodersand decoders with standard BERT (B2B) architec-tures. These Tiny, Mini, Small, Medium, andBase models have 2/128, 4/256, 4/512, 8/512, and12/768 layers/dimensions respectively. See Turcet al. (2019) for more details. In our early experi-ments, we also tried initializing the encoders anddecoders with weights from pre-trained Indic lan-guage models like MuRIL (Khanuja et al., 2021),but they showed poor performance and thus are notreported here.

Implementation Details All models are trainedusing HuggingFace Transformers (Wolf et al.,2020). Both source and target sequences are trun-cated at 128 tokens. We train WordPiece tokenizerson our dataset and use them for all models. Adamoptimizer (Kingma and Ba, 2014) with weight-

Model chrF Tok.Acc. BLEU TER

(↓)

English to SanskritMoses 25.9 5.48 0.21 1.53

B2B-Tiny 15.1 8.07 2.85 1.04B2B-Mini 21.6 8.52 5.66 1.03

B2B-Small 23.4 8.75 6.93 1.03B2B-Medium 23.7 8.67 6.94 1.02

B2B-Base 24.3 8.89 7.59 1.04Sanskrit to English

Moses 29.3 8.03 5.67 0.91B2B-Tiny 24.5 8.61 5.64 0.98B2B-Mini 29.3 8.58 7.28 0.95

B2B-Small 30.1 8.55 7.49 0.95B2B-Medium 30.4 8.49 7.48 0.94

B2B-Base 30.5 8.38 7.09 0.93

Table 2: Character F1, Token accuracy, BLEU, andTER scores for Moses and Transformer models. Scoresmarked with (↓) are better if they are lower.

decay of 0.01, and learning rate of 5 × 10−5 isused. All models are trained for 100 epochs. Thelearning rate is warmed up over 8,000 steps anddecayed later with a linear scheduler. We use abatch size of 128, and use standard cross-entropyloss with no label smoothing. We run into memoryerrors on bigger models (medium and base), butmaintain the effective batch-size and optimizationsteps by introducing gradient accumulation and in-creasing the number of epochs, respectively. Also,to reduce the total training time of bigger models,we stop training if the BLEU score does not im-prove over 10 epochs. During generation, we usea beam size of 5 and compute all metrics againsttruncated references.

Discussion We see that all models performpoorly, with low token accuracy and high TER.While the English to Sanskrit (E2S) models getbetter with size, this pattern is not clearly seen inSanskrit to English (S2E) models. Surprisingly forS2E models, the token accuracy progressively de-creases as their size increases. Also, Moses hasthe best TER among S2E models which suggeststhat the seq2seq models have not been able to learneven simple co-occurrences between source andtarget tokens. This leads us to hypothesize that theSanskrit encoders produce sub-optimal representa-tions. One way to improve them would be to adda Sandhi-splitting step to the tokenization pipeline,thereby decreasing the Sanskrit vocabulary size.Another natural extension to improve the quality ofrepresentations would be to initialize the encoderswith a pre-trained language model.

Page 5: Itihasa: A large-scale corpus for Sanskrit to English ... · The Mahabh¯ ¯arata to English: Manmatha N ath Dutt¯ in the 1890s and Bibek Debroy in the 2010s. M. N. Dutt was a prolific

195

ENHearing the words of Viśvāmitra, Rāghava, together with Laksmana, was struck with amazement, and spoke to Viśvāmitra, saying,

SN वशवामतरवचः शरतवा राघवः सहलकषमणः। वसमय परम गतवा वशवामतरमथाबरवीत॥

Pred.वशवामतरवचः शरतवा लकषमणः सहलकषमणः। वशवामतरवचः शरतवा वशवामतरोऽबरवीददम॥

Figure 5: A gold sentence and shloka from the test set,and its corresponding small model prediction.

Though it is clear that there is a large scope forimprovement, the models are able to learn some in-teresting features of the dataset. Fig. 5 shows a ran-dom gold translation pair and the small model’sprediction. Though we see repetitions of phrasesand semantic errors, the prediction follows the me-ter in which the original shlokas are written, i.e. italso consists of 4 parts containing 8 syllables each.

5 Related Work

Early translation efforts from Sanskrit to Englishwere limited to the construction of dictionariesby Western Indologists (Muller, 1866; Monier-Williams, 1899). Over the years, though notabletranslation works like Ganguli (1883) have beenpublished, the lack of digitization has been a bot-tleneck hindering any meaningful progress towardsautomatic translation systems. This has changedrecently, at least for monolingual data, with the cu-ration of digital libraries like GRETIL8 and DCS9.Currently, the largest freely available repository oftranslations are for The Bhagavadgita (Prabhakaret al., 2000) and The Ramayana (Geervani et al.,1989).

However, labeled datasets for other tasks, likethe ones proposed in (Kulkarni, 2013; Bhardwajet al., 2018; Krishnan et al., 2020) have resulted inparsers (Krishna et al., 2020, 2021) and sandhi split-ters (Aralikatte et al., 2018; Krishnan and Kulkarni,2020) which are pre-cursors to modular translationsystems. Though there have been attempts at build-ing Sanskrit translation tools (Bharati and Kulkarni,2009), they are mostly rule-based and rely on man-ual intervention. We hope that the availability ofthe Itihasa corpus pushes the domain towards end-to-end systems.

8http://gretil.sub.uni-goettingen.de/gretil.html

9http://www.sanskrit-linguistics.org/dcs/index.php

6 Conclusion

In this work, we introduce Itihasa, a large-scaledataset containing more than 93,000 pairs of San-skrit shlokas and their English translations fromThe Ramayana and The Mahabharata. First, wedetail the extraction process which includes an au-tomated OCR phase and a manual alignment phase.Next, we analyze the dataset to give an intuition ofits asymmetric nature and to showcase its complex-ities. Lastly, we train state-of-the-art translationmodels which perform poorly, proving the neces-sity for more work in this area.

Acknowledgements

We thank the reviewers for their valuable feedback.Rahul Aralikatte and Anders Søgaard are fundedby a Google Focused Research Award.

ReferencesRahul Aralikatte, Neelamadhav Gantayat, Naveen Pan-

war, Anush Sankaran, and Senthil Mani. 2018. San-skrit sandhi splitting using seq2(seq)2. In Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, pages 4909–4914,Brussels, Belgium. Association for ComputationalLinguistics.

Robert Stephen Paul Beekes. 1995. Comparative Indo-European Linguistics. Benjamins Amsterdam.

Akshar Bharati and Amba Kulkarni. 2009.Anusaaraka: an accessor cum machine transla-tor. Department of Sanskrit Studies, University ofHyderabad, Hyderabad, pages 1–75.

Shubham Bhardwaj, Neelamadhav Gantayat, NikhilChaturvedi, Rahul Garg, and Sumeet Agarwal. 2018.SandhiKosh: A benchmark corpus for evaluat-ing Sanskrit sandhi tools. In Proceedings of theEleventh International Conference on Language Re-sources and Evaluation (LREC 2018), Miyazaki,Japan. European Language Resources Association(ELRA).

Adriane Boyd. 2018. Using Wikipedia edits in lowresource grammatical error correction. In Proceed-ings of the 2018 EMNLP Workshop W-NUT: The4th Workshop on Noisy User-generated Text, pages79–84, Brussels, Belgium. Association for Compu-tational Linguistics.

G. Bradski. 2000. The OpenCV Library. Dr. Dobb’sJournal of Software Tools.

Krishna Dwaipayana and Manmatha Nath Dutt. 1895.Mahabharata. Elysium Press, Calcutta.

Kisari Mohan Ganguli. 1883. The Mahabharata.

Page 6: Itihasa: A large-scale corpus for Sanskrit to English ... · The Mahabh¯ ¯arata to English: Manmatha N ath Dutt¯ in the 1890s and Bibek Debroy in the 2010s. M. N. Dutt was a prolific

196

P Geervani, K Kamala, and V. V. Subba Rao. 1989.Valmiki ramayana.

Pawan Goyal, Gerard Huet, Amba Kulkarni, PeterScharf, and Ralph Bunker. 2012. A distributed plat-form for Sanskrit processing. In Proceedings ofCOLING 2012, pages 1011–1028, Mumbai, India.The COLING 2012 Organizing Committee.

Barry Haddow and Faheem Kirefu. 2020. Pmindia – acollection of parallel corpora of languages of india.

Simran Khanuja, Diksha Bansal, Sarvesh Mehtani,Savya Khosla, Atreyee Dey, Balaji Gopalan,Dilip Kumar Margam, Pooja Aggarwal, Rajiv TejaNagipogu, Shachi Dave, Shruti Gupta, SubhashChandra Bose Gali, Vish Subramanian, and ParthaTalukdar. 2021. Muril: Multilingual representationsfor indian languages.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Philipp Koehn. 2005. Europarl: A Parallel Corpusfor Statistical Machine Translation. In ConferenceProceedings: the tenth Machine Translation Summit,pages 79–86, Phuket, Thailand. AAMT, AAMT.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions, pages 177–180, Prague, Czech Republic. As-sociation for Computational Linguistics.

Amrith Krishna, Ashim Gupta, Deepak Garasangi, Pa-vankumar Satuluri, and Pawan Goyal. 2020. Keepit surprisingly simple: A simple first order graphbased parsing model for joint morphosyntactic pars-ing in Sanskrit. In Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 4791–4797, Online. As-sociation for Computational Linguistics.

Amrith Krishna, Bishal Santra, Ashim Gupta, Pavanku-mar Satuluri, and Pawan Goyal. 2021. A Graph-Based Framework for Structured Prediction Tasksin Sanskrit. Computational Linguistics, 46(4):785–845.

Sriram Krishnan and Amba Kulkarni. 2020. Sanskritsegmentation revisited.

Sriram Krishnan, Amba Kulkarni, and Gerard Huet.2020. Validation and normalization of dcs corpususing sanskrit heritage tools to build a tagged goldcorpus.

Amba Kulkarni. 2013. A deterministic dependencyparser with dynamic programming for Sanskrit. InProceedings of the Second International Conferenceon Dependency Linguistics (DepLing 2013), pages157–166, Prague, Czech Republic. Charles Univer-sity in Prague, Matfyzpress, Prague, Czech Repub-lic.

Monier Monier-Williams. 1899. A Sanskrit-Englishdictionary : etymologically and philologically ar-ranged with special reference to cognate Indo-European languages. Clarendon Press.

F.M. Muller. 1866. A Sanskrit Grammar for Begin-ners: In Devanagarı and Roman Letters Through-out. Handbooks for the study of Sanskrit. Long-mans, Green, and Company.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

T. V Prabhakar, Ravi Mula, T Archana, Harish Karnick,Nitin Gautam, Murat Dhwaj Singh, Madhu KumarDhavala, Rajeev Bhatia, and Saurabh Kumar. 2000.Gita supersite.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and Ralph Weischedel. 2006. A studyof translation error rate with targeted human annota-tion. In In Proceedings of the Association for Ma-chine Transaltion in the Americas (AMTA 2006.

Anders Søgaard, Sebastian Ruder, and Ivan Vulic.2018. On the limitations of unsupervised bilingualdictionary induction. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 778–788, Melbourne, Australia. Association for Compu-tational Linguistics.

Oscar Tackstrom. 2013. Predicting Linguistic Struc-ture with Incomplete and Cross-Lingual Supervision.Ph.D. thesis, Uppsala University.

Iulia Turc, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. Well-read students learn better:On the importance of pre-training compact models.arXiv preprint arXiv:1908.08962v2.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, undefine-dukasz Kaiser, and Illia Polosukhin. 2017. Attention

Page 7: Itihasa: A large-scale corpus for Sanskrit to English ... · The Mahabh¯ ¯arata to English: Manmatha N ath Dutt¯ in the 1890s and Bibek Debroy in the 2010s. M. N. Dutt was a prolific

197

is all you need. In Proceedings of the 31st Interna-tional Conference on Neural Information ProcessingSystems, NIPS’17, page 6000–6010, Red Hook, NY,USA. Curran Associates Inc.

Valmiki and Manmatha Nath Dutt. 1891. Ramayana.Deva Press, Calcutta.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander Rush. 2020. Trans-formers: State-of-the-art natural language process-ing. In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.