russiansuperglue: a russian language understanding

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4717–4726,November 16–20, 2020. c©2020 Association for Computational Linguistics

4717

RussianSuperGLUE: A Russian Language Understanding EvaluationBenchmark

Tatiana Shavrina1,2, Alena Fenogenova1, Anton Emelyanov1,3, Denis Shevelev1, Ekaterina Artemova2,Valentin Malykh4, Vladislav Mikhailov1,2, Maria Tikhonova1,2, Andrey Chertok1 and Andrey Evlampiev1

1Sberbank / Moscow, Russia2National Research University Higher School of Economics / Moscow, Russia

3Moscow Institute of Physics and Technology / Moscow, Russia4Huawei / Moscow, Russia

Abstract

In this paper, we introduce an advanced Rus-sian general language understanding evalua-tion benchmark – RussianGLUE.Recent advances in the field of universal lan-guage models and transformers require the de-velopment of a methodology for their broad di-agnostics and testing for general intellectualskills - detection of natural language infer-ence, commonsense reasoning, ability to per-form simple logical operations regardless oftext subject or lexicon. For the first time, abenchmark of nine tasks, collected and orga-nized analogically to the SuperGLUE method-ology (Wang et al., 2019), was developed fromscratch for the Russian language. We providebaselines, human level evaluation, an open-source framework for evaluating models andan overall leaderboard of transformer modelsfor the Russian language.Besides, we present the first results of compar-ing multilingual models in the adapted diag-nostic test set and offer the first steps to furtherexpanding or assessing state-of-the-art modelsindependently of language.

1 Introduction

With the development of technologies for text pro-cessing and then deep learning methods for ob-taining better text representation, language modelswent through the increasingly advanced stages ofnatural language modelling.

Modern scientific methodology is beginning togradually explore universal transformers as an inde-pendent object of study - furthermore, such modelsshow the ability to extract causal relationships intexts (natural language inference), common senseand world knowledge and logic (textual entail-ment), to generate coherent and correct texts. Anactively developing field of model interpretationdevelops testing procedures comparing their per-formance to a human level and even the ability to

reproduce some mechanisms of human brain func-tions.

NLP is gradually absorbing all the new areasresponsible for the mechanisms of thinking and thetheory of artificial intelligence.

Benchmark approaches are being developed,testing general intellectual “abilities” in a text for-mat, including complex input content, but having asimple output format. Most of these benchmarks(for more details see Section 2) make the develop-ment of machine intelligence anglo-centric, whileother, less widespread languages, in particular Rus-sian, have other characteristic linguistic categoriesto be tested.

In this paper, we expand the linguistic diversityof the testing methodology and present the firstbenchmark for evaluating universal language mod-els and transformers for the Russian language, to-gether with a portable methodology for collectingand filtering the data for other languages.

The contribution of RussianGLUE is two-fold.First, it provides nine novel datasets for the Russianlanguage covering a wide scope of NLU tasks. Thechoice of the tasks are justified by the design ofprior NLU benchmarks (Wang et al., 2018, 2019).Second, we evaluate two widely used deep modelsto establish baselines.

The remainder is structured as follows. Weoverview multiple prior works on developing NLUbenchmarks, including those designed for lan-guages other than English, in Section 2. Section 3.1lists the tasks and novel datasets, proposed for theRussian NLU. Section 4 presents with the baselines,established for the tasks, including a human levelbaseline. We overview compare achieved resultsin Section 2 to the current state of English NLU.We discuss future work directions and emphasizethe importance of NLU benchmarks for languagesother than English in Section 6. Section 7 con-cludes.

4718

2 Related Work

Several benchmarks have been developed to eval-uate and analyze word and sentence embeddingsover the past few years.

SentEval (Conneau and Kiela, 2018) is one ofthe first frameworks intended to evaluate the qualityof sentence embeddings. A twofold set of transfertasks is used to assess the generalization power ofsentence embedding models. The transfer taskscomprise downstream tasks, in which the sentenceembedding is used as a feature vector, and probingtasks, which are aimed to evaluate the capability ofsentence embeddings to encode linguistic proper-ties. The choice of the downstream tasks is limitedto sentiment classification, natural language infer-ence, paraphrase detection and image captioningtasks. The probing tasks are meant to analyse mor-phological, syntactical and semantical informationencoded in sentence embeddings.

The General Language Understanding Evalua-tion (GLUE) (Wang et al., 2018) benchmark is acollection of tools for evaluating the performanceof language models across a diverse set of exist-ing natural language understanding (NLU) tasks,adopted from different sources. These tasks are di-vided into two parts: single sentence classificationtasks and sentence pair classifications tasks sub-divided further into similarity and inference tasks.GLUE also includes a hand-crafted diagnostic test,which probes for complex linguistic phenomena,such as the ability of the model to express lexi-cal semantics and predicate-argument structure, topose logical apparatus and knowledge represen-tation. GLUE is recognized as a de-facto stan-dard benchmark to evaluate transformer-derivedlanguage models. Last but not least GLUE informson human baselines for the tasks, so that not onlysubmitted models are compared to the baseline, butalso to the human performance. The SuperGLUE(Wang et al., 2019) follows GLUE paradigm forlanguage model evaluation based on NLU tasks,providing with more complex tasks, of which somerequire reasoning capabilities and some are aimedat detecting ethical biases. A few recent projectsreveal that GLUE tasks may be not sophisticatedenough and do not require much tasks-specific lin-guistic knowledge (Kovaleva et al., 2019; Warstadtet al., 2019). Thus SuperGLUE benchmark, beingmore challenging, becomes much more preferablefor evaluation of language models.

decaNLP (McCann et al., 2018) widens the

scope for language model evaluation by introduc-ing ten disparate natural language tasks. Thesetasks comprise not only text classification prob-lems, but sequence tagging and sequence trans-formation problems. The latter include machinetranslation and text summarization, while the for-mer include semantic parsing and semantic rolelabelling. Although decaNLP along with the as-sociated research direction focuses on multi-tasklearning as a form of question answering, it sup-ports zero-shot evaluation.

To evaluate models for languages other than En-glish, several monolingual benchmarks were devel-oped, such as FLUE (Le et al., 2019) and CLUE(Liang, 2020), being French and Chinese versionsof GLUE. These benchmarks include a variety oftasks, ranging from part-of-speech tagging and syn-tax parsing to machine reading comprehension andnatural language inference.

To the best of our knowledge, LINSPECTOR(Eichler et al., 2019) is a first multi-lingual bench-mark for evaluating the performance of languagemodels. LINSPECTOR offers 22 probing tasksto analyse for a single linguistic feature such ascase marking, gender, person, or tense for 52 lan-guages. A part of these 22 probing tasks are static,i.e. are aimed at evaluation of word embeddings,and the rest are contextual and should be used toevaluate language models. Released in early 2020two multilingual benchmarks, (Liang et al., 2020)and XTREME (Hu et al., 2020), aim at evalua-tion of cross-lingual models. XGLUE includes11 tasks, which cover both language understand-ing and language generation problems, for 19 lan-guages. XGLUE provides with several multilingualand bilingual corpora that allow of cross-lingualmodel training. As for the Russian language,XGLUE provides with four datasets for POS tag-ging, a part of XNLI (Conneau et al., 2018) andtwo datasets, crawled from commercial news web-site, used for news classification and news headlinegeneration. XTREME consists of nine tasks whichcover classification, sequence labelling, questionanswering and retrieval problems for 40 languages.Almost a half of the datasets were translated fromEnglish to the target languages with the help ofprofessional translators. XTREME offers for theRussian language five datasets, including NER andtwo question-answering datasets. Both XGLUEand XTREME offer tasks that are much simplerthan SuperGLUE and are aimed at evaluation of

4719

cross-lingual models rather than at comparison ofmono-lingual models in similar setups. Thus theneed for novel datasets targeted at mono-lingualmodel evaluation for languages other than Englishis still not eliminated.

3 RussianGLUE Overview

We have intenooed to have the same task set in theframework as one in the SuperGLUE. There is noone-to-one mapping, but the corpora we use couldbe considered close to the specified tasks in theSuperGLUE framework.

We divided the tasks into six groups, coveringthe general diagnostics of language models and dif-ferent core tasks: common sense understanding,natural language inference, reasoning, machinereading and world knowledge.

3.1 TasksThe tasks description is provided below. The sam-ples from the tasks are presented at figs. 1 to 7.

3.1.1 DiagnosticsLiDiRus: Linguistic Diagnostic for Russian is adiagnostic dataset that covers a large volume oflinguistic phenomena, while allowing you to evalu-ate information systems on a simple test of textualentailment recognition. This dataset was translatedfrom English to Russian with the help of profes-sional translators and linguists to ensure that thedesired linguistic phenomena remain. This datasetcorresponds to AX-b dataset in SuperGLUE bench-mark.

3.1.2 Common SenseRUSSE: Word in context is a binary classificationtask, based on word sense disambiguation prob-lem. Given two sentences and a polysemous word,which occurs in both sentences, the task is to deter-mine, whether the word is used in the same sensein both sentences, or not. For this task we usedthe Russian word sense disambiguation datasetRUSSE (Panchenko et al., 2015) and convertedit into WiC dataset format from SuperGLUE.

Figure 1: A sample from RUSSE dataset.

PARus: The choice of Plausible Alternatives forRussian language evaluation provides researchers

with a tool for assessing progress in open-domaincommonsense causal reasoning. Each question inPARus is composed of a premise and two alter-natives, where the task is to select the alternativethat more plausibly has a causal relation with thepremise. The correct alternative is randomized sothat the expected performance of randomly guess-ing is 50%. PARus is constructed as a translationof COPA dataset from SuperGLUE and edited byprofessional editors. The data split from COPA isretained.

Figure 2: A sample from PARus dataset.

3.1.3 Natural Language InferenceTERRa: Textual Entailment Recognition for Rus-sian is a dataset which is devoted to capture textualentailment. The task of textual entailment has beenproposed recently as a generic task that capturesmajor semantic inference needs across many NLPapplications, such as Question Answering, Infor-mation Retrieval, Information Extraction, and TextSummarization. This task requires to recognize,given two text fragments, whether the meaning ofone text is entailed (can be inferred) from the othertext. The corresponding dataset in SuperGLUEis RTE, which in its place is constructed fromNIST RTE challenge series corpora. To collectTERRa we filtered out the large scale Russian web-corpus, Taiga (Shavrina and Shapovalova, 2017)with a number of rules to extract suitable sentencepairs and manually corrected them. The rules hadthe following structures: there should be a mentalverb in the first sentence and the second sentenceshould be attached to the first one by a subordinateconjunction. To ensure the literary language ofthe extracted sentences, we processed only newsand fiction parts of Taiga and made sure, that thesentences contain only frequently used words (i.e.number instances per million, IPM is higher than1). The word frequencies were estimated accordingto Russian National Corpus1.

RCB: The Russian Commitment Bank is a cor-pus of naturally occurring discourses whose finalsentence contains a clause-embedding predicate

1http://www.ruscorpora.ru/new/en/

http://www.ruscorpora.ru/new/en/

4720

Figure 3: A sample from TERRa dataset.

under an entailment canceling operator (question,modal, negation, antecedent of conditional). Sim-ilarly to the design of TERRa dataset, we filteredout Taiga with a number of rules and manually postprocessed the extracted passages. Final labellingwas conducted by three of the authors. This datasetcorresponds to CommonBank dataset.

Figure 4: A sample from RCB dataset.

3.1.4 ReasoningRWSD: Winograd Schema task is devoted to coref-erence resolution in specifically designed experi-ment, where reference could be resolved only usingthe common sense. The Russian Winograd SchemaDataset (RWSD) is constructed as translation ofthe Winograd Schema Challenge2.

Figure 5: A sample from RWSD dataset.

3.1.5 Machine ReadingMuSeRC: Russian Multi-Sentence Reading Com-prehension is a reading comprehension challengein which questions can only be answered by takinginto account information from multiple sentences.The dataset is the first to study multi-sentence infer-ence at scale, with an open-ended set of questiontypes that requires reasoning skills. The task isactually a binary classification, whether the answerto the question is correct or not. Each example con-sists of numerated passage, question and answers.Our dataset contains approximately 6000 questionsfor more than 800 paragraphs across 5 different

2https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html

domains, namely: 1) elementary school texts, 2)news, 3) fiction stories, 4) fairy tales, 5) brief an-notations of TV series and books. First, we havecollected open sources data from different domainsand automatically preprocessed them, filtered onlythose paragraphs that corresponds to the follow-ing parameters: 1) paragraph length 2) number ofnamed entities 3) number of coreference relations.Afterwords we have checked the correct splittingon sentences and numerate each of them. Next, inToloka3 we have generated the crowd sourcing taskto get the following information: 1) generate ques-tions 2) generate answers 3) check that to solveevery question a human needs more than one sen-tence in the text. Collecting the dataset we adherethe principles of MultiRC (Khashabi et al., 2018):a) We exclude any question that can be answeredbased on a single sentence from a paragraph; b) An-swers are not written in the full match form in thetext; c) Answers to the questions are independentfrom each other.

Figure 6: A sample from MuSeRC dataset.

RuCoS: Russian reading comprehension withCommonsense reasoning is a large-scale datasetfor machine reading comprehension requiring com-monsense reasoning. The dataset construction isbased on ReCoRD methodology (Zhang et al.,2018). RuCoS consists of passages and cloze-stylequeries automatically generated from Russian newsarticles, namely Lenta4 and Deutsche Welle5. Eachsample from the dev and test sets was validated bycrowd workers. The answer to each query is a text

3https://toloka.yandex.ru4https://lenta.ru/5https://www.dw.com/ru/

https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html

https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html

https://toloka.yandex.ru

https://lenta.ru/

https://www.dw.com/ru/

4721

span that corresponds to one or more referents ofthe answer entity in the context. The answer entitymay be expressed by an abbreviation, an acronymor a set of surface forms. Hence, the task requiresunderstanding of rich inflectional morphology andlexical variability of Russian. The goal of RuCoS isto test a machine’s ability to infer the answer basedon the commonsense reasoning and knowledge.

Figure 7: A sample from RuCoS dataset.

3.1.6 World Knowledge

DaNetQA: This question-answering corpus fol-lows BoolQ (Clark et al., 2019) design: it com-prises natural yes/no questions. Each question ispaired with a paragraph from Wikipedia and ananswer, derived from the paragraph. The task isto take both the question as input and a paragraphand come up with a yes/no answer, i.e. to producea binary output. DaNetQA was collected in a fewsteps: 1) we used crowd workers to compose can-didate yes/no questions; 2) we used Google API toretrieve relevant Wikipedia pages by treating eachquestion as a search query; 3) we queried a pre-trained BERT-based model for SQuAD (Kuratovand Arkhipov, 2019) to extract relevant paragraphsfrom Wikipedia pages, using candidate questions;4) finally, we used crowd workers to evaluate eachquestion and paragraph pair and provide the de-sired yes/no answers. We ensure high quality ofthe dataset by using a high overlap for annotation atthe last step and a number of control gold-standardcontrol questions, labelled by two of the authors.The core difference of DaNetQA to BoolQ is thatsome question may occur multiple times in thedataset, as at the step 3) we may retrieve morethan one relevant paragraph. To make the datasetmore challenging, we admit contradictory answersto a question if these answers are implied from thepassages.

3.1.7 Statistics for the Tasks

Table 1 below presents the characteristics ofthe collected datasets - examples partitioning bytrain/val/test, as well as the total volume in tokensand sentences. As one can see, the size of the Ru-CoS task significantly exceeds the rest of the tasksdue to the articles included in the task.

Task Samples Sents TokensLiDiRus 0/0/1104 2210 3.6 · 104

Common SenseRUSSE 19845/8508/12151 90862 1.1 · 106PARus 500/100/400 1000 5.4 · 103

NLITERRa 2616/307/3198 13706 2.53 · 105RCB 438/220/348 2715 3.7 · 104

ReasoningRWSD 606/204/154 1541 2.3 · 103

Machine ReadingMuSeRC 500/100/322 12805 2.53 · 105RuCoS 72193/4370/4147 583930 1.2 · 107

World KnowledgeDaNetQA 392/295/295 6231 1.31 · 105

Table 1: Cumulative task statistics. The sizetrain/validation/test splits is provided in “Samples” col-umn.

3.2 Scoring

Following (Wang et al., 2019), we calculate scoresfor each of the tasks based on their individual met-rics. All metrics are scaled by 100x (i.e., as per-centages). These scores are then averaged to getthe final score. For the tasks with multiple metrics,the metrics are averaged.

4 Experiments

4.1 Baselines

In this section, we provide a two-step baseline de-sign. At first we have developed a naıve baselinebased on the TF-IDF model (section 4.1.1), andthen evaluate state-of-the-art models for Russianlanguage (section 4.1.2).

4.1.1 Naıve Baseline

We used Scikit-learn package (Pedregosa et al.,2011) to train a TF-IDF model. We used a 20 thou-sand sample from Wikipedia, from Russian andEnglish sites equally. We restricted a vocabulary to10 thousand most common words. Then for eachtask set a logistic regression was trained to predictan answer.

4722

Dataset Metrics RuBERT MultiBERT TF-IDF HumanLiDiRus MCC 0.186 0.157 0.059 0.626RCB F1/Acc. 0.432/0.468 0.383/0.429 0.45 0.68/0.702PARus Acc 0.61 0.588 0.48 0.982MuSeRC F1/EM 0.656/0.256 0.626/0.253 0.589/0.244 0.806/0.42TERRa Acc 0.639 0.62 0.47 0.92RUSSE Acc 0.894 0.84 0.66 0.747RWSD Acc 0.675 0.675 0.66 0.84DaNetQA Acc 0.749 0.79 0.68 0.879RuCoS F1/EM 0.255/0.251 0.371/0.367 0.256/0.251 0.93/0.924Average 0.546 0.542 0.461 0.802

Table 2: Results of the human benchmark and the baseline models. MCC stands for Matthews Correlation Coeffi-cient; Acc - Accuracy; EM - Exact Match.

4.1.2 Advanced Baselines

We leverage two BERT-derived models as base-line. Multilingual BERT (MultiBERT), releasedby (Devlin et al., 2019), is a single language modelpre-trained from monolingual corpora in 104 lan-guages, Russian texts being a part of training data.MultiBERT uses a shared vocabulary for all lan-guages. The capabilities of MultiBERT for zero-shot cross-lingual tasks have been recently studiedby (Pires et al., 2019). Russian BERT (RuBERT)was trained on large-scale corpus of news andWikipedia in Russian. To alleviate the trainingall weights except sub-word embeddings were bor-rowed from MultiBERT. The sub-word vocabu-lary was obtained from the same training corpusand the new mono-lingual embeddings were trans-formed from the multi-lingual ones. This allowedto incorporate longer Russian sub-word units intothe vocabulary. This model is part of DeepPavlovframework (Kuratov and Arkhipov, 2019).

4.2 Human Evaluation

We include human performance estimates for allprovided benchmark tasks, including the diagnosticset. We estimate human performance by hiringcrowd workers via Toloka platform to re-annotatea sample from each task test set. We suggest atwo step procedure: 1) a crowd worker is providedwith an instruction and completes a short trainingphase before proceeding to the annotation phase,2) a crowd worker that passed through the trainingphase solves the original test set.

For the annotation phase we ask crowd workersto annotate the full test sets except for the RUSSEand the RuCoS datasets, where we randomly sam-pled only 5000 and 1000 examples from the tasks’

test sets, respectively. For each sample, we col-lect annotations from three to five crowd work-ers and take a majority vote to estimate humanperformance. In annotation phase we add controlquestions to prevent the crowd workers from cheat-ing. As a result, we reject the annotations from thecrowd workers that fail the training phase and donot include the results of those who achieved lowperformance on the control tasks. The results ofhuman evaluation are presented in Table 2. Theexample of a Toloka task is provided in Appendix.

5 Results

The analysis of Table 2 can give an exact represen-tation of the baseline model performance, whichstill remains significantly different from the humanlevel. Nevertheless, the task of resolving the am-biguity of the word meaning in context (RUSSE)was solved by both monolingual and multilingualBERT at a level significantly exceeding the hu-man one (0.89 vs 0.74). Besides, the monolingualmodel is showing a slightly higher quality thanthat of the multilingual one, especially prevailingtextual entailment tasks (RCB, TERRA, PARus),disambiguating word meaning (RUSSE) and read-ing comprehension (MuSeRC). The multilingualmodel shows the most excellent result on the small-est dataset on commonsense QA task (DaNetQA)and also on commonsense-related task on machinereading (RuCoS).

We hope that our benchmark will help to excelthe performance of models for the Russian lan-guage in the future, and will favour achieving com-parably high results.

Can the results of a multilingual BERT onRussian and English data be considered analo-

4723

gous? Based on the results of the assessment,MultiBERT in English gets an overall score of60.8 6, while on RussianGLUE task set an overallscore of 54.2 is achieved– 6% lower, but notingthat the English benchmark includes additionallyWinograd Gender Parity (Levesque et al., 2012)dataset, giving SOTA models from 90 to 93% of ac-curacy added to the overall assessment. In the nextsection, a detailed comparison of the multilingualmodel performance is provided.

5.1 Comparison to SuperGLUEAs mentioned in Section 3.1, the diagnostic datasethas been obtained by professional translation withpreservation of the original linguistic features men-tioned. Thus being said, this diagnostic data is thefirst of its kind that allows drawing a multilingualanalogy of comparable models.

Figure 8: Russian and English Diagnostic Evaluationon Multilingual transformer, scored using Matthews’correlation (MCC).

Procedure: using the original MultiBERT (De-vlin et al., 2019), we conducted sequential modelpretraining in English and Russian using the RTEdataset, and then tested the models on the diag-nostic set, as long as the task requires exactly thesame format. Predictions were further scored using

6Jiant, full SuperGLUE task set

Matthews’ correlation (MCC), and correlation fordifferent linguistic features was computed. Theresults are presented in Figure 8.

First of all, it could be noticed that the Englishvariant of the model performs slightly better andshows a higher overall correlation of 0.2 comparedto 0.15 for the Russian variant. This could be dueto an asymmetry of the quality of the multilingualmodel and its better understanding of the Englishlanguage in general.

As for the models’ performance in the contextof different linguistic features, the results generallycoincide. For those categories for which correlationis low in English, the result in Russian is in mostcases poor as well (for instance, Redundancy, Nom-inalization, Intervals/Numbers). However, there ex-ist several categories which are much better solvedin English than in Russian such as PA Structure, El-lipsis/Implicits, Genetives/Partitives, Prepositionalphrases, Datives – mostly low-level and/or syn-tactically driven categories, that may indicate thatoptimal hyperparameters of BERT architecture aremuch more suitable for English syntax and maynot be linguistically universal. Similarly, we couldfind categories which show an extremely high cor-relation in Russian and low correlation in English(Factivity, Coordination scope, Restrictivity andExistential) – high-level logical and semantic cate-gories.

These numbers compared to the ones for Englishcould be explained by the fact that the languagefeatures now included in the diagnostics are not ex-actly linguistically universal in different languagesand are mostly focused on the English language(at least those syntactic ones). Thus, for the com-prehensive cross-linguistic typological analysis ofpossible linguistic features should be reviewed.

6 Discussion

We hope that our project will give a start to newresearch in the application of universal languagemodels and transformers, including multilingualones. Our example of an analysis of translated di-agnostics shows that even in languages of the sameEuropean family (which Russian and English be-long to), significant differences in the influence oflinguistic categories on model performance are pos-sible. One of the directions of the next studies, weconsider detailed experiments on the influence ofmodel parameters and language categories in dataon the quality of the model in different languages.

4724

An independent problem for the English originalleaderboard is that a gradual improvement in thequality of models allows us to exceed the humanperformance level in individual tasks, as happenedwith the T5 (Raffel et al., 2019) model. We expectthat a similar situation will soon happen on Rus-sian data, which means that when releasing straightoff with complex SuperGLUE tasks, we will stillbe focused on adding tasks of a higher level ofcomplexity in the future. Such tasks can becomethose that are obviously inaccessible to models forthe “understanding” of long texts and documents,seq2seq tasks, tasks that require knowledge graphs.

In the further development of our leaderboard,we also see the possibility of adding an industrialassessment of models: for fair ranking and easeof use, all models could receive an estimate of therequired memory resources, an estimate of perfor-mance, and so on.

7 Conclusion

In this paper we present the first benchmark ongeneral language understanding evaluation for theRussian language. The benchmark including ninetask sets is aimed to test BERT-like models for theirability to perform entailment recognition, common-sense reasoning and machine reading while denois-ing various linguistic features added on the level ofsemantics, logical and syntactic structure.

We invite developers, researchers, and AI ex-perts to join our project. Further development ofthe benchmark includes areas such as evaluationof industrial performance of models on the leader-board and multilingual diagnostics.

Acknowledgements

Ekaterina Artemova works within the frameworkof the HSE University Basic Research Programand funded by the Russian Academic ExcellenceProject “5-100”.

References

Christopher Clark, Kenton Lee, Ming-Wei Chang,Tom Kwiatkowski, Michael Collins, and KristinaToutanova. 2019. Boolq: Exploring the surprisingdifficulty of natural yes/no questions. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 2924–2936.

Alexis Conneau and Douwe Kiela. 2018. Senteval: Anevaluation toolkit for universal sentence representa-tions. arXiv preprint arXiv:1803.05449.

Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-ina Williams, Samuel Bowman, Holger Schwenk,and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. In Proceedings ofthe 2018 Conference on Empirical Methods in Natu-ral Language Processing, pages 2475–2485.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Max Eichler, Gozde Gul Sahin, and Iryna Gurevych.2019. LINSPECTOR WEB: A multilingual prob-ing suite for word representations. In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP): System Demonstrations,pages 127–132, Hong Kong, China. Association forComputational Linguistics.

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra-ham Neubig, Orhan Firat, and Melvin Johnson.2020. Xtreme: A massively multilingual multi-taskbenchmark for evaluating cross-lingual generaliza-tion. arXiv preprint arXiv:2003.11080.

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth,Shyam Upadhyay, and Dan Roth. 2018. Lookingbeyond the surface:a challenge set for reading com-prehension over multiple sentences. In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL).

Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark secretsof bert. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages4356–4365.

Yuri Kuratov and Mikhail Arkhipov. 2019. Adaptationof deep bidirectional multilingual transformers forrussian language. arXiv preprint arXiv:1905.07213.

Hang Le, Loıc Vial, Jibril Frej, Vincent Segonne, Max-imin Coavoux, Benjamin Lecouteux, Alexandre Al-lauzen, Benoıt Crabbe, Laurent Besacier, and DidierSchwab. 2019. Flaubert: Unsupervised languagemodel pre-training for french.

Hector Levesque, Ernest Davis, and Leora Morgen-stern. 2012. The winograd schema challenge. InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning.

https://doi.org/10.18653/v1/D19-3022

https://doi.org/10.18653/v1/D19-3022

http://arxiv.org/abs/1912.05372


4725

Xu Liang. 2020. Chinese glue. https://github.com/chineseGLUE/chineseGLUE.

Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fen-fei Guo, Weizhen Qi, Ming Gong, Linjun Shou,Daxin Jiang, Guihong Cao, et al. 2020. Xglue:A new benchmark dataset for cross-lingual pre-training, understanding and generation. arXiv,pages arXiv–2004.

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,and Richard Socher. 2018. The natural language de-cathlon: Multitask learning as question answering.arXiv preprint arXiv:1806.08730.

Alexander Panchenko, Natalia Loukachevitch, DmitryUstalov, Denis Paperno, Christian Meyer, and Na-talia Konstantinova. 2015. Russe: the first workshopon russian semantic similarity. Dialogue, 14:89–105.

Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-learn:Machine learning in python. the Journal of machineLearning research, 12:2825–2830.

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual bert? In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 4996–5001.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2019. Exploring the limitsof transfer learning with a unified text-to-text trans-former. arXiv preprint arXiv:1910.10683.

Tatiana Shavrina and Olga Shapovalova. 2017. Tothe methodology of corpus construction for machinelearning:“taiga”. syntax tree corpus and parser. Cor-pus linguitics–2017, page 78.

Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel R. Bowman. 2019. Superglue: Astickier benchmark for general-purpose language un-derstanding systems.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R Bowman. 2018.Glue: A multi-task benchmark and analysis platformfor natural language understanding. arXiv preprintarXiv:1804.07461.

Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng, Ha-gen Blix, Yining Nie, Anna Alsop, Shikha Bordia,Haokun Liu, Alicia Parrish, et al. 2019. Investi-gating bert’s knowledge of language: Five analysismethods with npis. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 2870–2880.

Sheng Zhang, Xiaodong Liu, Jingjing Liu, JianfengGao, Kevin Duh, and Benjamin Van Durme. 2018.Record: Bridging the gap between human and ma-chine commonsense reading comprehension.

A Appendices

https://github.com/chineseGLUE/chineseGLUE

https://github.com/chineseGLUE/chineseGLUE






4726

B Examples

russiansuperglue: a russian language understanding

Documents