comprehension based question answering using bloom’s …

Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 20–28Bangkok, Thailand (Online), August 6, 2021. ©2021 Association for Computational Linguistics

20

Comprehension Based Question Answering using Bloom’s Taxonomy

Pritish Sahu1,2 ∗ Michael Cogswell1 ∗ Sara Rutherford-Quach1 Ajay Divakaran1

1SRI International2Rutgers University

Abstract

Current pre-trained language models have lotsof knowledge, but a more limited ability to usethat knowledge. Bloom’s Taxonomy helps ed-ucators teach children how to use knowledgeby categorizing comprehension skills, so weuse it to analyze and improve the comprehen-sion skills of large pre-trained language mod-els. Our experiments focus on zero-shot ques-tion answering, using the taxonomy to provideproximal context that helps the model answerquestions by being relevant to those questions.We show targeting context in this manner im-proves performance across 4 popular commonsense question answer datasets.

1 Introduction

Recent large language models such as GPT-3 (Brown et al., 2020) have made a giant leap for-ward in knowledge acquisition and even generalizethis knowledge to a new tasks. But when less nar-row tasks are considered they fail to understandas much as these benchmarks suggest. They turnout to be “stochastic parrots” (Bender et al., 2021)or “smart/super parrots.” (Dunietz et al., 2020) thatjust memorize without all of the comprehensionwe want from a Natural Language Understandingsystem. We focus on a particular kind of failuremode where the model knows (has memorized) theinformation it needs, but is not able to apply thatinformation correctly, and we do so in a zero-shotfashion to control for what the model knows.

For example, in Fig. 1 the model is asked if amixture of grape juice and cranberry juice is safeto drink (Marcus and Davis, 2020). GPT-3 declaresthat it is a deadly poison, even though it appears to”know” that grape juice and cranberry juice are safeto drink by themselves (Fig. 1, Level 1, dark pur-ple). It even knows that cranberry juice with grapejuice is not poisonous, but it still thinks the result isdeath (Fig. 1, Level 2, light blue). The model has

∗These two authors contributed equally.

memorized the necessary information from largeamounts of text, but does not use its knowledgeappropriately. Following (Shwartz et al., 2020), weextract this knowledge as explicit language thenfeed it back as additional context during inference,forcing the model to use what it already knows butin our case targeting specifically useful knowledge.

To formalize this distinction we drew inspirationfrom elementary school classrooms, where teach-ers (Miller, 2002; Harvey and Goudvis, 2007) havea schema based approach in which they teach chil-dren to demonstrate multiple levels of comprehen-sion, making complex inferences and direct recallfrom memory. They use a hierarchy of comprehen-sion skills called Bloom’s Taxonomy (Andersonet al., 2000) (c.f . Fig. 1) with memorization is at thebottom (requiring children to recall facts) followedby understanding (requiring children to grasp se-mantics) application (requiring children to solveproblems), and more complex skills. For us, thesecomprehension skills describe ways our languagemodel might fail to use its knowledge.

In this paper we address our failure mode byrelying on commonly understood relationships be-tween the skills of Bloom’s Taxonomy which weterm proximal context. In order to understandwhether the cranberry grape mixture is poisonousthe model needs to remember whether grape juiceis poisonous. In order to apply its knowledge tofigure out what will happen next it needs to un-derstand whether the cranberry grape mixture ispoisonous or not. In general, the proximal contextfor a particular task T at level L is given by thosetasks implicitly required by T , which are mostlyat level L − 1 of the taxonomy. We guide ourlanguage to answer questions more accurately byproviding it not just any context, but proximal con-text 1. In performing zero-shot question answeringour language model asks itself additional clarifica-

1Proximal context is not defined for level 1 questions, sowe only address questions at level 2 or above.

21

Figure 1: Our approach incorporates context into question answering guided by Bloom’s Taxonomy.

tion questions, choosing those most likely to resultin proximal context.

Our contributions in this paper are:

• We use Bloom’s Taxonomy to choose proxi-mal clarifying context that improves questionanswering performance using only what themodel already knows.

• We show proximal context is better than otherlevels of context on four different common-sense question answering tasks.

• By observing how different levels of clarifi-cation impact our language model we alsoexplain how the model answers questions.

2 Related Works

Question Answering from External Supervi-sion. Several approaches has been proposed toimprove question-answering by adding externalknowledge source. Recent large pre-trained lan-guage models (Peters et al., 2018; Radford et al.,2019; Devlin et al., 2018; Liu et al., 2019; Joshiet al., 2020; Clark et al., 2020) learn generalpurpose text encoders from a huge text corpus.(Petroni et al., 2019) recently used a languagemodel as knowledge base to unmask a token givenan entity and a relation in a predefined template.Shwartz et al. (2020); Bosselut et al. (2019a,b) usedpretrained language models to improve zero-shotquestion answering performance by extracting con-text from the language model itself, using self-talkor a knowledge graph. We add context via self-talk,with structure provided by Bloom’s Taxonomy.

Bloom’s Taxonomy. The original work (Bloom,1956) defined taxonomies for learning in the cog-nitive (intellectual), affective (interests, attitudes,values), and psychomotor domains, though the cog-nitive domain is what we usually refer to today.Almost half a century later the cognitive domaintaxonomy was revised (Anderson et al., 2000) to re-flect more active thinking and improve usability byadding verbs to describe levels. Teachers use thistaxonomy, for example in computer science educa-tion (Whalley et al., 2006; Thompson et al., 2008;Oliver et al., 2004), and our inspiration is from thisrevised version of the cognitive taxonomy. Ma-chine learning has been applied to automaticallyclassify questions (Mohammed and Omar, 2020;Zhang et al., 2021; Nafa et al., 2016) into Bloom’sTaxonomy levels, but the taxonomy has not beenapplied to analyze or improve machine learningmodels themselves. We use it to help our modelthink about what it knows.

3 Approach

Our approach builds on the zero-shot question an-swering approach of Shwartz et al. (2020) to an-swer questions (Section 3.1) by adding clarifica-tions with self-talk (Section 3.2). We describe thisapproach then we use Bloom’s Taxonomy to selectbetter clarifications (Section 3.2).

3.1 Question Answering with LanguageModels

Given a prompt p, a question q, and answer op-tions ao ∀o ∈ [1,K] we use a pre-trained languagemodel LM to pick the correct answer ao∗ . This

22

Dataset Question Prefix Sample ClarificationQuestion

Sample ClarificationAnswer

COPA What is the definition of : 1(a) What is the definition of an accident?(b) What is the definition of a flat tire?

(a) The definition of an accident is the crash ofcollusion caused by the vehicle.(b) The definition of a flat tire is that the tire doesnot hold air.

What is the main purpose of : 2

(a) What is the main purpose of thisinvestigation?(b) What is the main purpose of thispost?

(a) The purpose of this investigation is to provideinformation about how and why he was shot.(b) The purpose of this post is to share my thoughtsand feelings on his death.

CommonsenseQA What is the main function of a : 2

(a) What is the main function of ateacher in this area?(b) What is the main function of afarmer?

(a)The main function of a teacher in this area is toteach them about life and love.(b) The main function of a farmer is to provide foodfor his family and the community.

What might have caused : 3(a) What might have caused this problem?(b) What might have caused the animal to flee?

(a) the cause of this problem was that his wife’shusband didn’t have enough money.(b) The cause of the animal to flee was a predator.

Social IQA What did [NAME] do? : 1(a) What did Kendall do?(b) What did Kendall do?

(a) What Kendall did was make sure thathe wasn’t going anywhere else.(b) What Kendall did was so horrible, thatit was hard to believe.

How would you describe [NAME]? : 3(a) How would you describe Riley?(b) How would you describe Riley?

(a) Riley is a big brother, he’s an awesome dad.(b) Riley is a very sensitive person and has a lotof anxiety.

Winogrande What are the properties of a : 1(a) What are the properties of a dietthat is not healthy?(b) What are the properties of a home?

(a) The property of a diet that is not healthyare that it has high cholesterol (a good idea).(b) The properties of a home are that whichmakes it comfortable and pleasant for the occupants.

What does it mean to : 2(a) What does it mean to be an explorer?(b) What does it mean to be sophisticated?

(a) Be an explorer means to explore and makesense of things.(b) Be sophisticated means to be classy, elegantand smart.

Table 1: This table shows some of the question prefixes we used for different datasets in our experiments. Weassign each prefix a level in Bloom’s Taxonomy. We show generated clarifications questions and answers for bothDistil-GPT2 (a) and GPT-Neo (b) for their corresponding question prefixes.

approach simply concatenates each (prompt, ques-tion, answer) tuple into into a single string of textTo = [p, q, ao] and feeds this string to the languagemodel to assign each choice a score so = LM(To).The language model’s answer is just the answerwith the highest score: o = argmaxo so.

3.2 Self-talk Clarifications

Self-talk (Shwartz et al., 2020) has a languagemodel ask itself clarifying questions then answerthose questions to generate clarifications.

Stage 1: Ask clarification questions. To pro-duce clarifications we start with a set of clarifica-tion question prefixes r1, . . . , rJ that are designedspecifically for each question answering dataset.“What happens if” is a sample prefix for the clarifi-cations, shown in Fig. 1, and in Tab. 1 we presentexamples for all the datasets we use. In this stagethe language model completes each of these pre-fixes, using its generator function LMG to ask onequestion Rj = LMG(rj) per prefix.

Stage 2: Answer the questions. Next we usethe model to answer each of these questions, possi-bly prompted with an answer prefix bj correspond-ing to question prefix rj . The results are the clarifi-cations cj = LMG([Rj , bj ]).

Stage 3: Reconsider with a new prompt. Touse the clarifications we pick one from the list thenappend it to the original prompt. This approachsimply considers all combinations of clarificationsquestions and answers Tj,o = [p, q, cj , ao] ∀o, j,first chooses the clarification which maximizesmodel score per answer option, then chooses thefinal answer o∗ = argmaxomaxj LM(Tj,o). Thiscan improve question answering performance onits own, but in the next section we more carefullychoose clarifications using our notion of proximalcontext and Bloom’s Taxonomy.

3.3 Using Bloom’s Taxonomy to ChooseClarifications with Proximal Context

To test our idea of proximal context we consider thelevel L of task give by each dataset then allow onlyproximal clarifications of level L − 1. We labeleach question prefix with the level of Bloom’s Tax-onomy that it falls into, and then force the modelto choose from the set CL of clarifications of levelL. This results in a final choice for each levelo∗L = argmaxomaxj∈CL LM(Tj,o). We also pro-vide a Choice Baseline that allows the model tochoose any level of clarification to show the modelwould have difficulty choosing proximal clarifica-

23

tions itself. Note that the annotation of questionsalong Bloom’s taxonomy requires special skillstypically found only among educators. While alayperson can be trained to annotate such ques-tions, our experience was that it takes much moretime than we could afford for a preliminary studysuch as this one. We therefore relied on our co-author, Sara Rutherford-Quach, who is a researcherat SRI’s Education Division and has also workedas a teacher at the kindergarten-elementary level toprovide us the annotations. Two other co-authors,Sahu and Cogswell, went through those annota-tions and made sure that each label had a threeway consensus among Rutherford-Quach, Sahuand Cogswell. There might be some ambiguityabout which level a particular prefix fits into, butthis is also true of other applications of the taxon-omy (Thompson et al., 2008). In future work, weplan to carry out a more rigorous annotation withmore than one skilled annotator so we can measureinter-annotator agreement through measures suchas Kappa scores.

4 Experiments

4.1 Datasets

We evaluate our study on four datasets that caneach be thought of in terms of multiple choicequestion answering, all measuring some kind ofcommon sense: COPA (Roemmele et al., 2011)measures common sense causal reasoning, Com-monSenseQA (Talmor et al., 2019) asks questionsthat require prior knowledge, Social IQA (Sap et al.,2019) asks about social common sense, and Wino-Grande (Sakaguchi et al., 2020) adversarially mea-sures semantic common sense. Perhaps surpris-ingly, all of the datasets we used asked questionsthat fell into just one level of the taxonomy (Tab. 2).These datasets do focus on very specific problems,but the result is still disappointing because it wouldbe more useful to see variations in both task andclarification level. It may be interesting to developdatasets that can better express the range of abilitiesdescribed by Bloom’s Taxonomy.

4.2 Language Model

We use distill-GPT2 (Sanh et al., 2019) and thepublicly released GPT-Neo2.7B(Black et al., 2021)(based on EleutherAI’s replication of the GPT-3 ar-chitecture) as the language models throughout ourexperiments. Our clarification question prefixesand hyperparameter settings for both models are

Table 2: Question answering accuracy and std. dev.using different levels ofclarification over multiple clar-ification samples. Results on the dev sets of eachdataset.(* = level of proximal context \wrt the dataset)

Task Model Level Accuracy

Winogrande(1267 total)(2: Understand)

Distil-GPT2(235±5 valid)

0A: Choice Baseline1A: Remember*2A: Understand

53.2 ± 1.854.7 ± 3.652.5 ± 3.1

GPT-Neo(1230±7 valid)

0A: Choice Baseline1A: Remember*2A: Understand

54.62 ± 0.554.77 ± 0.554.76 ± 0.3

SocialIQA(1954 total)(3: Apply)


0B: Choice Baseline1B: Remember2B: Understand*3B: Apply

44.5 ± 0.143.7 ± 2.148.0 ± 1.144.4 ± 1.8


0B: Choice Baseline1B: Remember2B: Understand*3B: Apply

48.74 ± 0.447.31 ± 0.148.44 ± 0.548.1 ± 0.1

COPA(100 total)(3: Apply)


0C: Choice Baseline1C: Remember2C: Understand*3C: Apply

54.9 ± 0.946.0 ± 14.753.1 ± 12.540.8 ± 15.2


0C: Choice Baseline1C: Remember2C: Understand*3C: Apply

70.83 ± 0.065.62 ± 0.070.83 ± 1.470.83 ± 0.0

CommonsenseQA(1221 total)(3: Apply)


0D: Choice Baseline1D: Remember2D: Understand*3D: Apply

29.9 ± 2.726.5 ± 3.328.1 ± 1.225.6 ± 3.4


0D: Choice Baseline1D: Remember2D: Understand*3D: Apply

40.59 ± 3.638.00 ± 6.043.19 ± 0.242.30 ± 0.8

from (Shwartz et al., 2020). For each question pre-fix, we generate 5 clarification questions using nu-cleus sampling threshold probability p = 0.2 andadding at most 6 words to the clarification ques-tion prefix. We then generate 10 answers to eachclarification question using p = 0.5 and maximumanswer length 10. Some changes were necessaryto accurately measure the impact of clarificationlevel. Instead of always including no clarificationas a choice we do not allow this option as it defeatsour goal of measuring clarification level impact.Furthermore, we do not use the clarification ques-tions which were manually completed without in-put from the model (as in COPA and Winogrande).

In order to compare performance across differ-ent levels of clarifications we only consider exam-ples where the model was able to generate at leastone clarification from each level. To increase thenumber of viable examples we found it necessaryto remove some restrictions relative to the imple-mentation of (Shwartz et al., 2020). In particular,we kept all clarifications that had no overlappingwords with the context and did not allow the modelto chose the “no clarification” option. Even withthese constraints it was still often the case thatdistil-GPT2 could not generate a short clarification

24

sentence that was plausible enough to use whereasGPT-Neo was able to generate clarifications for al-most the entire dataset. This indicates larger scalemodels may be more able to take advantage of clar-ifying questions. The number of examples withvalid clarifications for all levels is indicated foreach model in column 2 of Tab. 2. These changeshelp us more accurately measure the impact ofBloom’s Taxonomy, but mean our approach is notdirectly comparable to Shwartz et al. (2020).

4.3 Results

Table 2 reports the performance of our Bloom’sTaxonomy infused zero-shot question answeringmethod. Each row shows question answering ac-curacy for a particular dataset and level of clarifi-cation. If our hypothesis is correct then the levelof available clarifications should matter and clari-fications that provide proximal context –one levelbelow the dataset level– should be most helpful.

Clarification Level Makes a Difference. Alllevels of clarification questions and answers pro-vide some amount of extra information thatchanges how a language model processes the entirestring it is presented with. This is often helpful in-formation, but it may be that all levels of Bloom’sTaxonomy provide equally useful information. Wefind that is not the case. Different levels of clarifi-cation help more or less, as evidenced by the largegap between minimum and maximum accuracy foreach dataaset. Furthermore, when the model canchoose any clarification (rows 0A/B/C/D) it eitherdoes a worse job than proximal context or its per-formance similar to proximal context, so enforcinga particular kind of context should be helpful.

Proximal Context Helps Most. Proximal con-text, as we’ve defined it with respect to Bloom’sTaxonomy is context from the clarification leveldirectly below the dataset question level. The prox-imal clarification level for each dataset is markedby a * in Tab. 2. In all cases proximal clarifica-tions are better than using clarifications of a lowerlevel. For the datasets that ask level 3 questions theproximal (level 2) clarifications also outperformlevel 1 clarifications (2B/C/D greater than 1B/C/D).Proximal clarifications are also about as good asor better than using clarifications of a higher level.You can see this for Winogrande by noting row 1Ais greater than 2A and for the other datasets by not-ing rows 2B/C/D usually have greater performancethan 3B/C/D. Overall, proximal context is most

consistent in efficacy.

4.4 Qualitative Results

In Tab. 1 we show samples of question answer pairsgenerated for each model and in Tab. 5 of the ap-pendix we show complete examples (with contextand choices) for each model and dataset. GPT-Neois much larger than distil-GPT2 and is expected togeneralize to slightly new tasks like the clarifica-tion generation task better than the smaller model.This expectation is clearly met by the observedquality of clarifications. Distil-GPT2 clarificationquestions and answers often do not have meaning-ful semantics, are not correct, or are not relevant.GPT-Neo is much more likely to generate questionsand answers which are meaningful, correct, and rel-evant. This suggests the greater number of validclarifications generated by GPT-Neo may be due toan increase in clarification quality. Furthermore, itfails in an intuitive fashion: when it fails to gener-ate meaningful answers it often has also failed togenerate a meaningful clarification question in thefirst place.

Also note that the performance differences ob-served for distil-GPT2 occur despite its relativelypoor interpretability. This indicates that contextwhich is somewhat relevant to the topic even if itdoes not precisely make sense can still be useful.

5 Conclusion

Large pre-trained language models sometimes havethe right information, but they just do not knowhow to use it. We used Bloom’s taxonomy to pickquestions with the right amount of proximal con-text. This helped the language models use theirknowledge to more effectively answer questions.In the future we would like to extend our work ontasks that present a wide range of questions that fallunder different levels of the taxonomy. Similarly,we also would like to study and improve upon thecurrent limited set of prefix questions used.

Acknowledgments

The authors thank Yunye Gong, Stephanie Nunn,and the anonymous reviewers for the helpful dis-cussions and comments.

ReferencesL. Anderson, D. Krathwohl, and B. Bloom. 2000. A

taxonomy for learning, teaching, and assessing: A

25

revision of bloom’s taxonomy of educational objec-tives.

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On thedangers of stochastic parrots: Can language modelsbe too big? . Proceedings of the 2021 ACM Confer-ence on Fairness, Accountability, and Transparency.

Sid Black, Leo Gao, Phil Wang, Connor Leahy,and Stella Biderman. 2021. GPT-Neo: Largescale autoregressive language modeling with mesh-tensorflow.

B. Bloom. 1956. Taxonomy of educational objectives:The classification of educational goals.

Antoine Bosselut, Ronan Le Bras, and Yejin Choi.2019a. Dynamic neuro-symbolic knowledge graphconstruction for zero-shot commonsense questionanswering. arXiv preprint arXiv:1911.03876.

Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-tanya Malaviya, Asli Celikyilmaz, and Yejin Choi.2019b. Comet: Commonsense transformers forautomatic knowledge graph construction. arXivpreprint arXiv:1906.05317.

T. Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, J. Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry,Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child,A. Ramesh, Daniel M. Ziegler, Jeffrey Wu, ClemensWinter, Christopher Hesse, Mark Chen, Eric Sigler,Mateusz Litwin, Scott Gray, Benjamin Chess,J. Clark, Christopher Berner, Sam McCandlish,A. Radford, Ilya Sutskever, and Dario Amodei.2020. Language models are few-shot learners.ArXiv, abs/2005.14165.

Kevin Clark, Minh-Thang Luong, Quoc V Le, andChristopher D Manning. 2020. Electra: Pre-trainingtext encoders as discriminators rather than genera-tors. arXiv preprint arXiv:2003.10555.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Jesse Dunietz, Greg Burnham, Akash Bharadwaj, Jen-nifer Chu-Carroll, Owen Rambow, and D. Ferrucci.2020. To test machine comprehension, start bydefining comprehension. In ACL.

Stephanie Harvey and Anne Goudvis. 2007. Strategiesthat work: Teaching comprehension for understand-ing and engagement. Stenhouse Publishers.

Mandar Joshi, Kenton Lee, Yi Luan, and KristinaToutanova. 2020. Contextualized representations us-ing textual encyclopedic knowledge. arXiv preprintarXiv:2004.12006.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Gary Marcus and Ernest Davis. 2020. Gpt-3, bloviator:Openai’s language generator has no idea what it’stalking about. MIT Technology Review.

Debbie Miller. 2002. Reading with meaning: Teachingcomprehension in the primary grades. StenhousePublishers.

Manal Mohammed and N. Omar. 2020. Question clas-sification based on bloom’s taxonomy cognitive do-main using modified tf-idf and word2vec. PLoSONE, 15.

Fatema Nafa, S. Othman, and J. Khan. 2016. Auto-matic concepts classification based on bloom’s tax-onomy using text analysis and the naıve bayes clas-sifier method. In CSEDU.

D. Oliver, Tony Dobele, Myles Greber, and Tim S.Roberts. 2004. This course has a bloom rating of3.9. In ACE.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers), pages 2227–2237.

Fabio Petroni, Tim Rocktaschel, Patrick Lewis, AntonBakhtin, Yuxiang Wu, Alexander H Miller, and Se-bastian Riedel. 2019. Language models as knowl-edge bases? arXiv preprint arXiv:1909.01066.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. OpenAIblog, 1(8):9.

Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S Gordon. 2011. Choice of plausible alterna-tives: An evaluation of commonsense causal reason-ing. In AAAI Spring Symposium: Logical Formal-izations of Commonsense Reasoning, pages 90–95.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-ula, and Yejin Choi. 2020. Winogrande: An adver-sarial winograd schema challenge at scale. In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence, volume 34-05, pages 8732–8740.

Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. ArXiv,abs/1910.01108.

http://github.com/eleutherai/gpt-neo



https://www.technologyreview.com/2020/08/22/1007539/gpt3-openai-language-generator-artificial-intelligence-ai-opinion/



26

Maarten Sap, Hannah Rashkin, Derek Chen, RonanLe Bras, and Yejin Choi. 2019. Social iqa: Com-monsense reasoning about social interactions. InProceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP), pages 4453–4463.

Vered Shwartz, Peter West, Ronan Le Bras, ChandraBhagavatula, and Yejin Choi. 2020. Unsupervisedcommonsense question answering with self-talk. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 4615–4629.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, andJonathan Berant. 2019. Commonsenseqa: A ques-tion answering challenge targeting commonsenseknowledge. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4149–4158.

E. Thompson, Andrew Luxton-Reilly, J. Whalley,M. Hu, and Phil Robbins. 2008. Bloom’s taxonomyfor cs assessment. In ACE ’08.

J. Whalley, R. Lister, E. Thompson, T. Clear, Phil Rob-bins, P. Kumar, and C. Prasad. 2006. An australasianstudy of reading and comprehension skills in noviceprogrammers, using the bloom and solo taxonomies.

James Zhang, Casey Wong, Nasser Giacaman, and An-drew Luxton-Reilly. 2021. Automated classificationof computing education questions using bloom’s tax-onomy. Australasian Computing Education Confer-ence.

A Prefixes and Examples

In the appendix we provides more details aboutthe question prefixes we used in Tab. 3 and pro-vide more examples of outputs from our models inTab. 5.

27

Table 3: All the prefix questions with its corresponding taxonomy level used in our zero shot question answeringevaluation.

Question Prefix Answer Prefix Bloom’s Taxonomy LevelCommonsenseQA & COPA

What is the definition ofWhat is the main purpose ofWhat is the main function of aWhat are the properties of aWhat is aWhat happened as a result ofWhat might have caused

The definition of isThe purpose of is toThe main function of a isThe properties of a are that

isAs a result of ,The cause of was

1221133

SocialIQAWhat will [NAME] want to do next?What will [NAME] want to do after?How would [NAME] feel afterwards?How would [NAME] feel as a result?How would [NAME] feel after?How would you describe [NAME]?What kind of person is [NAME]?How would you describe [NAME] as a person?Why did [NAME] do that?Why did [NAME] do this?Why did [NAME] want to do this?What does [NAME] need to do beforehand?What does [NAME] need to do before?What does [NAME] need to do before this?What did [NAME] need to do before this?What will happen to [NAME]?What will happen to [NAME] next?What will [NAME] do next?What did [NAME] do?

NAME] wanted[NAME] wanted[NAME] felt[NAME] felt[NAME] felt[NAME] is a[NAME] is a[NAME] is a[NAME] did this because they wanted[NAME] did this because they wanted[NAME] did this because they wantedBefore doing that, [NAME] first had toBefore doing that, [NAME] first had toBefore doing that, [NAME] first had toBefore doing that, [NAME] first had to[NAME][NAME][NAME]What [NAME] did was

3333322233322223331

WinograndeWhat is the definition ofWhat is the main purpose ofWhat is the main function of aWhat are the properties of aWhat isWhat does it mean to

The definition of isThe purpose of is toThe main function of a isThe properties of a are that

ismeans

122112

28

Table 4: Example contexts, questions, choices, clarification questions and clarification answers for each dataset.We present results for both Distil-GPT2 and GPT-Neo.

COPA Generated Clarifications

(Context, Question, Choice) Distil-GPT2 GPT-NeoC:The man turned on the faucet.Q: effectCh: [The toilet filled with water.,

Water flowed from the spout.]

Q: What might have caused it to be so hot?A: The cause of it to be so hot was that he wouldhave a big fire in his face and then immediatelyhad an explosion, which killed all but one person.

Q: What happened as a result of that?A: As a result of that, he was forced to drink water.

C: The woman retired.Q: effectCh: [She received her pension.,

She paid off her mortgage.]

Q: What happened as a result of this?A: As a result of this, the government has announcedthat it will pay Rs 7 crore to her family for compensationand legal expenses related thereto as part-time payments.

Q: What happened as a result of her retirement?A: As a result of her retirement, she had been receivinga pension for life.

C: I wanted to conserve energy.Q: effectCh: [I swept the floor in the unoccupied room.,

I shut off the light in the unoccupied room.]

Q: What is the main function of a plant?A: The main function of a plant is the growthand maintenance of nutrients, so that we canincrease our food intake by 10% or more per year, he said.

Q: What might have caused this?A: The cause of this was the lack of electricity in my home.

Winogrande Generated Clarifications

(Sentence, Context, Choice) Distil-GPT2 GPT-NeoS: The home that my parents had when I wasin school was a lot nicer than my house nowbecause the was sophisticated.C: homeCh: [home, house]

Q: What is the main function of a building?A: The main function of a building is to keeppeople happy and healthy.

Q: What are the properties of a home?A: he properties of a home are that whichmakes it comfortable and pleasant for the occupants.

S: He had enough time between classes to goto a cafe or to the library. He went to thebecause his paper could wait.C: classesCh: [cafe, library]

Q: What is the definition of a book?A: The definition of a book is an articlewritten in order for it be read.

Q: What is the definition of cafe?A: cafe is defined as a place where peoplegather for refreshment.

S: They had to eat a lot to gain the strengththey had lost and be able to work, the was too much.C: strengthCh: [work, strength]

Q: What is the main function of a good diet?A: The main function of a good diet is not forfood or health but also on exercise.

Q: What is the definition of strength?A: strength is defined as : having poweror ability to do something.

SocialIQA Generated Clarifications

(Context, Question, Choice) Distil-GPT2 GPT-NeoC: Kendall got a new sports car and could not waitto show friends.Q: What will Kendall want to do next?Ch: [drive that sports car,show off his new sports car,clean and wax her legs]

Q: Why did Kendall want to do this?A: Kendall did this because they wanted her, too.

Q: What will Kendall want to do next?A: Kendall wanted to impress them withhis new ride.”

C: Kendall was really into house plants and oftenpour water in to jars so the fluoride would evaporate out.Q: What will Kendall want to do next?Ch: [watered a few of the plants,gather jars,watered a few of her pets]

Q: Why did Kendall do that?A: Kendall did this because they wanted him athome, he said.

Q: Why did Kendall do this?A: Kendall did this because they wanted toget a lot of water for their plants.

C: Remy gave birth to a baby girl on a Saturday afternoon.Q: What does Remy need to do before this?Ch: [leave her husband,did this to start a family,go to the hospital]

Q: What did Remy need to do before this?A: Before doing that, Remy first had to tell herhe was too old.

Q: What does Remy need to do before?A: Before doing that, Remy first had to go tothe hospital to deliver her baby.

CommonsenseQA Generated Clarifications

(Context, Question, Choice) Distil-GPT2 GPT-NeoC: In what Spanish speaking North American countrycan you get a great cup of coffee?Q: cup of coffeeCh: [mildred’s coffee shop, mexico, diner,kitchen, canteen]

Q: What is the definition of coffee?A: The definition of coffee is the wordthat comes from Latin for espresso.

Q: What is the definition of mexico?A: mexico is defined as the country thatproduces the best coffee.

C: Unlike a spider and his many sight seers, peopleonly have what?Q: peopleCh: [tongues, names, brains, feelings, two eyes]

Q: What is the main purpose of this article?A: The purpose of this article is to explainthe basic characteristics that are characteristic of spiders.

Q: What is the definition of two eyes?A: two eyes is defined as the organsby which we see.

C: The artist was sitting quietly pondering, thensuddenly he began to paint when what struck him?Q: sitting quietlyCh: [sadness, anxiety, inspiration, discomfort, insights]

Q: What is a man who can do that?A: Man who can do that is a really great painter.

Q: What is the definition of inspiration?A: inspiration is defined as a sudden flashof inspiration that comes from within.

comprehension based question answering using bloom’s …

Documents