multimodal learning for classroom activity …

MULTIMODAL LEARNING FOR CLASSROOM ACTIVITY DETECTION

Hang Li, Yu Kang, Wenbiao Ding, Song Yang, Songfan Yang, Gale Yan Huang, Zitao Liu∗

TAL AI Lab, TAL Education Group, Beijing, China

ABSTRACT

Classroom activity detection (CAD) focuses on accuratelyclassifying whether the teacher or student is speaking andrecording both the length of individual utterances during aclass. A CAD solution helps teachers get instant feedbackon their pedagogical instructions. This greatly improves ed-ucators’ teaching skills and hence leads to students’ achieve-ment. However, CAD is very challenging because (1) theCAD model needs to be generalized well enough for dif-ferent teachers and students; (2) data from both vocal andlanguage modalities has to be wisely fused so that they canbe complementary; and (3) the solution shouldn’t heavilyrely on additional recording device. In this paper, we addressthe above challenges by using a novel attention based neuralframework. Our framework not only extracts both speechand language information, but utilizes attention mechanismto capture long-term semantic dependence. Our frameworkis device-free and is able to take any classroom recording asinput. The proposed CAD learning framework is evaluatedin two real-world education applications. The experimentalresults demonstrate the benefits of our approach on learn-ing attention based neural network from classroom data withdifferent modalities, and show our approach is able to outper-form state-of-the-art baselines in terms of various evaluationmetrics.

Index Terms— Multimodal Learning, Classroom Activ-ity Detection, K-12 Education

1. INTRODUCTION

Teacher-student interaction analysis in live classrooms withthe goal of accurately quantifying classroom activities, suchas lecturing, discussion, etc is very crucial for student achieve-ment [1, 2, 3, 4, 5, 6, 7]. It not only provides students theopportunity to work through their understanding and learnfrom ideas of others but gives teachers epistemic feedback ontheir instruction which is important for crafting their teachingskills [8, 9, 10, 11]. Such analysis usually takes into accounta classroom recording (e.g., as collected by an audio or videorecorder) and outputs pedagogical annotations of classroomactivities.

∗The corresponding author. Email: [email protected].

The majority of the current practices of classroom dia-logic analysis is logistically complex and expensive, requir-ing observer rubrics, observer training, and continuous as-sessment to maintain a pool of qualified observers [12, 10].Even with performance support tools developed by Nystrandand colleagues, including live coding CLASS 4.25 software[1, 13], it still requires approximately 4 hours of coding timeper 1 hour of classroom observation. This is an unsustainabletask for scalable research, let alone for providing day-to-dayfeedback for teacher professional development.

In this work, we focus on the very fundamental and clas-sic classroom activity detection (CAD) problem and aim toautomatically distinguish between whether the teacher or stu-dent is speaking and record both the length of individual ut-terances and the total amount of talk during a class. An exam-ple annotation trace for a class is illustrated in Figure 1. TheCAD results of identified activity patterns during lessons willgive valuable information about the quantity and distributionof classroom talk and therefore help teachers improve theirinteractions with students so as to improve student achieve-ment.

teacher talk student talk no voice or noise

Time (mins)

0 5 10 15 20 25 30 35 40One class

Fig. 1. A graphical illustration of CAD results in a sampleclass. The x-axis represents time within the class.

A large spectrum of models have been developed and suc-cessfully applied in solving CAD problems [14, 2, 15, 8].However, CAD in real-world scenarios poses numerous chal-lenges. First, vocal information is usually not enough whensolving the CAD task due to the multimodal classroom envi-ronment. The teacher’s voice might be very close to somestudent’s voice, which undoubtedly poses a hard modelingproblem since the existing well-developed approaches eitherfocus on identifying each individual speaker in the clean en-vironment or utilize extra recording devices for such activitydetection. Second, teacher-student conversations from real-world classroom scenarios are very causal and open-ended. Itis difficult to capture the latent semantic information and howto model the long-term sentence-level dependence remains abig concern. Third, the CAD solution should be flexible and

arX

iv:1

910.

1379

9v3

[ee

ss.A

S] 1

0 Fe

b 20

20

doesn’t rely on additional recording devices like portable mi-crophones for only collecting teacher audio.

In this paper we study and develop a novel solution toCAD problems that is applicable and can learn neural net-work models from the real-world multimodal classroom en-vironment. More specifically, we present an attention basedmultimodal learning framework which (1) fuses the multi-modal information by attention based networks such that theteachers’ or students’ semantic ambiguities can be alleviatedby vocal attention scores; and (2) directly learns and predictsfrom classroom recordings and doesn’t rely on any additionalrecording device for teachers.

2. RELATED WORK

There is a long research history on the use of audio (andvideo) to study instructional practices and student behav-iors in live classrooms and many approaches and schemesare designed for CAD annotations due to different purposes[14, 2, 15, 8]. For examples, Owens et al. develop DecibelAnalysis for Research in Teaching, i.e., DART, to analyzesthe volume and variance of classroom recordings to predictthe quantity of time spend on single voice (e.g., lecture), mul-tiple voice (e.g., pair discussion), and no voice (e.g., clickerquestion thinking) activities [2]. Cosbey et al. improvethe DART performance by using deep and recurrent neuralnetwork (RNN) architectures [15]. A comprehensive com-parison experiments of deep neural network (DNN), gatedrecurrent network (GRU) and RNN are studied. Mu et al.present the Automatic Classification of Online Discussionswith Extracted Attributes, i.e., ACODEA, framework forfully automatic segmentation and classification of online dis-cussion [14]. ACODEA focuses on learners’ argumentationknowledge acquiring and categorizes the content based onthe micro-argumentation dimensions such as Claim, Ground,Warrant, Inadequate Claim, etc [16]. Donnelly et al. aim toprovide teachers with formative feedback on their instruc-tions by training Naive Bayes models to identify occurrencesof some key instructional segments, such as Question & An-swer, Procedures and Directions, Supervised Seatwork etc[8].

The closest related work is research by Wang et. al [4],who conduct CAD by using LENA system [5] and identifiesthree discourse activities of teacher lecturing, class discussionand student group work. Our work is different from Wang et.al since we develop a novel attention based multimodal neuralframework to conduct the CAD tasks in the real-world device-free environment. While Wang et. al need to ask teachers towear the LENA system during the entire teaching process anduse differences in volume and pitch in order to assess whenteachers were speaking or students were speaking.

Please note that CAD is different from the classic speakerverification [17, 18, 19] and speaker diarization [20] where(1) there is no enrollment-verification 2-stage process in CAD

tasks; and (2) not every speaker need to be identified.

3. OUR APPROACH

3.1. Problem Statement

Let S be the sequence of segments of a classroom recording,i.e., S = {si}Ni=1 where si denotes the ith segment and Nis the total number of segments. Let Y be the correspondinglabel sequence, i.e., Y = {yi}Ni=1 and each yi represents theclassroom activity type, i.e, whether the segment is spokenby a student or a teacher. For each segment si, we extractboth the acoustic feature xai ∈ Rda×1 and the text featurexti ∈ Rdt×1. da and dt are the dimensionality of xai andxti. Let Xa and Xt be the acoustic and text feature matri-ces of sequence S, i.e., Xa ∈ RN×dt and Xt ∈ RN×dt . Withthe aforementioned notations and definitions, we can now for-mally define the CAD problem as a sequence labeling prob-lem:

Given a classroom recording segment sequence S and thecorresponding acoustic and text feature matrices Xa and Xt,the goal is to find the most probable classroom activity typesequence Y as follows:

Y = argmaxY∈Y

P (Y|Xa,Xt)

where Y is the collection of all possible labeling sequencesand Y is the predicted classroom activity type sequence.

3.2. The Proposed Framework

3.2.1. Multimodal Attention Layer

In order to capture the information from both vocal and lan-guage modalities in the classroom environment, we design anovel multimodal attention layer that is able to alleviate thelanguage ambiguity by using the voice attention mechanism.The majority of classroom conversations is open-ended and itis very difficult to distinguish its activity type when only con-sidering the sentence itself. Furthermore, not every piece ofcontextual segments contributes equally to the labeling task,especially the context is a mix of segments from teachers andstudents. Therefore, we use acoustic information as a com-plementing resource. More specifically, for each segment si,we not only utilize its own acoustic and language informationbut also the contextual information within the entire class-room recording. Moreover, the contextual segments are auto-matically weighted by the voice attention scores. The voiceattention scores aim to cluster segments from the same subject(teacher or student), which is illustrated in Figure 2(a).

As shown in Figure 2(a), firstly, each segment is treatedas a query and we compute its attention scores between thequery and all the remaining segments by using the acousticfeatures. We choose the standard scaled dot-product as ourattention function [21]. The scaled dot-product scores can be

xt1xa1 xt2xa2 xtNxa

N…

q1 K V

MatMul

Scale

SoftMax

MatMul

ha1

(a)

ha1

hc1

haN

hcN

ha2

hc2

xt1xa1 xtNxa

N…

Bi-directional LSTM

Multimodal Attention

…

Position-wise Feed-Forward Networks

y1 y2 yN…

xt2xa2

(b)

Fig. 2. (a) Multimodal attention layer. (b) The proposed neu-ral framework.

viewed as the substitutes of the cosine similarities betweenvoice embedding vectors, which have been commonly usedas a calibration for the acoustic similarity between differentspeakers’ utterances [17, 20]. After that, we compute themultimodal representation by attending scores to contextuallanguage features. The above process can be concisely ex-pressed in the matrix form as Q = K = Xa and V = Xt

where Q, K and V are the query, key and value matrices inthe standard attention framework [21].

Both Xa and Xt are from pre-trained models. Theacoustic feature Xa is obtained from the state-of-the-artLSTM-based speaking embedding network proposed by Wanet al. [17]. The d-vector generated by such network has beenproven to be effective in representing the voice characteris-tics of different speakers in speaker verification and speakerdiarization problems [20]. The language feature Xt comesfrom the word embedding network proposed by Mikolovet al. [22], which is widely used in many neural languageunderstanding models [21, 23]. In practice, to achieve bet-ter performance on the classroom specific datasets, we alsofine tune the pre-trained models with linear transformationoperators, i.e., Q = K = XaW a; V = XtW t, whereW a ∈ Rda×dq and W t ∈ Rdt×dv are the linear projectionmatrices.

The attention score matrix A ∈ RN×N is computedthrough dot-product of Q and K, and the softmax function isthen used to normalized the score values. Finally, the outputembedding matrix H is calculated by the dot product of Aand the value matrix V . The complete equation is shown asfollow:

Ha = Attention(Q,K,V ) = softmax(QKT√dq

)V

With our multimodal attention layer, the acoustic featuresXa served as a bridge that wisely connects the scattered se-mantic features Xt across different segments.

3.2.2. The Overall Architecture

By utilizing the above multimodal attention layer, we areable to learn the fused multimodal embeddings for eachsegment. Similar to [24], we add a residual connection byconcatenating hai with xti in multimodal attention block, i.e.,hci = [hai ;x

ti] and Hc ∈ RN×2dv is the matrix form of all

hci s. What’s more, to better capture the contextually sequen-tial information within the entire classroom recording, weintegrate the position information of the sequence by using aBi-directional LSTM (BiLSTM) layer after the residual layer[23]. We denote the hidden representation of BiLSTM asHb, where Hb = BiLSTM(Hc) ∈ RN×2db and 2db is thesize of hidden vector of BiLSTM. Finally, we use a two-layer fully-connected position-wise feed forward network toconduct the final predictions, i.e., Y = softmax(FCN(Hb)),where softmax(·) denotes the softmax function and FCN(·)denotes the fully-connected network. The entire frameworkis illustrated in Figure 2 (b).

In our multimodal learning framework, we use binarycross-entropy loss Lc to optimize the prediction accuracy,which is defined asLc =

∑Ni=1 yi log pi+(1−yi) log(1−pi),

where pi the prediction probability for segment si.Besides that, we also want to optimize the multimodal at-

tention scores with existing label information. Therefore, weintroduce an attention score regularization operator Rα. Rαaims to penalize the similarity scores between two segmentswhen they are from different activity types. Rα is defined asRα =

∑Ni=1

∑Nj=1 αij ·1{yi 6=yj}, where αij is the (i, j)th el-

ement in A and represents the attention score between si andsj . 1{·} is the indicator function.

Therefore, the final loss function L in our multimodallearning framework is shown as L = Lc + βRα, where βis the hyper parameter and is selected (in all experiments) bythe internal cross validation approach while optimizing mod-els’ predictive performances.

4. EXPERIMENTS

4.1. Experimental Setup

In the following experiments, we first feed each classroomrecording to a publicly available third-party ASR online tran-scription service to (1) generate the unlabeled segment se-quence; (2) filter out silence or noise segments; and (3) ob-tain the raw sentences transcriptions. Then similar to [20],each segment level audio signal is transformed into framesof width 25ms and step 10ms, and log-mel-filterbank ener-gies of dimension 40 are extracted from each frame. Afterthat, we build sliding windows of a fixed length (240ms) andstep (120ms) on these frames. We run the pre-trained acous-tic neural network on each window and compute acoustic fea-tures, i.e., xai , by averaging these window level embeddings.Similarly, text features, i.e., xti, are also generated from a pre-trained word embedding network.

The projected dimension, i.e., dq , and the number of neu-rons in BiLSTM, i.e., db are set to 64 and 100. The numbersof neurons in the final two-layer fully connected network are128 and 2. We use ReLU as the activation function. We setβ to 10 and use ADAM optimizer with learning rate of 0.001.We set batch size and the number of training epoch to 64 and20, respectively.

We compare with different methods by accuracy and F1score. Different from standard classification evaluation inwhich examples are equal, in CAD tasks, the lengths of dif-ferent segments vary a lot. Therefore, evaluation results areweighted by the time span of each segment. The weight iscomputed as the proportion of each segment’s duration overthe sum of total segment durations of the class recording.

4.2. Baselines

We carefully choose the following state-of-the-art CAD re-lated approaches as our baselines. They are: (1) BiLSTMwith acoustic features (BiLSTMa): We train the BiLSTMmodel with acoustic features only. The text features arecompletely ignored and there is no multimodal attention fu-sion; (2) BiLSTM with text features (BiLSTMt): Similarto BiLSTMa but instead only the text features are used;(3) attention based BiLSTM with acoustic features (Attn-BiLSTMa): A self-attention layer is added before BiLSTMmodel, and similar to BiLSTMa, only acoustic features areused; (4) attention based BiLSTM with text features (Attn-BiLSTMt): Similar to Attn-BiLSTMa but instead only thetext features are used; (5) BiLSTM with concatenated fea-tures (BiLSTMc): Both the acoustic and text features areused and the concatenation of them are fed into the BiLSTMmodel directly without multimodal attention fusion; (6) spec-tral clustering with d-vectors (Spectral): Spectral cluster-ing on speaker-discriminative embeddings (a.k.a. d-vectors)[20]. It first extracts d-vectors from acoustic features, andthen applies the spectral clustering to cluster all the segments.After that, a classifier takes both acoustic and text features andpredict the final activity type; (7) unbounded interleaved-state recurrent neural networks (UIS− RNN): Similar toSpectral, but instead of using spectral clustering, UIS-RNNuses a distance-dependent Chinese restaurant process [25].

4.3. Datasets

To assess the proposed framework, we conduct several exper-iments on two real-world K-12 education datasets.Online Classroom Data(“Online”) We collect 400 onlineclass recordings from a third-party online education platform.The data is recorded by webcams via live streaming. Aftergenerating segment sequences according to steps in Section4.1, we label each segment as either Teacher or Student.For those segment consisting of both teacher speaking andstudent speaking, we label it as the dominant activity type in

the segment. The average duration of classroom recordingsis 60 minutes. The average length of the segment sequencesis 700. We train our model with 350 recordings and use theremaining 50 recordings as the test set.Offline Classroom Data(“Offline”): We also collect another50 recordings from offline classroom environment as an ad-ditional test set. The data is obtained by indoor cameras thatare installed on the ceiling of the classrooms.

4.4. Results & Analysis

The results show that our approach outperforms all othermethods on both Online and Offline datasets. Specifically,from Table 1, we find the following results: (1) whencomparing acoustic feature only models (BiLSTMa, Attn-BiLSTMa) to text feature only models (BiLSTMt, Attn-BiLSTMt), we can see that models based on acoustic fea-tures in general perform worse than models based on textfeatures. We believe this is because two similar voice seg-ments may have different activity types but the correspondingspoken terms may differ; (2) comparing BiLSTMa and Attn-BiLSTMa, BiLSTMt and Attn-BiLSTMt, blindly incorpo-rating attention mechanism cannot improve the performancein CAD tasks due to the fact that the sequence is mixed byteacher spoken segments and student spoken segments; (3)the performance on teacher is better than student in general,this is because the majority of segments in the classroomrecording is spoken by teachers. The percentages of studenttalk time is 22% on both Online and Offline datasets.

Table 1. Experimental results on Online and Offline datasets.F1T and F1S indicate the F1 scores for activity types Teacherand Student.

Online OfflineAcc F1T F1S Acc F1T F1S

BiLSTMa 0.76 0.84 0.54 0.72 0.82 0.35Attn-BiLSTMa 0.73 0.80 0.58 0.77 0.86 0.28BiLSTMt 0.83 0.90 0.48 0.78 0.87 0.28Attn-BiLSTMt 0.78 0.87 0.13 0.78 0.87 0.03BiLSTMc 0.81 0.88 0.61 0.79 0.87 0.24Spectral 0.78 0.86 0.53 0.73 0.83 0.37UIS-RNN 0.81 0.88 0.58 0.71 0.81 0.35Our 0.92 0.95 0.81 0.80 0.87 0.48

5. CONCLUSION

In this paper, we presented an attention based multimodallearning framework for CAD problems. Our approach is ableto fuse data from different modalities and capture the long-term semantic dependence without any additional recordingdevice. Experiment results demonstrated that our approachoutperforms other state-of-the-art CAD learning approachesin terms of accuracy and F1 score.

6. REFERENCES

[1] Martin Nystrand, “Research on the role of classroom discourseas it affects reading comprehension,” Research in the Teachingof English, pp. 392–412, 2006.

[2] Melinda T Owens, Shannon B Seidel, Mike Wong, Travis EBejines, Susanne Lietz, Joseph R Perez, Shangheng Sit, Zahur-Saleh Subedar, Gigi N Acker, Susan F Akana, et al., “Class-room sound can be used to classify teaching practices in col-lege science courses,” Proceedings of the National Academy ofSciences, vol. 114, no. 12, pp. 3085–3090, 2017.

[3] Sidney K D’Mello, Andrew M Olney, Nathan Blanchard,Borhan Samei, Xiaoyi Sun, Brooke Ward, and Sean Kelly,“Multimodal capture of teacher-student interactions for auto-mated dialogic analysis in live classrooms,” in ICMI. ACM,2015, pp. 557–566.

[4] Zuowei Wang, Xingyu Pan, Kevin F Miller, and Kai S Cortina,“Automatic classification of activities in classroom discourse,”Computers & Education, vol. 78, pp. 115–123, 2014.

[5] Hillary Ganek and Alice Eriks-Brophy, “The language en-vironment analysis (lena) system: A literature review,” inProceedings of the Joint Workshop on NLP for Computer As-sisted Language Learning and NLP for Language Acquisitionat SLTC, Umea, 16th November 2016. Linkoping UniversityElectronic Press, 2016, number 130, pp. 24–32.

[6] Jiahao Chen, Hang Li, Wenxin Wang, Wenbiao Ding, Gale YanHuang, and Zitao Liu, “A multimodal alerting system for on-line class quality assurance,” in International Conference onArtificial Intelligence in Education. Springer, 2019, pp. 381–385.

[7] Wenbiao Ding, Guowei Xu, Tianqiao Liu, Weiping Fu, Yu-jia Song, Chaoyou Guo, Cong Kong, Songfan Yang, Gale YanHuang, and Zitao Liu, “Dolphin: A spoken language profi-ciency assessment system for elementary education,” in Pro-ceedings of The Web Conference 2020, 2020.

[8] Patrick J Donnelly, Nathan Blanchard, Borhan Samei, An-drew M Olney, Xiaoyi Sun, Brooke Ward, Sean Kelly, MartinNystran, and Sidney K D’Mello, “Automatic teacher modelingfrom live classroom audio,” in Proceedings of the 2016 Confer-ence on User Modeling Adaptation and Personalization. ACM,2016, pp. 45–53.

[9] Patrick J Donnelly, Nathaniel Blanchard, Borhan Samei, An-drew M Olney, Xiaoyi Sun, Brooke Ward, Sean Kelly, MartinNystrand, and Sidney K D’Mello, “Multi-sensor modeling ofteacher instructional segments in live classrooms,” in ICMI.ACM, 2016, pp. 177–184.

[10] Andrew M Olney, Patrick J Donnelly, Borhan Samei, and Sid-ney K D’Mello, “Assessing the dialogic properties of class-room discourse: Proportion models for imbalanced classes,”International Educational Data Mining Society, 2017.

[11] Tiaoqiao Liu, Wenbiao Ding, Zhiwei Wang, Jiliang Tang,Gale Yan Huang, and Zitao Liu, “Automatic short answer grad-ing via multiway attention networks,” in International Confer-ence on Artificial Intelligence in Education. Springer, 2019, pp.169–173.

[12] Jeff Archer, Steven Cantrell, Steven L Holtzman, Jilliam N Joe,Cynthia M Tocci, and Jess Wood, Better feedback for betterteaching: A practical guide to improving classroom observa-tions, John Wiley & Sons, 2016.

[13] Martin Nystrand, “CLASS: A Windows Laptop ComputerSystem for the In-Class Analysis of Classroom Discourse,”https://dept.english.wisc.edu/nystrand/class.html, [Online; accessed 10-October-2019].

[14] Jin Mu, Karsten Stegmann, Elijah Mayfield, Carolyn Rose, andFrank Fischer, “The acodea framework: Developing segmen-tation and classification schemes for fully automatic analysisof online discussions,” International Journal of Computer-supported Collaborative Learning, vol. 7, no. 2, pp. 285–305,2012.

[15] Robin Cosbey, Allison Wusterbarth, and Brian Hutchinson,“Deep learning for classroom activity detection from audio,”in ICASSP. IEEE, 2019, pp. 3727–3731.

[16] Armin Weinberger and Frank Fischer, “A framework toanalyze argumentative knowledge construction in computer-supported collaborative learning,” Computers & Education,vol. 46, no. 1, pp. 71–95, 2006.

[17] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno,“Generalized end-to-end loss for speaker verification,” inICASSP. IEEE, 2018, pp. 4879–4883.

[18] FA Rezaur rahman Chowdhury, Quan Wang, Ignacio LopezMoreno, and Li Wan, “Attention-based models for text-dependent speaker verification,” in ICASSP. IEEE, 2018, pp.5359–5363.

[19] Georg Heigold, Ignacio Moreno, Samy Bengio, and NoamShazeer, “End-to-end text-dependent speaker verification,” inICASSP. IEEE, 2016, pp. 5115–5119.

[20] Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mans-field, and Ignacio Lopz Moreno, “Speaker diarization withlstm,” in ICASSP. IEEE, 2018, pp. 5239–5243.

[21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin, “Attention is all you need,” in NIPS, 2017, pp.5998–6008.

[22] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean,“Efficient estimation of word representations in vector space,”arXiv preprint arXiv:1301.3781, 2013.

[23] Zhiheng Huang, Wei Xu, and Kai Yu, “BidirectionalLSTM-CRF models for sequence tagging,” arXiv preprintarXiv:1508.01991, 2015.

[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in CVPR,2016, pp. 770–778.

[25] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, andChong Wang, “Fully supervised speaker diarization,” inICASSP. IEEE, 2019, pp. 6301–6305.

https://dept.english.wisc.edu/nystrand/class.html

https://dept.english.wisc.edu/nystrand/class.html

multimodal learning for classroom activity …

Documents