Lecture2notes: Summarizing Lecture
Videos by Classifying Slides and
Analyzing Text with Machine Learning
Hayden Housen
https://[email protected]: @HHousenMentor: Dhiraj Joshi @ IMB
All Images Without Citations Are My OwnBackground: https://unsplash.com/photos/ieic5Tq8YMk
Natural Language Processing • Computer Vision • Machine Learning
Phone
Presentation
Goal/Focus
AI Model
Lecture Video Detailed Notes
Goal/Focus
Slide ClassifierAI Summarization
Models
Website
Gap/Hypothesis
End-to-End Process
Presenter SlideSlide
Lecture Video
Video
Audio
Summarization Algorithm
Detailed Notes
Brief Methodology
Combination Algorithm
StructuredContent
Transcript
Render
Computer Vision
“Teaching Computers To See”•Face
Recognition Emotion Analysis
Crowd Analytics
Sports: draw lines on field& highlights
Medical Imaging
Movie Special Effects
Self Driving Cars
Computer Vision
“Teaching Computers To See”
•Face Recognition
Emotion Analysis
Crowd Analytics
Sports: draw lines on field& highlights
Medical Imaging
Movie Special Effects
Self Driving Cars
75%
Humans generate massive amounts of video (Video data
accounted for 75% of the total internet traffic in 2017) [23]
Literature Review – Note Taking
Note taking is almost a universal activity among students.[27]
Students’ notes are generally incomplete, and thus not adequate for reviewing the material.[28]
Students prefer guided notes[29] and course final exam performance higher for guided notes.[30]
Estimated Effects
Previewing: [31] suggests that summarized lecture slides reduce the amount of previewing time required without impacting quiz scores
MOOCs: Summaries enable quick skimming of the main points.
Automated summaries will
Decrease time spent creating notes
Increase quiz scores (content knowledge)
Enable faster learning
Literature Review – Lecture Summarization
TalkMiner22 – Web scraping and OCR used to index online lecture videos. We implement the
duplicate alignment algorithm suggested by TalkMiner.
Summary of TalkMiner [22] Summary of [20]
“Lecture Video Indexing20 – Inspired our slide structure analysis algorithm. We use the same
title identification and text classification thresholds.
Literature Review – Lecture Summarization
BERT Text Summarization Lectures [23] is most related since it applies deep learning to lecture summarization.
Approach is limited because:
It only tests the BERT model
Does not consider the information on the slides
Does not fine-tune BERT and merely uses it for embeddings
Has no STT component
Gap/HypothesisMain Components
Slide ClassifierAI Summarization
Models
WebsiteEnd-to-End
Process
Presenter SlideSlide
Lecture Video DatasetAI
Summarization Models
WebsiteEnd-to-End
Process
Main Components
AI Slide Classifier Model
1. Slide Classification Model
Classify frames of input video into 9 classes
Allows system to identify images containing slides for additional processing
Means system can ignore useless frames (containing only the presenter or audience)
SortingAuto Sort (AI Classification)
Manual Sorting Repairs
1.1. Lecture Video Dataset – Diagram
WebsiteScraper
Video Downloader
Frame Extractor
YouTubeScraperHuman
Interaction
Compile Data
Videos CSV
Train Slide
Classifier
Bootstrapping
MITOpenCourseWare
Final dataset contains 15599 images extracted
from 78 videos manually classified into 9
categories.
1.2. Slide Classifier – Dataset Modifications
Train-test Train-test-three CV
1.2. Slide Classifier – Dataset Modifications
Train-test
• Split dataset evenly (by category) into train (80%) & test (20%) sets.
• Average deviation = 0.0929.• Random subset would cause
class imbalance.
Train-test-three
• Only the “slide” and “presenter_slide” classes used in end-to-end process, so create dataset with only those classes.
• “Other” class contains remaining7 classes
• Average deviation = 0.001104.
1.3. Slide Classifier – Squished Images
After hyperparameters found (by training 262 models in CV using optimization framework), trainfinal-general and three-category models.
Frames from videos have 16:9 aspect ratio but ImageNet CNNs accept images with 1:1 aspect ratio.
Center cropping and scaling can remove important data.
Train a model by simply rescaling.Misclassified as “slide”
because presenter cropped out.
1.4. Slide Classifier – Results
Model Dataset Accuracy Accuracy (train) F-score Precision Recall
Final-general train-test 82.62 98.58 87.44 97.73 82.62
Three-category train-test-three 89.97 99.72 93.82 99.95 89.97
train-test 83.86 97.16 88.16 97.72 83.86
train-test-three 87.21 100 91.57 99.8 87.21Squished-image
For each model configuration, trained 11 models and report the average metrics.
In the final pipeline, the three-category model is used.
Slide classifier can be used in other research to easily identify slides.
1.4 Slide Classifier – Results
Final-generalMedian (by accuracy)
Three-categoryMedian (by accuracy)
1.5 Slide Classifier – Discussion
Incorrectly classifying “slide” frames as “presenter_slide” will have minimal impact on final summary will be ignored during processing.
Incorrectly classifying “presenter_slide” frames as “slide” will impact final summary will not be cropped.
2.1 End-to-End Approach – Overview
ffmpeg Slide Classifier that uses the ML model from
previous slide trained on
the dataset
Crop slides captured by
video camera
Extract Frames Classify Frames Perspective Crop Cluster Slides
Segment or Normal
2.1 End-to-End Approach – Overview
Tesseract OCRStroke Width
Title IdentificationBold Identification
Segment or DeepSpeechCMU Sphinx
VoskGoogle Cloud Speech API
CombinationModifications
ExtractiveAbstractive
Slide Structure Analysis & OCR Figure Extraction Transcribe Audio SummarizationCluster Slides
Text Detection w/EAST
Color DetectionShannon Entropy
2.2. Perspective Cropping Algorithm
Frames Unique SlidesAlgorithm
Feature Matching
No Yes
RANSAC Transform
Remove Duplicates
Match Features
Corner Crop Transform
2.2. Perspective Cropping Algorithm
Frames
SlidesScreen
Capture
Video Camera
Classifier
Cluster Together
Unique Slides
Camera Motion
Feature Matching
No Yes
RANSAC Transform
Remove Duplicates
Camera Motion
Match Features
Corner Crop
Transform
2.3. Perspective Cropping – CCT
CCT
Contours
Hough Lines
Feature Matching
No Yes
RANSAC Transform
Remove Duplicates
Camera Motion
Match Features
Corner Crop
Transform
2.3. Perspective Cropping – Feature Match
Iterates through the “screen capture” and “video camera” slides in chronological order.
When this category switches, begins matching using ORB & FLANN until number of matched features below threshold.
Goal: find “video camera” within “screen capture”
2.3. Perspective Cropping – Feature Match
If number of matches is above a threshold then the slides are considered identical.
2.3. Perspective Cropping – Feature Match
2.4. Perspective Cropping – More Content?
124240 content units addedAlgorithmThe slide with the most content is kept.
2.4. Slide Structure Analysis (SSA)
Title: 2.4. Perspective Cropping – Feature MatchBullet 1: **Iterates** through the “screen capture” and “video camera” slides in **chronological** order.Bullet 2: When this **category switches**, begins matching using ORB & FLANN until number of matched features below threshold.Bullet 3: **Goal:** find “video camera” within “screen capture”No footer text
Goal: Copy and paste formatted text from slides while not having access to the text, only an image of the text.
2.4. Slide Structure Analysis – 4 Algorithms
Stroke Width1. Distance Transform2. Peak Local Max – Find peaks in color
Identify Title [20]
Criteria for line to be a title:1. In upper third of image2. In first block and first
paragraph3. Stroke width > 1
standard deviation from average stroke width
4. Line height > average5. x position of top-left
corner < image_width*0.77
6. > 3 characters
Categorize Text [20]
Based on stroke width and height of text bounding boxes.• Bold text• Small text• Normal text
Figure Extraction1. Detects objects that are
close together & meet a size and aspect ratiothreshold
2. Checks that each potential figure contains a minimal amount of text and is color
3. Removes overlappingfigures
2.5. Figure Extraction
2.5. Figure Extraction
Detects objects that are close together &
meet a size and aspect ratio
threshold
Checks that each potential figure
contains a minimal amount of text and is
color
Removes overlappingfigures
2.6. Transcribe Audio – Speech-To-Text
Tested performance of STT models on 43 lecture videos from slide classifier dataset.
Chunking by Voice Activity: Uses WebRTC Voice Activity Detector (VAD) – increases the speed of STT by reducing the amount of audio without speech.
STT ModelsDeepSpeech
SphinxVosk small-0.3
Vosk daanzu-20200905Vosk aspire-0.2
Google’s paid STT API
2.6. Transcribe Audio – Results
Model WER MER WIL LS TC WER Processing Time
DeepSpeech (chunking) 43.01 41.82 59.04 5.97 ≈4 hours
DeepSpeech 44.44 42.98 59.99 5.97 ≈20 hours
Google STT ("default" model) 34.43 33.14 49.05 12.23 ≈20 minutes
Vosk small-0.3 35.43 33.64 50.84 15.34 ≈8.5 hours
Vosk daanzu-20200905 33.67 31.87 48.28 7.08 ≈5.5 hours
Vosk aspire-0.2 41.38 38.44 56.35 13.64 ≈19 hours
Vosk aspire-0.2 (chunking) 41.45 38.65 56.56 13.64 ≈19 hours
"LS TC WER" stands for LibriSpeech test-clean WER (Word Error Rate)Sphinx was not tested because it is between 6x-18x slower than other models.
Extract Frames Classify Frames
Perspective Crop Cluster Slides
Slide Structure Analysis &
OCR
Figure Extraction
Transcribe AudioSummarization
3.1. End-to-End Approach – Slides/Videos
= Novel ML Models= Novel Approaches
3.1. Text Summarization
Extractive SummarizationIdentifies important sections of the text and generates them verbatim producing a subset of the sentences from the originaltext
Abstractive SummarizationReproduces important material in a new way after interpretationand examination of the textusing advanced naturallanguage
Summarization Model
3.1. Text Summarization
Slides Transcript
Audio Transcript
Combination Algorithm
Combination Algorithm• only_asr – only uses the audio transcript (deletes the
slide transcript)• only_slides – reverse of only_asr• concat – appends audio transcript to slide transcript• full_sents – audio transcript appended to only the
complete sentences of the slide transcript• keyword_based (most advanced) – selects a certain
percentage of sentences from the audio transcript based on keywords found in the slides transcript
3.1. Text Summarization
Slides Transcript
Audio Transcript
Summarization Model
Combination Algorithm
Summarization Model
3.1. Text Summarization
Slides Transcript
Audio Transcript
Combination Algorithm
3.1. Text Summarization
Slides Transcript
Audio Transcript
Combination Algorithm
Summarization Model
Summarization Model1. Extractive Summarization
• Cluster – groups the lecture transcript into categories by topic and summarizes each topic using “generic”
• Generic (non-neural) – standard summarization heuristics
2. Abstractive Summarization• PreSumm or BART or TransformerSum or
PEGASUS
3.1. Text Summarization – SOTA Models
PreSumm: Applied BERT to text summarization
BART: standard seq2seq architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT)
PEGASUS: pretraining task is intentionally similar to summarization
[24]
3.2. TransformerSum – Overview
Summarization training and inference library created by Hayden Housen
Improves the architecture of BertSum(extractive component of PreSumm)
TransformerSum was used by a company that serves industry-leading enterprises, which engage with millions of customers daily.
3.2. TransformerSum – Diagram
S = SentenceT = TokenE = EmbeddingR = Sentence Ranking
Tokenizer
T 1T 2T 3T 4…
Word EmbeddingModel
TE 1TE 2TE 3TE 4
…
Pooling
SE 1SE 2
…
Input Document
S 1S 2…
Classifier
R 1R 2…
3.2. TransformerSum – Pooling Modes
Test all three pooling modes (mean_tokens, sentence_rep_tokens, and max_tokens) using DistilBERT and DistilRoBERTa
Across all datasets and models, pooling mode has no significant impacton the final ROUGE scores.
Model Pooling Method CNN/DM WikiHow ArXiv/PubMed
sent_rep 42.71/19.91/39.18 30.69/08.65/28.58 34.93/12.21/31.00
mean 42.70/19.88/39.16 30.48/08.56/28.42 34.48/11.89/30.61
max 42.74/19.90/39.17 30.51/08.62/28.43 34.50/11.91/30.62
sent_rep 42.87/20.02/39.31 31.07/08.96/28.95 34.70/12.16/30.82
mean 43.00/20.08/39.42 30.96/08.93/28.86 34.24/11.82/30.42
max 42.91/20.04/39.33 30.93/08.92/28.82 34.28/11.82/30.44
distilbert-base-uncased
distilroberta-base
3.2. TransformerSum – Classifiers
Test all four classification layers.
No significant difference in ROUGE scores between classifiers.
Model Classifier CNN/DM WikiHow ArXiv/PubMed
simple_linear 42.71/19.91/39.18 30.69/08.65/28.58 34.93/12.21/31.00
linear 42.70/19.84/39.14 30.67/08.62/28.56 34.87/12.20/30.96
transformer 42.78/19.93/39.22 30.66/08.69/28.57 34.94/12.22/31.03
transformer_linear 42.78/19.93/39.22 30.64/08.64/28.54 34.97/12.22/31.02
simple_linear 42.87/20.02/39.31 31.07/08.96/28.95 34.70/12.16/30.82
linear 43.18/20.26/39.61 31.08/08.98/28.96 34.77/12.17/30.88
transformer 42.94/20.03/39.37 31.05/08.97/28.93 34.77/12.17/30.87
transformer_linear 42.90/20.00/39.34 31.13/09.01/29.02 34.77/12.18/30.88
distilbert-base-uncased
distilroberta-base
3.2. TransformerSum – Results
R1/R2/RL-Sum CNN/DM WikiHow ArXiv/PubMed
distilbert-base-uncased 42.71/19.91/39.18 30.69/08.65/28.58 34.93/12.21/31.00
distilroberta-base 42.87/20.02/39.31 31.07/08.96/28.95 34.70/12.16/30.82
bert-base-uncased 42.78/19.83/39.18 30.68/08.67/28.59 34.80/12.26/30.92
roberta-base 43.24/20.36/39.65 31.26/09.09/29.14 34.81/12.26/30.91
mobilebert-uncased 42.01/19.31/38.53 30.72/08.78/28.59 33.97/11.74/30.19
BertSumExt 43.25/20.24/39.63 None None
BertSumExt-large 43.85/20.34/39.90 None None
MatchSum bert-base 44.22/20.62/40.38 31.85/08.98/29.58 None
MatchSum roberta-base 44.41/20.86/40.55 None None
PEGASUS-base 41.79/18.81/38.93 36.58/15.64/30.01 37.39/12.66/23.87
PEGASUS-large HN 44.17/21.47/41.11 41.35/18.51/33.42 44.88/18.37/26.58
PEGASUS-large C4 43.90/21.20/40.76 43.06/19.71/34.80 45.10/18.59/26.75
Smaller: Mobilebert-uncased-ext-sum model achieves 96.59% of the performance of BertSumwhile containing 4.45x fewer parameters.
Faster: Distilroberta-base-ext-sum model takes 74x less time to train than MatchSum.
My
Res
ults
4. Lecture2Notes – Website
Summary – Five-Fold Contribution
Lecture Video
Dataset
• Appropriate categories
• Large variety
Slide Classifier
• Identify important frames in slide presentations
•Summarization Models
• Improve state-of-the-art in text summarization
• Novel approaches
•End-To-End
• Multimodal approach
• Convert lecture videos to notes
Online Web Service
• Allows people to easily use the created lecture summarizer to create notes based on their own lectures
https://unsplash.com/photos/OyCl7Y4y0Bk https://www.advectas.com/en/blog/what-is-machine-learning/
Machine LearningAssess Impact on Education
Future Research
Summaries of lecture slide sets enhanced likelihood students would preview material & sometimes led
to better quiz scores [21]
Slide classifier (makes future slide analysis easier), dataset of lecture
videos, TransformerSum
Acknowledgements
Dr. Dhiraj Joshi for giving me advice on methodology
Gillian Rinaldo for guiding me
Classmates/Peers for supporting me
Parents for helpful suggestions
References
[1] Szeliski, R. (2010). Computer Vision: Algorithms and Applications. Springer.
[2] Boden, M. (2008). Mind as machine (p. 781). Oxford: Clarendon Press.
[3] Dragonette, L. (2018, April 19). Artificial Intelligence - AI. Retrieved December 6, 2018, from https://www.investopedia.com/terms/a/artificial-intelligence-ai.asp
[4] Machine learning. (2018, December 04). Retrieved from https://en.wikipedia.org/wiki/Machine_learning
[5] Feature (computer vision). (2019). Retrieved from https://en.wikipedia.org/wiki/Feature_(computer_vision)
[6] Ripley, B. (2009). Pattern recognition and neural networks (p. 354). Cambridge: Cambridge University Press.
[7] What is the difference between labeled and unlabeled data?. (2019). Retrieved from https://stackoverflow.com/questions/19170603/what-is-the-difference-between-labeled-and-unlabeled-data/19172720#19172720
[8] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi: 10.1007/bf00994018
[9] A Beginner's Guide to Neural Networks and Deep Learning. (2019). Retrieved from https://skymind.ai/wiki/neural-network
[10] Le, J. (2018). A Tour of The Top 10 Algorithms for Machine Learning Newbies. Retrieved from https://towardsdatascience.com/a-tour-of-the-top-10-algorithms-for-machine-learning-newbies-dde4edffae11
[11] Brownlee, J. (2017). What is the Difference Between Test and Validation Datasets?. Retrieved from https://machinelearningmastery.com/difference-test-validation-datasets/
[12] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 1097-1105. Retrieved February 5, 2019.
[13] Image Classification. (n.d.). Retrieved February 5, 2019, from https://cs231n.github.io/classification/
[14] Seif, G. (2018, December 13). How to do everything in Computer Vision. Retrieved February 3, 2019, from https://towardsdatascience.com/how-to-do-everything-in-computer-vision-2b442c469928
[15] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2016.91
[16] Chablani, M. (2017, August 21). YOLO - You only look once, real time object detection explained. Retrieved February 5, 2019, from https://towardsdatascience.com/yolo-you-only-look-once-real-time-object-detection-explained-492dc9230006
[17] Gandhi, R. (2018, July 09). R-CNN, Fast R-CNN, Faster R-CNN, YOLO - Object Detection Algorithms. Retrieved February 5, 2019, from https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e
References
[18] Krizhevsky, A. (n.d.). Google Scholar. Retrieved February 5, 2019, from https://scholar.google.com/citations?user=xegzhJcAAAAJ
[19] Mahasseni, B., Lam, M., & Todorovic, S. (2017). Unsupervised Video Summarization with Adversarial LSTM Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2017.318
[20] Yang, H., Siebert, M., Luhne, P., Sack, H., & Meinel, C. (2012). Lecture Video Indexing and Analysis Using Video OCR Technology. Journal of Multimedia Processing and Technologies, 2(4), 176-196. doi:10.1109/sitis.2011.20
[21] Shimada, A., Okubo, F., Yin, C., & Ogata, H. (2017). Automatic Summarization of Lecture Slides for Enhanced Student Preview Technical Report and User Study. IEEE Transactions on Learning Technologies, 11(2), 165-178. doi:10.1109/tlt.2017.2682086
[22] Adcock, J., Cooper, M., Denoue, L., Pirsiavash, H., & Rowe, L. A. (2010). TalkMiner. MM'10 - Proceedings of the ACM Multimedia 2010 International Conference, 241-250. doi:10.1145/1873951.1873986
[23] “Cisco Visual Networking Index: Forecast and Trends, 2017–2022 White Paper.” VNI Global Fixed and Mobile Internet Traffic Forecasts. Cisco, 27 February 2019, https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white-paper-c11-741490.html
[24] Liu, Y., & Lapata, M. (2019). Text Summarization with Pretrained Encoders. Retrieved from https://arxiv.org/abs/1908.08345
[25] Sciforce. (2019, January 23). Towards Automatic Text Summarization: Extractive Methods. Retrieved December 27, 2019, from https://medium.com/sciforce/towards-automatic-text-summarization-extractive-methods-e8439cd54715.
[26] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/cvpr.2016.90
[27] Pauline A. Nye, Terence J. Crooks, Melanie Powley, and Gail Tripp. Student note-taking related to university examination performance. Higher Education, 13(1):85–97, February 1984. ISSN 0018-1560, 1573-174X. doi: 10.1007/BF00136532. URL http://link.springer.com/10.1007/BF00136532.
[28] Kenneth A. Kiewra. Providing the Instructor’s Notes: An Effective Addition to Student Notetaking. Educational Psychologist, 20(1):33–39, January 1985. ISSN 0046-1520, 1532-6985. doi: 10.1207/s15326985ep2001_5. URL https://www.tandfonline.com/doi/full/10.1207/s15326985ep2001_5.
[29] Jennifer L Austin, Matthew D Thibeault, and Jon S Bailey. Effects of Guided Notes on University Students’ Responding and Recall of Information. Journal of Behavioral Education, page 12, 2002.
[30] Anneh Mohammad Gharravi. Impact of instructor-provided notes on the learning and exam performance of medical students in an organ system-based medical curriculum. Advances in Medical Education and Practice, Volume 9:665–672, September 2018. ISSN 1179-7258. doi: 10.2147/AMEP.S172345. URL https://www.dovepress.com/impact-of-instructor-provided-notes-on-the-learning-and-exam-performan-peer-reviewed-article-AMEP.
[31] Atsushi Shimada, Fumiya Okubo, Chengjiu Yin, and Hiroaki Ogata. Automatic Summarization of Lecture Slides for Enhanced Student PreviewTechnical Report and User Study. IEEE Transactions on Learning Technologies, 11(2):165–178, April 2018. ISSN 1939-1382, 2372-0050. doi: 10.1109/TLT.2017.2682086. URL https://ieeexplore.ieee.org/document/7879355/.
Lecture2notes: Summarizing Lecture
Videos by Classifying Slides and
Analyzing Text with Machine Learning
Hayden Housen
https://[email protected]: @HHousenMentor: Dhiraj Joshi @ IMB
All Images Without Citations Are My OwnBackground: https://unsplash.com/photos/ieic5Tq8YMk
Natural Language Processing • Computer Vision • Machine Learning