edx log data analysis - iit bombay
TRANSCRIPT
Edx Log Data Analysis
A Research & Development Report
Submitted in partial fulfillment of requirements for the degree of
Master of Technology
by
Rajeev Kumar GautamRoll No : 13305R007
under the guidance of
Prof. Deepak B. Phatak
Department of Computer Science and EngineeringIndian Institute of Technology, Bombay
May, 2015
Contents
1 Introduction 2
2 Experimental Setup 32.1 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Data Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Question wise analysis of students 43.1 Quiz-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Quiz-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.3 Quiz-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.4 Quiz-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.5 Quiz-final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.6 Quiz-final assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Attempt wise analysis of students 104.1 Quiz-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.2 Quiz-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Quiz-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.4 Quiz-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.5 Quiz-final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Video played in exam 155.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Enrollment and Unenrollment in week-2 186.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Code and Commands 197.1 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197.2 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.2.1 For analysing the the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207.2.2 For plotting the the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217.2.3 For Running Hadoop and Hive Server . . . . . . . . . . . . . . . . . . . . . . . . . 22
8 Analysis Results 238.1 Question wise analysis of students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238.2 Attempt wise analysis of students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238.3 Video played in exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238.4 Enrollment and Unenrollment in week-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9 Future Work 24
1
Chapter 1
Introduction
Today data analysis is very important to enhance the quality of online education in today,s online era.Like traditional teaching teacher keeps track of the small size of the class so in online teaching log recordsis the way how we keep track of student, To get the valuable information from these log records , we haveto analyse these records. These log records keep all the activity done by students in studying , takingexams , etc.We are gonig to anylyse the log data of students CS101.2x , This course is offered by IIT Bombay onEDX platform. The course duration was 6 weeks and we do not have the first week data so we analyselast 5 weeks data. In that data, we mainly focus on quizzes and final exam. In the analysis part we willfind the no. of of students that that has correct or incorrect specific question. we will also see how manytime video has been played during an examination when a student has enter wrong answer or he hasforgotten the concept this will help to determine the week and the good student so using this analysisinstructor can help poor students within time where they are lagging in specific topic. Another type ofanalysis we will see how many students have left the course and joined the course.Hadoop uses as data storage while running the query from Hive interface , RStudio use to plot the datain graphical format , Python is used to parse the Log data and convert into csv format so Hive query caneasily perform . SED is used to cleaning the parse data .
2
Chapter 2
Experimental Setup
2.1 Tools Used
HadoopHiveRStudioPythonSED (stream editor)Latex
2.2 Data Used
The data used in the analysis were EDX log data of students that was in JSON format. We decode thatdata into a useful format like csv. We break the data into table according to query and with the help ofHADOOP and HIVE we perform desire oprations. We perform map reduce oprations on HADOOP andperform HIVE query on HADOOP database.
3
Chapter 3
Question wise analysis of students
3.1 Quiz-2
Number of students did correct questions in any attempt and number of students make the wrong attemptson each question.
Figure 3.1: quiz-2
4
CHAPTER 3. QUESTION WISE ANALYSIS OF STUDENTS
3.2 Quiz-3
Number of students did correct questions in any attempt and number of students make the wrong attemptson each question.
Figure 3.2: quiz-3
5
CHAPTER 3. QUESTION WISE ANALYSIS OF STUDENTS
3.3 Quiz-4
Number of students did correct questions in any attempt and number of students make the wrong attemptson each question.
Figure 3.3: quiz-4
6
CHAPTER 3. QUESTION WISE ANALYSIS OF STUDENTS
3.4 Quiz-5
Number of students did correct questions in any attempt and number of students make the wrong attemptson each question.
Figure 3.4: quiz-5
7
CHAPTER 3. QUESTION WISE ANALYSIS OF STUDENTS
3.5 Quiz-final
Number of students did correct questions in any attempt and number of students make the wrong attemptson each question.
Figure 3.5: quiz-final
8
CHAPTER 3. QUESTION WISE ANALYSIS OF STUDENTS
3.6 Quiz-final assignment
Number of students did correct questions in any attempt and number of students make the wrong attemptson each question.
Figure 3.6: quiz-final assignment
9
Chapter 4
Attempt wise analysis of students
4.1 Quiz-2
Number of students did correct the questions in 1,2 and 3 attempt.
Figure 4.1: quiz-2
10
CHAPTER 4. ATTEMPT WISE ANALYSIS OF STUDENTS
4.2 Quiz-3
Number of students did correct the questions in 1,2 and 3 attempt.
Figure 4.2: quiz-3
11
CHAPTER 4. ATTEMPT WISE ANALYSIS OF STUDENTS
4.3 Quiz-4
Number of students did correct the questions in 1,2 and 3 attempt
Figure 4.3: quiz-4
12
CHAPTER 4. ATTEMPT WISE ANALYSIS OF STUDENTS
4.4 Quiz-5
Number of students did correct the questions in 1,2 and 3 attempt
Figure 4.4: quiz-5
13
CHAPTER 4. ATTEMPT WISE ANALYSIS OF STUDENTS
4.5 Quiz-final
Number of students did correct the questions in 1,2 and 3 attempt
Figure 4.5: quiz-final
14
Chapter 5
Video played in exam
5.1
Number of students attempt the exam and no. of students play the video at that day.
Figure 5.1: Play Video
15
CHAPTER 5. VIDEO PLAYED IN EXAM
5.2
Number of students gave wrong answer and saw the video while attempting the quiz.
Figure 5.2: Play Video
16
CHAPTER 5. VIDEO PLAYED IN EXAM
5.3
Per students video played in exam.
Figure 5.3: Play Video
17
Chapter 6
Enrollment and Unenrollment inweek-2
6.1
Number of students enroll an unenroll in week-2.
Figure 6.1: enrollment
18
Chapter 7
Code and Commands
7.1 Code
This is python script that we used to cleaned the data
import j sonfrom ppr int import ppr intf=open ( ’tmpQopRMw’ )l i n e=f . r e a d l i n e ( )i=0fo = open ( ’ r f i l e ’ , ’w’ )#fo . wr i t e (” username ”+ ’ , ’+” u s e r i d ”+ ’ , ’+ ” event type ”+ ’ , ’+” c o u r s e i d ”+’\n ’ ) # python w i l l convert \n to os . l i n e s e p#fo . wr i t e(”=================\t==============\t===========\t +
==========================\n”)whi l e l i n e :
i+=1#pr in t i ,data=json . l oads ( l i n e )ct = data [ ” context ” ]i f ”module” in ct . keys ( ) :
i f ” s u c c e s s ” in data [ ” event ” ] . keys ( ) :i f ” attempts ” in data [ ” event ” ] . keys ( ) :
#pr in t ” blah ”s = data [ ” username ” ] + ’ , ’++ s t r ( c t [ ” module ” ] [ ” display name ”])+
+ ’ , ’ + s t r ( data [ ” event ” ] [ ” s u c c e s s ” ] ) + ’ , ’++ s t r ( data [ ” event ” ] [ ” attempts ”])+
+ ’ , ’ + s t r ( c t [ ” c o u r s e i d ” ] ) +’\n ’f o . wr i t e ( s )
e l s e :pass
#pr in t ct . keys ( )#pr in t data [ ” ip ” ] , data [ ” username ” ] , c t [ ” u s e r i d ” ]#s = data [ ” username ” ] + ’ , ’ + s t r ( c t [ ” module ” ] [ ” display name ” ] ) +
+ ’ , ’ + s t r ( c t [ ” c o u r s e i d ” ] ) +’\n ’#fo . wr i t e ( s )#pr in t data [ ” event ” ] [ ” attempts ” ]l i n e=f . r e a d l i n e ( )
#ppr int ( data )
f o . c l o s e ( )
19
CHAPTER 7. CODE AND COMMANDS
## + ’ , ’ + data [ ” event type ” ]
import j sonfrom ppr int import ppr intf=open ( ’tmpQopRMw’ )l i n e=f . r e a d l i n e ( )i=0fo = open ( ’ r f i l e ’ , ’w’ )#fo . wr i t e (” username ”+ ’ , ’+” u s e r i d ”+ ’ , ’+ ” event type ”+ ’ , ’+” c o u r s e i d ”+’\n ’ ) # python w i l l convert \n to os . l i n e s e p#fo . wr i t e(”=================\t==============\t===========\t+
==========================\n”)whi l e l i n e :
i+=1#pr in t i ,data=json . l oads ( l i n e )ct = data [ ” context ” ]i f ” event type ” in data . keys ( ) :
#pr in t ” blah ”s = data [ ” username ” ] + ’ , ’ + s t r ( data [ ” event type ”])+
+ ’ , ’ + s t r ( c t [ ” c o u r s e i d ” ] ) +’\n ’f o . wr i t e ( s )
e l s e :pass
f o . c l o s e ( )
## + ’ , ’ + data [ ” event type ” ]
7.2 Commands
7.2.1 For analysing the the Data
c r e a t e t a b l e csq2 (name s t r i ng , qs s t r i ng , s t a t u s s t r i ng , c ou r s e i d s t r i n g )+ROW FORMAT DELIMITED FIELDS TERMINATED BY ” , ” ;
s e l e c t DISTINCT name from csq2 where qs = ”Q15” and s t a t u s = ” i n c o r r e c t ” ;
s e l e c t DISTINCT name from csq5 where qs = ”Q20” +and s t a t u s = ” c o r r e c t ” and atmpt = 1 ;
s e l e c t count (DISTINCT name) from csq5 where +qs = ”Q1” and s t a t u s = ” c o r r e c t ” and atmpt = 1 ;
LOAD DATA LOCAL INPATH ’/home/ r a j e e v /Downloads/work/ csq4 ’ OVERWRITE +INTO TABLE csq3 ;
sed −n ’/ play /p ’ . / csp > csv
cat csv | awk ’ ! seen [ $0 ]++’ > csq5
s o r t −u −t ’ , ’ −k1 , 1 csp > t o t a l
20
CHAPTER 7. CODE AND COMMANDS
awk −F ” ,” ’{ pr in t $1 } ’ t o t a l > a1
grep −f a1 a2 | wc − l
grep −f a1 a2 > v1
grep −f c1 v1 | wc − l
grep i n c o r r e c t t o t a l > c1
grep −f i 1 v1 | wc − l
7.2.2 For plotting the the Data
p lo t <− read . csv (”˜/ Downloads/work/ p l o t ” , sep =”,” , header=TRUE)
barp lo t ( as . matrix ( p l o t ) )
barp lo t ( as . matrix ( p l o t ) , be s ide=TRUE)
barp lo t ( as . matrix ( p l o t ) , be s ide=TRUE , xlab=”NUMBER OF +
QUESTIONS” , ylab=”NUMBER OF STUDENTS” , c o l=c (” darkblue ” ,” red ”) )
legend (” top r i gh t ” , c o l = c (” darkblue ” ,” red ”) , l egend = c (” i n c o r r e c t ” , ” c o r r e c t ” ) )
barp lo t ( as . matrix ( p l o t ) , be s ide=TRUE , main = ”Quiz − FINAL Analys i s ” , +
xlab=”NUMBER OF QUESTIONS” , ylab=”NUMBER OF STUDENTS” , c o l=c +
(” darkblue ” ,” red ”) , l egend (” top r i gh t ” , l t y =1, c o l = c (” darkblue ” ,” red ”) , +
legend = c (” i n c o r r e c t ” , ” c o r r e c t ” ) ) )
21
CHAPTER 7. CODE AND COMMANDS
7.2.3 For Running Hadoop and Hive Server
/ usr / l o c a l /hadoop/ sb in / s ta r t−a l l . sh
export HADOOPHOME=/usr / l o c a l /hadoopexport HIVE HOME=/usr / l o c a l / h iveexport PATH=$PATH:$HIVE HOME/ binexport PATH=$PATH:$HADOOP HOME/ bin
/ usr / l o c a l / h ive / bin / hive
22
Chapter 8
Analysis Results
8.1 Question wise analysis of students
From the above figure we can figure out that as the course progress the number of wrong attempt inexams also increases.Reason students may be careless or may not focus on course well.
8.2 Attempt wise analysis of students
From the above figure we can figure out that as the course progress the students correct in one attemptsis also decreases.
8.3 Video played in exam
From the above figure we can figure out that as the course progress the video played per students duringexam also reduce. Giving wrong answer after watching video also reduce.
8.4 Enrollment and Unenrollment in week-2
Analysis of Enrollment and Unenrollment in week-2 .
23
Chapter 9
Future Work
We can perform Real time analysis with the use of HUE(Hadoop User Experience). This can providemore variety of query so some decision can be taken as early as possible.
24