-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsrs_final.rst
1724 lines (1296 loc) · 86.9 KB
/
srs_final.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
===================
**Acknowledgment**
===================
We, the summer interns team of edX-Data Analytics, are overwhelmed in all humbleness
and gratefulness to acknowledge our depth to all those who have helped us to put our ideas
and assigned work, well above the level of simplicity and into something concrete.
We all thank whole heartedly, **Prof. Deepak Phatak** for selecting us a part of his valuable
project, constantly motivating for doing better and showing complete confidence in our work.
We are indebted to our Project Manager, **Mr. Parag Tiwari** for his constant help and support in
light of this project. We are very thankful to our mentors **Mr. Pushpak Burange**, **Ms. Mitali Nayak** and **Mr. Praveen Pal** for their valuable suggestions, guidance and encouragement. Also in the process, we learnt a lot other technical
and non technical things from them.
We would like to thank **Mr. Mayank Paliwal** and **Mr. Rahul Kharat** for making our stay here as summer
interns comfortable and for all their administrative help.
Finally we also like to thank all other colleagues working in different projects under Prof.
Phatak for helping us at small problems as well as critical junctures.
----------------
**Team Members**
----------------
Sachin Vilas Sable
Akansha
B. Pallavi Reddy
Rounak Nandanwar
Shubham Utwal
Oshin Prem
-----------
**Mentors**
-----------
Mr. Parag Tiwari-Project Manager at IITB
Mr. Pushpak Burange
Mr. Praveen Pal-Jr. Software Engineer at IITB
Ms. Mitali Nayak-Jr. Software Engineer at IITB
================
**Declaration**
================
We declare that this written submission represents our ideas in our own words and where
others' ideas or words have been included. We have adequately cited and referenced the
original sources. We also declare that we have adhered to all principles of academic honesty
and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source
in our submission. We understand that any violation of the above will be cause for
disciplinary action by the Institute and can also evoke penal action from the sources which
have thus not been properly cited or from whom proper permission has not been taken when
needed.
=============
**Abstract**
=============
IIT Bombay now has its contributions(in the form of two courses) to the prestigious edX platform. EdX is a massive open online course (MOOC) destination site and online learning platform that is open sourced, founded by the Massachusetts Institute of Technology and Harvard University. Though it is an ITS(Intelligent Tutoring System), students find out some or the other way to play around with the features of the system through which they reach the final(correct) answer without even thinking deeply about the question asked. So to avoid such instances, we need to determine those students who are gaming the system(behavior aimed at obtaining correct answers and advancing within the tutoring curriculum by systematically taking advantage of regularities in the software’s feedback and help). For this, we analyzed the log files generated by the student activities in the ITS (Intelligent tutoring system) i.e, the edX course site and extracted some features relevant to finding out the students who are gaming the system. Determination of such students will help us in the future to intervene the learning process, so as to prevent such students from gaming and help them in learning the course.
We have also written queries to extract information which will be useful for data analytics. Using this data, the ITS designer and the course instructors can get useful information as to how well the students are giving positive response to the system and how popular and effecive their courses are. To aid them, a visualized interface has been designed so that it becomes easier and graphically legible to infer such aspects of a student's interest.
=================
**INTRODUCTION**
=================
Data analytics is the discovery and communication of meaningful patterns in data. There is extensive use of mathematics and statistics, the use of descriptive techniques and predictive models to gain valuable knowledge from data - data analysis. The insights from data are used to recommend action or to guide decision making rooted in business context. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology. From our project's point of view, we are concerned with 'EdX Analytics'. In this, we have to figure out the different patterns hidden in the log files generated.
The log files have entries for every activity ever made by a student on the edX platform. Each and every click on tabs or question attempted or a video being played, all the entries are created in the log file. From this log file, we parse out the relevant information and find out patterns to detect that a student is showing any actions involving gaming the system. Gaming, as described above, is any behaviour aimed at obtaining correct answers and advancing within the tutoring curriculum by systematically taking advantage of regularities in the software’s feedback and help. Thus, classifying the gaming students is our priority. This involves careful feature extraction, data analytics and machine learning on the features for classification.
Feedback to the ITS designer and course tutor is equally important. It is essential to know whether the system is doing a good work or not. This involves calculating a student's response based on various categories. A visualized form of such information is useful to them for deciding what reformations they need to apply, so as to make the course even more compelling and interesting to the students.
============
**PURPOSE**
============
The main purpose of our project is to determine which student is gaming the system so that we can apply interventions in their tutor system. For example if a question has some hint options, then a student can repeatedly use that hint option to reach to the correct answer. But, this kind of behaviour tampers the process of learning of the student. So, our main aim is to filter out such students who get involved in off-task behaviours and implement interventions only on those students' tutor system, which would passify their gaming process and indirectly motivate them to learn and attempt the quizes whole-heartedly. To implement such a system, relevant feature extraction from the log files is a necessity. Then comes the use of machine learning to classify a student as gaming or not gaming based on the features extracted.
Secondly, data analytics also involves a thorough study of the database containing all the demographic and activity information of a user. From this data, one can infer as to which category of students are mostly interested in learning the course. This requires queries to be written on that data from which we can extract relevant information. A visualized form of such information needs to be created as a feedback to the course designers. Looking at this, they can decide what reformations they need to apply, so as to make the course seem even more compelling and interesting to the students.
==========
**SCOPE**
==========
Implementing the ideas described above will make the tutoring system very efficient while grading a student. Any normal ITS, without any provisions for detection of a student who is gaming and implementing interventions in their system, will award a certificate to any student who has completed the course. But this would make a system very incompetent in correctly grading an undeserving student. Thus our idealogies would aid the ITS in fairly classifying the students into gaming and not gaming and accrordingly award the certificates to the deserving students. In addition, the tutoring system will be able to pick out the 'gaming' students and interevene their learning process so that their learning skills also match the regular students' learning, thereby widening the scope of the efficiency of the edX course site.
=======================
**Overall description**
=======================
EdX-data analyzer uses data genreated by edX in the form of log entries and the database it creates. Data containing information related to the students is stored in the database 'edxapp'. Data from edxapp is used for analysis purposes like number of dropouts according to education level, location, gender, number of students according to their eduation level enrolled in a particular course.
Log entries genrated by the server will help us to find if the student is gaming the system or not. EdX-data analyzer will constantly look for the new entries in the log files and parse, process them to find wheather students are gaming or not. EdX-data analyzer parses the log file entries and stores them using hive on hadoop distributed file system.
=====================
**Product Functions**
=====================
EdX-data analyzer serves main purpose of determining wheather the student is gaming the system or not. It can be used to interrupt the student who are tring to game the system and adjust the tutor system such that it will be difficult for the student to game the system.
===============
**Constraints**
===============
Analysing the log data to find whether the student is gaming the system or not involves many constriants like diffculties in predicting the state of mind of person by just looking at it's interaction with the system. It is not possible to determine if a student is sleeping while watching the video or whether a student is paying proper attention. Along with difficulties in predicting the state of mind of the student, it is also not possible to note down each and every action or interaction of the user with his/her system due to privacy policies. Instance of this diffcutly can be a situation for example, a student who pauses the video might be pausing the video and indulging in the other off task behavior or a student might be getting confused while watching the video and now try to understand the concept over the internet. As seen in the example, it is difficult to predict the extact state mind of the student. To determine whether the student is gaming or not, detailed analysis is required. Even if we are able to determine whether the student is gaming or not, we need to take some action to prevent the student from the gaming. But, question is how to determine which steps should be taken to stop the student from gaming as it will depend on the reason behind the gaming which is a furture part of analysis not covered here.
================================
**Assumptions and Dependancies**
================================
EdX-data analyzer assumes that log entries genrated by the EdX ITS server are error free. As only source of input to EdX-data analyzer is data provided by EdX ITS server. EdX-data analyzer totally depends on the EdX ITS for the data.
======================
**Technologies used:**
======================
1. Hadoop:
----------
Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
Apache Hadoop has two main subprojects:
MapReduce - The framework that understands and assigns work to the nodes in a cluster.
HDFS - A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes
2. Hive:
--------
Hive is a runtime Hadoop support structure that allows anyone who is already fluent with SQL (which is commonplace for relational data-base developers) to leverage the Hadoop platform right out of the gate.
Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements. HQL is limited in the commands it understands, but it is still useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster.
3. Sqoop:
---------
Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase. Exports can be used to put data from Hadoop into a relational database.
===========================
**Functional Requirements**
===========================
1. The system shall analyse the data based on various parameters of the student such as location,age group,gender etc.
2. The instructor shall be able to choose the comparison parameters and input valid entries to be queried. The instructor shall be able to input the subject for the data to be queried.
3. The data shall be represented in visual format to be understood by the instructor. The visuals formats may include pie-charts,bar charts, line charts etc.
4. The system shall parse the log data and store the parsed data into relevant event related tables.
5. The system shall extract the relevant and useful data from the parsed data.
6. The system shall tell whether a student is gaming a system or not. The system shall do this after analysing the various actions performed by the student while giving the test.
============================
**Performance Requirements**
============================
1. The edx analytics shall support in courses having large number of students (in thousands). There shall be minimal delay in retrieving the data.
2. The analysis shall be done on the data which has not been processed,i.e,only new data shall be considered for analysis. This would avoid the reading of unneccesary data again and again. This would be called as incremental implementation of queries.
==============================
**Non Functional Requirments**
==============================
1. The visual diagrams displaying the analysis with various parameters of the student shall be in a easy form so as to be understood by each and every instructor including those belonging to non-mathematical back-ground.
====================
**Project Overview**
====================
Hadoop Installation
------------------
| Since the main aim of our project is to determine which student is gaming , we achieve this by parsing the log files generated at the server side when a student is taking a course on Edx.Log entries are generated on a massive scale , so we cant process it on normat DBMS. For that we need to install hadoop which is a mapreduce system.The approach taken by MapReduce may seem like a brute-force approach. The premise
is that the entire dataset—or at least a good portion of it—is processed for each query.But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.
Hadoop’s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets. For that we installed hadoop in two different scenario:-
:Standalone:
* By default, Hadoop is configured to run in a non-distributed or standalone mode
* HDFS is not utilized in this mode.
* Local file system is used for input and output
* There are no daemons running and everything runs in a single JVM instance.
* No Custom Configuration is required in 3 hadoop(mapred-site.xml,core-site.xml, hdfs-site.xml) files.
* Standalone mode is much faster than Pseudo-distributed mode.
:Multinode:
* This is a Production Phase
* Data are used and distributed across many nodes.
* Different Nodes will be used as Master Node / Data Node / Job Tracker / Task Tracker
* The Hadoop daemons run on a local machine, and the cluster is simulated on all the machines included in cluster.
Prerequisites
for hadoop installation is a working Java 1.5+ (aka Java 5) installation.For installation procedure of hadoop , refer to the above topics.
HIVE installation
------------------
| Hive was created to make it possible for analysts with strong SQL skills (but meagerJava programming skills) to run queries on the huge volumes of data . Hive is used by many organizations as a general-purpose, scalable data processing platform.SQL is the lingua franca in s intelligence tools (ODBC is a common bridge, for example), so Hive is well placed to integrate with these products .Hive runs on our workstation and converts SQL query into a series of MapReduce jobs for execution on a Hadoop cluster. Hive organizes data into tables,
which provide a means for attaching structure to data stored in HDFS. Metadata such as table schemas—is stored in a database called the metastore.
| Installation of Hive is straightforward. Java 6 is a prerequisite.You also need to have the same version of Hadoop installed locally that your cluster is running. Of course, we may choose to run Hadoop locally,either in standalone or multinode mode, while getting started with Hive.The proper procedure for installing hive is mentioned in the above topics.
| Hive provides several services that you can run using the hive command.Out of those , we have used hiveserver service of hive to connect the JAVA program to HIVE using JDBC driver.
Sqoop Installation
------------------
| Sqoop is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing.
This processing can be done with MapReduce programs or other higher-level tools such as Hive.When the final results of an analytic pipeline are available, Sqoop can export these results back to the database for consumption by other clients.The proper procedure for installation of sqoop is mentioned above.
| We have basically used sqoop for importing data from SQL into HIVE.We have used two different command for import.One is to import a single table and the other one is the import of the whole database.
.. image:: pik2.png
Research paper study
----------------------
After all the installation procedures, we studied the research paper, “Detecting Student Misuse of Intelligent Tutoring Systems” authored by Ryan Shaun Baker, Albert T. Corbett, Kenneth R. Koedinger and gave a presentation on it. Their study says that students who are averted to such ‘gaming the system behaviour’ (behavior aimed at obtaining correct answers and advancing within the tutoring curriculum by systematically taking advantage of regularities in the software’s feedback and help) learn 2/3rds as much as similar students who do not engage in such behaviors. They came up with a machine-learned latent response model that can identify whether a student is gaming the system or not. Based on these predictions, the tutor can be re-designed for such students and make their learning process effective.
Baker and his colleagues found that a student’s frequency of gaming was strongly negatively correlated with learning. According to them, understanding why students game the system will be essential in deciding how the system should respond. Ultimately, though, whatever remediation approach is chosen, it is likely to have costs as well as benefits. For instance, preventive approaches, such as changing interface widgets to make them more difficult to game or delaying successive levels of help to prevent rapid-fire usage, may reduce gaming, but at the cost of making the tutor more frustrating and less time-efficient for other students. Since many students use help effectively and seldom or never game the system, the costs of using such an approach indiscriminately may be higher than the rewards. Whichever approach we take to remediating gaming the system, the success of that approach is likely to depend on accurately and automatically detecting which students are gaming the system and which are not.
The LRM they suggested, takes 24 features as input or data source and also the predetermined value of the student ‘gaming or not’ of a training set of 70 students. Then it uses forward selection for model selection and then finally implements iterative gradient descent to find the best model parameters. The best-fitting model had 4 parameters, and no model considered had more than 6 parameters. They also used a cross-validation techninque, LOOCV (Leave One Out Cross Validation). Finally with the ROC (Receiver Operating Characteristic) curve, they classified the student as gaming or not gaming. On this result, they applied the interventions in the ITS.
===============
**Log Parsing**
===============
Log files provides information about the event data that is delivered in data packages. Events are emitted by the server or the browser to capture information about interactions with the courseware and the Instructor Dashboard in the LMS, and are stored in JSON documents. In the data package, event data is delivered in a log file.The JSON documents that include event data are delivered in a machine-readable format that is difficult to read . A sample is the following :- ::
{"agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/30.0.1599.101 Safari/537.36", "context": {"course_id": "edx/AN101/2014_T1",
"module": {"display_name": "Multiple Choice Questions"}, "org_id": "edx", "user_id":
9999999}, "event": {"answers": {"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_2_1":
"yellow", "i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_4_1": ["choice_0", "choice_2"]},
"attempts": 1, "correct_map": {"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_2_1":
{"correctness": "incorrect", "hint": "", "hintmode": null, "msg": "", "npoints": null,
"queuestate": null}, "i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_4_1":
{"correctness": "correct", "hint": "", "hintmode": null, "msg": "", "npoints": null,
"queuestate": null}}, "grade": 2, "max_grade": 3, "problem_id": "i4x://edx/AN101/problem/
a0effb954cca4759994f1ac9e9434bf4", "state": {"correct_map": {}, "done": null, "input_state":
{"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_2_1": {}, "i4x-edx-AN101-problem-
a0effb954cca4759994f1ac9e9434bf4_4_1": {}}, "seed": 1, "student_answers": {}}, "submission":
{"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_2_1": {"answer": "yellow", "correct":
false, "input_type": "optioninput", "question": "What color is the open ocean on a sunny day?",
"response_type": "optionresponse", "variant": ""}, "i4x-edx-AN101-problem-
a0effb954cca4759994f1ac9e9434bf4_4_1": {"answer": ["a piano", "a guitar"], "correct": true,
"input_type": "checkboxgroup", "question": "Which of the following are musical instruments?",
"response_type": "choiceresponse", "variant": ""}}, "success": "incorrect"}, "event_source":
"server", "event_type": "problem_check", "host": "precise64", "ip": "NN.N.N.N", "page": "x_module",
"time": 2014-03-03T16:19:05.584523+00:00", "username": "AAAAAAAAAA"}
We can use pretty print( by jq '.' command) to see this information in a readable format which is shown as below :- ::
{
"agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
"context": {
"course_id": "edx/AN101/2014_T1",
"module": {
"display_name": "Multiple Choice Questions"
},
"org_id": "edx",
"user_id": 9999999
},
"event": {
"answers": {
"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_2_1": "yellow",
"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_4_1": [
"choice_0",
"choice_2"
]
},
"attempts": 1,
"correct_map": {
"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_2_1": {
"correctness": "incorrect",
"hint": "",
"hintmode": null,
"msg": "",
"npoints": null,
"queuestate": null
},
"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_4_1": {
"correctness": "correct",
"hint": "",
"hintmode": null,
"msg": "",
"npoints": null,
"queuestate": null
}
},
"grade": 2,
"max_grade": 3,
"problem_id": "i4x://edx/AN101/problem/a0effb954cca4759994f1ac9e9434bf4",
"state": {
"correct_map": {},
"done": null,
"input_state": {
"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_2_1": {},
"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_4_1": {}
},
"seed": 1,
"student_answers": {}
},
"submission": {
"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_2_1": {
"answer": "yellow",
"correct": false,
"input_type": "optioninput",
"question": "What color is the open ocean on a sunny day?",
"response_type": "optionresponse",
"variant": ""
},
"i4x-edx-AN101-problem-a0effb954cca4759994f1ac9e9434bf4_4_1": {
"answer": [
"a piano",
"a guitar"
],
"correct": true,
"input_type": "checkboxgroup",
"question": "Which of the following are musical instruments?",
"response_type": "choiceresponse",
"variant": ""
}
},
"success": "incorrect"
},
"event_source": "server",
"event_type": "problem_check",
"host": "precise64",
"ip": "NN.N.N.N",
"page": "x_module",
"time": "2014-03-03T16:19:05.584523+00:00",
"username": "AAAAAAAAAA"
}
The JSON object 'event_source' is the most important feture we are dealing with. All the entries in the tracking.log file get parsed into the log table of our database. There is an entry corresponding to every event generated in the log files. Different types of events are generated. Common fields of those events are as follows:-
:Agent:
| Browser agent string of the user who triggered the event
:Context:
| For all events, this field includes member fields that identify:
* The course_id of the course that generated the event.
* The org_id of the organization that lists the course.
* The user_id of the individual who is performing the action.
:Event:
| This field includes member fields that identify specifics of each triggered event.
:event_source:
| Specifies whether the triggered event originated in the browser or on the server.
:event_type:
| The type of event triggered.
:Host:
| The site visited by the user, for example, courses.edx.org.
:ip:
| IP address of the user who triggered the event.
:page:
| The '$URL' of the page the user was visiting when the event was emitted.
:session:
| This 32-character value is a key that identifies the user's session.
:time:
| Gives the UTC time at which the event was emitted in 'YYYY-MM- DDThh:mm:ss.xxxxxx' format.
:username:
| The username of the user who caused the event to be emitted. This string is empty for anonymous events, such as when the user is not logged in.
| Each and every event has its own attributes. Various types of events generated which we have used are as follows:-
* Problem_check generated at two sources , browser and server.
* Problem_show
* load_video
* play_video
* speed_change_video
* seek_video
* pause_video
Procedure
----------
| Entries of log files are parsed and then stored in database.
| For each log entry genral data like ip address, event_type, event_source is stored in table log.
| Data related to specific event_type is stored in corresponding tables like play_video, puase_video.
Following jar files are used in the the java program to parse the log files:-
* java-json.jar
* apache-logging-log4j.jar
* commons-httpclient-3.0.1.jar
* commons-logging-1.1.3.jar
* hadoop-core-1.2.0.jar
* hive-cli-0.13.0.jar
* hive-common-0.13.0.jar
* hive-exec-0.13.0.jar
* hive-jdbc-0.13.0.jar
* hive-metastore-0.13.0.jar
* hive-service-0.13.0.jar
* hive.txt
* libfb303-0.9.0.jar
* libthrift-0.9.0.jar
* log4j-1.2.16.jar
* slf4j-api-1.7.7.jar
* slf4j-jdk14-1.7.7.jar
Let the old_size be the size of the file which has been already parsed and whose value has been stored in the status table. Let new_size be the size of the log file at present.
There are three cases for the log files which are as folllows:-
| 1. old_size < new_size
It means that new entries has been added to the log files . The value of number of lines that have been parsed is stored in status table .That value is retrieved and those many lines skipped.Rest of the lines are parsed.Log entries are in the form of json .Folllowing is a snippet for this case:- ::
public class LineParser {
public int parseline(String line)
{
int success=0;
Database db = new Database();
JSONObject rootObject;
try {
rootObject = new JSONObject(line);
Log log = new Log();
JSONObject context = rootObject.getJSONObject("context");
log.setCourse_id(context.get("course_id").toString();
log.setOrg_id(context.get("org_id").toString
try
{
log.setUser_id((Integer)context.get("user_id"));
}
catch(Exception e)
{
log.setUser_id(0);
}
try
{
JSONObject module = context.getJSONObject("module");
log.setModule(module.get("display_name").toString()); //3rd field
}
catch (Exception e)
{
log.setModule("");
}
log.setEvent_source(rootObject.get("event_source").toString()); //7th field
log.setEvent_type(rootObject.get("event_type").toString()); //6th and 8th field
log.setHost(rootObject.get("host").toString()); //9th field
log.setIp(rootObject.get("ip").toString()); //10th field
log.setPage(rootObject.get("page").toString()); //11th field
String time = rootObject.get("time").toString();
String time2=time.substring(0, 10);
time2=time2.concat(" ");
//system.out.println("********"+time2+"***********");
time2=time2.concat(time.substring(11,19));
log.setTime(time2); //12th field
log.setUsername(rootObject.get("username").toString()); //13th field
log.setEvent(rootObject.get("event").toString());
try
{
log.setSession(rootObject.get("session").toString());
}
catch(Exception e)
{
log.setSession("");
}
success = db.insertlogdata(log);
}
catch (JSONException e1)
{
e1.printStackTrace();
}
return success;
}
}
| 2. old_size=new_size
| The log file has not changed and no new entries added.
| 3. old_size >new_size
| This means that the log file has been archived.So all those file whose modification time is greater than the time stored in the status table, which is the modification time of the log file just processed, are retrieved and extracted.
| It is possible that the current file which program was parsing is modified i.e., new log entries are added to the file and is now archived. So, we need to find the archived file corresponding to the file which was being parsed and parse and process if any new entries were added to the file.
| The file whose modification date is oldest among those files is the file which was archived while program was parsing it. Those new entries which were added are parsed with the help of values such as linesparsed and size of file stored in the status table.Rest of the files are processed from beginning to the end.
Following is a code snippet of the following case::
public class {
public void handlefile()
{
System.out.println("Entered into filehandler");
File mydirectory = new File("/home/sachin/workspace/json/src/json/log");
System.out.println(mydirectory.isDirectory());
File names[] = mydirectory.listFiles();
Database db = new Database();
//getting filelastmodified date
long lastmdate=db.getfilelastmodified();
//creating arraylist to store new tar.gz files
ArrayList<File> left = new ArrayList<File>();
//adding new files into the arraylist
for(int i=0;i<names.length;i++)
{
if(names[i].lastModified()>lastmdate&&names[i].toString().matches(".*gz$"))
{
left.add(names[i]);
System.out.println("got match: "+names[i].lastModified()+" "+names[i]);
}
}
long table[][] = new long[left.size()][2];
//intializing two dimensional array
for(int i=0;i<left.size();i++)
{
table[i][0] = i;
table[i][1] = left.get(i).lastModified();
System.out.println(table[i][0]+" "+table[i][1]);
}
//sorting according to the date
for (int c = 0; c < left.size(); c++)
{
for (int d = 0; d < left.size() - c - 1; d++)
{
if(table[d][1] > table[d+1][1])
{
long swap = table[d][0];
table[d][0]=table[d+1][0];
table[d+1][0]=swap;
swap = table[d][1];
table[d][1]=table[d+1][1];
table[d+1][1]=swap;
}
}
}
InputStream is=null;
boolean first=true;
String line;
LineParser lp = new LineParser();
for(int i=0;i<left.size();i++)
{
try
{
if(first)
{
System.out.println("started reading file "+left.get((int) table[i][0]));
int linenum = db.getlinenum();
System.out.println("line count in file"+linenum);
is = new GZIPInputStream(new FileInputStream(left.get((int) table[i][0])));
BufferedReader buffered = new BufferedReader(new InputStreamReader(is));
int j=0;
/
while(j<linenum)
{
if((line=buffered.readLine())!=null)
{
j++;
System.out.println("The value of the j is "+j);
}
}
while((line=buffered.readLine())!=null)
{
System.out.println(line);
if(line.startsWith("{"))
{
lp.parseline(line);
}
}
buffered.close();
first=false;
db.setfilelastmodified(left.get((int) table[i][0]).lastModified());
}
else
{
System.out.println("started reading file "+left.get((int) table[i][0]));
is = new GZIPInputStream(new FileInputStream(left.get((int) table[i][0])));
Reader decoder = new InputStreamReader(is, "UTF-8");
BufferedReader buffered = new BufferedReader(decoder);
while((line=buffered.readLine())!=null)
{
if(line.startsWith("{"))
{
lp.parseline(line);
}
System.out.println(line);
}
buffered.close();
db.setfilelastmodified(left.get((int) table[i][0]).lastModified());
}
}
catch(Exception e)
{
System.out.println("error while reading tar.gz files"+left.get((int) table[i][0]));
}
}
db.setsize(0);
db.insertlinenum(0);
}
}
And the flowchart for the same is:
.. image:: diagram.png
.. image:: log.png
After the classification has been done, queries were written on the tables to extract the features required for implementing machine learning.
Importing data from mysql to hive:
----------------------------------
The database schema above defined was first made in mysql. Then it was imported into hive by sqoop. But later on, we came up with a method by which we could directly connect with hive from the java program.
Connection of Hive with Java
---------------------------
| Since we are using JAVA programs for feature extraction on the tables of HIVE , we need to connect to HIVE in JAVA program .We can run Hive queries from a Java Database Connectivity (JDBC) Connectivity (ODBC) application using the Hive JDBC driver. The Hive JDBC driver allows you to access Hive from a Java program that uses JDBC to communicate with database products.
| Hive provides a Type 4 (pure Java) JDBC driver, defined in the class an application that can connect to a Hive server using the Hive JDBC driver org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application can connect to a Hive server running at the specifed host and port. The driver makes calls to an interface implemented by the Hive Thrift Client using the Java Thrift bindings. Before connecting to HIVE in a JAVA program , we need to run hiveserver on a particular port.The default port number for hiveserver is 10000.We need to mention this port number in our JAVA program.
The project which is connecting to hive should have the following jar files incuded:
* apache-logging-log4j.jar
* commons-httpclient-3.0.1.jar
* commons-logging-1.1.3.jar
* hadoop-core-1.2.0.jar
* hive-cli-0.13.0.jar
* hive-common-0.13.0.jar
* hive-exec-0.13.0.jar
* hive-jdbc-0.13.0.jar
* hive-metastore-0.13.0.jar
* hive-service-0.13.0.jar
* hive.txt
* libfb303-0.9.0.jar
* libthrift-0.9.0.jar
* log4j-1.2.16.jar
* slf4j-api-1.7.7.jar
* slf4j-jdk14-1.7.7.jar
The Connect class code snippet looks like : ::
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
Connection connect = DriverManager.getConnection("jdbc:hive://localhost:10000/exptnew", "", "");
System.out.println("Connected successfully");
return(connect);
Feature Extraction:
-------------------
In order to determine if a student is gaming the system or not, we have written some queries on the parsed log data (the database we created for the entries from the log file). The features we came up with are :-
| 1. To calculate the minimum number of attempts a student takes in correctly answering a question
| 2. To calculate the amount of tme a student is seeking a video
| 3. To calculate the difficulty level of each question in every course
| 4. To calculate the activity level of a student per day.
Feature 1
~~~~~~~~~
In order to find the difficulty level of the question, it is required to know in how many minimum number of attempts a student was able to answer the question correctly. Edx allows a user to answer a question any number of times even after the student has already given the correct answer. This is why we need to consider only the first attempt in which the student correctly answered the question. As analyzing the attempts after the user has already correctly answered the question is futile because student already knew the answer and those extra attempts will only mean revision for the student or that the student is just playing around with the system.
The java class written for this purpose is Feature_no_of_attempts. The following snippet fulfills the above described functionality :- ::
statement.executeQuery("insert overwrite table temp select problem_id,username,min(attempts) from problem_check_server where log_id>0 and log_id<=2000 and correctness='correct' group by problem_id,username");
//statement.executeQuery("insert overwrite table temp select problem_id,username,min(attempts) from problem_check_server where log_id>0 and log_id<=1000 and correctness='correct' group by problem_id,username");
System.out.println("first and last "+start+" "+end);
statement.executeQuery("insert into table attempts select p.problem_id,p.username,0 from temp p where not exists(select * from attempts where problem_id=p.problem_id and username=p.username)");
resultset=statement.executeQuery("select * from temp");
while(resultset.next())
{
a=resultset.getString(1);
b=resultset.getString(2);
c=resultset.getInt(3);
statement.executeQuery("insert overwrite table attempts select problem_id,username,case when problem_id='"+a+"' and username='"+b+"' then "+c+" else attempts end as attempts from attempts");
System.out.println(resultset.getString(1)+resultset.getString(2)+resultset.getString(3));
}
Finding the number of minimum attempts per user per question has been done in three steps. This feature has been extracted using incremental approach. First step is to identify the new users from newly generated log entries and store them into temporary table along with the minimum number of attempts per question in the corresponding log entries. Second step is to insert dummy entries into the table attempts for the user not already present in the attempts table. Third step is to insert the values extracted in the temporary table into the attempts table.
Feature 2
~~~~~~~~~
One of those features is to calculate the amount of time a user seeked a video. For this, the details of a video is also required. As such information wasn't provided, a project named 'Download'(package name 'fetch_video_information') was written to extract the details of a video.
The table created for this purpose was 'video_information'. The steps involved in this program are:
1. Fetch the video code from the log parsed database('load_video' table) in the Database class.
2. Pass this video code into the URL (in the Down class): http://gdata.youtube.com/feeds/api/videos/"+video_code+"?v=2&alt=jsonc Example - http://gdata.youtube.com/feeds/api/videos/dXb3Tx8V4hU?v=2&alt=jsonc opens the following :- ::
{"apiVersion":"2.1","data":{"id":"dXb3Tx8V4hU","uploaded":"2013-02-23T11:16:41.000Z","updated":"2013-02-23T11:16:41.000Z","uploader":"aakashlab","category":"People","title":"Android UI and Layouts part 2","description":"","thumbnail":{"sqDefault":"http://i1.ytimg.com/vi/dXb3Tx8V4hU/default.jpg","hqDefault":"http://i1.ytimg.com/vi/dXb3Tx8V4hU/hqdefault.jpg"},"player":{"default":"http://www.youtube.com/watch?v=dXb3Tx8V4hU&feature=youtube_gdata_player","mobile":"http://m.youtube.com/details?v=dXb3Tx8V4hU"},"content":{"5":"http://www.youtube.com/v/dXb3Tx8V4hU?version=3&f=videos&app=youtube_gdata","1":"rtsp://r2---sn-a5m7zu7z.c.youtube.com/CiILENy73wIaGQkV4hUfT_d2dRMYDSANFEgGUgZ2aWRlb3MM/0/0/0/video.3gp","6":"rtsp://r2---sn-a5m7zu7z.c.youtube.com/CiILENy73wIaGQkV4hUfT_d2dRMYESARFEgGUgZ2aWRlb3MM/0/0/0/video.3gp"},"duration":308,"viewCount":371,"favoriteCount":0,"commentCount":0,"accessControl":{"comment":"allowed","commentVote":"allowed","videoRespond":"moderated","rate":"allowed","embed":"allowed","list":"allowed","autoPlay":"allowed","syndicate":"allowed"}}}
3. The above URL opens a page containing the JSON object about that video. So, next we downloaded this piece of information into a file.
The code snippet for the same looks like : ::
URL url = new URL("http://gdata.youtube.com/feeds/api/videos/"+video_code+"?v=2&alt=jsonc");
Scanner s = new Scanner(url.openStream());
String line;
while(s.hasNext())
{
line=s.nextLine();
File file = new File("/home/dell/workspace/Download/src/videoJson.json");
// if file doesnt exists, then create it
if (!file.exists())
{
file.createNewFile();
}
FileWriter fw = new FileWriter(file.getAbsoluteFile());
BufferedWriter bw = new BufferedWriter(fw);
bw.write(line);
bw.close();
System.out.println(line);
JsonParser.parseJson(video_code);
}
4. Then, we parsed out the objects, title and duration from this JSON object (in the JsonParser class).
The code snippet for the same looks like : ::
JSONObject obj = new JSONObject(jsonStr);
String title = obj.getJSONObject("data").getString("title");
System.out.println(title);
JSONObject obj2 = new JSONObject(jsonStr);
int duration = (int) obj2.getJSONObject("data").get("duration");
System.out.println(duration);
Database.putdata(video_code,title,duration);
5. Finally this information was stored back in the table 'video_information' (in the Database class).
This process was repeated for all video codes.(Run the Download only when new entries are required)
Connection with hive was made by using the Connect class.
After the video_information table is ready, the main queries for extraction of seek time can be implemented.
:Extracting the amount of time a student seeked(or skipped) a video:
This feature is concerned with extracting the amount of time a student has skipped a portion of the video. If a student is seeking a video more than the amount of his/her viewed time, then the student is likely not interested in the course (But it is also possible that a student is skipping one video only beacuse he/she has some knowledge about that topic. This is difficult to track beacuse we cannot estimate a student's knowledge on a specific topic. And it is very rare to find a student who will seek almost all the videos in a given course, provided that he/she already knows about this topic. In that case the student wouldn't have selected the course. As we are keeping track of the seek time of all videos in a course for each student, the case of a student seeking just one or two odd videos beacause he/she had some previous knowledge in it will be handled in the mapping function later described in futher topic).
The java class written for this purpose is Feature_seek_time. The following snippet fulfills the above described functionality :::
statement.executeQuery("drop table temp0");
statement.executeQuery("create table temp0(code string,username string,seek int)");
statement.executeQuery("create table seek_time_total(code string,username string,seek int,title string,duration int)");
statement.executeQuery("insert into table temp0 select sv.code,sv.username,sum(sv.new_time-sv.old_time) from seek_video sv where log_id>="+start+" and log_id<"+end+" and not exists(select * from temp0 t2 where sv.code=t2.code and sv.username=t2.username) group by sv.code,sv.username ");
resultset=statement.executeQuery("select * from temp0");
statement.executeQuery("insert overwrite table seek_time_total select a.code,a.username,a.seek,b.title,b.duration from temp0 a join video_information b on a.code=b.code");
resultset=statement.executeQuery("select * from seek_time_total");
while(resultset.next())
{
System.out.println(resultset.getString(1)+"\t"+resultset.getString(2)+"\t"+resultset.getString(3)+"\t"+resultset.getString(4)+"\t"+resultset.getString(5)+"\t");
}
statement.executeQuery("insert overwrite table status select name, instring, case when name='seek_time_processed' then "+end+" else inint end as inint from status");
The sample output space looks like :
=========== ======== ==== ============================= ========
code username seek title duration
=========== ======== ==== ============================= ========
RU2qJTO0Gms ak 764 IntroductionToAndroidPart1 927
RU2qJTO0Gms sachin 696 IntroductionToAndroidPart1 927
KdX4DaFRAKU ak 440 Android UI and Layouts part-1 415
KdX4DaFRAKU oshin 26 Android UI and Layouts part-1 415
2E_KTtnbzVU sachin 269 Android UI and Layouts part 3 395
aI1uMZMmnY8 sachin 181 Android UI and Layouts part 5 351
d45uLZEU5U0 oshin 758 Introduction To Android Part2 782
=========== ======== ==== ============================= ========
To execute this query, we used an intermediate table named temp0. The table temp0 holds the 11 digit code of the video, username and seek time of a user whose entry for a particular video code is not present in the table. The seek time has been claculated by the difference in the new_time and old_time fields in the seek_video table. Only those entries are considered while query execution which have not been processed yet. This has been taken care by the status table entry 'seek_time_processed' which contains the log_id of the user who's entry has been processed last. This is yet another example of an incremental query which uses two variables 'start' and 'end' to implement this concept.
Then the table 'seek_time_total' finally contains the code, username, seek time(in seconds), title and duration of the video(in seconds). The video_information table gives the details about the title and duration of the video in seconds.
:Points to be noted:
1. The statements drop table temp0, creation of temp0 and seek_time should be executed for the first time only and then comment these lines once done.
2. Also, if a student has seen the video completely then a pause event is generated and no special event as to whether he/she has completed watching the video or not is not generated. Thus there will be a problem when a student has almost seen the video and also when the video will be watched multiple times.
Feature 3
~~~~~~~~~
| Measure of how much each user of each course is active per day. For this the entries of log tables have been taken.Two types of events have been considered :-
:play_video:
| This activity gives an estimate of how much active an user is in case of watching video.Though this does not gives an exact measure of how active an user is in watching videos , but gives an approximate idea about how many times user plays and pauses the videos .
:problem_check:
This activity gives an estimate of how many problems a user has attempted.The final answer of this query has been stored in the table
activity_per_day.
| This table has the following columns:-
* username
* day
* course_id
* video_acts
* quiz_acts
| For implementing runtime queries , we make use of three intermediate tables namely :-
:video_activity:
* username
* day
* course_id
* count
* quiz_activity
* username
* day
* course_id
* count
| These two tables contains for a particular course,particular username,particular day how many videos were played and and how many questions attempted respectively.A third table which contains a join of these two tables on username,course_id and day contains actual count of videos played and questions attempted by each user on each day of every courses.This table contains entries which needs to be put on the table activity_per_day.If the entries already exists , those are updated and those are not present,new entries are created for them.The structure of temp table is as follows :-
* temp
* username
* day
* course_id
* count_video
* count_quiz
| Since we need to create runtime queries , we should consider only those entries in the log tables which are new.So we keep a track of the log_id which have been processed in the status table.Let this be old_val.Maximum of log_id is taken to be as new_val.Only those entries are considered whose lod id are greaater than old_val and less than or equal to new_val.After that the status table is updated with the new_val as the entry for log_id already processed.
Feature 4
~~~~~~~~~~
Next feature deals with the difficulty level of each question.This difficulty level of a question is based on the information that how many students have attempted that question and in how many attempts.The final result of this query is stored in the following table:-
:diff_level:
* problem_id
* attempts
* no_of_users
* level
| Since we need to make the query runtime, we keep a note of log_id which has been tracked till now in the status table.We fetch that value and parse only those entries from the table problem_check_server whose log_id is greater than that value.The result of the query that which questions have been attempted by how many students and in how many attempts is stored in a temporary table :-
:record:
* problem_id
* username
* attempts
| From this table we can calculate how many students attempted a particular question and in total how many attempts.We can use that vaue to update the difficulty leve of the questions whose entries already exists in table difficulty_level. The entries which do not exists , for those dummy entries are created and then updated.following id the query written for the same.
| Let x be the number of attempts recored against a question and y be the number of users involved in that.So now the difficulty level of a question is updated as follows:-
| new level=(old_level*no_of_users+x)/(no_of_users +y)
| and the no_of_users column is updated as
| no_of_users=no_of_users+y
Machine Learning :
==================
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders.
The main aim of our project is to classify the students as gaming or not gaming. Accordingly, the ITS will intervene the learning process of the gaming students and make their learning process effective. In reference to th research paper, **“Detecting Student Misuse of Intelligent Tutoring Systems” authored by Ryan Shaun Baker, Albert T. Corbett, Kenneth R. Koedinger**, their study says that students who are averted to such 'gaming the system behaviour' (behavior aimed at obtaining correct answers and advancing within the tutoring curriculum by systematically taking advantage of regularities in the software’s feedback and help) learn 2/3rds as much as similar students who do not engage in such behaviors. They came up with a machine-learned latent response model that can identify whether a student is gaming the system or not. Based on these predictions, the tutor can be re-designed for such students and make their learning process effective.
Baker and his colleagues found that a student’s frequency of gaming was strongly negatively correlated with learning. According to them, understanding why students game the system will be essential to deciding how the system should respond. Ultimately, though, whatever remediation approach is chosen, it is likely to have costs as well as benefits. For instance, preventive approaches, such as changing interface widgets to make them more difficult to game or delaying successive levels of help to prevent rapid-fire usage, may reduce gaming, but at the cost of making the tutor more frustrating and less time-efficient for other students. Since many students use help effectively and seldom or never game the system, the costs of using such an approach indiscriminately may be higher than the rewards. Whichever approach we take to remediating gaming the system, the success of that approach is likely to depend on accurately and automatically detecting which students are gaming the system and which are not.
The LRM they suggested, takes 24 features as input or data source and also the predetermined value of the student 'gaming or not' of a training set of 70 students. Then it uses forward selection for model selection and then finally implements iterative gradient descent to find the best model parameters. The best-fitting model had 4 parameters, and no model considered had more than 6 parameters. They also used a cross-validation techninque, LOOCV (Leave One Out Cross Validation). Finally with the ROC (Receiver Operating Characteristic) curve, they classified the student as gaming or not gaming. On this result, they applied the interventions in the ITS.
From our project's point of view, machine leaning system is trained on student's repective 3 features, so as to make it learn distinguish between students who are gaming the system and students who are not gaming the system. After learning, it can be used to classify whether a student is gaming the system or not.
In order to implement the machine learning algorithms on the features extracted by hive queries, we have to convert them into proper form (like numerical values), suitable for implementaion. To acheive this, we have mapped the query results into the following form :
username feature1 feature2 feature3 result
This data has been stored in the table feature. For each user there is just one entry in this table and the field 'result' stores the precoded data i.e, whether the student is gaming or not.
Feature 1:
---------
This feature of mapping deals with the question solving ability of a person.This feature not only calculates how many questions have been solved by a user in each course, rather it also delas with the difficulty level of each question solved by a user.For this , we have utilised the information from two tables record and difficulty_level:-
:diff_level:
* problem_id
* attempts
* no_of_users
* level
:record:
* problem_id
* username
* attempts
From these two tables we decideed upon a numrical value depending upon the extracted information that how many question is solved by each user of how much difficulty.This information is stored in the following intermediate table:-
:assign:
* username
* value
| This value has been calculated on the folloing basis:-
new_value=old_value+summation(difficulty_level/attempts)/summation(no_of_questions)
| For those users, whose entries does not exists in the final table feature whose schema is expalined below,are created with dummy values.And those values are finally upadted.This feature counts in feature1 , so its value us stored in the f1 column against a particular user in the feature table:-
:feature:
* username
* f1
* f2
* f3
* total
| Following is the query for the same ::
stmt.executeQuery("insert overwrite table assign select b.username,sum(a.diff*b.attempts)/count(*) from diff a join record b group by username");
res=stmt.executeQuery("select * from assign");
while(res.next())
{
//System.out.println(res.getString(1)+"\t"+res.getString(2));//+res.getString(3)+"\t"+res.getString(4)+res.getString(5)+"\t");
}
stmt.executeQuery("insert into table feature select a.username,0,0,0,0 from assign a where not exists(select * from feature where username=a.username) ");
res=stmt.executeQuery("select * from assign");
float g;
while(res.next())
{
a=res.getString(1);
g=res.getFloat(2);
//System.out.println(res.getString(1)+"\t"+res.getString(2));//+"\t"+res.getString(3)+"\t"+res.getString(4)+"\t"+res.getString(4)+"\t");
stmt.executeQuery("insert overwrite table feature select username,case when username='"+a+"' then "+g+" else f1 end as f1,f2,f3,result from feature ");
}
Feature 2:
----------
:Mapping the seek_time feature:
We have written a java class Map_feature_seek_time. In this for each user, we have calculated :
[sum{(duration/(duration+seek))*10}]/number of videos seeked
(Say, d = duration and s = seek time.)
i.e, the sum of the fraction (d/(d+s)) multiplied by 10 (so that the range of a student's seek time remains within 10), divided by the total number of videos he/she has seeked.
If the grade is closer to 10 then the student is regular and seeks less else the student is seeking most of the videos.
The code snippet for the above is: ::
statement.executeQuery("insert into table feature_seek select username,sum((duration/(duration+seek))*10)/count(*) from seek_time_total group by username");
statement.executeQuery("insert into table feature select a.username,0,0,0,0 from feature_seek a where not exists(select * from feature where username=a.username) ");
Feature 3:
----------
This feature is depending on the activity level of the user i. e., how much user is interacting with the system. It combines the results obtained after processing the log file and storing activity of user per day in table activity_per_day into a single value for per user. It is clear that students not interested in the course will have minimum activity level. also students who are trying to game the system will have high activity levels as they will constantly seek, pause videos frequently and while test they will answer the questions without contemplating over the questions.
The java class written for this purpose is Feature_seek_time. The following snippet fulfills the above described functionality :- ::
stmt.executeQuery("insert into table feature select apd.username,apd.course_id,0,0,0,0 from activity_per_day as apd where not exists (select username,course from feature as fe where fe.username=apd.username and fe.course=apd.course_id)");
stmt.executeQuery("insert overwrite table tmp_feature_attempt select username,course_id,(10-abs((sum(video_act)-"+avg+")/("+avg+"*count(*))*10)) as value from activity_per_day group by username,course_id");
stmt.executeQuery("insert overwrite table feature select f.username,f.course,f.f1,f.f2,case when f.username=apd.username and f.course=apd.course_id then apd.value else f.f3 end as f3,result from tmp_feature_attempt as apd join feature as f on apd.username=f.username and apd.course_id=f.course");
This is incremental query i.e., this will only process the log entries which were not processed earlier. To accomplish task of extracting feature three steps are required. First involves inserting dummy entries for the entries which were added newly in the log table.Second step involves calculating level of activity and storing the values of activity level in the intermediate table. Value of activity is calculated such that users having level of activity at average level of all the users will be awarded highest score i.e., 10 and as students activity level deviate from the average value of activity level of all the users their score will decrease till the lowest possible score 10. Third step involved in which the scores which were calculated for each students will now be added into the table feature.