user2code2vec: Embeddings for Profiling Students Based on Distributional Representations of Source Code
Full research paper presented at Learning Analytics & Knowledge 2019 Conference in AZ, USA (LAK 2019)
Please consider citing this paper if you use any of the work.
@inproceedings{azcona2019user2code2vec,
title={user2code2vec: Embeddings for Profiling Students Based on Distributional Representations of Source Code},
author={Azcona, David and Arora, Piyush and Hsiao, I-Han and Smeaton, Alan},
booktitle={Proceedings of the 9th International Learning Analytics & Knowledge Conference (LAK’19)},
year={2019},
organization={ACM}
}
In this work, we propose a new methodology to profile individual students of Computer Science based on their programming design using embeddings. We investigate different approaches to analyze user source code submissions in the Python language. We compare the performances of different source code vectorization techniques to predict the correctness of a code submission. In addition, we propose a new mechanism to represent students based on their code submissions for a given set of laboratory tasks on a particular course. This way, we can make deeper recommendations for programming solutions and pathways to support student learning and progression in computer programming modules effectively at a Higher Education Institution. Recent work using Deep Learning tends to work better when more and more data is provided. However, in Learning Analytics, the number of students in a course is an unavoidable limit. Thus we cannot simply generate more data as is done in other domains such as FinTech or Social Network Analysis. Our findings indicate there is a need to learn and develop better mechanisms to extract and learn effective data features from students so as to analyze the students' progression and performance effectively.
- Code Vectorization: Representations of programming submissions by tokenizing the code:
- Code as Word Vectors: words from the code solutions by spliting it only using the space, tabular (\t) and new line (\n) characters
- Code as Token Vectors: Python’s Tokenizer library for source code analysis which provides a lexical scanner for Python code
- Code as Abstract Syntax Tree Vectors: tree representation of the abstract syntactic structure of source code to preserve the structure of the source code in a submission
#!/usr/bin/env python
def say_hello():
print("Hello, World!")
say_hello()
- Code as Word Vectors
['def', 'say_hello():', 'print("Hello,', 'World!")', 'say_hello()']
- Code as Token Vectors: Categories or IDs
Characters Category Token
1,0-1,3: NAME 'def'
1,4-1,13: NAME 'say_hello'
1,13-1,14: OP '('
1,14-1,15: OP ')'
1,15-1,16: OP ':'
1,16-1,17: NEWLINE '\n'
2,0-2,4: INDENT ' '
2,4-2,9: NAME 'print'
2,9-2,10: OP '('
2,10-2,25: STRING '"Hello, World!"'
2,25-2,26: OP ')'
2,26-2,27: NEWLINE '\n'
3,0-3,1: NL '\n'
4,0-4,0: DEDENT ''
4,0-4,9: NAME 'say_hello'
4,9-4,10: OP '('
4,10-4,11: OP ')'
4,11-4,12: NEWLINE '\n'
5,0-5,0: ENDMARKER ''
- Code as Abstract Syntax Tree Vectors
Parent Node Child Node
'Module' 'FunctionDef'
'Module' 'Expr'
'FunctionDef' 'say_hello'
'FunctionDef' 'arguments'
'FunctionDef' 'Print'
'Expr' 'Call'
'Print' 'Str'
'Print' 'bool'
'Call' 'Name'
'Str' 'Hello\tWorld!'
'Name' 'say_hello'
'Name' 'Load'
Transform student code submissions into meaningful vectors using bag-of-words or embeddings. For each technique, we leverage our code vectorization approaches: Words, Python Token Categories, Python Token Words and AST Nodes.
- Code bag-of-words: orderless source code representation as a bag of its words. Further details: BOW (bag-of-words) Part I, BOW Part II
- Code Embeddings: code submissions are mapped into vectors of real numbers in a continuous vector space. Further details: Embeddings Part I, Embeddings Part II
Embeddings for the Top 20 Most Common Words
Embeddings for the Top 20 Most Common Token Words
For each course and academic year, a User Representation Matrix is constructed for each student using the code vectors of the submissions to the proposed labsheets by the lecturer. Having a vector representation of code submissions allows researchers to generate a higher-level representation for each student. This User Representation Matrix is built by vectorizing the latest submission for each task using either:
- Word Tokenizer
- Token Word Python Tokenizer
- user2code2vec: vectorization of the User Representation Matrix of shape (number_tasks, MAX_LENGTH). The User Representation Matrix for each student is flattened out as a long vector. PCA is then leveraged as the dimensionality reduction technique to visualize the 100-dimension vectors or user embeddings into 2 dimensions. In short, a student is represented as a vector of her submissions.
User Raw Representations Using Word Tokens
User Learned Embeddings Using Word Tokens
Each dot in the graphs is a student represented by the vector of her submissions. For the course analysed, Programming I, these student vectors have 13,800 dimensions: 276 tasks and sequence padding limit is 50. The colors show the performance average on the course and it is very hard to cluster students due to the Curse of dimensionality.
In a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values. A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation. Around a hundred of students is not sufficient. Thus instead of representing each student using the concatenation of all their submission made in a course, it would be better to identify important features from each submission and concatenate key features across the code submission. In short, keep the number of features low to effectively learn from constrained data. These user2code2vec representations can then be used to identify student neighbours for programming recommendations.
-
Viz an AST: An abstract syntax tree (AST) is a tree representation of the abstract syntactic structure of source code written in a programming language. In Python, we leverage the ast module to process trees of the Python abstract syntax grammar. Visualizations are made using a great notebook developed by H. Chase Stevens.
-
BOW (bag-of-words): The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR) where a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. In order to grab these words from Python source code we can leverage the tokenize module, the tokenizer for Python source. It provides a lexical scanner for Python source code.
-
Word Embeddings: Word embeddings are feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension. A common method to develop this mapping is fitting a Neural Network. We use PCA to reduce the dimensionality of the embeddings and plot them in 2D.