Skip to content

Latest commit

 

History

History
32 lines (27 loc) · 3.87 KB

README.md

File metadata and controls

32 lines (27 loc) · 3.87 KB

ACM CODS-COMAD 2024 Data Challenge - Vid Pre-req edge detection

Team - JeS0 - Rank 3

Summary of our approach

  • Final submission file:
    • ./src/submissions/83_GNN_submission.csv
    • Weighted F1 (Local): 0.704
      • Path to (best) model checkpoint used to test (& extra files): ./src/submissions/models/83/0.704-F1_ep156_EdgeConvGNNClassifier.pt
    • Public LB: 0.44912
  • Code path: ./src/
    • models.py -> Contains the GNN model.
    • siam_models.py -> Contains the Siamese model. (Not used in final submission)
    • train.py -> Script to train the model.
    • test.py -> Script to test the model on public test set.
    • gen_sub.py -> Script to generate the final submission file in the required format.
    • eda.ipynb -> For Exploratory Data Analysis.
    • pre_embs.ipynb -> To generate node embeddings from given data.
    • ind_lbls.ipynb -> To generate indices & labels for all data.
    • lm_embs.ipynb -> To test & generate LM embeddings for the transcripts.
    • lm_emb_norm.ipynb -> To test normalizations for the LM embeddings.
  • Additional data files generated by us (preprocessed data etc.) are available in ./data/kagdata/, along with the original data files.

For this competition, we modeled the problem as a link prediction task in a graph. The nodes in the graph are the videos (transcripts) and the edges are the pre-requisite relationships between them. The task is to predict the existence of an edge between two nodes. Our best model is a GNN-based classifier, that uses Dynamic Edge convolutions. The model was implemented using torch-geometric and trained on a single A100 GPU (80 GB). See ./src/models.py for the model implementation and train.py for training details. We used the Adam optimizer (LR = 0.0001) and BCEWithLogitsLoss as the loss function.

Since no node features were available, we employed a general combination strategy to bootstrap node features. Since there are a few bidirectional edges, we ignored the edge directionality, discarding our attempt at augmenting the features using the provided labels, and combined all edge features per node using simple operations, such as average or sum. Based on our testing, summing the features led to better F1 scores. We also rescaled the newly created node features, with standard scaling producing the best results.

After producing these features, we turned our attention to generating new features using the provided transcripts in metadata.csv. To do this, we first preprocessed the transcripts via the usual NLP techniques (lemmatization, stopword removal, etc.) and then used a K-12BERT language model (LM) pre-trained on an Indian K-12 corpus to generate the embeddings (see lm_embs.ipynb for details). These per-node embeddings were then rescaled (standard scaling) and combined with the ones generated previously ("pre_embeddings"). The overall embeddings were used to train the GNN-based classifier.

A direct improvement to this model should be to try a standard GCN classifier, by treating the LM embeddings as node features and the given features as edge features. Also, as a side note, we tried fine-tuning K-12BERT and related LMs on the preprocessed transcripts to train a classifier, but the results were not satisfactory. We think that the results can be improved by mixing these embeddings with the "pre_embeddings" or by using the "pre_embeddings" as starting features. Another augmentation was to fine-tune the LM as part of a Siamese architecture, but due to practical limitations, we could not get good results. The code for the Siamese model is available in ./src/siam_models.py.