Skip to content

Latest commit

 

History

History
20 lines (14 loc) · 919 Bytes

README.md

File metadata and controls

20 lines (14 loc) · 919 Bytes

tabGPT

An adaption of Andrej Karpathy's minGPT for tabular data.

procedure:

  • create column embedding for each column:
    • sentence embedding of column name/description (forward call of GPT2 with language-pre-trained weights and extract mean of last hidden states for each token in sequence)
    • for categorical values: for each row, same embedding procedure as for column name, then add the two embeddings
    • for numerical values: multiply the numerical values to each element of the column name embedding
  • concatenation of the embeddings for the different column along sequence dimension
  • run adjusted minGPT
    • instead of tokenization and learned embeddings, take the column embeddings as input
    • without positional encoding
    • instead of next-token prediction, use a classification or regression head (same as GPT2ForSequenceClassification in Hugging Face's transformers)

License

MIT