Persistent homology of context vectors computed by individual attention heads in Hebrew language models
In the notebooks we compute the persistent homology of context vectors for individual attention heads to analyze the attention head's linguistic properties. We find that some attention heads represent certain collocations and multiword expressions in topologcially interesting ways. We also find that when focusing on the persistent homology of a subset of the tokens corresponding to a collocation or multiword expression, the topology is preserved from one context to another better by some models than others, and clear trends in this behavior, especially when using the Wasserstein (Kontorovich) distance metric between persistence diagrams. We hypothesize that maintaining the topological structure of a collocation or multiword expression may be important to model performance on tasks like collocation and multiword expression extraction. If the topological structure is in fact important and if this is included in the training objective of the model, this could improve performace on tasks related to collocations and multiword expressions such as machine translation. We note that the one multilingual model bert-base-multilingual-cased
that we study is the worst at this compared to the other monolingual models. We also note that onlplab/alephbert-base
, the newest models of the three with the largest vocabulary performs the best of the three, suggesting vocabulary size is a factor in performance on preserving the topology of the context vectors.
It is important to note that the tokenizer, as well as the model used are important in this analysis. We use HeBERT avichr/heBERT_sentiment_analysis
, AlephBERT, onlplab/alephbert-base
, and bert-base-multilingual-cased
in our expirements with Hebrew. We also include alephbertgimmel
which is a more recent Hebrew model with a larger vocabulary than the other three, implying it will perform better at preserving the topology of the context vectors and at capturing collocations and multiword expressions. Our preliminary findings seem to confirm this. Note, including persistent homology, that is, the task of preserving the topology of the context vectors, in the objective of the model, may be a good way to improve models on languages considered "morphologically rich" and "low-resource". In our analysis we find that tokenization into fewer subwords seems to mean that topological stucture is preserved more easily by the model. This makes sense, due to the fact that the topology becomes more complicated as we add context vectors, and thus will be more difficult to preserve. At the same time, smaller collections of words sometimes do not seem to present a clear treand, whereas larger collections do. We also note that trying to cluster the tokens using DBSCAN with the n
-token collocations and multiword expressions will correspond to n-simplices that have relatively stable persistent homology across different contexts and that larger values of n
will provide a more fair comparison. Preliminary experiements seem to confirm this.
Also note, that since there can be a hierarchical structure to the morphemes in a language, and persistent topology is a generalized form of hierarchical clustering (see simplex trees), we may find this provides an alternative perspective on morphological segmentation, and one that is self-supervised at that. See Morphological Segmentation Inside-Out for an explanation of the hierarchical structure of morphological segmentation. See also Word-level Morpheme segmentation using Transformer neural network for an application of a character level transformer to this task. We also note that with the information contained in Persistent Topology of Syntax, it would be interesting to have a model that understands the hiercharical structure from the character level to the word level and up to the sentence level, that is, with morpheme tree understanding, and with sentence level parse trees as well.
A Few Things to Note When Comparing Barcode Diagrams for Key Phrases, Collocations, and MWEs of Different Models
- The Wasserstein (a.k.a. the Kontorovich) distance metric is a more accurate comparison for barcode diagrams and clear trends can be found when using it, whereas the bottleneck distance at times does not seem to have any clear trends.
- A more revealing and accurate comparison can be found when the collocations or multiword expressions are long and contain many words. This is because preserving the topology of a larger number of tokens is a more difficult task, and clear trends in the Wasserstiein distance between barcodes can be seen for larger collections of tokens making up a key phrases, collocation, or multiword expression.
- Models with a larger vocabulary tend to preserve the topology more than models with smaller vocabularies. Again, this is because preserving the topology of more tokens is a more difficult task, so less "word splitting" makes the task easier.
- It is unclear at present how context volume effects the persistent homology of collocations and multiword expressions, that is, when the contexts have significantly different lengths.
- It may be important to compare different heads to one another as well, as can be done with. Is there a clear reason why particular heads should capture certain collocations in terms of the DBSCAN or persistent homology of context vectors and others wouldn't?
It would be very interesting to use a Low Rank Adaptation (LoRA) for each of the models, trained with the objective of preserving the persistent homology of key phrases, collocations, and multiword expressions in some dictionary, and then testing if the models perform better on Information Extraction like tasks involving key phrases, collocations, and multiword expressions not contained in the dictionary.