-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NLP Results and CCT size #69
Comments
There's two things we should note from our experiments here that I think are important.
LLMs are a vastly different domain and are accomplishing different goals. While there may be some shared tasks, the purpose of these works is quite different. Here we're trying to provide the utility of transformers to small datasets, which enables an average user or scientist (who doesn't have a large compute budget) to be able to train their networks from scratch. LLMs, and other large networks (including ViTs), are attempting to achieve the highest performance on tasks without care for compute budgets. Both these goals are important, but different. As for the decoder structure, we were just focused on classification so it only makes sense to use an encoder style transformer. We didn't need the cross attention. You're welcome to incorporate that if you wish to extend this to other types of problems and we'd love to see the results. I hope this helps. |
Thank you very much for your generous reply. I found CCT while looking for a transformer that could allow experimentation with a consumer GPU. It is great to be able to explore transformers with limited resources! My concern with the NLP performance not improving with larger CCT models was mainly whether, while working with CCT, will insights scale to much larger transformers. My assumption is that many insights will but I wonder if the NLP tasks were indicating some limitation. For experimenting with transformer architecture there could be advantages to having both the encoder and encoder. I'm unsure what a realistic benchmark for a transformer the size of CCT, with an encoder, would be. I assume image generation would be the right task to focus on. Maybe a GAN building on CCT in a benchmark like https://paperswithcode.com/sota/image-generation-on-cifar-10 Thanks again for your work. |
Thanks for making transformers much more approachable! The down side of this may be stupid questions from beginners like me (still, I hope this is not one). In the NLP results the five different datasets had best accuracy with five different CCT models. The Transformer, ViT-Lite, and CVT models almost have accuracy inversely correlated with size. My "intuition" is that bigger models would be better (for example, LLM often give the best results). Maybe the small size of the datasets means larger models can't be trained as well. Maybe the embedding is not optimized for transformers. Could you please offer insight into this?
The CCT is an encoder architecture. Are there small transformers that demonstrates an encoder/decoder or decoder architecture? How would you expect a decoder implementation of CCT to perform in generative tasks?
The text was updated successfully, but these errors were encountered: