-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mesh completion task #71
Comments
The autoencoders task is to translate a page of "words" into mesh where a single triangle is 6 words/tokens.
The language between encoder and decoder have 16k different words, however as in similar manner as a sentence in English; the order of the words in a sentence matter. So will raise complexity of the language but will allow for a huge amount of different meanings for each word. So the task of the auto-encoder is to create a language (e.g codebook) which they both can understand, the encoder creates and makes up the language and then the decoder will try to understand it. The model learns by punishing in the models if the output e.g. if the decoder understands it badly and cannot output a correct reconstructed mesh. In short: No :) While in inference mode, it can only decode the codes that are provided. The transformer task is to input a text or tokens that e.g 1/10 of a page, and then it predicts what words the rest of the page needs to have.
As previous analogy, the transformer predicts how a page (e.g. full mesh) should look like. There are some merits for using 'partials' e.g. sliding window. But I believe this would require a lot more training data since it needs to be able to generalize even better.
Since the mesh it trains on should be orderd from bottom to top, the first chunk of the codes from then autoencoder should be the button of the mesh already. The idea with the prompt is to provide it with the lower potions of the mesh since this is how the page of words/tokens are constructed. The bottom part of the mesh is always in the start of the page. Dependant on how good your model is; it can be ok to just provide it 2-10 tokens and it can generate the mesh by itself. The transformer is very good being guided by the text to generate the mesh, but not at the start of the sequence. But thinking of it, I might insert the same meaningless token of the start of each sequence to get stronger text guidance from the start of the sequence. Sorry about that wall of text, sometimes I write too much since I also don't fully understand it and it helps me to learn/fully think out my thoughts.
|
Thanks a lot for your insights. 6 tokens for a triangle stands for 3 vertices per triangle and 2 quantizers per vertex. Is that correct? Actually, I first trained the autoencoder model to learn the codebook on my own dataset (human anatomy 3D meshes) with very decent results both qualitatively and quantitatively (<0.3 loss for ~60K meshes after augmentations and a batch size of 32). This is very encouraging! The main issue I am facing is the hardware limitation since I am using 2500 triangles per mesh to have a decent representation of the dense anatomic mesh. This is very different than "artist created" meshes as they call it in the paper with few triangles and sharp edges. Nevertheless, I managed to freeze the autoencoder on dense meshes with satisfying results. Now I want to train the transformer to perform mesh completion. The autoencoder was trained to overfit one single human anatomy model from different individuals so that I sadly do not really have a text to guide the transformer aside from the human gender for each mesh. What I mean is that I cannot say "table", "chair" or "lamp" so I am using condition_on_text=False for the transformer. I am using your code snippet to perform mesh completion by prompting the model the first tokens. With my use case, I am rather using token_length_procent = 0.9 to provide almost all the tokens (first big part of my ordered mesh) and complete a small missing part of the anatomy. By doing that, I am providing a big first part of the mesh and trying to reconstruct the small missing part of the mesh. My hope is that the huge amount of provided tokens will be abe to compensate the fact I am not providing any text to guide the completion. However, I needed to reduce the batch size to 1 for the transformer because of the 2500 triangles per mesh (max_seq_length=15000). I am also thinking to reduce the number of triangles per mesh to be able to increase the batch size and the training performance of the transformer. I could then post process the mesh by subdividing the triangles to have a good representation of the anatomic model. Another option would be to reduce some layers' size and thus the model size since I am overfitting just one type of model and not different kind of objects. Do you think the MeshTransformer can have good performance for mesh completion with batch size of 1 and providing 90% of the first tokens during the inference? Also, which loss value would you recommend to reach with your experiments with the transformer to get good results? Sorry for my long text as well, it's just that I think this MeshGPT model is worth to give it a try. Thanks for the discussion! |
Hello,
Mesh/shape completion application is mentionned in the paper (see Figure 9). I was wondering how to apply this kind of feature and I would have three questions:
Thanks in advance for your clarifications!
The text was updated successfully, but these errors were encountered: