Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mesh completion task #71

Open
jrmgr opened this issue Mar 15, 2024 · 2 comments
Open

Mesh completion task #71

jrmgr opened this issue Mar 15, 2024 · 2 comments

Comments

@jrmgr
Copy link

jrmgr commented Mar 15, 2024

Hello,

Mesh/shape completion application is mentionned in the paper (see Figure 9). I was wondering how to apply this kind of feature and I would have three questions:

  • Once the autoencoder has been trained, would it be possible to only use it for mesh completion? In other words, what are the advantages/strengths of the transformer compared to the autoencoder for this task since the output of both the autoencoder and transformer are the same i.e. a mesh? Can the autoencoder only perform this task?
  • I have trained the autoencoder on my own meshes dataset. Now, if I want to use the transformer to perform mesh completion, do I need to train it for this specific task i.e. should I pass partial mesh with its associated complete mesh as ground truth during the training or should I just train the transformer "normally" i.e. passing the full meshes in all batches so that the model will learn by itself and be able to perform mesh completion during the inference only?
  • I am not very familiar with transformer and tokens. My understanding is that the triangles (words) of the mesh (sentence) are first converted to tokens thanks to the autoencoder (is it correct?). During the inference, let's imagine we want to complete tables top given their table legs only, should I feed to the transformer a partial mesh (table legs) and call the generate function of the transformer with all the associated tokens of the input partial mesh or should I feed the complete mesh (legs and top) with only the first tokens corresponding to the table legs? In a nutshell, how should I construct the prompt for the shape completion task?
    meshcompletion

Thanks in advance for your clarifications!

@MarcusLoppe
Copy link
Contributor

Hello,

Mesh/shape completion application is mentionned in the paper (see Figure 9). I was wondering how to apply this kind of feature and I would have three questions:

* Once the autoencoder has been trained, would it be possible to only use it for mesh completion? In other words, what are the advantages/strengths of the transformer compared to the autoencoder for this task since the output of both the autoencoder and transformer are the same i.e. a mesh? Can the autoencoder only perform this task?

The autoencoders task is to translate a page of "words" into mesh where a single triangle is 6 words/tokens.
So the auto-encoder consist of two parts:

  • Encoder: Creates the language of how to describe a mesh, so it takes the mesh and encodes each triangles into a codes that is 6 words.
  • Decoder: Gets the codes or language and then reconstructs all the triangles to it's original coordinates.

The language between encoder and decoder have 16k different words, however as in similar manner as a sentence in English; the order of the words in a sentence matter. So will raise complexity of the language but will allow for a huge amount of different meanings for each word.

So the task of the auto-encoder is to create a language (e.g codebook) which they both can understand, the encoder creates and makes up the language and then the decoder will try to understand it. The model learns by punishing in the models if the output e.g. if the decoder understands it badly and cannot output a correct reconstructed mesh.

In short: No :) While in inference mode, it can only decode the codes that are provided. The transformer task is to input a text or tokens that e.g 1/10 of a page, and then it predicts what words the rest of the page needs to have.
Using text as a guide will help this process, I've trained both on no text and text, and the text version will outperform the non-text.

* I have trained the autoencoder on my own meshes dataset. Now, if I want to use the transformer to perform mesh completion, do I need to train it for this specific task i.e. should I pass partial mesh with its associated complete mesh as ground truth during the training or should I just train the transformer "normally" i.e. passing the full meshes in all batches so that the model will learn by itself and be able to perform mesh completion during the inference only?

As previous analogy, the transformer predicts how a page (e.g. full mesh) should look like.
During the training, the transformer will start at code/token 1 and then guess token number 2 , then using token 1 & 2 guess the 3rd token.

There are some merits for using 'partials' e.g. sliding window. But I believe this would require a lot more training data since it needs to be able to generalize even better.

* I am not very familiar with transformer and tokens. My understanding is that the triangles (words) of the mesh (sentence) are first converted to tokens thanks to the autoencoder (is it correct?). During the inference, let's imagine we want to complete tables top given their table legs only, should I feed to the transformer a partial mesh (table legs) and call the generate function of the transformer with all the associated tokens of the input partial mesh or should I feed the complete mesh (legs and top) with only the first tokens corresponding to the table legs? In a nutshell, **how should I construct the prompt for the shape completion task**? 

Thanks in advance for your clarifications!

Since the mesh it trains on should be orderd from bottom to top, the first chunk of the codes from then autoencoder should be the button of the mesh already.
So you can see how I generate with tokens with the code below, or you can check out my fork for the demo notebook to see the entire code.

The idea with the prompt is to provide it with the lower potions of the mesh since this is how the page of words/tokens are constructed. The bottom part of the mesh is always in the start of the page.
Just a small detail; the mesh should be ordered in YXZ so it's more like; the page starts with the lowest, furthest left and most forward triangles.

Dependant on how good your model is; it can be ok to just provide it 2-10 tokens and it can generate the mesh by itself.

The transformer is very good being guided by the text to generate the mesh, but not at the start of the sequence.
I'll talk to lucidrains about this before but maybe it's due to that the transformer never trains using 0 tokens, e.g. just being provided the text and nothing else.
During the training it is always provided at least one token of the sequence since that is how transformers are trained.

But thinking of it, I might insert the same meaningless token of the start of each sequence to get stronger text guidance from the start of the sequence.

Sorry about that wall of text, sometimes I write too much since I also don't fully understand it and it helps me to learn/fully think out my thoughts.

token_length_procent = 0.20 
for label in target_labels:
    for item in dataset.data: 
        if item['texts'] == label:
            code = autoencoder.tokenize(
                vertices = item['vertices'],
                faces = item['faces'],
                face_edges = item['face_edges']
            ) 
            num_tokens = int(code.shape[0] * token_length_procent)  
            texts.append(item['texts']) 
            codes.append(code.flatten()[:num_tokens].unsqueeze(0))  
            break
             
             
coords = [] 
for text, prompt in zip(texts, codes): 
    print(f"Generating {text} with {prompt.shape[1]} tokens")
    faces_coordinates = transformer.generate(texts = [text],  prompt = prompt, temperature = 0) 
    coords.append(faces_coordinates)   
combind_mesh(f'{folder}/text+prompt_all.obj', coords)

@jrmgr
Copy link
Author

jrmgr commented Mar 18, 2024

Thanks a lot for your insights.

6 tokens for a triangle stands for 3 vertices per triangle and 2 quantizers per vertex. Is that correct?

Actually, I first trained the autoencoder model to learn the codebook on my own dataset (human anatomy 3D meshes) with very decent results both qualitatively and quantitatively (<0.3 loss for ~60K meshes after augmentations and a batch size of 32). This is very encouraging! The main issue I am facing is the hardware limitation since I am using 2500 triangles per mesh to have a decent representation of the dense anatomic mesh. This is very different than "artist created" meshes as they call it in the paper with few triangles and sharp edges. Nevertheless, I managed to freeze the autoencoder on dense meshes with satisfying results.

Now I want to train the transformer to perform mesh completion. The autoencoder was trained to overfit one single human anatomy model from different individuals so that I sadly do not really have a text to guide the transformer aside from the human gender for each mesh. What I mean is that I cannot say "table", "chair" or "lamp" so I am using condition_on_text=False for the transformer. I am using your code snippet to perform mesh completion by prompting the model the first tokens. With my use case, I am rather using token_length_procent = 0.9 to provide almost all the tokens (first big part of my ordered mesh) and complete a small missing part of the anatomy. By doing that, I am providing a big first part of the mesh and trying to reconstruct the small missing part of the mesh. My hope is that the huge amount of provided tokens will be abe to compensate the fact I am not providing any text to guide the completion. However, I needed to reduce the batch size to 1 for the transformer because of the 2500 triangles per mesh (max_seq_length=15000). I am also thinking to reduce the number of triangles per mesh to be able to increase the batch size and the training performance of the transformer. I could then post process the mesh by subdividing the triangles to have a good representation of the anatomic model. Another option would be to reduce some layers' size and thus the model size since I am overfitting just one type of model and not different kind of objects. Do you think the MeshTransformer can have good performance for mesh completion with batch size of 1 and providing 90% of the first tokens during the inference? Also, which loss value would you recommend to reach with your experiments with the transformer to get good results?

Sorry for my long text as well, it's just that I think this MeshGPT model is worth to give it a try. Thanks for the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants