This repo intends to be an open discussion & implementation platform for the technical reproduction of video generative models with quality on par with sora.
OpenAI sora technical report summary:
- The overall framework is similar to WALT but with many missing details.
- The main improvements seem to come from model scaling. 16x compute model shows significantly better spatial & temporal quality
- The video compression network could be similar to MAGVIT2 but with a higher temporal compression ratio. The number of frames could be very high, as suggested by a previous Google paper.
- It is probable that casual attention is used for temporal modeling since it supports both image and video data.
- learning at high resolution could benefit the model performance also.
- re-captioning is important for text understanding.