Preprocessing Text: Normalization -> Tokenization [Pre-Tokenization -> Tokenizer Model -> Post-processing] -> Token to ids (lookup table, hashing) Preprocessing Images: Preprocessing Videos: Decode frames -> sample frames -> Resize -> Scale, normalize