Integrate the audio modality in CoCa #94

manasMauryax · 2024-03-28T15:40:12Z

These commits essentially bring in two things:

The Conformer audio encoder:
The Conformer architecture is readily available via torchaudio, and only a few additional modules were coded.
Changes to the CoCa code which allow the Conformer encoder and the audio modality to be used with the CoCa architecture:
These changes include renaming and introducing a few variables and defining usage for them, as well as, slightly modifying the forward pass logic.

spravil · 2024-05-06T13:12:25Z

src/modalities/models/audio_transformer/audio_transformer_model.py

+            dropout=pre_conformer_dropout,
+        )
+
+        self.conformer = Conformer(


Can we remove the dependency to conformer and build it with components from the vision transformer? Maybe we want to change the conformer arcitecture in the future.

spravil · 2024-05-06T13:13:26Z

src/modalities/models/audio_transformer/audio_transformer_model.py

+        super().__init__()
+        self.sample_key = sample_key
+        self.prediction_key = prediction_key
+        self.pre_conformer = PreConformer(


Is this a tokenization of the input audio? Maybe choose a better name

This is not tokenization, just reduction in frame rate of the input. I will come up with a better name.

spravil · 2024-05-06T13:14:28Z

src/modalities/models/audio_transformer/audio_transformer_model.py

+        self.post_conformer = nn.Sequential(
+            nn.Linear(
+                input_dims,
+                n_embd,


Why do we need to project from input_dims to n_embd? input_dims != n_embd?

yup, precisely -> input_dims!=n_embd

In the Conformer implementation that I have worked on now, this will not be needed. I will project it in the very beginning (before any computation occurs in the conformer blocks).

spravil · 2024-05-06T13:16:30Z

src/modalities/models/audio_transformer/audio_transformer_model.py

+            nn.Conv1d(
+                in_channels=n_input_dims,
+                out_channels=n_input_dims,
+                kernel_size=2,


Two conv1d layers? Is this common? I assumed we apply vit style patching with conv2d of the spectrogram.

Yup, in speech, sub-sampling like the one being performed here is common.

spravil · 2024-05-06T13:18:00Z

src/modalities/models/coca/coca_model.py

    text_cls_prediction_key: str
-    vision_encoder_config: VisionTransformerConfig
+    modality_encoder_config: AudioTransformerConfig | VisionTransformerConfig | AVConfig


Here we should have vision and audio config with default None. If its set the model is created. With both None we should end up with a normal language model.

spravil · 2024-05-06T13:19:40Z

src/modalities/models/coca/coca_model.py

+    def _init_modality(self, encoder_class, encoder_config, n_queries):
+        encoder = encoder_class(**dict(encoder_config))
+        queries = nn.Parameter(torch.randn(n_queries + 1, encoder_config.n_embd))
+        attn_pool = AttentionPooling(


Attention pooling layer should attend to the combination of the audio and vision endcoder output tokens if both are activated.

Maybe this is something for the future, since, currently, we don't parallel data across all modalities.

spravil · 2024-05-06T13:23:12Z

src/modalities/models/coca/coca_model.py

-        vision_embd, vision_cls_token = self._forward_encode_vision(inputs)
+        # TODO: The "modality_key" needs to be implemented.
+        if inputs[self.modality_key][0] == self.AUDIO:
+            modality_embd, modality_cls_token = self._forward_encode_audio(inputs)


Apply if audio encoder exists. Im not sure if we also want to check if audio data is in the inputs. Explicitly checking would maybe help with training only on two modalites at a time.

Again, for the same reason as mentioned above, currently we can only train on two modalities at a time.

These can help run audio-only, vision-only or audio-vision experiments!

On the basis of a training.txt file and number of assimilation operations, a bpecodes file is generated which is used to create bpe_to_ind and ind_to_bpe dictionary pickles required for tokenization and detokenization.

…s/modalities into feat/audio_coca

manasMauryax requested a review from spravil March 28, 2024 15:40

manasMauryax force-pushed the feat/audio_coca branch from dbea356 to c342a35 Compare April 2, 2024 16:23

manasMauryax requested a review from SogolHaghighat April 2, 2024 16:24

manasMauryax marked this pull request as ready for review April 8, 2024 07:55

manasMauryax self-assigned this Apr 16, 2024

spravil requested changes May 6, 2024

View reviewed changes

spravil added 24 commits May 7, 2024 13:01

feat: add basic webdataset

d909ae0

fix: dim of cls token

e233676

feat: simple console logging

9986691

fix: add attention mask to cross entropy loss

c47b6c1

feat: allow multiple loss functions

70823e1

fix: register nce loss

0c87d91

feat: add dataloader for webdataset

b652a7d

chore: add config

b0e933a

feat: add nicer logging to wandb

d8d5a5f

fix: hardcoded batches in web loader

4ea65c8

chore: update coca config

dfe88c9

fix: rebase

5a3e844

fix: print only on main rank in component factory

043384d

fix: total loss average logging

3cd9244

fix: cuda env and run script

b09d20e

chore: update coca config

e09745d

fix: print parameters and done only on main rank

5a74dee

chore: update coca wds config

f7b725c

fix: tokenizer config of coca

c7308e2

fix: add multinode splitter to webdataset

32d0b19

fix: webdataset slow loading

55c039f

fix: add batching

966d237

fix: add more options to webloader

dacf639

fix: webloader

b40ecd5

manasMauryax and others added 21 commits June 12, 2024 07:24

feat: make CoCa audio compatible

5f63246

test: change config and dummy dataset for E2E CoCa test

857867e

feat: add new Dataset class for Audio/Vision

c9ad825

fix: incorrect variable usage and audio input shape

97ecb20

fix: to avoid torch.tensor(tensor)

4463ae7

fix: add argument to ignore padding indices

b410073

test: uptate tests to comply with changes

6c0ced6

feat: add entrypoint for audio coca tokenizer

e0577b7

chore: add configs

f40f7f6

These can help run audio-only, vision-only or audio-vision experiments!

chore: update pyproject.toml

8bed2c2

feat: allow masking of "pad" keys

743d999

feat: implement Conformer from scratch

b2cb27c

test: fix to comply to changes

7a23ced

test: remove deprecated test

363701c

feat: improve arrow dataset

d403c62

feat: add audio-only and vision-only arrow dataset

bc0d9f0

chore: fix configs to comply to changes

bae2f0f

chore: add script which generates tokenizer

0cf2290

On the basis of a training.txt file and number of assimilation operations, a bpecodes file is generated which is used to create bpe_to_ind and ind_to_bpe dictionary pickles required for tokenization and detokenization.

fix: accelerate import

f871403

refactor: introduce global constants

c9f58b9

fix: constant renaming

1806062

manasMauryax force-pushed the feat/audio_coca branch from 9528bb6 to 1806062 Compare June 12, 2024 14:01

mmaurya and others added 8 commits June 12, 2024 14:04

fix: disable mamba imports

f8181d0

chore: update audio coca arrow dataset config

25803ca

chore: Merge branch 'feat/audio_coca' of https://github.com/Modalitie…

b27bf78

…s/modalities into feat/audio_coca

feat: add audio transform

cca934e

fix: cross entropy loss ignore index

bcc7e9d

fix: prepare_sample

e78b2ea

chore: apply changes from origin/feat/audio_coca

8d5f459

chore: fix configs to comply with changes

162c965

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate the audio modality in CoCa #94

Integrate the audio modality in CoCa #94

manasMauryax commented Mar 28, 2024

spravil May 6, 2024

spravil May 6, 2024

manasMauryax May 22, 2024

spravil May 6, 2024

manasMauryax May 22, 2024

manasMauryax May 22, 2024

spravil May 6, 2024

manasMauryax May 22, 2024

spravil May 6, 2024

spravil May 6, 2024

manasMauryax May 29, 2024

spravil May 6, 2024

manasMauryax May 29, 2024

Integrate the audio modality in CoCa #94

Are you sure you want to change the base?

Integrate the audio modality in CoCa #94

Conversation

manasMauryax commented Mar 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment