Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

channel error #10

Open
jianmanLin opened this issue Jul 19, 2023 · 20 comments
Open

channel error #10

jianmanLin opened this issue Jul 19, 2023 · 20 comments

Comments

@jianmanLin
Copy link

    elif cond_class == "audio":
        if self.cond_stage_forward is None:
            bs = c.shape[0] # 20
            c = c.reshape(-1,16,29) # [20, 16, 29]
            c = self.cond_stage_model_for_audio(c) # [20, 64]
            c = c.reshape(bs, 8, -1) # [20, 8, 8]
            c = self.cond_stage_model_for_audio_smooth(c)

在处理音频信息的时候,网络要求输入维度是(B, 16, 29),c.reshape(-1,16,29)也可以确认网络的输入维度信息,我输入的音频信息与其一致,经过c = self.cond_stage_model_for_audio_smooth(c)的时候报错RuntimeError: Given groups=1, weight of size [16, 32, 3], expected input[20, 8, 8] to have 32 channels, but got 8 channels instead

@979277
Copy link

979277 commented Jul 19, 2023

我也碰到了一样的错误,不知道是不是由于我处理音频特征的方式有问题导致的

@jianmanLin
Copy link
Author

image
输入音频经过DeepSpeech和窗口化处理后输出(-1,16,29)作者所给的代码也是接受这个维度的,可以成功推过self.cond_stage_model_for_audio这个网络,但是无法成功推理过self.cond_stage_model_for_audio_smooth网络,可以猜测self.cond_stage_model_for_audio_smooth网络的输出为(-1,32),因为我想跑通整个网络,所以随机初始化了(-1,32)的张量做为音频输出,进行后续的推理,但是后面还是遭遇到了维度不一致问题

@jianmanLin
Copy link
Author

我也碰到了一样的错误,不知道是不是由于我处理音频特征的方式有问题导致的

你好,请问你是如何处理音频特征的呢,我是通过作者引用的VOCA的那一篇论文,提取出的(N,16, 29)维度的音频特征

@979277
Copy link

979277 commented Jul 19, 2023

我是和你一样的做法,报错的过程也一样

@jianmanLin
Copy link
Author

我也碰到了一样的错误,不知道是不是由于我处理音频特征的方式有问题导致的

我对这个问题感到很奇怪,他有一步c = c.reshape(-1,16,29)操作,就是默认了输入维度是(-1,16,29)

@jianmanLin
Copy link
Author

我也碰到了一样的错误,不知道是不是由于我处理音频特征的方式有问题导致的

我对这个问题感到很奇怪,他有一步c = c.reshape(-1,16,29)操作,就是默认了输入维度是(-1,16,29)

后续通过网络不应该报错的才对

@979277
Copy link

979277 commented Jul 19, 2023

只能等作者后续的修正了,可能是某几步的参数填错了导致的

@jianmanLin
Copy link
Author

只能等作者后续的修正了,可能是某几步的参数填错了导致的

老师让我在近期复现这篇论文的baseline,我现在不知道该怎么做了

@jianmanLin
Copy link
Author

我也碰到了一样的错误,不知道是不是由于我处理音频特征的方式有问题导致的
应该是我们的音频处理方式不对,音频部分最终输出应该是(B, 64)的,这样整个模型就可以跑通了

@979277
Copy link

979277 commented Jul 19, 2023

只能等作者后续的修正了,可能是某几步的参数填错了导致的

老师让我在近期复现这篇论文的baseline,我现在不知道该怎么做了

刚刚讨论了一下,它代码里attnet的seq_len设置的是8,有可能是作者选取了8个16*29的特征作为这一帧图片对应的音频特征

@jianmanLin
Copy link
Author

只能等作者后续的修正了,可能是某几步的参数填错了导致的

老师让我在近期复现这篇论文的baseline,我现在不知道该怎么做了

刚刚讨论了一下,它代码里attnet的seq_len设置的是8,有可能是作者选取了8个16*29的特征作为这一帧图片对应的音频特征

你说的对,谢谢解答

@sstzal
Copy link
Owner

sstzal commented Jul 19, 2023

只能等作者后续的修正了,可能是某几步的参数填错了导致的

老师让我在近期复现这篇论文的baseline,我现在不知道该怎么做了

刚刚讨论了一下,它代码里attnet的seq_len设置的是8,有可能是作者选取了8个16*29的特征作为这一帧图片对应的音频特征

You are right. Sorry for missing the details. The dim of each audio feature '0_0.npy' corresponding to each frame is 81629.

@Bebaam
Copy link

Bebaam commented Jul 21, 2023

Hello,
I have the same error.
How do you get the 8 16*29 for each frame? When I make use of the deepspeech extraction (with video_fps=25 instead of original 60) I get (1, 16, 29) for each frame. How do you extend it to get (8, 16, 29)?

@jianmanLin
Copy link
Author

jianmanLin commented Jul 21, 2023 via email

@Bebaam
Copy link

Bebaam commented Jul 21, 2023

Thanks for your response. Yeah I've seen that paper, I'll try and share if I am successful :)

@jianmanLin
Copy link
Author

我也碰到了一样的错误,不知道是不是由于我处理音频特征的方式有问题导致的

Hello, did you successfully reproduce this paper? As a result of my training, the inpainting area will keep shaking, and then the training loss will drop rapidly at the beginning, and then it will shake within a small area. I feel very happy troubled

@979277
Copy link

979277 commented Jul 31, 2023

您好,请问你复现的怎么样,方便加个v讨论一下彼此复现的结果吗

@jianmanLin
Copy link
Author

jianmanLin commented Jul 31, 2023 via email

@979277
Copy link

979277 commented Jul 31, 2023 via email

@sstzal
Copy link
Owner

sstzal commented Dec 27, 2023

Hello, I have the same error. How do you get the 8 16*29 for each frame? When I make use of the deepspeech extraction (with video_fps=25 instead of original 60) I get (1, 16, 29) for each frame. How do you extend it to get (8, 16, 29)?

Yes, it's (1, 16, 29) for each frame after deepspeech. Then you can concat the audio features from its neighbor 8 frames to the audio feature of [8,16,29] for each frame.

For example, for the 10th frame, we use the audio feature of [7th, 8th, ..., 10th, ..., 14th] as the smooth audio feature of 10th frame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants