Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

top1_err几乎都为100怎么解决 #60

Open
YJforgithub opened this issue Dec 21, 2023 · 10 comments
Open

top1_err几乎都为100怎么解决 #60

YJforgithub opened this issue Dec 21, 2023 · 10 comments

Comments

@YJforgithub
Copy link

YJforgithub commented Dec 21, 2023

我采用Model_zoo中的b16_f8的checkpoint初始化,按理来说开始错误率就会比较低,而且我用的K400数据集是作者提供的。作者能帮忙分析一下可能是我哪里改错了。
具体的配置为
TRAIN:
ENABLE: True
DATASET: kinetics
BATCH_SIZE: 32
EVAL_PERIOD: 1
CHECKPOINT_PERIOD: 5
AUTO_RESUME: True
DATA:
USE_OFFSET_SAMPLING: True
DECODING_BACKEND: decord
NUM_FRAMES: 8
SAMPLING_RATE: 16
TRAIN_JITTER_SCALES: [256, 320]
TRAIN_CROP_SIZE: 224
TEST_CROP_SIZE: 224
INPUT_CHANNEL_NUM: [3]
PATH_TO_DATA_DIR: /home/fzus/kyj/Dataset_all/Kinetics_400/kinetics_400
PATH_LABEL_SEPARATOR: ','
PATH_PREFIX: /home/fzus/kyj/Dataset_all/Kinetics_400/kinetics_400/videos_320
TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]
TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]
UNIFORMERV2:
BACKBONE: 'uniformerv2_b16'
N_LAYERS: 4
N_DIM: 768
N_HEAD: 12
MLP_FACTOR: 4.0
BACKBONE_DROP_PATH_RATE: 0.
DROP_PATH_RATE: 0.
MLP_DROPOUT: [0.5, 0.5, 0.5, 0.5]
CLS_DROPOUT: 0.5
RETURN_LIST: [8, 9, 10, 11]
NO_LMHRA: True
TEMPORAL_DOWNSAMPLE: False
PRETRAIN: 'k400/k400_b16_f8x224/k400_uniformerv2_b16_8x224.pyth'
AUG:
NUM_SAMPLE: 1
ENABLE: True
COLOR_JITTER: 0.4
AA_TYPE: rand-m7-n4-mstd0.5-inc1
INTERPOLATION: bicubic
RE_PROB: 0.
RE_MODE: pixel
RE_COUNT: 1
RE_SPLIT: False
BN:
USE_PRECISE_STATS: False
NUM_BATCHES_PRECISE: 200
SOLVER:
ZERO_WD_1D_PARAM: True
BASE_LR_SCALE_NUM_SHARDS: True
BASE_LR: 4e-4
COSINE_AFTER_WARMUP: True
COSINE_END_LR: 1e-6
WARMUP_START_LR: 1e-6
WARMUP_EPOCHS: 0.
LR_POLICY: cosine
MAX_EPOCH: 50
MOMENTUM: 0.9
WEIGHT_DECAY: 0.05
OPTIMIZING_METHOD: adamw
COSINE_AFTER_WARMUP: True
MODEL:
NUM_CLASSES: 400
ARCH: uniformerv2
MODEL_NAME: Uniformerv2
LOSS_FUNC: cross_entropy
DROPOUT_RATE: 0.5
USE_CHECKPOINT: False
CHECKPOINT_NUM: [0]
TEST:
ENABLE: True
DATASET: kinetics
BATCH_SIZE: 256
NUM_SPATIAL_CROPS: 1
NUM_ENSEMBLE_VIEWS: 1
DATA_LOADER:
NUM_WORKERS: 8
PIN_MEMORY: True
TENSORBOARD:
ENABLE: False
NUM_GPUS: 1
NUM_SHARDS: 1
RNG_SEED: 0
OUTPUT_DIR: .

log如下:
image

@YJforgithub
Copy link
Author

图片比较小,下面的截图大一些,epoch为1
image

@YJforgithub
Copy link
Author

当我调小base_lr, epoch1的第一个iter top_err即为25,这是什么原因?

@Andy1621
Copy link
Collaborator

直接测试结果能不能对上呢

@YJforgithub
Copy link
Author

直接用默认的lr, 0.0004会Nan,直接测试还没测过。
我用的作者提供的test.csv和val.csv,发现内容一致。论文中报告的精度是另外的测试集测试得到的精度还是直接以您提供的val.csv或test.csv测试得到的精度

@Andy1621
Copy link
Collaborator

可以先测一下k400能不能对上结果,更换数据集的话一开始error 100是正常的,后期收敛。对于CLIP based的模型,学习率需要调小

@YJforgithub
Copy link
Author

我采用作者提供Model Zoo里的K400的checkpoint,按理来说拿过来初始化模型,再在K400训练,刚开始在训练集上做训练误差应该不会100,因为并没有更改数据集。

@Andy1621
Copy link
Collaborator

噢噢,那你可以先测试一下测试集能不能对上

@YJforgithub
Copy link
Author

作者你好,我之前代码训练了一会儿,结果意外断开了,但是保存了checkpoints,但是不是只能恢复权重,epoch,opt似乎不能恢复

@Andy1621
Copy link
Collaborator

checkpoint中保存了optimizer的

@Aaron198513
Copy link

PATH_TO_DATA_DIR: /home/fzus/kyj/Dataset_all/Kinetics_400/kinetics_400
PATH_LABEL_SEPARATOR: ','
PATH_PREFIX: /home/fzus/kyj/Dataset_all/Kinetics_400/kinetics_400/videos_320

我想请问一下这两个有啥区别啊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants