This repository contains code to run the model designed in 'Adaptive Vision Transformers for Efficient Processing of Video Data in Automotive Applications'. Below, you'll find details about the code structure, how to set up the environment, run inferences, and interpret benchmarking results.
model contains the primary implementation of the model, extending the mmengine and mmsegmentation frameworks.
- encoder-decoder: Modified encoder-decoder implementation.
- token reducing vision transformer: Token-reducing Vision Transformer module.
- setup: Notebook to install model weights and dependencies.
- example.ipynb: Example notebook demonstrating inference with the model.
- benchamrking: Benchmarking and analysis
benchamrking contains numpy files with benchmarking results.
- starting with encode_times: Measures the time taken to run the encoder. (in seconds)
- starting with pixel_wise_acc: Shows the pixel-wise accuracy loss compared to the original model. (in %)
- starting with reduced_tokens_heatmap: Visualizes the location of the most pruned tokens.
- starting with pruned_tokens: Represents the amount of tokens that are most pruned. (absolute numbers)
- Files ending with '0.xx': Fixed threshold with a standard reduction interval of 8.
- Files containing 'lin': Linear threshold.
- Files containing 'all_layers': Reduction interval set to 1.
- Files containing 'int_x': Representing varying reduction intervals.