Segment Anything Video (SA-V), consists of 51K diverse videos and 643K high-quality spatio-temporal segmentation masks (i.e., masklets). The dataset is released under the CC by 4.0 license. Browse the dataset here.
Visit here to download SA-V including the training, val and test sets.
Num Videos | Num Masklets | |
---|---|---|
SA-V train | 50,583 | 642,036 (auto 451,720 and manual 190,316) |
SA-V val | 155 | 293 |
SA-V test | 150 | 278 |
To load and visualize the SA-V training set annotations, refer to the example sav_visualization_example.ipynb notebook.
For SA-V training set we release the mp4 videos and store the masklet annotations per video as json files . Automatic masklets and manual masklets are stored separately as two json files: {video_id}_auto.json
and {video_id}_manual.json
. They can be loaded as dictionaries in python in the format below.
{
"video_id" : str; video id
"video_duration" : float64; the duration in seconds of this video
"video_frame_count" : float64; the number of frames in the video
"video_height" : float64; the height of the video
"video_width" : float64; the width of the video
"video_resolution" : float64; video_height $\times$ video_width
"video_environment" : List[str]; "Indoor" or "Outdoor"
"video_split" : str; "train" for training set
"masklet" : List[List[Dict]]; masklet annotations in list of list of RLEs.
The outer list is over frames in the video and the inner list
is over objects in the video.
"masklet_id" : List[int]; the masklet ids
"masklet_size_rel" : List[float]; the average mask area normalized by resolution
across all the frames where the object is visible
"masklet_size_abs" : List[float]; the average mask area (in pixels)
across all the frames where the object is visible
"masklet_size_bucket" : List[str]; "small": $1$ <= masklet_size_abs < $32^2$,
"medium": $32^2$ <= masklet_size_abs < $96^2$,
and "large": masklet_size_abs > $96^2$
"masklet_visibility_changes" : List[int]; the number of times where the visibility changes
after the first appearance (e.g., invisible -> visible
or visible -> invisible)
"masklet_first_appeared_frame" : List[int]; the index of the frame where the object appears
the first time in the video. Always 0 for auto masklets.
"masklet_frame_count" : List[int]; the number of frames being annotated. Note that
videos are annotated at 6 fps (annotated every 4 frames)
while the videos are at 24 fps.
"masklet_edited_frame_count" : List[int]; the number of frames being edited by human annotators.
Always 0 for auto masklets.
"masklet_type" : List[str]; "auto" or "manual"
"masklet_stability_score" : Optional[List[List[float]]]; per-mask stability scores. Auto annotation only.
"masklet_num" : int; the number of manual/auto masklets in the video
}
Note that in SA-V train, there are in total 50,583 videos where all of them have manual annotations. Among the 50,583 videos there are 48,436 videos that also have automatic annotations.
For SA-V val and test sets, we release the extracted frames as jpeg files, and the masks as png files with the following directory structure:
sav_val(sav_test)
├── sav_val.txt (sav_test.txt): a list of video ids in the split
├── JPEGImages_24fps # videos are extracted at 24 fps
│ ├── {video_id}
│ │ ├── 00000.jpg # video frame
│ │ ├── 00001.jpg # video frame
│ │ ├── 00002.jpg # video frame
│ │ ├── 00003.jpg # video frame
│ │ └── ...
│ ├── {video_id}
│ ├── {video_id}
│ └── ...
└── Annotations_6fps # videos are annotated at 6 fps
├── {video_id}
│ ├── 000 # obj 000
│ │ ├── 00000.png # mask for object 000 in 00000.jpg
│ │ ├── 00004.png # mask for object 000 in 00004.jpg
│ │ ├── 00008.png # mask for object 000 in 00008.jpg
│ │ ├── 00012.png # mask for object 000 in 00012.jpg
│ │ └── ...
│ ├── 001 # obj 001
│ ├── 002 # obj 002
│ └── ...
├── {video_id}
├── {video_id}
└── ...
All masklets in val and test sets are manually annotated in every frame by annotators. For each annotated object in a video, we store the annotated masks in a single png. This is because the annotated objects may overlap, e.g., it is possible in our SA-V dataset for there to be a mask for the whole person as well as a separate mask for their hands.
We provide an evaluator to compute the common J and F metrics on SA-V val and test sets. To run the evaluation, we need to first install a few dependencies as follows:
pip install -r requirements.txt
Then we can evaluate the predictions as follows:
python sav_evaluator.py --gt_root {GT_ROOT} --pred_root {PRED_ROOT}
or run
python sav_evaluator.py --help
to print a complete help message.
The evaluator expects the GT_ROOT
to be one of the following folder structures, and GT_ROOT
and PRED_ROOT
to have the same structure.
- Same as SA-V val and test directory structure
{GT_ROOT} # gt root folder
├── {video_id}
│ ├── 000 # all masks associated with obj 000
│ │ ├── 00000.png # mask for object 000 in frame 00000 (binary mask)
│ │ └── ...
│ ├── 001 # all masks associated with obj 001
│ ├── 002 # all masks associated with obj 002
│ └── ...
├── {video_id}
├── {video_id}
└── ...
In the paper for the experiments on SA-V val and test, we run inference on the 24 fps videos, and evaluate on the subset of frames where we have ground truth annotations (first and last annotated frames dropped). The evaluator will ignore the masks in frames where we don't have ground truth annotations.
- Same as DAVIS directory structure
{GT_ROOT} # gt root folder
├── {video_id}
│ ├── 00000.png # annotations in frame 00000 (may contain multiple objects)
│ └── ...
├── {video_id}
├── {video_id}
└── ...
The evaluation code is licensed under the BSD 3 license. Please refer to the paper for more details on the models. The videos and annotations in SA-V Dataset are released under CC BY 4.0.
Third-party code: the evaluation software is heavily adapted from VOS-Benchmark
and DAVIS
(with their licenses in LICENSE_DAVIS
and LICENSE_VOS_BENCHMARK
).