From da17e15ea81c1f321177823325d2d8ca54749845 Mon Sep 17 00:00:00 2001
From: ermu2001 <ermuzzz2001@gmail.com>
Date: Mon, 29 Apr 2024 23:19:53 +0800
Subject: [PATCH] update readme: extending recaptioning

---
 DATA.md   | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 README.md |  6 +++++
 2 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/DATA.md b/DATA.md
index d61673e..cc87045 100644
--- a/DATA.md
+++ b/DATA.md
@@ -32,6 +32,70 @@ Also other fantastic repo intergrating these benchmarks are helpful in the proce
 - [VideoLlava](https://github.com/PKU-YuanGroup/Video-LLaVA/tree/main/videollava)
 - [IG-VLM](https://github.com/imagegridworth/IG-VLM/tree/main)
 
-### Inter4k
 
-This is a dataset with 1000 samples of high resolution videos. We prepare the data folloing the instructions from their [official website](https://alexandrosstergiou.github.io/datasets/Inter4K/index.html)
\ No newline at end of file
+
+### Recaptioning
+#### Inter4k
+
+This is a dataset with 1000 samples of high resolution videos. We prepare the data folloing the instructions from their [official website](https://alexandrosstergiou.github.io/datasets/Inter4K/index.html)
+
+#### Extending Reacptioning
+The recaptioning part is designed to be extendable.
+
+inference script [tasks/eval/recaption/pllava_recaption.py](tasks/eval/recaption/pllava_recaption.py) would use a dataset class [RecaptionDataset](tasks/eval/recaption/__init__.py#L197). The detailed information is kept in the data_list_info attribute as:
+```
+data_list_info = OrderedDict({
+        # "Panda70M": OrderedDict(
+        #     json_relpath="Panda70M/annotations.json", 
+        #     prefix="DATAS/Recaption/Panda70M/videos", 
+        #     data_type="video", 
+        #     bound=False,
+        #     key_rename_map={
+        #         # 'caption': 'hint',
+        #     },
+        #     name_key='video_name',
+        #     postfix=('mp4', 'mkv', 'webm'),
+        #     recaption_type=RecaptionSample,
+        # ), # don't has start & end
+        "Inter4K": OrderedDict(
+            json_relpath="Inter4K/annotations.json", 
+            prefix="DATAS/Recaption/Inter4K/60fps/UHD", 
+            data_type="video", 
+            bound=False,
+            key_rename_map={
+                # 'caption': 'hint',
+            },
+            name_key='video_name',
+            postfix=('mp4', 'mkv', 'webm'),
+            recaption_type=CaptionSample,
+        ), # don't has start & end
+    })
+```
+It contains the path to a annotation json file where there is a list and each item of the list is a sample waiting for captioning. For example, the Inter4K/annotations.json is like:
+```json
+[
+    {
+        "video_name": "973"
+    },
+    ...
+]
+```
+and the directory DATAS/Recaption/Inter4K/60fps/UHD would look like:
+```
+$ ls DATAS/Recaption/Inter4K/60fps/UHD
+1.mp4 134.mp4  170.mp4 ....
+```
+
+Naively, only the video is needed when captioning directly, therefore the annotation file only needs to contain the names of each video under the "prefix" directory.
+
+Extending a dataset for captioning would consist of the folloing steps:
+1. have all the videos downloaded
+2. construct a annotation.json file with sepecific format.
+3. configure the recaption dataset [here](tasks/eval/recaption/__init__.py#L197), where you would need to determine:
+    - json_relpath: the annotation relative path
+    - prefix: root directory for videos
+    - postfix: a list containing all the file extensions for these videos
+
+The other options are experimental, so stick with the default setting as in Inter4k. The recommended length of video is around 5-20 seconds. 
+
+p.s. "bound" is to make sure the video pass to the model doesn't have scene transition or so. This part wasn't tested, so set the bound to false and make sure the original videos files are single clip of a video. But always feel free to discover and contribute to PLLaVA!
\ No newline at end of file
diff --git a/README.md b/README.md
index 70c0e33..b07bc8e 100644
--- a/README.md
+++ b/README.md
@@ -340,6 +340,12 @@ bash scripts/gallery.sh
 
 Feel free to use the compare version to compare differnt models' results or use the single gallery version to check out one model's results. They are basically the same. Check out this [script](scripts/gallery.sh) for more details
 
+#### For Captioning and Recaptioning
+Follow instructions at [DATA.md](DATA.md#extending-reacptioning) and you can extend the recaptioning data with a few steps.
+
+Feel free to point out high quality dataset of videos, we would proceed on doing captioning on those datasets.
+
+
 # :page_facing_up: Citation
 
 If you find this project useful in your research, please consider cite: