diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..c4b818b Binary files /dev/null and b/.DS_Store differ diff --git a/images/algo_bench.pdf b/images/algo_bench.pdf new file mode 100644 index 0000000..412d9be Binary files /dev/null and b/images/algo_bench.pdf differ diff --git a/images/arch6.pdf b/images/arch6.pdf new file mode 100644 index 0000000..4c6d405 Binary files /dev/null and b/images/arch6.pdf differ diff --git a/images/car_bbox.jpg b/images/car_bbox.jpg new file mode 100644 index 0000000..621c4e0 Binary files /dev/null and b/images/car_bbox.jpg differ diff --git a/images/compare1.png b/images/compare1.png new file mode 100644 index 0000000..5b3cbf3 Binary files /dev/null and b/images/compare1.png differ diff --git a/images/compare2.png b/images/compare2.png new file mode 100644 index 0000000..89cfd28 Binary files /dev/null and b/images/compare2.png differ diff --git a/images/compare3.png b/images/compare3.png new file mode 100644 index 0000000..1141725 Binary files /dev/null and b/images/compare3.png differ diff --git a/images/data_creation.png b/images/data_creation.png new file mode 100644 index 0000000..c29dde0 Binary files /dev/null and b/images/data_creation.png differ diff --git a/images/hallucination.png b/images/hallucination.png new file mode 100644 index 0000000..29aac4d Binary files /dev/null and b/images/hallucination.png differ diff --git a/images/llava_g_arch.pdf b/images/llava_g_arch.pdf new file mode 100644 index 0000000..04372e5 Binary files /dev/null and b/images/llava_g_arch.pdf differ diff --git a/images/llavag_arch.png b/images/llavag_arch.png new file mode 100644 index 0000000..6bab8ce Binary files /dev/null and b/images/llavag_arch.png differ diff --git a/images/mark.png b/images/mark.png new file mode 100644 index 0000000..3801299 Binary files /dev/null and b/images/mark.png differ diff --git a/images/overview.pdf b/images/overview.pdf new file mode 100644 index 0000000..fe7b54c Binary files /dev/null and b/images/overview.pdf differ diff --git a/images/teasor.png b/images/teasor.png new file mode 100644 index 0000000..d4c9312 Binary files /dev/null and b/images/teasor.png differ diff --git a/images/teasor_v5.pdf b/images/teasor_v5.pdf new file mode 100644 index 0000000..7024980 Binary files /dev/null and b/images/teasor_v5.pdf differ diff --git a/images/vis.png b/images/vis.png new file mode 100644 index 0000000..e3879ef Binary files /dev/null and b/images/vis.png differ diff --git a/images/visual_prompt.png b/images/visual_prompt.png new file mode 100644 index 0000000..01d2756 Binary files /dev/null and b/images/visual_prompt.png differ diff --git a/index.html b/index.html new file mode 100644 index 0000000..eaff843 --- /dev/null +++ b/index.html @@ -0,0 +1,661 @@ + + + + + + + + + LLaVA-Plus + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+
+
+

πŸŒ‹ LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models +

+ + +
+ + +
+
+ HKUST + SCUT + Microsoft Research, Redmond + IDEA Research + University of Wisconsin-Madison + Tsinghua University + CUHK +
+ +
+ * Equal Contribution +    + Equal Advisory Contribution +    + 🚩 Directional Lead +
+
+ + + +
+
+
+
+
+ + + +
+
+ +
+
+ +
+

+ + + +

+ +
+
+
+ +
+
+ +
+
+ +
+
+ +
+
+ +
+
+

Highlights

+
+

+ LLaVA-Plus maintains +

    +
  1. New grounded visual chat data. We introduce a data annotation pipeline to label high-quality Grounded Visual Chat (GVC) data. Leveraging human-labeled object detection data and harnessing the robust matching capability of GPT-4, we have successfully labeled 150K GVC instances using the LLaVA instruction tuning dataset.
  2. +
  3. πŸŒ‹ LLaVA-Grounding Model. We present an end-to-end model, which connects a Large Multimodal Model (LMM) with a grounding model to facilitate grounded visual chat. Our model supports both object and pixel-level grounding, accommodating various visual prompts such as mark, click, box, and scribble. Our model offers a broader range of input and output prompt types compared to other LMMs.
  4. +
  5. Grounding Bench. We establish Grounding Bench for evaluating grounded visual chat and propose an auto-evaluation pipeline aided by GPT-4. This benchmark assesses grounded visual chat capabilities and provides performance metrics for other state-of-the-art methods.
  6. +
  7. Performance. Our empirical study validates the effectiveness of LLaVA-Grounding with the best overall performance on our Grounding Bench and competitive performance on traditional grounding tasks such as RefCOCO and Flickr30K.
  8. +
+

+ +
+
+
+ +
+
+ + + + + + +
+ +
+
+

πŸŒ‹ LLaVA-Grounging Network Architechture

+
+
+ + +
+ +
+
+
+ + +
+ +
+
+ LLaVA-Grounding enables grounding and visual prompts with two additional modules. +

+ Prompt encoder. + +

    For an input image \(X_{\texttt{v}}\) and a visual prompt \(X_{\texttt{p}}\), we employ the pre-trained Semantic-SAM as the prompt encoder. This encoder extracts visual features based on the input image and visual prompts, denoted as \(Z_{\texttt{p}}=h(X_{\texttt{v}},X_{\texttt{p}})\). To convert these prompt features into language embedding tokens \(H_{\texttt{p}}\) of the same dimensionality as the word embedding space in the language model, we use a simple linear layer with a trainable projection matrix \(W_{\texttt{p}}\): + + \begin{equation} + H_{\texttt{p}}=W_{\texttt{p}} \cdot Z_{\texttt{p}}, \text{ where } Z_{\texttt{p}}=h\left(X_{\texttt{v}},X_{\texttt{p}}\right) + \end{equation}
+ Grounding model. + +
    In addition to the language response \(X_{\texttt{a}}\), our model also produces features \(X_{\texttt{g}}\) for grounding. we employ a pretrained OpenSeeD model as the grounding model to generate bounding boxes \(\mathbf{B}\) and masks \(\mathbf{M}\). This process can be defined as follows: + + \begin{equation} + \mathbf{B, M}=s\left(X_{\texttt{v}},W_{\texttt{g}} \cdot X_{\texttt{g}}\right) + \end{equation}
+
+ + +

+ + +
+ + +
+
+ +
+ + + + + +
+ +
+
+

Comparison with other LMMs: Grounded detailed description

+ + +
+ +
+ + + + +
+ +
+
1 / 11
+ +
Example 1: A real-life image.
+
+ +
+
2 / 11
+ +
Example 2: An open-set concept "dragon".
+
+ +
+
3 / 11
+ +
Example 3: A real-life image.
+
+ +
+ +
+
+
+ + + + + +
+
+

BibTeX

+

+@misc{zhang2023llavagrounding,
+title={LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models},
+author={Hao Zhang and Hongyang Li and Feng Li and Tianhe Ren and Xueyan Zou and Shilong Liu and Shijia Huang and Jianfeng Gao and Lei Zhang and Chunyuan Li and Jianwei Yang},
+year={2023},
+booktitle={arXiv}
+}
+  
+
+
+ +
+
+

Acknowledgement

+

+ This website is adapted from Nerfies, licensed under a Creative + Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna. +

+ +

+Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. +

+ +

+ + Related Links: + [REACT] + [GLIGEN] + [Computer Vision in the Wild (CVinW)] + [Insutrction Tuning with GPT-4] +

+
+
+ + + + + + diff --git a/static/css/index.css b/static/css/index.css new file mode 100644 index 0000000..a0cc196 --- /dev/null +++ b/static/css/index.css @@ -0,0 +1,250 @@ +body { + font-family: 'Noto Sans', sans-serif; + } + + + .footer .icon-link { + font-size: 25px; + color: #000; + } + + .link-block a { + margin-top: 5px; + margin-bottom: 5px; + } + + .dnerf { + font-variant: small-caps; + } + + + .teaser .hero-body { + padding-top: 0; + padding-bottom: 3rem; + } + + .teaser { + font-family: 'Google Sans', sans-serif; + } + + + .publication-title { + } + + .publication-banner { + max-height: parent; + + } + + .publication-banner video { + position: relative; + left: auto; + top: auto; + transform: none; + object-fit: fit; + } + + .publication-header .hero-body { + } + + .publication-title { + font-family: 'Google Sans', sans-serif; + } + + .publication-authors { + font-family: 'Google Sans', sans-serif; + } + + .publication-venue { + color: #555; + width: fit-content; + font-weight: bold; + } + + .publication-awards { + color: #ff3860; + /* width: fit-content; */ + font-weight: bolder; + } + + .title + .publication-authors, + .subtitle + .publication-authors { + margin-top: -1.25rem; + } + + .publication-authors a { + color: hsl(204, 86%, 53%) !important; + } + + .publication-authors a:hover { + text-decoration: underline; + } + + .author-block { + display: inline-block; + } + + .publication-banner img { + } + + .publication-authors { + /*color: #4286f4;*/ + } + + .publication-video { + position: relative; + width: 100%; + height: 0; + padding-bottom: 56.25%; + + overflow: hidden; + border-radius: 10px !important; + } + + .publication-video iframe { + position: absolute; + top: 0; + left: 0; + width: 100%; + height: 100%; + } + + .publication-body img { + } + + .results-carousel { + overflow: hidden; + } + + .results-carousel .item { + margin: 5px; + overflow: hidden; + border: 1px solid #bbb; + border-radius: 10px; + padding: 0; + font-size: 0; + } + + .results-carousel video { + margin: 0; + } + + + .interpolation-panel { + background: #f5f5f5; + border-radius: 10px; + } + + .interpolation-panel .interpolation-image { + width: 100%; + border-radius: 5px; + } + + .interpolation-video-column { + } + + .interpolation-panel .slider { + margin: 0 !important; + } + + .interpolation-panel .slider { + margin: 0 !important; + } + + #interpolation-image-wrapper { + width: 100%; + } + #interpolation-image-wrapper img { + border-radius: 5px; + } + + + + * {box-sizing:border-box} + + /* Slideshow container */ + .slideshow-container { + max-width: 1000px; + position: relative; + margin: auto; + } + + /* Hide the images by default */ + .mySlides { + display: none; + } + + /* Next & previous buttons */ + .prev, .next { + cursor: pointer; + position: absolute; + top: 50%; + width: auto; + margin-top: -22px; + padding: 16px; + color: white; + font-weight: bold; + font-size: 18px; + transition: 0.6s ease; + border-radius: 0 3px 3px 0; + user-select: none; + } + + /* Position the "next button" to the right */ + .next { + right: 0; + border-radius: 3px 0 0 3px; + } + + /* On hover, add a black background color with a little bit see-through */ + .prev:hover, .next:hover { + background-color: rgba(0,0,0,0.8); + } + + /* Caption text */ + .text { + color: #bbb; + font-size: 18px; + padding: 8px 12px; + position: absolute; + bottom: -80px; + width: 100%; + text-align: center; + } + + /* Number text (1/3 etc) */ + .numbertext { + color: #f2f2f2; + font-size: 12px; + padding: 8px 12px; + position: absolute; + top: 0; + } + + /* The dots/bullets/indicators */ + .dot { + cursor: pointer; + height: 15px; + width: 15px; + margin: 0 2px; + background-color: #bbb; + border-radius: 50%; + display: inline-block; + transition: background-color 0.6s ease; + } + + .active, .dot:hover { + background-color: #717171; + } + + /* Fading animation */ + .fade { + animation-name: fade; + animation-duration: 3.5s; + } + + @keyframes fade { + from {opacity: .4} + to {opacity: 1} + } + \ No newline at end of file