Skip to content

Commit

Permalink
Merge pull request #293 from mohammad-gh009/patch-4
Browse files Browse the repository at this point in the history
fixing a typo !
  • Loading branch information
merveenoyan authored May 2, 2024
2 parents 313e295 + 0bdde2e commit 0261016
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions chapters/en/unit4/multimodal-models/tasks-models-part1.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ In this section, we will briefly look at the different multimodal tasks involvin
## Examples of Tasks
Before looking into specific models, it's crucial to understand the diverse range of tasks involving image and text. These tasks include but are not limited to:

- **Visual Question Anwering (VQA) and Visual Reasoning:** Imagine a machine that looks at a picture and understands your questions about it. Visual Question Answering (VQA) is just that! It trains computers to extract meaning from images and answer questions like "Who's driving the car?" while Visual Reasoning is the secret sauce, enabling the machine to go beyond simple recognition and infer relationships, compare objects, and understand scene context to give accurate answers. It's like asking a detective to read the clues in a picture, only much faster and better!
- **Visual Question Answering (VQA) and Visual Reasoning:** Imagine a machine that looks at a picture and understands your questions about it. Visual Question Answering (VQA) is just that! It trains computers to extract meaning from images and answer questions like "Who's driving the car?" while Visual Reasoning is the secret sauce, enabling the machine to go beyond simple recognition and infer relationships, compare objects, and understand scene context to give accurate answers. It's like asking a detective to read the clues in a picture, only much faster and better!

- **Document Visual Question Answering (DocVQA):** Imagine a computer understanding both the text and layout of a document, like a map or contract, and then answering questions about it directly from the image. That's Document Visual Question Answering (DocVQA) in a nutshell. It combines computer vision for processing image elements and natural language processing to interpret text, allowing machines to "read" and answer questions about documents just like humans do. Think of it as supercharging document search with AI to unlock all the information trapped within those images.

Expand Down Expand Up @@ -392,4 +392,4 @@ Congratulations! you made it till the end. Now on to the next section for more o
## References
1. [Vision-Language Pre-training: Basics, Recent Advances, and Future Trends](https://arxiv.org/abs/2210.09263)<a id="pretraining-paper"></a>
2. [Document Collection Visual Question Answering](https://arxiv.org/abs/2104.14336)<a id="doc-vqa-paper"></a>
3. [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499)<a id="grounding-dino"></a>
3. [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499)<a id="grounding-dino"></a>

0 comments on commit 0261016

Please sign in to comment.