From 4f5aef0ff70b7ca0bfd58168f2aa4369695cdf11 Mon Sep 17 00:00:00 2001 From: psetinek Date: Tue, 30 Apr 2024 23:15:56 +0200 Subject: [PATCH 1/8] test html render --- chapters/en/unit8/terminologies/camera-models.mdx | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/chapters/en/unit8/terminologies/camera-models.mdx b/chapters/en/unit8/terminologies/camera-models.mdx index 9ba9d8db5..d27e5e0de 100644 --- a/chapters/en/unit8/terminologies/camera-models.mdx +++ b/chapters/en/unit8/terminologies/camera-models.mdx @@ -1,8 +1,10 @@ # Camera models ## Pinhole Cameras - -![Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Pinhole-camera.png) +
+ Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg +

Example text

+
The simplest kind of camera - perhaps one that you have made yourself - consists of a lightproof box, with a small hole made in one side and a screen or a photographic film on the other. Light rays passing through the hole generate an inverted image on the rear wall of the box. This simple model for a camera is commonly used in 3D graphics applications. ### Camera axes conventions From 97d728a0e83890a9514e5a692214bbed3bff6aa9 Mon Sep 17 00:00:00 2001 From: psetinek Date: Tue, 30 Apr 2024 23:31:13 +0200 Subject: [PATCH 2/8] test html --- chapters/en/unit8/terminologies/camera-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit8/terminologies/camera-models.mdx b/chapters/en/unit8/terminologies/camera-models.mdx index d27e5e0de..728375ade 100644 --- a/chapters/en/unit8/terminologies/camera-models.mdx +++ b/chapters/en/unit8/terminologies/camera-models.mdx @@ -1,7 +1,7 @@ # Camera models ## Pinhole Cameras -
+
Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg

Example text

From 567efbefe8fa2e18060e5ec0645b56a0af50e921 Mon Sep 17 00:00:00 2001 From: psetinek Date: Tue, 30 Apr 2024 23:35:07 +0200 Subject: [PATCH 3/8] test html v2 --- chapters/en/unit8/terminologies/camera-models.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/en/unit8/terminologies/camera-models.mdx b/chapters/en/unit8/terminologies/camera-models.mdx index 728375ade..818917633 100644 --- a/chapters/en/unit8/terminologies/camera-models.mdx +++ b/chapters/en/unit8/terminologies/camera-models.mdx @@ -2,8 +2,8 @@ ## Pinhole Cameras
- Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg -

Example text

+ Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg +

Text underneath the image

The simplest kind of camera - perhaps one that you have made yourself - consists of a lightproof box, with a small hole made in one side and a screen or a photographic film on the other. Light rays passing through the hole generate an inverted image on the rear wall of the box. This simple model for a camera is commonly used in 3D graphics applications. From 8c1af0d2602dd3dda7ef68b3aa3b5f8e991785ef Mon Sep 17 00:00:00 2001 From: psetinek Date: Wed, 1 May 2024 00:04:35 +0200 Subject: [PATCH 4/8] centered all images that needed it and their imgsubtitles --- chapters/en/unit8/3d-vision/nvs.mdx | 22 +++++--- .../unit8/3d_measurements_stereo_vision.mdx | 50 +++++++++++++------ chapters/en/unit8/nerf.mdx | 7 +-- .../en/unit8/terminologies/camera-models.mdx | 4 +- .../en/unit8/terminologies/linear-algebra.mdx | 16 ++++-- 5 files changed, 68 insertions(+), 31 deletions(-) diff --git a/chapters/en/unit8/3d-vision/nvs.mdx b/chapters/en/unit8/3d-vision/nvs.mdx index 798adc397..293ef232e 100644 --- a/chapters/en/unit8/3d-vision/nvs.mdx +++ b/chapters/en/unit8/3d-vision/nvs.mdx @@ -28,6 +28,10 @@ Unlike the original NeRF, which trains a MLP which takes spatial points to a den ![PixelNeRF diagram](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_pipeline.png) image from https://alexyu.net/pixelnerf +
+ PixelNeRF diagram +

image from https://alexyu.net/pixelnerf

+
The method first passes the input images through a convolutional neural network (ResNet34), bilinearly upsampling features from multiple layers to the same resolution as the input images. As in a standard NeRF, the new view is generated by volume rendering. @@ -46,11 +50,14 @@ A model was trained separately on each class of object (e.g. planes, benches, ca ### Results (from the PixelNeRF website) -![Input image of a chair](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_input.png) -![Rotating gif animation of rendered novel views](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_output.gif) - -image from https://alexyu.net/pixelnerf +
+ Input image of a chair +
+
+ Rotating gif animation of rendered novel views +

image from https://alexyu.net/pixelnerf

+
The PixelNeRF code can be found on [GitHub](https://github.com/sxyu/pixel-nerf) @@ -76,9 +83,10 @@ The model actually starts with the weights from [Stable Diffusion Image Variatio However, here these CLIP image embeddings are concatenated with the relative viewpoint transformation between the input and novel views. (This viewpoint change is represented in terms of spherical polar coordinates.) -![Zero123](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Zero123.png) -image from https://zero123.cs.columbia.edu - +
+ Zero123 +

image from https://zero123.cs.columbia.edu

+
The rest of the architecture is the same as Stable Diffusion. However, the latent representation of the input image is concatenated channel-wise with the noisy latents before being input into the denoising U-Net. diff --git a/chapters/en/unit8/3d_measurements_stereo_vision.mdx b/chapters/en/unit8/3d_measurements_stereo_vision.mdx index ac287651b..e9dbdbfe1 100644 --- a/chapters/en/unit8/3d_measurements_stereo_vision.mdx +++ b/chapters/en/unit8/3d_measurements_stereo_vision.mdx @@ -8,9 +8,10 @@ Now, let's say we are given this 2D image and the location of the pixel coordina We aim to solve the problem of determining the 3D structure of objects. In our problem statement, we can represent an object in 3D as a set of 3D points. Finding the 3D coordinates of each of these points helps us determine the 3D structure of the object. -![Figure 1: Image formation using single camera](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/image_formation_single_camera.png?download=true) - -Figure 1: Image formation using single camera +
+ Figure 1: Image formation using single camera +

Figure 1: Image formation using single camera

+
## Solution Let's assume we are given the following information: @@ -42,9 +43,10 @@ Therefore, using 2 images of the same scene point P, known positions and orienta ## Simplified Solution Since there are many different positions and orientations for the camera locations which can be selected, we can select a location that makes the math simpler, less complex, and reduces computational processing when running on a computer or an embedded device. One configuration that is popular and generally used is shown in Figure 2. We use 2 cameras in this configuration, which is equivalent to a single camera for capturing 2 images from 2 different locations. -![Figure 2: Image formation using 2 cameras](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/image_formation_simple_stereo.jpg?download=true) - -Figure 2: Image formation using 2 cameras +
+ Figure 2: Image formation using 2 cameras +

Figure 2: Image formation using 2 cameras

+
1. Origin of the coordinate system is placed at the pinhole of the first camera which is usually the left camera. 2. Z axis of the coordinate system is defined perpendicular to the image plane. @@ -98,13 +100,19 @@ We'll work through an example, capture some images, and perform some calculation The left and right cameras in OAK-D Lite are oriented similarly to the geometry of the simplified solution detailed above. The baseline distance between the left and right cameras is 7.5cm. Left and right images of a scene captured using this device are shown below. The figure also shows these images stacked horizontally with a red line drawn at a constant height (i.e. at a constant v value ). We'll refer to the horizontal x-axis as u and the vertical y-axis as v. Raw Left Image -![Raw Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_left_frame.jpg?download=true) +
+ Raw Left Image +
Raw Right Image -![Raw Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_right_frame.jpg?download=true) +
+ Raw Right Image +
-![Raw Stacked Left and Right Images ](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_stacked_frames.jpg?download=true) Raw Stacked Left and Right Images +
+ Raw Stacked Left and Right Images +
Let's focus on a single point - the top left corner of the laptop. As per equation 3 above, \\(v\_left = v\_right\\) for the same point in the left and right images. However, notice that the red line, which is at a constant v value, touches the top-left corner of the laptop in the left image but misses this point by a few pixels in the right image. There are two main reasons for this discrepancy: @@ -115,27 +123,39 @@ Let's focus on a single point - the top left corner of the laptop. As per equati We can perform image rectification/post-processing to correct for differences in intrinsic parameters and orientations of the left and right cameras. This process involves performing 3x3 matrix transformations. In the OAK-D Lite API, a stereo node performs these calculations and outputs the rectified left and right images. Details and source code can be viewed [here](https://github.com/luxonis/depthai-experiments/blob/master/gen2-stereo-on-host/main.py). In this specific implementation, correction for intrinsic parameters is performed using intrinsic camera matrices, and correction for orientation is performed using rotation matrices(part of calibration parameters) for the left and right cameras. The rectified left image is transformed as if the left camera had the same intrinsic parameters as the right one. Therefore, in all our following calculations, we'll use the intrinsic parameters for the right camera i.e. focal length of 452.9 and principal point at (298.85, 245.52). In the rectified and stacked images below, notice that the red line at constant v touches the top-left corner of the laptop in both the left and right images. Rectified Left Image -![Rectified Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_left_frame.jpg?download=true) +
+ Rectified Left Image +
Rectified Right Image -![Rectified Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_right_frame.jpg?download=true) +
+ Rectified Right Image +
Rectified and Stacked Left and Right Images -![Rectified and Stacked Left and Right Images](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_stacked_frames.jpg?download=true) +
+ Rectified and Stacked Left and Right Images +
Let's also overlap the rectified left and right images to see the difference. We can see that the v values for different points remain mostly constant in the left and right images. However, the u values change, and this difference in the u values helps us find the depth information for different points in the scene, as shown in Equation 6 above. This difference in 'u' values \\(u\_left - u\_right\\) is called disparity, and we can notice that the disparity for points near the camera is greater compared to points further away. Depth z and disparity \\(u\_left - u\_right\\) are inversely proportional, as shown in equation 6. Rectified and Overlapped Left and Right Images -![Rectified and Overlapped Left and Right Images](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_overlapping_frames.jpg?download=true) +
+ Rectified and Overlapped Left and Right Images +
### Annotated Left and Right Rectified Images Let's find the 3D coordinates for some points in the scene. A few points are selected and manually annotated with their (u,v) values, as shown in the figures below. Instead of manual annotations, we can also use template-based matching, feature detection algorithms like SIFT, etc for finding corresponding points in left and right images. Annotated Left Image -![Annotated Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_left_img.jpg?download=true) +
+ Annotated Left Image +
Annotated Right Image -![Annotated Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_right_img.jpg?download=true) +
+ Annotated Right Image +
### 3D Coordinate Calculations Twelve points are selected in the scene, and their (u,v) values in the left and right images are tabulated below. Using equations 4, 5, and 6, (x,y,z) coordinates for these points are also calculated and tabulated below. X and Y coordinates concerning the left camera, and the origin is at the left camera's pinhole (or optical center of the lens). Therefore, 3D points left and above the pinhole have negative X and Y values, respectively. diff --git a/chapters/en/unit8/nerf.mdx b/chapters/en/unit8/nerf.mdx index fdc553682..7b7693e21 100644 --- a/chapters/en/unit8/nerf.mdx +++ b/chapters/en/unit8/nerf.mdx @@ -30,9 +30,10 @@ As neural networks can serve as universal function approximators, we can approxi A simple NeRF pipeline can be summarized with the following picture: -![nerf_pipeline](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/nerf_pipeline.png) - -Image from: [Mildenhall et al. (2020)](https://www.matthewtancik.com/nerf) +
+ nerf_pipeline +

Image from: [Mildenhall et al. (2020)](https://www.matthewtancik.com/nerf)

+
**(a)** Sample points and viewing directions along camera rays and pass them through the network. diff --git a/chapters/en/unit8/terminologies/camera-models.mdx b/chapters/en/unit8/terminologies/camera-models.mdx index 818917633..440ea7414 100644 --- a/chapters/en/unit8/terminologies/camera-models.mdx +++ b/chapters/en/unit8/terminologies/camera-models.mdx @@ -2,9 +2,9 @@ ## Pinhole Cameras
- Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg -

Text underneath the image

+ Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg
+ The simplest kind of camera - perhaps one that you have made yourself - consists of a lightproof box, with a small hole made in one side and a screen or a photographic film on the other. Light rays passing through the hole generate an inverted image on the rear wall of the box. This simple model for a camera is commonly used in 3D graphics applications. ### Camera axes conventions diff --git a/chapters/en/unit8/terminologies/linear-algebra.mdx b/chapters/en/unit8/terminologies/linear-algebra.mdx index 9ba844ffd..6f59970ed 100644 --- a/chapters/en/unit8/terminologies/linear-algebra.mdx +++ b/chapters/en/unit8/terminologies/linear-algebra.mdx @@ -99,7 +99,9 @@ plot_cube(ax, translated_cube, label="Translated", color="red") The output should look something like this: -![output_translation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/translation.png) +
+ output_translation +
### Scaling @@ -129,7 +131,9 @@ plot_cube(ax, scaled_cube, label="Scaled", color="green") The output should look something like this: -![output_scaling](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/scaling.png) +
+ output_scaling +
### Rotations @@ -168,7 +172,9 @@ plot_cube(ax, rotated_cube, label="Rotated", color="orange") The output should look something like this: -![output_rotation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/rotation.png) +
+ output_rotation +
- Rotation around the Y-axis @@ -207,4 +213,6 @@ plot_cube(ax, final_result, label="Combined", color="violet") The output should look something like the following. -![output_rotation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/combined.png) +
+ output_combined +
\ No newline at end of file From e39514882ef9f090a9ed455eb278dec731ff3eaa Mon Sep 17 00:00:00 2001 From: psetinek Date: Wed, 1 May 2024 00:09:25 +0200 Subject: [PATCH 5/8] test hyperlink --- chapters/en/unit8/nerf.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit8/nerf.mdx b/chapters/en/unit8/nerf.mdx index 7b7693e21..86a770bf1 100644 --- a/chapters/en/unit8/nerf.mdx +++ b/chapters/en/unit8/nerf.mdx @@ -32,7 +32,7 @@ A simple NeRF pipeline can be summarized with the following picture:
nerf_pipeline -

Image from: [Mildenhall et al. (2020)](https://www.matthewtancik.com/nerf)

+

Image from: Mildenhall et al. (2020)

**(a)** Sample points and viewing directions along camera rays and pass them through the network. From 6a99e689cd54ac8cf9f1b3bf95f884aa882d4d99 Mon Sep 17 00:00:00 2001 From: psetinek Date: Wed, 1 May 2024 00:14:28 +0200 Subject: [PATCH 6/8] fix hyperlinks --- chapters/en/unit8/3d-vision/nvs.mdx | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/chapters/en/unit8/3d-vision/nvs.mdx b/chapters/en/unit8/3d-vision/nvs.mdx index 293ef232e..1e3dc18a1 100644 --- a/chapters/en/unit8/3d-vision/nvs.mdx +++ b/chapters/en/unit8/3d-vision/nvs.mdx @@ -25,12 +25,9 @@ PixelNeRF is a method that directly generates the parameters of a NeRF from one In other words, it conditions the NeRF on the input images. Unlike the original NeRF, which trains a MLP which takes spatial points to a density and color, PixelNeRF uses spatial features generated from the input images. -![PixelNeRF diagram](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_pipeline.png) -image from https://alexyu.net/pixelnerf -
PixelNeRF diagram -

image from https://alexyu.net/pixelnerf

+

Image from: PixelNeRF

The method first passes the input images through a convolutional neural network (ResNet34), bilinearly upsampling features from multiple layers to the same resolution as the input images. From d761c38497126bc4d5816571acd85d5270177ebf Mon Sep 17 00:00:00 2001 From: psetinek Date: Wed, 1 May 2024 00:17:34 +0200 Subject: [PATCH 7/8] small fixes --- chapters/en/unit8/3d-vision/nvs.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit8/3d-vision/nvs.mdx b/chapters/en/unit8/3d-vision/nvs.mdx index 1e3dc18a1..823f167b0 100644 --- a/chapters/en/unit8/3d-vision/nvs.mdx +++ b/chapters/en/unit8/3d-vision/nvs.mdx @@ -53,7 +53,7 @@ A model was trained separately on each class of object (e.g. planes, benches, ca
Rotating gif animation of rendered novel views -

image from https://alexyu.net/pixelnerf

+

Image from: PixelNeRF

The PixelNeRF code can be found on [GitHub](https://github.com/sxyu/pixel-nerf) From 5cf951a165fade4e3eb4f8e0fdef21a49a9bae2d Mon Sep 17 00:00:00 2001 From: psetinek Date: Wed, 1 May 2024 00:18:45 +0200 Subject: [PATCH 8/8] small fixes final --- chapters/en/unit8/3d-vision/nvs.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit8/3d-vision/nvs.mdx b/chapters/en/unit8/3d-vision/nvs.mdx index 823f167b0..7b06e890c 100644 --- a/chapters/en/unit8/3d-vision/nvs.mdx +++ b/chapters/en/unit8/3d-vision/nvs.mdx @@ -82,7 +82,7 @@ However, here these CLIP image embeddings are concatenated with the relative vie
Zero123 -

image from https://zero123.cs.columbia.edu

+

Image from: https://zero123.cs.columbia.edu

The rest of the architecture is the same as Stable Diffusion.