Skip to content

Commit

Permalink
Merge pull request #286 from psetinek/unit8_fixes
Browse files Browse the repository at this point in the history
center images and text using html
  • Loading branch information
ATaylorAerospace authored May 1, 2024
2 parents e03f0d4 + 5cf951a commit 313e295
Show file tree
Hide file tree
Showing 5 changed files with 69 additions and 33 deletions.
25 changes: 15 additions & 10 deletions chapters/en/unit8/3d-vision/nvs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@ PixelNeRF is a method that directly generates the parameters of a NeRF from one
In other words, it conditions the NeRF on the input images.
Unlike the original NeRF, which trains a MLP which takes spatial points to a density and color, PixelNeRF uses spatial features generated from the input images.

![PixelNeRF diagram](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_pipeline.png)
image from https://alexyu.net/pixelnerf

<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_pipeline.png" alt="PixelNeRF diagram" />
<p>Image from: <a href="https://alexyu.net/pixelnerf">PixelNeRF</a></p>
</div>

The method first passes the input images through a convolutional neural network (ResNet34), bilinearly upsampling features from multiple layers to the same resolution as the input images.
As in a standard NeRF, the new view is generated by volume rendering.
Expand All @@ -46,11 +47,14 @@ A model was trained separately on each class of object (e.g. planes, benches, ca

### Results (from the PixelNeRF website)

![Input image of a chair](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_input.png)
![Rotating gif animation of rendered novel views](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_output.gif)

image from https://alexyu.net/pixelnerf
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_input.png" alt="Input image of a chair" />
</div>

<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/PixelNeRF_output.gif" alt="Rotating gif animation of rendered novel views" />
<p>Image from: <a href="https://alexyu.net/pixelnerf">PixelNeRF</a></p>
</div>

The PixelNeRF code can be found on [GitHub](https://github.com/sxyu/pixel-nerf)

Expand All @@ -76,9 +80,10 @@ The model actually starts with the weights from [Stable Diffusion Image Variatio
However, here these CLIP image embeddings are concatenated with the relative viewpoint transformation between the input and novel views.
(This viewpoint change is represented in terms of spherical polar coordinates.)

![Zero123](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Zero123.png)
image from https://zero123.cs.columbia.edu

<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Zero123.png" alt="Zero123" />
<p>Image from: <a href="https://zero123.cs.columbia.edu">https://zero123.cs.columbia.edu</a></p>
</div>

The rest of the architecture is the same as Stable Diffusion.
However, the latent representation of the input image is concatenated channel-wise with the noisy latents before being input into the denoising U-Net.
Expand Down
50 changes: 35 additions & 15 deletions chapters/en/unit8/3d_measurements_stereo_vision.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ Now, let's say we are given this 2D image and the location of the pixel coordina

We aim to solve the problem of determining the 3D structure of objects. In our problem statement, we can represent an object in 3D as a set of 3D points. Finding the 3D coordinates of each of these points helps us determine the 3D structure of the object.

![Figure 1: Image formation using single camera](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/image_formation_single_camera.png?download=true)

Figure 1: Image formation using single camera
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/image_formation_single_camera.png?download=true" alt="Figure 1: Image formation using single camera" />
<p>Figure 1: Image formation using single camera</p>
</div>

## Solution
Let's assume we are given the following information:
Expand Down Expand Up @@ -42,9 +43,10 @@ Therefore, using 2 images of the same scene point P, known positions and orienta
## Simplified Solution
Since there are many different positions and orientations for the camera locations which can be selected, we can select a location that makes the math simpler, less complex, and reduces computational processing when running on a computer or an embedded device. One configuration that is popular and generally used is shown in Figure 2. We use 2 cameras in this configuration, which is equivalent to a single camera for capturing 2 images from 2 different locations.

![Figure 2: Image formation using 2 cameras](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/image_formation_simple_stereo.jpg?download=true)

Figure 2: Image formation using 2 cameras
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/image_formation_simple_stereo.jpg?download=true" alt="Figure 2: Image formation using 2 cameras" />
<p>Figure 2: Image formation using 2 cameras</p>
</div>

1. Origin of the coordinate system is placed at the pinhole of the first camera which is usually the left camera.
2. Z axis of the coordinate system is defined perpendicular to the image plane.
Expand Down Expand Up @@ -98,13 +100,19 @@ We'll work through an example, capture some images, and perform some calculation
The left and right cameras in OAK-D Lite are oriented similarly to the geometry of the simplified solution detailed above. The baseline distance between the left and right cameras is 7.5cm. Left and right images of a scene captured using this device are shown below. The figure also shows these images stacked horizontally with a red line drawn at a constant height (i.e. at a constant v value ). We'll refer to the horizontal x-axis as u and the vertical y-axis as v.

Raw Left Image
![Raw Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_left_frame.jpg?download=true)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_left_frame.jpg?download=true" alt="Raw Left Image" />
</div>

Raw Right Image
![Raw Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_right_frame.jpg?download=true)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_right_frame.jpg?download=true" alt="Raw Right Image" />
</div>

![Raw Stacked Left and Right Images ](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_stacked_frames.jpg?download=true)
Raw Stacked Left and Right Images
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/unrectified_stacked_frames.jpg?download=true" alt="Raw Stacked Left and Right Images" />
</div>

Let's focus on a single point - the top left corner of the laptop. As per equation 3 above, \\(v\_left = v\_right\\) for the same point in the left and right images. However, notice that the red line, which is at a constant v value, touches the top-left corner of the laptop in the left image but misses this point by a few pixels in the right image. There are two main reasons for this discrepancy:

Expand All @@ -115,27 +123,39 @@ Let's focus on a single point - the top left corner of the laptop. As per equati
We can perform image rectification/post-processing to correct for differences in intrinsic parameters and orientations of the left and right cameras. This process involves performing 3x3 matrix transformations. In the OAK-D Lite API, a stereo node performs these calculations and outputs the rectified left and right images. Details and source code can be viewed [here](https://github.com/luxonis/depthai-experiments/blob/master/gen2-stereo-on-host/main.py). In this specific implementation, correction for intrinsic parameters is performed using intrinsic camera matrices, and correction for orientation is performed using rotation matrices(part of calibration parameters) for the left and right cameras. The rectified left image is transformed as if the left camera had the same intrinsic parameters as the right one. Therefore, in all our following calculations, we'll use the intrinsic parameters for the right camera i.e. focal length of 452.9 and principal point at (298.85, 245.52). In the rectified and stacked images below, notice that the red line at constant v touches the top-left corner of the laptop in both the left and right images.

Rectified Left Image
![Rectified Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_left_frame.jpg?download=true)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_left_frame.jpg?download=true" alt="Rectified Left Image" />
</div>

Rectified Right Image
![Rectified Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_right_frame.jpg?download=true)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_right_frame.jpg?download=true" alt="Rectified Right Image" />
</div>

Rectified and Stacked Left and Right Images
![Rectified and Stacked Left and Right Images](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_stacked_frames.jpg?download=true)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_stacked_frames.jpg?download=true" alt="Rectified and Stacked Left and Right Images" />
</div>

Let's also overlap the rectified left and right images to see the difference. We can see that the v values for different points remain mostly constant in the left and right images. However, the u values change, and this difference in the u values helps us find the depth information for different points in the scene, as shown in Equation 6 above. This difference in 'u' values \\(u\_left - u\_right\\) is called disparity, and we can notice that the disparity for points near the camera is greater compared to points further away. Depth z and disparity \\(u\_left - u\_right\\) are inversely proportional, as shown in equation 6.

Rectified and Overlapped Left and Right Images
![Rectified and Overlapped Left and Right Images](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_overlapping_frames.jpg?download=true)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/rectified_overlapping_frames.jpg?download=true" alt="Rectified and Overlapped Left and Right Images" />
</div>

### Annotated Left and Right Rectified Images
Let's find the 3D coordinates for some points in the scene. A few points are selected and manually annotated with their (u,v) values, as shown in the figures below. Instead of manual annotations, we can also use template-based matching, feature detection algorithms like SIFT, etc for finding corresponding points in left and right images.

Annotated Left Image
![Annotated Left Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_left_img.jpg?download=true)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_left_img.jpg?download=true" alt="Annotated Left Image" />
</div>

Annotated Right Image
![Annotated Right Image](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_right_img.jpg?download=true)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/3d_stereo_vision_images/annotated_right_img.jpg?download=true" alt="Annotated Right Image" />
</div>

### 3D Coordinate Calculations
Twelve points are selected in the scene, and their (u,v) values in the left and right images are tabulated below. Using equations 4, 5, and 6, (x,y,z) coordinates for these points are also calculated and tabulated below. X and Y coordinates concerning the left camera, and the origin is at the left camera's pinhole (or optical center of the lens). Therefore, 3D points left and above the pinhole have negative X and Y values, respectively.
Expand Down
7 changes: 4 additions & 3 deletions chapters/en/unit8/nerf.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,10 @@ As neural networks can serve as universal function approximators, we can approxi

A simple NeRF pipeline can be summarized with the following picture:

![nerf_pipeline](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/nerf_pipeline.png)

Image from: [Mildenhall et al. (2020)](https://www.matthewtancik.com/nerf)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/nerf_pipeline.png" alt="nerf_pipeline" />
<p>Image from: <a href="https://www.matthewtancik.com/nerf">Mildenhall et al. (2020)</a></p>
</div>

**(a)** Sample points and viewing directions along camera rays and pass them through the network.

Expand Down
4 changes: 3 additions & 1 deletion chapters/en/unit8/terminologies/camera-models.mdx
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Camera models

## Pinhole Cameras
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Pinhole-camera.png" alt="Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg" />
</div>

![Pinhole camera from https://commons.wikimedia.org/wiki/File:Pinhole-camera.svg](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Pinhole-camera.png)
The simplest kind of camera - perhaps one that you have made yourself - consists of a lightproof box, with a small hole made in one side and a screen or a photographic film on the other. Light rays passing through the hole generate an inverted image on the rear wall of the box. This simple model for a camera is commonly used in 3D graphics applications.

### Camera axes conventions
Expand Down
16 changes: 12 additions & 4 deletions chapters/en/unit8/terminologies/linear-algebra.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,9 @@ plot_cube(ax, translated_cube, label="Translated", color="red")

The output should look something like this:

![output_translation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/translation.png)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/translation.png" alt="output_translation" />
</div>

### Scaling

Expand Down Expand Up @@ -129,7 +131,9 @@ plot_cube(ax, scaled_cube, label="Scaled", color="green")

The output should look something like this:

![output_scaling](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/scaling.png)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/scaling.png" alt="output_scaling" />
</div>

### Rotations

Expand Down Expand Up @@ -168,7 +172,9 @@ plot_cube(ax, rotated_cube, label="Rotated", color="orange")

The output should look something like this:

![output_rotation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/rotation.png)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/rotation.png" alt="output_rotation" />
</div>

- Rotation around the Y-axis

Expand Down Expand Up @@ -207,4 +213,6 @@ plot_cube(ax, final_result, label="Combined", color="violet")

The output should look something like the following.

![output_rotation](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/combined.png)
<div style="display: flex; flex-direction: column; align-items: center;">
<img src="https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/combined.png" alt="output_combined" />
</div>

0 comments on commit 313e295

Please sign in to comment.