Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make output consistent over video? #6

Open
JonathonLuiten opened this issue Oct 30, 2024 · 8 comments
Open

How to make output consistent over video? #6

JonathonLuiten opened this issue Oct 30, 2024 · 8 comments

Comments

@JonathonLuiten
Copy link

JonathonLuiten commented Oct 30, 2024

On the website is video results, where the scale / shift are consistent across the video.

If I just run the method per frame, it is very obviously that these are not consistent.

Is there code to make it consistent?

@EasternJournalist
Copy link
Collaborator

EasternJournalist commented Oct 31, 2024

Thank you for your interest in our project. We are currently in the process of cleaning up the code and will release it once it's ready. Below is a brief description of our implementation:

  • We maintain a list of point maps registered in world space.
  • For each frame in the video sequence:
    • Estimate the camera-space point map of the current frame using MoGe.
    • Select a few previous frames as references and compute dense image matching using PDCNet between the current frame and the reference frames. (Alternative robust image matching methods can also be used if you have more options)
    • [KEY] Solve for point-set rigid transformation (1-DoF scale, 3-DoF rotation and 3-DoF translation) based on the matching results. RANSAC is employed to remove outliers. Code snippet for this part sees How to make output consistent over video? #6 (comment)
    • Apply the calculated rigid transformation to the current frame's point map and append it to the registered point maps list.

Please note that achieving video consistency is NOT one of the primary objectives of MoGe. This simplified application is to demonstrate the potential for video reconstruction through a simplified implementation that considers only rigid registration. To enhance consistency, additional optimization techniques would be necessary.

See also a related issue(multiview reconstruction application).

@JonathonLuiten
Copy link
Author

Thanks for your answer!

This makes lots of sense!

One follow up question: when you say 'rigid transformation' I assume you mean rotation + translation. Does that mean that we assume that the 'scale' is automatically globally consistent that comes out of the MoGe network? Is there a reason it should be consistent?

@EasternJournalist
Copy link
Collaborator

EasternJournalist commented Nov 1, 2024

Thanks for your answer!

This makes lots of sense!

One follow up question: when you say 'rigid transformation' I assume you mean rotation + translation. Does that mean that we assume that the 'scale' is automatically globally consistent that comes out of the MoGe network? Is there a reason it should be consistent?

The rigid transformation here includes scale, rotation and translation. The raw output scale of MoGe is unconstrained and not consistent across video frames, since it has been trained to be scale-invariant for single images.

Our implementation for RANSAC rigid (similarity) registration is quite simple.

  • $p_i$: the current frame camera-space point;
  • $q_i$: matched reference frame world-space point.
  • $w_i$: inversely proportional to its depth.

The following code snippet solves the transformation (s, R, t) given two sets of 3D points $\{p_i\}_{i=1}^N,\{q_i\}_{i=1}^N$ and weighting $\{w_i\}_{i=1}^N$.

$$ \min_{s,\bf R,\bf t}\sum_{i=1}^Nw_i\Vert s\bf R\bf p_i+t-\bf q_i\Vert_2^2 $$

import numpy as np
from typing import *

def rigid_registration(
    p: np.ndarray, 
    q: np.ndarray, 
    w: np.ndarray = None, 
    eps: float = 1e-12
) -> Tuple[float, np.ndarray, np.ndarray]:
    if w is None:
        w = np.ones(p.shape[0])
    centroid_p = weighted_mean_numpy(p, w[:, None], axis=0)
    centroid_q = weighted_mean_numpy(q, w[:, None], axis=0)

    p_centered = p - centroid_p
    q_centered = q - centroid_q
    w = w / (np.sum(w) + eps)
        
    cov = (w[:, None] * p_centered).T @ q_centered
    U, S, Vh = np.linalg.svd(cov)
    R = Vh.T @ U.T
    if np.linalg.det(R) < 0:
        Vh[2, :] *= -1
        R = Vh.T @ U.T
    scale = np.sum(S) / np.trace((w[:, None] * p_centered).T @ p_centered)
    t = centroid_q - scale * (centroid_p @ R.T)
    return scale, R, t


def rigid_registration_ransac(
    p: np.ndarray,
    q: np.ndarray,
    w: np.ndarray = None,
    max_iters: int = 20,
    hypothetical_size: int = 10,
    inlier_thresh: float = 0.02
) -> Tuple[float, np.ndarray, np.ndarray]:
    n = p.shape[0]
    if w is None:
        w = np.ones(p.shape[0])
    
    best_score, best_inlines = 0., np.zeros(n, dtype=bool)
    best_solution = (np.array(1.), np.eye(3), np.zeros(3))

    for _ in range(max_iters):
        maybe_inliers = np.random.choice(n, size=hypothetical_size, replace=False)
        try:
            s, R, t = rigid_registration(p[maybe_inliers], q[maybe_inliers], w[maybe_inliers])
        except np.linalg.LinAlgError:
            continue
        transformed_p = s * p @ R.T + t
        errors = w * np.linalg.norm(transformed_p - q, axis=1)
        inliers = errors < inlier_thresh
        
        score = inlier_thresh * n - np.clip(errors, None, inlier_thresh).sum()
        if  score > best_score:
            best_score, best_inlines = score, inliers
            best_solution = rigid_registration(p[inliers], q[inliers], w[inliers])
    
    return best_solution, best_inlines

@guangkaixu
Copy link

@EasternJournalist Hi, thanks for your great contribution! You mentioned that the scale, rotation, and translation can be solved based on matching results. Is there any recommended library or GitHub repo to realize the RANSAC algorithm? By the way, the traditional methods seldom optimize the "depth scale" of monocular depth as far as I know, how can I consider it as optimizable to the existing code base? Thanks so much for your soon reply!

@EasternJournalist
Copy link
Collaborator

EasternJournalist commented Nov 7, 2024

@guangkaixu Hi. The previous comment has been updated. The code for RANSAC is now shared in the code snippet.

@EasternJournalist EasternJournalist pinned this issue Nov 7, 2024
@guangkaixu
Copy link

Thanks! It will be helpful and I'll have a try.

By the way, after I computed depth, camera intrinsic, and pose, is there any appropriate method to perfrom RGB-D fusion? I tried tsdf-fusion, but I'm afraid the shortcomings of the huge GPU memory requirement and the existance of hollow cave of the fused mesh are less satisfactory.

@EasternJournalist
Copy link
Collaborator

EasternJournalist commented Nov 19, 2024

Thanks! It will be helpful and I'll have a try.

By the way, after I computed depth, camera intrinsic, and pose, is there any appropriate method to perfrom RGB-D fusion? I tried tsdf-fusion, but I'm afraid the shortcomings of the huge GPU memory requirement and the existance of hollow cave of the fused mesh are less satisfactory.

That is quite tricky. TSDF fusion is apparently not applicable for large-scale SLAM, especially with dynamic scenes. I am working on to find a convenient alternative too.

@kexul
Copy link

kexul commented Dec 11, 2024

Hi @EasternJournalist , Thanks for your great work! Will the code for processing video be released anytime soon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants