-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential to track more points in live_demo.py? #71
Comments
Indeed, hard to answer without more specifics on your goals. For RoboTAP we tracked 128 points on 480p videos on an RTX 3090 and ran the entire controller at about 10 fps. Tracking fewer points at lower resolution would make that faster. Whether you can fit in network latency and get a speedup depends heavily on the network. Doing the above would require latency of about 5 ms, which is pretty unlikely unless your datacenter is quite close. If you're tracking thousands of points, though, maybe the latency is worth it. |
Now I have a more specific goal and want to estimate what GPU would be required! I want to track 20 points on local for real-time 1080p(1920x1080) videos at 10fps. According to #69,
***My reasoning is as follows. According to https://gpu.userbenchmark.com/Compare/Nvidia-RTX-4090-vs-Nvidia-Quadro-RTX-4000/4136vsm716215, ===== Update: I tried to verify my hypothesis with the example you gave in the previous reply. If the linear scaling is true, then |
With BootsTAPIR we don't really see benefits for tracking at resolution higher than 512x512 (we haven't trained at such high resolutions, and the larger search space tends to result in false positives). The only reason this might change is if the textures aren't visible at 512x512, but in this case you might want to try with custom resolutions. Actually getting improvements will be an empirical effort. Have you verified that it actually works better at 1080p? Parallelizing the live demo across two GPUs would not be trivial: you would need to shard the feature computation across space, and then shard the pips updates across the point axes, and even then the communication overhead might outweigh the computational speedup. We have internally made partir work with this codebase and I can give some guidance, but it's a bit painful. However, an RTX 4090 is substantially more powerful than an Quadro RTX 4000 (a 5-year-old mobile GPU), so you probably don't need it. I would be very surprised if you can't do 20 points on 512x512 at well over 10fps on that hardware. Scaling is probably somewhat better than linear: the PIPs updates dominated the computation for our 128-point computation. Is there anything preventing you from simply trying it on a single 4090? |
Really appreciate your prompt response and let me elaborate the current situation. I want to use tapnet for real-time tracking of several identical-look black spheres under a clear (i.e., white) or patterned (i.e., alternating black and white) background with a scientific CMOS camera. I have tested with some recorded videos through your Colab demo and it worked decently. Now I am working towards running ===== The 1080p resolution is chosen based on the CMOS camera resolution and the needed field of view for my application. I haven't tried on different resolutions yet since I am working on the previous camera issue and figuring out the right GPU to use/buy.
Regarding the above statement, a few follow-ups are as follows:
===== Regarding the GPU, currently I have a RTX 2080 and that would be the first GPU for me to try after I solve the camera issue. I was trying to estimate the GPU requirement for my case and the difficulty to stack GPUs. Then I would accommodate the number of points and input resolution accordingly since I have limited budget for hardwares and zero experience in GPU parallelizing. After reading your previous reply, it seems I won't think about parallelization and set my computation upper limit as a RTX 4090. With your better-than-linear scaling guess and my previous hypothesis as quoted below:
Do you think it's a conservative estimate that a single RTX 4090 should manage 10 points tracking on 10fps real-time 1080p videos for a local setup? ===== The last question is for the scaling.
I tried to verify my hypothesis with the example you gave in the above line. If the linear scaling is true, then |
Sorry for the slow response, there's a lot here.
Yes. It's a convolutional architecture, so it can run at a different resolution than what it was trained on, but whether it generalizes depends on boundary effects. Empirically generalization to higher resolution seems to be OK, but it doesn't really improve things: the boundary effects seem to cancel out the extra benefits from higher resolution.
Again, this is an empirical question. live_demo.py hasn't been re-tested with bootstapir, and was really just the result of an afternoon of me playing with my webcam. The closest experiments we ran to your suggested setup are with the current bootstapir non-causal checkpoint, where we found the best performance at 512x512, which is the highest resolution that bootstapir was trained on, and higher resolutions performed about the same. I expect the same would be true for the causal checkpoint, but I haven't tested. I would encourage you to just record some data, dump it to mp4, and run causal tapir on whatever data you have. You can then see if it works well enough even if it doesn't work in real time.
Lower bound is 256x256; I can't imagine it would work well at lower resolution. Whether it works best at 512x512 or higher resolution depends on the application.
It was TAPIR for robotap. However, for 20 points I would expect the difference wouldn't be large. Unfortunately I can't guess what the performance would be. RTX 4090 has about 9x the peak flops of the 2080, so that's an upper-bound; however, a non-trivial amount of time in TAPIR is spent on gathers. But I wouldn't be suprised if you see 5x the framerate on a 4090 vs a 2080. What framerate do you get with your desired setup on the 2080? TBH, the right answer may be to just find a cloud provider that lets you rent a 4090 so you can get an actual benchmark. |
Hi Carl, thanks again for your detailed responses! They are always clear and insightful. Please see below for my follow-ups.
I have already tested with some recorded videos in Currently there are 2 issues. The first is to read RGB data from our CMOS camera through Python that is not accessible with Considering that there are installation and compatibility issues for JAX on Windows, is it possible to provide a live demo in PyTorch if it doesn't cost too much as requested in #102? It would be super helpful for non-CS research teams who use Windows as the primary system to explore your work and check the potential for collaboration. I am from one of those teams in applied physics and currently working on a demo for our setting (as quoted below) to persuade my PI for some official collaboration.
Our project could use not only TAPIR/BootsTAP for real-time tracking that is only a starting point, but also RoboTAP for subsequent applications. Thus, I really appreciate it if #102 could be addressed to help accelerate the implementation of your work and potential collaboration! |
Checking #49 it seems the issue is that |
Hi Carl, I have some updates! I studied through the code base and managed to run Now I have some questions for
Appreciate any comments and thanks in advance! |
Great to hear that it's working on Windows! Regarding 1, yes, the block_until_ready must be a remnant from an earlier version of the code. It's None at that point, so I guess it's accomplishing nothing. Blocking on _ might make sense though, since it will force jax to compile the online_init code. Regarding 2, setting causal_state to None might save a tiny bit of computation the first time the function is called, but the result is that causal_state will have a different size every time you add another point (up to 8 points). That means that online_model_predict will need to get recompiled every time you click a new point, which can take minutes. |
Following up on 1, I would like to clarify that the first use of Following up on 2, I am confused why Lines 169 to 218 in a9ef766
tapnet/tapnet/models/tapir_model.py Lines 1157 to 1188 in a9ef766
Another 3 questions are as follows:
|
I am curious about the potential to run live_demo.py with better GPU in order to track more points in real-time.
Have you ever done some testing to run it on cloud? If not, what would be the bottleneck? One thing I could think of is the streaming delay between local and cloud, but I am not sure whether it's a big problem.
Maybe switching & stacking better GPU on local would be more straightforward?
The high-level question would be: Have you thought about ways to scale the model for more points tracking?
Although it might be hard to answer without any experiments, it's always good to have some discussions in advance! Appreciate any comments and feedbacks!
The text was updated successfully, but these errors were encountered: