Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End-To-End TensoRF connection #108

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open

Conversation

SimonDaKappa
Copy link

This PR covers the implementation of TensoRF as the nerf-worker container in docker. The nerf-worker is implemented similarly to colmaps sfm-worker, consuming nerf-in to put files in data/inputs, and producing nerf-out to send files from data/outputs.

Features:
TensoRF is now the nerf-worker container in docker
Communication between all workers and web-server via rabbitMQ
web-server now publishes finished sfm jobs to nerf worker
nerf-worker now consumes sfm jobs, runs the TensoRF pipeline, and produces the generated model/video
web-server now consumes finished nerf jobs, saves to MongoDB, and responds to web-app get requests for nerf video

SimonDaKappa and others added 15 commits February 23, 2024 15:08
…d application performance

Contains: A LOT of debug code, will be removed in next few commits
Not Tracked: web-app frontend changes

Features:
web-server/queue_service.py: digest_finished_XXXX functions now have access to RabbitMQ
web-server/queue_service.py: digest_finished_sfms now creates a job and publishes to nerf-in
TensoRF: Now pulled into vidtonerf. This allows for the creation of a docker container for a nerf-worker
TensoRF/dockerFile: simple pythonslim instance with basic CV2 dependencies
docker-compose.yaml: Created nerf-worker for tensoRF and future backends (gaussian), depends on web-server, rabbitmq,
and uses port 5200
TensoRF/main.py: Created basic flask app with single get endpoint for rendered video. Consumes nerf-in and publishes
to nerf-out. Runs tensorf based off config file defined in DockerFile CMD[...]
web-server/controller.py: Finished end-point to serve rendered nerf video to frontend
web-server/scene_service.py: Unified function return types with expectation in controller.py
…ug code and general cleanup

Rabbitmq .env credentials and connection retry mechanism

Modified:
TensoRF/Dockerfile: no longer need ~1GB ffmpeg dependencies by using opencv-headless instead of opencv
TensoRF/main.py: Unified input and output file structure to data/sfm_data and data/nerf_data, simplified
process_nerf_job() and added retry mechanism for connection to rabbitmq
TensoRF/requiresments.txt: replaced opencv with headless version, added shutils for file manip, and python-dotenv for
environment variable rabbitmq comnection
colmap/main.py: added rabbitmq retry mech
web-server/controller.py: debug code removal
web-server/services/queue_service.py: debug code removal
nerf worker gpu docker image and cuda support
Copy link
Contributor

@rougejaw rougejaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, any significant training of the model results in the process being halted, as Pika sends a heartbeat every 300 seconds that is being blocked by the training process, killing the process.

@SimonDaKappa
Copy link
Author

Pika and GPU fixes for nerf-worker are now in. This is handled in the following way

OLD:
TensoRF/main.py fork main into
| Flask Process
| nerf_worker Pika / TensoRF Training process

NEW:
TensoRF/main.py fork main into
| FLask Process

Use pytorch.multiprocess
TensoRF/main.py from main spawn into
| nerf_worker pika process. On nerf-in consume:
| create thread to run training and rendered, use threadsafe callback to ack and publish vid to
| nerf-out

Additionally, Aidan has integrated logging into sfm-worker, web-server, and nerf-worker, with some minor tweaks I made to work with my TensoRF changes.

Something to note, colmap seems to spit out images not in the order which we would get them by splitting the video into frames, so the TensoRF training path is very unsmooth. Each frame to the next can rotate +/- 180 degrees around the scene center, leading to the final reconstructed video following the same nonsmooth path.

Reasoning:

  1. Nerf flask server can only send get requests, not receive get requests if process spawned. Could
    not figure out why.
  2. Since the research code is baaaad, it loads tensors into cuda device memory by redeclaring the cuda
    device every single time. CUDA has a known bug where forked processes, ( I assume something to do
    with trying to copy-on-write cuda objects not working ) cannot declare cuda device more than once.
    Pytorch has a wrapper around python.multiprocessing.spawn to spawn a fresh process and handle
    transfer of tensors in that process. This works with the research code and heartbeat fix for pika.

…bilties. Add GPU supported up to CUDA ISA SM_86 (RTX 3000 Series). Merge Aidans logging
@SimonDaKappa
Copy link
Author

Also added a .sh script empty_data.sh to clear out all the worker/webserver input and output data folders while preserving fild structure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants