Finalize feluda operator system requirement #29

dennyabrain · 2024-02-05T08:41:59Z

Overview

Understand Operator dependency locally and identify ways to improve
Finalize the acceptable status of the video operators
Deploy on various ec2 nodes and profile performance
Finalize and Publish instructions to deploy workers to cluster. Tracked here

Acceptance Criteria

Identify the few optimum EC2 machines for various scenarios
Scenario Plan and cost for each

dennyabrain · 2024-02-05T13:36:41Z

@aatmanvaidya @duggalsu
Here's a list of ec2 types offered by aws- https://aws.amazon.com/ec2/pricing/on-demand/
it lets you choose between memorry optimized, vs storage optimized vs compute optimized. it also lets you choose between core count and ram.
review it and make a list for me of which instance types are worth evaluating.

dennyabrain · 2024-02-05T13:37:45Z

I had an issue setting up feluda on my machine.

requirements. This could take a while.
ERROR: Cannot install -r requirements.txt (line 13) and urllib3==2.0.7 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested urllib3==2.0.7
    botocore 1.34.19 depends on urllib3<1.27 and >=1.25.4; python_version < "3.10"

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

flagging this as something that might trump us.

Python 3.9.18 on Ubuntu 20.04.2 LTS. Using the c35079

duggalsu · 2024-02-05T14:08:52Z

This shouldn't have happened. But it is happening because we have not upgraded boto3 (and other packages) to the latest in feluda core. So there are dependency mismatches when generating requirements.txt for operators and core, and i've been manually downgrading botocore across requirements files and regenerating to maintain compatibility.

dennyabrain · 2024-02-05T15:31:04Z

A note on trying to break the operator to its limit.
I removed the check around file size and tried processing a 1 hour long video that was 800 mb in size. The python process exits after 10 or so seconds. My rudimentary observation of htop tells me that the all the 12 cores don't run at ful capacity but the memory usage increases with time and eventually the process runs out of memory. So I think in the short run, keeping the file size limit might be useful to prevent large files from causing a crash.

dennyabrain · 2024-02-05T16:50:10Z

A neat thing. I eventually got the operator to run on this 1 hour video without running into out of memory error!!!
Caveat : i just got the operator to run. I cant say anything about the search result implications of it

I figured that the cause for out of memory error was

https://github.com/tattle-made/feluda/blob/c350792db1ffcb9c53149ea42f68adf5f4b0cd07/src/api/core/operators/vid_vec_rep_resnet.py#L157C9-L168C26

        def extract_frames(self, v):
            # print("extracting frames")
            images = []
            for i in range(self.n_frames):
                success, image = v.read()
                if image is None:
                    continue
                else:
                    if i % self.sampling_rate == 0:
                        images.append(Image.fromarray(image))
            # print("extracted frames")
            return images

Every frame of the video is added to the images list. Hence we get the out of memory error.

I tried a rudimentary trick to convert this into a generator and return 100 frames at a time :

def extract_frames(self, v):
            # print("extracting frames")
            for i in range(0, self.n_frames, 100):
                images = []
                for i in range(100):
                    success, image = v.read()
                    if image is None:
                        continue
                    else:
                        if i % self.sampling_rate == 0:
                            images.append(Image.fromarray(image))
                yield images
            # print("extracted frames")

and the corresponding change in the analyze function to consume this generator :

def analyze(self, video):
            # print("analyzing video")
            for frames in self.extract_frames(video):
                feature_matrix = self.extract_features(frames)
                self.keyframe_indices = self.find_keyframes(feature_matrix)
                self.keyframe_features = feature_matrix[:, self.keyframe_indices]
            # print("analysed video")

Result : function took 625.5453 seconds to run.

dennyabrain · 2024-02-06T11:30:46Z

Current status is that we know RAM usage depends on the length of the video file. Given my proof of concept above, it looks like we can process long files by chunking the processing of frames and get a decent upper limit for RAM consumption. Given that for the next milestone, our priority is to be able to support processing of video files that are a few minutes long and that right now we dont anyways want to support processing really long files, we can assume the file length to be limited and hence the RAM usage also to be limited.

In today's call Aurora mentioned that looking at the code we use for inference, trying out a GPU wont be worth it also. So we are parking all GPU related tests for later as well.

This leaves us with compute optimized EC2s as the category of instances to try within. One thing we can also check for is that since our cores and memory isnt used at full capacity, this means that kubernetes can successfully schedule multiple pods on the same node. Getting us more value for money for every new node we provision.

aatmanvaidya · 2024-02-06T12:47:20Z

documentation on memory and cpu profiling is here - https://github.com/tattle-made/feluda/wiki/Optimization

dennyabrain · 2024-02-06T13:15:58Z

I've selected some EC2s for the first round of test. Included the hourly and daily cost because we might scale the nodes up and down and might not need a large node to stay on throughout.

EC2 type	vCPU	Memory	hourly USD	hourly INR	daily INR	monthly INR
c7g.large	2	4	0.0491	4.078246	97.877904	2936.33712
c7g.xlarge	4	8	0.1445	12.00217	288.05208	8641.5624
c7g.2xlarge	8	16	0.289	24.00434	576.10416	17283.1248
c7g.4xlarge	16	32	0.3926	32.609356	782.624544	23478.73632
c7g.16xlarge	64	128	1.5706	130.454036	3130.896864	93926.90592
r7g.large	2	16	0.0751	6.237806	149.707344	4491.22032
r7g.xlarge	4	32	0.1502	12.475612	299.414688	8982.44064
r7g.4xlarge	16	128	0.704	58.47424	1403.38176	42101.4528

dennyabrain · 2024-02-06T13:46:57Z

@aatmanvaidya @duggalsu when we deploy the container to kubernetes, we can specify the command it should run when launched. for the sake of this test I was thinking we can create scripts inside /benchmark folder.
The caveat to our script is that it should not exit but stay alive (think infinite loop), so that kubernetes does not kill the container. The other reason i need the container to be running is that after the test is run I'd like to ssh into it to get the output files.

So i was thinking that our bench mark script could be something like this
script1.sh

python test.py
tail -f /dev/null

script2.sh

python3 -m memray run -o vid_vec_rep_resnet.bin vid_vec_rep_resnet.py
tail -f /dev/null

So lets create appropriate scripts like these. Then we can deploy the container and change the command that needs to be executed on container start and run these tests in the cluster.

dennyabrain · 2024-02-06T13:50:17Z

Sharing the Kubernetes deployment file for reference. We'll simply change the replica count and command to run different containers.

apiVersion: apps/v1
kind: Deployment

metadata:
  name: feluda-operator-vidvec
  labels:
    app.kubernetes.io/name: feluda-operator-vidvec

spec:
  replicas: 1
  resources:
    requests:
        cpu: "1000m"
        memory: "4000Mi"
    limits:
        cpu: "4000m"
        memory: "8000Mi"
  selector:
    matchLabels:
      app.kubernetes.io/name: feluda-operator-vidvec
  template:
    metadata:
      labels:
        app.kubernetes.io/name: feluda-operator-vidvec
    spec:
      containers:
        - name: feluda-operator-vidvec
          image: tattletech/feluda-operator-vid-vec:f6bb56c
          imagePullPolicy: Always
          command: ["python"]
          args: ["test.py"]

dennyabrain · 2024-02-06T13:52:00Z

We'll rely on the github actions to push new docker images of our operators to dockerhub. Reference implementation https://github.com/tattle-made/feluda/blob/9f425587f93e02005554b496c059144c90e19f74/.github/workflows/prod-deploy.yml#L44-L50

dennyabrain · 2024-02-06T13:56:25Z

Workflow :

Denny will provision the EC2 instance we want to test on.
Aatman, Aurora make changes to the operator and push to github, which triggers(manually or automatically) a workflow that builds a docker image customized for the appropriate operator and pushes to the dockerhub. Each such push will tag the image on dockerhub with the commit id. so new versions of the image will be uniquely identifiable.
Denny will copy the commit id and place it in the kubernetes deployment manifest file above. He'll also change the replica count if apt to force more than one container to run on each node. When deployed this should run our test and leave the container idle.
Denny will SSH into the containers and download the output files
Aatman and Aurora will analyze the result.

We are charged hourly on the EC2 instance usage so once spun up, we have no reason to shut down the EC2 immediately. So we can run a few tests in one go within that hour and learn all we need to before shutting it down.

We then repeate these steps for all EC2 instances we care to test this on.

duggalsu · 2024-02-06T14:06:56Z

The dockerfiles for operators are now running successfully with this PR - Create and optimize Dockerfiles feluda#58
The feluda core dockerfile is also optimized.
Used the creosote package to check for unused dependencies creosote --deps-file ./src/api/requirements.in

- False flags:
	- memray - dev - memory profiling
	- pyinstrument - dev/prod - cpu profiling
	- pytest - dev - testing
	- python-dotenv - dev/prod - loading key/value pairs from .env files - see docs/src/pages/burns.mdx for issues
- Issues:
	- boto3 - only used in detect_text_in_image - but aws config in development.env
	- google-cloud - only used in detect_text_in_image - but google config in development.env
	- google-cloud-vision - only used in detect_text_in_image - but google config in development.env
- Removed:
	- opencv-python-headless - only used in vid_vec_rep_resnet.py
	- textblob - only used in 
		- detect_lang_of_text
		- vid_vec_rep_resnet - declares as global but not used anywhere in code
	- sentence-transformers - used only in text_vec_rep_paraphrase_lxml - `import sentence_transformers`
	- tqdm - not imported anywhere directly
	- scikit-image - not imported anywhere directly - `import skimage`
	- typing-extensions - not imported anywhere directly - gets installed as dependency in generated requirements.txt

Current docker image sizes

REPOSITORY                                      SIZE
python:3.11-slim                              131MB

- WITH - python:3.11-slim : original dockerfile with ffmpeg
feluda-api   2.88GB

- WITH - python:3.11-slim : removed ffmpeg
feluda-api   2.36GB

- WITH - python:3.11-slim : removed tesseract packs
feluda-api   2.25GB

- WITH - python:3.11-slim : removed multiple unused python packages from feluda core
feluda-api   1.53GB

image-operator    1.3GB
video-operator       1.66GB

dennyabrain · 2024-02-06T16:57:29Z

A trivial feedback on using --no-cache-dir as an argument to pip install. Did a quick try and it bought the video-operator size to 1.35 GB.

I also notice that the largest thing in the docker image is the torch library. around 800 mb. doesnt seem like much we can do to reduce it. what is your opinion?

duggalsu · 2024-02-07T07:49:52Z

Further optimized dockerfiles

- WITH - pip no cache optimization, and removing torch, torchvision, vim, curl from feluda core
feluda-indexer  470MB
feluda-reporter 470MB
feluda-api      470MB
image-operator	1.08GB
video-operator	1.35GB

dennyabrain · 2024-02-08T04:33:23Z

Scenario Planning :

Goal : Offer an acceptable response time (lets assume < 5 minutes for now) for every possible scenario.

Scenarios :

Consistent Traffic through the day :
a. low (1000 messages a day)
b. medium (50,000 messages a day)
c. high (1,00,000 messages a day)
Traffic Surge on a known day (pre provisioned infrastructure)
a. low (1000 messages in a minute)
b. medium (50,000 messages in a minute)
c. high (1,00,000 messages in a minute)
Unexpected Traffic surge on a day (provisioning will happen post facto)
a. low (1000 messages in a minute)
b. medium (50,000 messages in a minute)
c. high (1,00,000 messages in a minute)

dennyabrain · 2024-02-08T11:31:42Z

Question to focus on : Whey do we get slow performance on multicore intel machines (c7i* family) when we increase the number of pod replicas (container). Especially when core > 4

dennyabrain · 2024-02-09T08:45:01Z

Tasks

Test if the new arm images work on graviton
Test the multiprocess code on c7i.4xlarge

dennyabrain · 2024-02-09T10:20:09Z

dennyabrain mentioned this issue Feb 5, 2024

Demonstrate 'user query to response cycle' on Staging environment #13

Open

8 tasks

dennyabrain added the level:ticket An issue that describes a ticket (initiative>feature>ticket) label Feb 5, 2024

dennyabrain added this to 2024 Q1 Planner Feb 5, 2024

dennyabrain self-assigned this Feb 5, 2024

dennyabrain moved this to In Progress in 2024 Q1 Planner Feb 5, 2024

dennyabrain assigned duggalsu and aatmanvaidya Feb 8, 2024

dennyabrain moved this from In Progress to Done in 2024 Q1 Planner Feb 9, 2024

dennyabrain removed the status in 2024 Q1 Planner Feb 9, 2024

dennyabrain moved this to Done in 2024 Q1 Planner Feb 9, 2024

dennyabrain closed this as completed Feb 9, 2024

dennyabrain added this to 2024 Deepfake Analysis Unit Feb 11, 2024

dennyabrain moved this to In progress in 2024 Deepfake Analysis Unit Feb 11, 2024

dennyabrain moved this from In progress to Done in 2024 Deepfake Analysis Unit Feb 11, 2024

dennyabrain mentioned this issue Feb 12, 2024

Managing Scale #33

Closed

3 tasks

aatmanvaidya mentioned this issue Jul 22, 2024

Try out Embedding models and evaluate clustering tattle-made/feluda#355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize feluda operator system requirement #29

Finalize feluda operator system requirement #29

dennyabrain commented Feb 5, 2024 •

edited

Loading

dennyabrain commented Feb 5, 2024

dennyabrain commented Feb 5, 2024 •

edited

Loading

duggalsu commented Feb 5, 2024

dennyabrain commented Feb 5, 2024

dennyabrain commented Feb 5, 2024 •

edited

Loading

dennyabrain commented Feb 6, 2024

aatmanvaidya commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

duggalsu commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

duggalsu commented Feb 7, 2024

dennyabrain commented Feb 8, 2024

dennyabrain commented Feb 8, 2024

dennyabrain commented Feb 9, 2024 •

edited

Loading

dennyabrain commented Feb 9, 2024

Finalize feluda operator system requirement #29

Finalize feluda operator system requirement #29

Comments

dennyabrain commented Feb 5, 2024 • edited Loading

Overview

Acceptance Criteria

dennyabrain commented Feb 5, 2024

dennyabrain commented Feb 5, 2024 • edited Loading

duggalsu commented Feb 5, 2024

dennyabrain commented Feb 5, 2024

dennyabrain commented Feb 5, 2024 • edited Loading

dennyabrain commented Feb 6, 2024

aatmanvaidya commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

Workflow :

duggalsu commented Feb 6, 2024

dennyabrain commented Feb 6, 2024

duggalsu commented Feb 7, 2024

dennyabrain commented Feb 8, 2024

Scenario Planning :

dennyabrain commented Feb 8, 2024

dennyabrain commented Feb 9, 2024 • edited Loading

dennyabrain commented Feb 9, 2024

dennyabrain commented Feb 5, 2024 •

edited

Loading

dennyabrain commented Feb 5, 2024 •

edited

Loading

dennyabrain commented Feb 5, 2024 •

edited

Loading

dennyabrain commented Feb 9, 2024 •

edited

Loading