Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Docker image file for CLI #168

Merged
merged 15 commits into from
Nov 19, 2024
Merged

Add Docker image file for CLI #168

merged 15 commits into from
Nov 19, 2024

Conversation

mhmohona
Copy link
Member

Contributor checklist


Description

Related issue

Fixes - #150

Copy link

github-actions bot commented Jul 10, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution

    • The contributor's name and icon in remote commits should be the same as what appears in the PR
    • If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@mhmohona
Copy link
Member Author

So here the next step is pushing docker image to the registry. From which docker id shall I push?

@andrewtavis andrewtavis requested a review from wkyoshida July 10, 2024 13:22
@andrewtavis
Copy link
Member

I'll send this one over to @wkyoshida as he's the expert here :) :)

@wkyoshida
Copy link
Member

So here the next step is pushing docker image to the registry. From which docker id shall I push?

hmm I guess we have some options. We could create a repository in Docker Hub for us or we could use GitHub's registry.

The latter could be cool since I believe it'd prominently show up under Packages in the repo - though this option does have a limit of 500MB for free use. If you build the image from the Dockerfile, @mhmohona , about how large is the resulting image?

The former would likely allow us to save several versions of the Docker image though, since we can go beyond 500MB

Copy link
Member

@wkyoshida wkyoshida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀
Adding some thoughts that came to mind..

ENV PYTHONPATH=/app/src

# Set the entry point to the main CLI script
ENTRYPOINT ["python", "src/scribe_data/cli/main.py"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a version of the Dockerfile where scribe-data gets installed instead, as opposed to running python ...? Running via python still works tbh though, just wondering if we could make the Dockerfile this way.

ENV PYTHONPATH=/app/src

# Set the entry point to the main CLI script
ENTRYPOINT ["python", "src/scribe_data/cli/main.py"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior btw with the saved output files? They just make me wonder we might need a Docker volume to store and retain the output results after the container exits. Are they getting saved inside the container atm?

@andrewtavis
Copy link
Member

What's the plan for this one here? @mhmohona, are we ready to merge? Just checking in as you were saying it's hard for you to test this, so do you need support finishing this up?

@mhmohona
Copy link
Member Author

So I was able to successfully test it in Github codespace. I think I can do it till the end.

@wkyoshida
Copy link
Member

wkyoshida commented Aug 26, 2024

Nice!!
Are you able to see how large it ended up in GH Codespace? Also, curious if that was still including the below command?

RUN apt-get update && apt-get install -y \
    build-essential \
    pkg-config \
    libicu-dev \
    && rm -rf /var/lib/apt/lists/*

For context @andrewtavis , building with the Dockerfile locally was bloating up. @mhmohona mentioned that the command above takes some time to complete. We discussed the idea of trying to install PyICU stuff with the Brewfile that we already have, i.e. brew bundle install --file=Brewfile

@mhmohona might have to run something like the below actually - now that I look at it again. Didn't explicitly try it myself though

COPY Brewfile /app/
RUN brew bundle install --file=Brewfile && \
   export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \
   export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"

@mhmohona
Copy link
Member Author

Nice!! Are you able to see how large it ended up in GH Codespace? Also, curious if that was still including the below command?

RUN apt-get update && apt-get install -y \
    build-essential \
    pkg-config \
    libicu-dev \
    && rm -rf /var/lib/apt/lists/*

For context @andrewtavis , building with the Dockerfile locally was bloating up. @mhmohona mentioned that the command above takes some time to complete. We discussed the idea of trying to install PyICU stuff with the Brewfile that we already have, i.e. brew bundle install --file=Brewfile

@mhmohona might have to run something like the below actually - now that I look at it again. Didn't explicitly try it myself though

COPY Brewfile /app/
RUN brew bundle install --file=Brewfile && \
   export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \
   export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"

So its the storage status -
image

I got error if i add Brewfile, keeping original file gave me following output -

image

and its the image in docker hub - https://hub.docker.com/repository/docker/mhmohona/scribe-data-cli
I created it as test, will delete once everything is okay.

image
trying to pull, and then will run it locally.

@wkyoshida, what do you think about it?

@mhmohona
Copy link
Member Author

image

and its the CLI in action -

image

I am not sure yet how to make only scribe-data list work instead of docker run mhmohona/scribe-data-cli list

@mhmohona mhmohona requested a review from wkyoshida August 28, 2024 00:54
@wkyoshida
Copy link
Member

Hmm - it seems that the image ends up pretty big..
What is the error with trying to use the Brewfile? I'm mostly curious if we can get that working, as I suspect using it alternatively could help with image size perhaps.

@wkyoshida
Copy link
Member

With the docker cli, you could also try inspecting details of the image and looking into the layer details to find out which layers are the large ones, i.e. responsible for the enlarged image size.

With that information, we can go from there and modify the Dockerfile to address the responsible ones.

@mhmohona
Copy link
Member Author

mhmohona commented Aug 29, 2024

Hmm - it seems that the image ends up pretty big.. What is the error with trying to use the Brewfile? I'm mostly curious if we can get that working, as I suspect using it alternatively could help with image size perhaps.

got following error -
image

Its the output of docker history -

@mhmohona ➜ /workspaces/Scribe-Data (docker) $ docker images
REPOSITORY                 TAG       IMAGE ID       CREATED      SIZE
scribe-data-cli            latest    39ac5a96734c   2 days ago   8.59GB
mhmohona/scribe-data-cli   latest    39ac5a96734c   2 days ago   8.59GB

@mhmohona ➜ /workspaces/Scribe-Data (docker) $ docker history  39ac5a96734c
IMAGE          CREATED       CREATED BY                                      SIZE      COMMENT
39ac5a96734c   2 days ago    ENTRYPOINT ["python" "src/scribe_data/cli/ma…   0B        buildkit.dockerfile.v0
<missing>      2 days ago    ENV PYTHONPATH=/app/src                         0B        buildkit.dockerfile.v0
<missing>      2 days ago    COPY /app /app # buildkit                       440MB     buildkit.dockerfile.v0
<missing>      2 days ago    COPY /usr/local/bin /usr/local/bin # buildkit   28.8MB    buildkit.dockerfile.v0
<missing>      2 days ago    COPY /usr/local/lib/python3.9 /usr/local/lib…   7.99GB    buildkit.dockerfile.v0
<missing>      2 days ago    WORKDIR /app                                    0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   CMD ["python3"]                                 0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   RUN /bin/sh -c set -eux;   savedAptMark="$(a…   10MB      buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PYTHON_GET_PIP_SHA256=6fb7b781206356f45a…   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PYTHON_GET_PIP_URL=https://github.com/py…   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PYTHON_SETUPTOOLS_VERSION=58.1.0            0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PYTHON_PIP_VERSION=23.0.1                   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   RUN /bin/sh -c set -eux;  for src in idle3 p…   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   RUN /bin/sh -c set -eux;   savedAptMark="$(a…   31.3MB    buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PYTHON_VERSION=3.9.19                       0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV GPG_KEY=E3FF2839C048B25C084DEBE9B26995E3…   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   RUN /bin/sh -c set -eux;  apt-get update;  a…   9.21MB    buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV LANG=C.UTF-8                                0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   ENV PATH=/usr/local/bin:/usr/local/sbin:/usr…   0B        buildkit.dockerfile.v0
<missing>      4 weeks ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B        
<missing>      4 weeks ago   /bin/sh -c #(nop) ADD file:3d9897cfe027ecc7c…   74.8MB    ```

@wkyoshida
Copy link
Member

<missing> 2 days ago COPY /usr/local/lib/python3.9 /usr/local/lib… 7.99GB buildkit.dockerfile.v0
hmm.. yeah this seems to be the big culprit (7.99GB). This would be the layer on L25 of the Dockerfile, and could be due to what is getting pip installed on L15. It would be interesting to see how large the layer for the pip install is. Notice though that we can't really see easily the layers from before L21, which is when the second stage of the Dockerfile starts. The deepest layer that we see in this output is of the layer on L22:
<missing> 2 days ago WORKDIR /app 0B buildkit.dockerfile.v0

One thing we could try for now is to actually remove the multi-stage aspect of the Dockerfile and just use a single stage. We'll be able to more easily analyze which layers are heavy in the builder stage.


got following error -

Yeah - what I had first sent you in Matrix is slightly off. Notice that it's using things like GITHUB_ENV AND GITHUB_PATH. Those are from when we need to install PyICU in the GitHub workflow (here).

The below might be closer to what we need, but I haven't tested it myself:

COPY Brewfile /app/
RUN brew bundle install --file=Brewfile && \
   export PATH="/opt/homebrew/opt/icu4c/bin:/opt/homebrew/opt/icu4c/sbin:$PATH" && \
   export PKG_CONFIG_PATH="$PKG_CONFIG_PATH:/opt/homebrew/opt/icu4c/lib/pkgconfig"

NOTE: This will likely need a step before to install brew though

@wkyoshida
Copy link
Member

Just a quick note for clarification -

All the layers further than
<missing> 2 days ago WORKDIR /app 0B buildkit.dockerfile.v0
are most likely those of the python:3.9-slim image itself that the Dockerfile is using as the base image.

@wkyoshida
Copy link
Member

Alrighty -

So in tweaking the Dockerfile and building the image myself, I'm seeing that the big culprit here really appears to be the pip install step in fact (this layer alone ends up at over 8GB in size 😱). More specifically - in particular the tensorflow and torch requirements as they bring in pretty large dependencies. As I am understanding though..

  • tensorflow: we're probably good to just scratch it off the requirements.txt. We are no longer making use of it, is this correct?
  • torch: we are still using or at least did still use it for the translations that were generated recently. However, as the idea anyways is to transition to Wiktionary-based translations, we are perhaps good with removing it from the requirements.txt as well soon?
  • bonus: as we're already having to look at the requirements.txt, are there any other dependencies listed that we no longer make use of that we could scratch off too?

CC: @andrewtavis

@mhmohona
Copy link
Member Author

Thank you for looking into it @wkyoshida.

@wkyoshida
Copy link
Member

I did also decide to actually go with removing the multi-stage aspect of the Dockerfile @mhmohona 😁
The second stage seemed that it was already pretty much copying over most of what was being done in the builder stage anyways, so it made sense to me to simplify the Dockerfile in this way

@andrewtavis
Copy link
Member

Points on the above:

  • tensorflow: we're probably good to just scratch it off the requirements.txt. We are no longer making use of it, is this correct?
  • Ya we can remove it :)
  • torch: we are still using or at least did still use it for the translations that were generated recently. However, as the idea anyways is to transition to Wiktionary-based translations, we are perhaps good with removing it from the requirements.txt as well soon?
  • Yes, we'll be removing it as well. Maybe what I can do is do the data update for Scribe-iOS post fixing formatting to include lexeme IDs, and then from there I can remove the old translation process completely except for a reference in the CLI that says that it's WIP, and then I'll remove all of the above from the dependencies? This would take another few weeks, but then from there we'd be ready to do the pip and Dockerized releases 😊
  • bonus: as we're already having to look at the requirements.txt, are there any other dependencies listed that we no longer make use of that we could scratch off too?
  • pyarrow, sentencepiece, tabulate and transformers are currently not used or are being used in translation. We need to make a decision on how the "currently available data table" is being made, as I believe that that's why tabulate was in there. Specifically we're planning on hosting the PNG on Srcribe-Server, but are we generating it with Scribe-Data? How are we making that table PNG, anyway 🤔

@wkyoshida
Copy link
Member

Ya we can remove it :)

🚀 - removed with cb29d7b

.. and then from there I can remove the old translation process ..

Ya that sounds good!

For references, I tested out building the image after removing the soon-to-scrapped dependencies. The pip install layer is currently at ~8GB. After removing..

  • tensorflow: went from ~8GB to ~6GB
  • torch: went from ~6GB to ~1GB
  • pyarrow, sentencepiece, tabulate and transformers: went from ~1GB to ~800MB

So yea - I think once we're able to scrap torch (in particular), we'll be in better shape

How are we making that table PNG, anyway 🤔

I was thinking of leaving it entirely to Scribe-Server actually; so eventually we could even scrap tabulate too. I'd say there's no rush for it though. We can continue to use tabulate for this and keep it around until then. Its footprint is fairly small; so no concern with Docker if it's still around.

@andrewtavis
Copy link
Member

tensorflow has been removed in a recent commit. I'll get rid of the rest after I do the final data update.

@andrewtavis
Copy link
Member

Note that this can be updated and merged once #292 has been finished as then all of the above dependencies will be gone :)

@andrewtavis
Copy link
Member

Alright the old translation process has been removed, so we should be good to work on this again without excessively large images :)

@andrewtavis
Copy link
Member

How are we doing here now? Do we need to include some documentation on how to get the container up and running? :)

@mhmohona
Copy link
Member Author

So now it took 48 minute.

image

and it's docker history output -

(.venv) (base) mona@mhm:~/Desktop/github/Scribe-Data$ docker images
REPOSITORY        TAG       IMAGE ID       CREATED          SIZE
scribe-data-cli   latest    5dd362f0a3ec   10 minutes ago   3.09GB
(.venv) (base) mona@mhm:~/Desktop/github/Scribe-Data$ docker history 5dd362f0a3ec
IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
5dd362f0a3ec   10 minutes ago   ENTRYPOINT ["python" "src/scribe_data/cli/ma…   0B        buildkit.dockerfile.v0
<missing>      10 minutes ago   ENV PYTHONPATH=/app/src                         0B        buildkit.dockerfile.v0
<missing>      10 minutes ago   COPY . /app # buildkit                          267MB     buildkit.dockerfile.v0
<missing>      10 minutes ago   RUN /bin/sh -c pip install --no-cache-dir -r…   2.27GB    buildkit.dockerfile.v0
<missing>      53 minutes ago   COPY requirements.txt /app/ # buildkit          339B      buildkit.dockerfile.v0
<missing>      53 minutes ago   RUN /bin/sh -c apt-get update && apt-get ins…   437MB     buildkit.dockerfile.v0
<missing>      58 minutes ago   WORKDIR /app                                    0B        buildkit.dockerfile.v0
<missing>      6 days ago       CMD ["python3"]                                 0B        buildkit.dockerfile.v0
<missing>      6 days ago       RUN /bin/sh -c set -eux;  for src in idle3 p…   36B       buildkit.dockerfile.v0
<missing>      6 days ago       RUN /bin/sh -c set -eux;   savedAptMark="$(a…   36.4MB    buildkit.dockerfile.v0
<missing>      6 days ago       ENV PYTHON_VERSION=3.13.0                       0B        buildkit.dockerfile.v0
<missing>      6 days ago       ENV GPG_KEY=7169605F62C751356D054A26A821E680…   0B        buildkit.dockerfile.v0
<missing>      6 days ago       RUN /bin/sh -c set -eux;  apt-get update;  a…   9.24MB    buildkit.dockerfile.v0
<missing>      6 days ago       ENV PATH=/usr/local/bin:/usr/local/sbin:/usr…   0B        buildkit.dockerfile.v0
<missing>      2 weeks ago      /bin/sh -c #(nop)  CMD ["bash"]                 0B        
<missing>      2 weeks ago      /bin/sh -c #(nop) ADD file:a9a95cfab16803be0…   74.8MB    

It's so much better than before. What do you think @wkyoshida?

@andrewtavis
Copy link
Member

Are we able to expand the readme and docs a bit with instructions to deploy the Docker setup, @mhmohona?

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that changes to this file have been taken care of in the overall discussion here :) Let's look into any changes that are needed post uploading it to DockerHub later this week 🚀

Thanks for the persistence here, @mhmohona, and also for the review, @wkyoshida!

@andrewtavis andrewtavis merged commit 48de65a into scribe-org:main Nov 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants