Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3.11 Support #156

Open
ksarge opened this issue Jul 26, 2024 · 0 comments
Open

Python 3.11 Support #156

ksarge opened this issue Jul 26, 2024 · 0 comments

Comments

@ksarge
Copy link

ksarge commented Jul 26, 2024

I created a new Python 3.11 & Spark 3.5 Dockerfile locally and successfully built it using the Makefile, but after using the image in some processing jobs I'm noticing that only algo-1 seems to do any work, the rest of the algos drop down to below 10% utilization, whereas before, when using the Python 3.9 & Spark 3.5 image, the work was distributed nicely across all algos.

I haven't seen much of a push to upgrade the image beyond Python 3.9, and I was wondering if that could be related to the issues I'm seeing with my processing jobs?

Here is the Docker image I built:

FROM public.ecr.aws/amazonlinux/amazonlinux:2023
ARG REGION
ENV AWS_REGION ${REGION}

RUN rpm -q system-release --qf '%{VERSION}'

RUN dnf clean all \
    && dnf update -y \
    && dnf install -y awscli vim gcc gzip unzip zip tar wget liblapack* libblas* libopenblas* \
    && dnf install -y openssl openssl-devel \
    && dnf install -y kernel kernel-headers kernel-devel \
    && dnf install -y bzip2-devel libffi-devel sqlite-devel xz-devel \
    && dnf install -y ncurses ncurses-compat-libs binutils \
    && dnf install -y nss-softokn-freebl avahi-libs avahi dbus dbus-libs \
    && dnf install -y python-pillow

# Install python 3.11
ARG PYTHON_BASE_VERSION=3.11
ARG PYTHON_WITH_BASE_VERSION=python${PYTHON_BASE_VERSION}
ARG PIP_WITH_BASE_VERSION=pip${PYTHON_BASE_VERSION}
ARG PYTHON_VERSION=${PYTHON_BASE_VERSION}.9
RUN dnf groupinstall -y 'Development Tools' \
    && wget https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz \
    && tar xzf Python-${PYTHON_VERSION}.tgz \
    && cd Python-*/ \
    && ./configure --enable-optimizations \
    && make altinstall \
    && echo -e 'alias python3=python3.11\nalias pip3=pip3.11' >> ~/.bashrc \
    && ln -s $(which ${PYTHON_WITH_BASE_VERSION}) /usr/local/bin/python3 \
    && ln -s $(which ${PIP_WITH_BASE_VERSION}) /usr/local/bin/pip3 \
    && cd .. \
    && rm Python-${PYTHON_VERSION}.tgz \
    && rm -rf Python-${PYTHON_VERSION}

#Amazon Linux 2023 uses dnf instead of yum as pacakge management tool: https://docs.aws.amazon.com/linux/al2023/ug/package-management.html

# Copied from EMR: https://tiny.amazon.com/kycbidpc/codeamazpackAwsCblob51c8src
RUN dnf install -y java-1.8.0-amazon-corretto-devel nginx python3-virtualenv \
    && dnf -y clean all && rm -rf /var/cache/dnf

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
# http://blog.stuart.axelbrooke.com/python-3-on-spark-return-of-the-pythonhashseed
ENV PYTHONHASHSEED 0
ENV PYTHONIOENCODING UTF-8
ENV PIP_DISABLE_PIP_VERSION_CHECK 1

# Install EMR Spark/Hadoop
ENV HADOOP_HOME /usr/lib/hadoop
ENV HADOOP_CONF_DIR /usr/lib/hadoop/etc/hadoop
ENV SPARK_HOME /usr/lib/spark

COPY yum/emr-apps.repo /etc/yum.repos.d/emr-apps.repo

# Install hadoop / spark dependencies from EMR's yum repository for Spark optimizations.
# replace placeholder with region in repository URL
RUN sed -i "s/REGION/${AWS_REGION}/g" /etc/yum.repos.d/emr-apps.repo
RUN ls /etc/yum.repos.d/emr-apps.repo
RUN cat /etc/yum.repos.d/emr-apps.repo
RUN adduser -N hadoop

# These packages are a subset of what EMR installs in a cluster with the
# "hadoop", "spark", and "hive" applications.
# They include EMR-optimized libraries and extras.
RUN dnf install -y aws-hm-client \
    aws-java-sdk \
    emr-goodies \
    emr-scripts \
    emr-s3-select \
    emrfs \
    hadoop \
    hadoop-client \
    hadoop-hdfs \
    hadoop-hdfs-datanode \
    hadoop-hdfs-namenode \
    hadoop-httpfs \
    hadoop-kms \
    hadoop-lzo \
    hadoop-mapreduce \
    hadoop-yarn \
    hadoop-yarn-nodemanager \
    hadoop-yarn-proxyserver \
    hadoop-yarn-resourcemanager \
    hadoop-yarn-timelineserver \
    hive \
    hive-hcatalog \
    hive-hcatalog-server \
    hive-jdbc \
    hive-server2 \
    s3-dist-cp \
    spark-core \
    spark-datanucleus \
    spark-history-server \
    spark-python \
    && dnf -y clean all \
    && rm -rf /var/cache/dnf /var/lib/dnf/* /etc/yum.repos.d/emr-*

# Point Spark at proper python binary
ENV PYSPARK_PYTHON=/usr/local/bin/python3


# Setup Spark/Yarn/HDFS user as root
ENV PATH="/usr/bin:/opt/program:${PATH}"
ENV YARN_RESOURCEMANAGER_USER="root"
ENV YARN_NODEMANAGER_USER="root"
ENV HDFS_NAMENODE_USER="root"
ENV HDFS_DATANODE_USER="root"
ENV HDFS_SECONDARYNAMENODE_USER="root"

# Set up bootstrapping program and Spark configuration
COPY hadoop-config /opt/hadoop-config
COPY nginx-config /opt/nginx-config
# COPY aws-config /opt/aws-config
COPY Pipfile Pipfile.lock setup.py *.whl /opt/program/
ENV PIPENV_PIPFILE=/opt/program/Pipfile
# Use --system flag, so it will install all packages into the system python,
# and not into the virtualenv. Since docker containers do not need to have virtualenvs
# pipenv > 2022.4.8 fails to build smspark
RUN /usr/local/bin/python3.11 -m pip --version
RUN /usr/local/bin/python3.11 -m pip install --upgrade pip
RUN /usr/local/bin/python3.11 -m pip install --upgrade pip setuptools wheel


RUN /usr/local/bin/python3.11 -m pip install pipenv==2022.4.8 \
    && pipenv install --system \
    && /usr/local/bin/python3.11 -m pip install /opt/program/*.whl

# Setup container bootstrapper
COPY container-bootstrap-config /opt/container-bootstrap-config
RUN chmod +x /opt/container-bootstrap-config/bootstrap.sh \
    && /opt/container-bootstrap-config/bootstrap.sh

# With this config, spark history server will not run as daemon, otherwise there
# will be no server running and container will terminate immediately
ENV SPARK_NO_DAEMONIZE TRUE

WORKDIR $SPARK_HOME

# Install the sagemaker feature store spark connector
# https://docs.aws.amazon.com/sagemaker/latest/dg/batch-ingestion-spark-connector-setup.html
# Feature store connector library currently does not support spark 3.4 so commenting out this line
# RUN /usr/local/bin/python3.11 -m pip install sagemaker-feature-store-pyspark-3.3==1.1.2 --no-binary :all:

ENTRYPOINT ["smspark-submit"]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant