Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactive docker image #709

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
19 changes: 8 additions & 11 deletions docker/images/interactive/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -178,15 +178,6 @@ build_dotnet_spark_base_interactive() {
local image_name="dotnet-spark-base-interactive:${dotnet_spark_version}"

cd dotnet-spark-base
cp --recursive templates/HelloSpark ./HelloSpark

replace_text_in_file HelloSpark/HelloSpark.csproj "<TargetFramework><\/TargetFramework>" "<TargetFramework>netcoreapp${dotnet_core_version}<\/TargetFramework>"
replace_text_in_file HelloSpark/HelloSpark.csproj "PackageReference Include=\"Microsoft.Spark\" Version=\"\"" "PackageReference Include=\"Microsoft.Spark\" Version=\"${dotnet_spark_version}\""

replace_text_in_file HelloSpark/README.txt "netcoreappX.X" "netcoreapp${dotnet_core_version}"
replace_text_in_file HelloSpark/README.txt "spark-X.X.X" "spark-${apache_spark_short_version}.x"
replace_text_in_file HelloSpark/README.txt "microsoft-spark-${apache_spark_short_version}.x-X.X.X.jar" "${dotnet_spark_jar}"

build_image "${image_name}"
cd ~-
}
Expand All @@ -202,6 +193,14 @@ build_dotnet_spark_interactive() {

cd dotnet-spark
cp --recursive templates/scripts ./bin
cp --recursive templates/HelloSpark ./HelloSpark

replace_text_in_file HelloSpark/HelloSpark.csproj "<TargetFramework><\/TargetFramework>" "<TargetFramework>netcoreapp${dotnet_core_version}<\/TargetFramework>"
replace_text_in_file HelloSpark/HelloSpark.csproj "PackageReference Include=\"Microsoft.Spark\" Version=\"\"" "PackageReference Include=\"Microsoft.Spark\" Version=\"${dotnet_spark_version}\""

replace_text_in_file HelloSpark/README.txt "netcoreappX.X" "netcoreapp${dotnet_core_version}"
replace_text_in_file HelloSpark/README.txt "spark-X.X.X" "spark-${apache_spark_short_version}.x"
replace_text_in_file HelloSpark/README.txt "microsoft-spark-${apache_spark_short_version}.x-X.X.X.jar" "${dotnet_spark_jar}"

replace_text_in_file bin/start-spark-debug.sh "microsoft-spark-X.X.X.jar" "${dotnet_spark_jar}"

Expand All @@ -218,8 +217,6 @@ cleanup()
{
cd dotnet-spark
rm --recursive --force bin
cd ~-
cd dotnet-spark-base
rm --recursive --force HelloSpark
cd ~-
}
Expand Down
17 changes: 5 additions & 12 deletions docker/images/interactive/dotnet-interactive/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,31 +2,24 @@ FROM jupyter/base-notebook:ubuntu-18.04

ARG DOTNET_CORE_VERSION=3.1

ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION
ENV PATH="${PATH}:${HOME}/.dotnet/tools"

ENV DOTNET_RUNNING_IN_CONTAINER=true \
ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the Dockerfile Best Practices, sort multi-line instructions to improve readability where possible (e.g. cross dependencies)

PATH="${PATH}:${HOME}/.dotnet/tools" \
DOTNET_RUNNING_IN_CONTAINER=true \
DOTNET_USE_POLLING_FILE_WATCHER=true \
NUGET_XMLDOC_MODE=skip \
DOTNET_TRY_CLI_TELEMETRY_OPTOUT=true
NUGET_XMLDOC_MODE=skip

USER root

RUN apt-get update \
&& apt-get install -y --no-install-recommends \
apt-utils \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is requiring all of these native dependencies? Several are already provided by the base image so they don't seem necessary to declare.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be cleaned up now. Java obviously is required by spark.

bash \
dialog \
libc6 \
libgcc1 \
libgssapi-krb5-2 \
libicu60 \
libssl1.1 \
libstdc++6 zlib1g \
openjdk-8-jdk \
software-properties-common \
unzip \
&& wget -q --show-progress --progress=bar:force:noscroll https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb \
&& wget -q https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb \
&& dpkg -i packages-microsoft-prod.deb \
&& add-apt-repository universe \
&& apt-get install -y apt-transport-https \
Expand Down
21 changes: 5 additions & 16 deletions docker/images/interactive/dotnet-spark-base/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,13 @@ ARG DOTNET_CORE_VERSION=3.1
FROM dotnet-interactive:$DOTNET_CORE_VERSION

ARG DOTNET_SPARK_VERSION=1.0.0
ENV DOTNET_SPARK_VERSION=$DOTNET_SPARK_VERSION
ENV DOTNET_WORKER_DIR=/dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}
ENV DOTNET_SPARK_VERSION=$DOTNET_SPARK_VERSION \
DOTNET_WORKER_DIR=/dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}

USER root

RUN mkdir -p /dotnet/HelloSpark \
&& mkdir -p /dotnet/Debug/netcoreapp${DOTNET_CORE_VERSION} \
&& wget -q --show-progress --progress=bar:force:noscroll https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
&& tar -xvzf Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
&& mv Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION} /dotnet/ \
&& cp /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.exe \
RUN mkdir -p /dotnet/Debug/netcoreapp${DOTNET_CORE_VERSION} \
&& wget -q https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
&& tar -xvzf Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz --directory /dotnet \
&& chmod 755 /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker \
&& rm Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz

COPY HelloSpark /dotnet/HelloSpark

RUN cd /dotnet/HelloSpark \
&& dotnet build \
&& cp /dotnet/HelloSpark/bin/Debug/netcoreapp${DOTNET_CORE_VERSION}/microsoft-spark-*.jar /dotnet/Debug/netcoreapp${DOTNET_CORE_VERSION}/


29 changes: 16 additions & 13 deletions docker/images/interactive/dotnet-spark/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,31 @@ FROM dotnet-spark-base-interactive:$DOTNET_SPARK_VERSION
ARG SPARK_VERSION=3.0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having version numbers like this hard coded gives me pause. Is this done so that the Dockerfile as it is checked in is buildable without having to specify any args? The problem that introduces is a maintenance burden of keeping it up-to-date.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to my earlier point about the purpose of the Dockerfile(s). The intention was to have a build-able Dockerfile even if the build script is not used. I agree with your observation about maintenance. Maybe @rapoth has a view on that.

ARG DOTNET_SPARK_JAR="microsoft-spark-3-0_2.12-$DOTNET_SPARK_VERSION"

ENV DAEMON_RUN=true
ENV SPARK_VERSION=$SPARK_VERSION
ENV SPARK_HOME=/spark

ENV HADOOP_VERSION=2.7
ENV PATH="${SPARK_HOME}/bin:${DOTNET_WORKER_DIR}:${PATH}"
ENV DOTNETBACKEND_PORT=5567
ENV JUPYTER_ENABLE_LAB=true
ENV DAEMON_RUN=true \
DOTNETBACKEND_PORT=5567 \
HADOOP_VERSION=2.7 \
JUPYTER_ENABLE_LAB=true \
SPARK_VERSION=$SPARK_VERSION \
SPARK_HOME=/spark \
PATH="${SPARK_HOME}/bin:${DOTNET_WORKER_DIR}:${PATH}"

USER root

COPY bin/* /usr/local/bin/

COPY *.ipynb ${HOME}/dotnet.spark/examples/

RUN cd / \
&& wget -q --show-progress --progress=bar:force:noscroll https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
COPY HelloSpark /dotnet/HelloSpark

RUN cd /dotnet/HelloSpark \
&& dotnet build \
&& cp /dotnet/HelloSpark/bin/Debug/netcoreapp${DOTNET_CORE_VERSION}/microsoft-spark-*.jar ${HOME}/ \
&& rm -rf /dotnet/HelloSpark \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unfortunate consequence of this pattern is that HelloSpark remains in the image as a result of obtaining it via COPY. This is not desirable. Is there a way this can be generated during the Docker build or can it be a published tarball so that is can get copied and deleted within a single Dockerfile instruction?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @MichaelSimons for your great feedback! Just creating a dummy project during the build process now.

&& cd / \
&& echo "\nDownloading spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ..." \
&& wget -q https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
&& tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can extract to the spark directory with a single instruction which would eliminate the need for mv

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean to use tar with --directory. But wouldn't that required that the directory exist already? In that case I'd have to add a mkdir first.

Copy link
Member

@MichaelSimons MichaelSimons Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, I missed what was happening here. Please ignore my comment.

&& mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark \
&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
&& chmod 755 /usr/local/bin/start-spark-debug.sh \
&& cp /dotnet/Debug/netcoreapp${DOTNET_CORE_VERSION}/${DOTNET_SPARK_JAR} ${HOME}/ \
&& chown -R ${NB_UID} ${HOME}

USER ${NB_USER}
Expand Down