Interactive docker image #709

indy-3rdman · 2020-10-05T17:49:44Z

This PR contains a build script, Dockerfile(s), README.md and supporting files to create a docker image that can run .NET for Apache Spark in an interactive Jupyter notebook.

An initial description for the interactive image, along with the folder structure can be found here: https://github.com/indy-3rdman/spark/tree/interactive-docker-image/docker/images/interactive

…init

…tive-docker-image

MichaelSimons

I mainly looked at the dotnet-interactive/Dockerfile. I left a few suggestions/questions that may help simplify the Dockerfile.

MichaelSimons · 2020-10-14T15:37:13Z

docker/images/interactive/dotnet-interactive/Dockerfile

@@ -0,0 +1,46 @@
+FROM jupyter/base-notebook:ubuntu-18.04
+
+ARG NB_USER=jovyan


What is the usage in which this ARG would be specified when building? I am not seeing how this would work if a different user were specified. By this I mean I don't see the user being defined/added. The base image defines the "jovyan" user, why would you want do define something different? The same applies to NB_UID.

MichaelSimons · 2020-10-14T15:38:58Z

docker/images/interactive/dotnet-interactive/Dockerfile

+ARG NB_USER=jovyan
+ARG NB_UID=1000
+ARG DOTNET_CORE_VERSION=3.1
+ARG DEBIAN_FRONTEND=noninteractive


Is this necessary? The base image is already defining ENV DEBIAN_FRONTEND=noninteractive

MichaelSimons · 2020-10-14T15:40:06Z

docker/images/interactive/dotnet-interactive/Dockerfile

+ARG DEBIAN_FRONTEND=noninteractive
+
+ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION
+ENV USER ${NB_USER}


Related to a previous comment, are these USER related ENVs necessary since they are defined in the base image.

MichaelSimons · 2020-10-14T15:42:44Z

docker/images/interactive/dotnet-interactive/Dockerfile

+    DOTNET_RUNNING_IN_CONTAINER=true \
+    DOTNET_USE_POLLING_FILE_WATCHER=true \
+    NUGET_XMLDOC_MODE=skip \
+    DOTNET_TRY_CLI_TELEMETRY_OPTOUT=true


Since this is being proposed as part of the .NET project, telemetry should remain enabled.

MichaelSimons · 2020-10-14T15:49:48Z

docker/images/interactive/dotnet-interactive/Dockerfile

+
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends \
+        dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \


What is requiring all of these dependencies. Ones that are already provided by the base image don't seem necessary to include.

MichaelSimons · 2020-10-14T16:01:50Z

docker/images/interactive/dotnet-interactive/Dockerfile

+    && apt-get install -y apt-transport-https \
+    && apt-get update \
+    && apt-get install -y dotnet-sdk-$DOTNET_CORE_VERSION \
+    && apt-get autoremove -y --purge \


Does this remove anything?

MichaelSimons · 2020-10-14T16:02:15Z

docker/images/interactive/dotnet-interactive/Dockerfile

+    && apt-get update \
+    && apt-get install -y dotnet-sdk-$DOTNET_CORE_VERSION \
+    && apt-get autoremove -y --purge \
+    && apt-get clean \


This shouldn't be necessary because of the next line.

MichaelSimons · 2020-10-14T16:07:40Z

docker/images/interactive/apache-spark/Dockerfile

+    && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
+    && chmod 755 /usr/local/bin/start-spark-debug.sh \
+    && chown -R ${NB_UID} ${HOME} \
+    && cd ${HOME}/nb


This doesn't seem necessary.

MichaelSimons · 2020-10-14T16:10:52Z

docker/images/interactive/dotnet-interactive/Dockerfile

+
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends \
+        dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \


Per the Dockerfile best practices, it helps the readability to break this apart one package per line and alphabetize them.

…tive-docker-image

indy-3rdman · 2020-10-21T12:35:08Z

@MichaelSimons, thank you very much for reviewing the Dockerfile and your valuable comments. I've just updated the PR to reflect the changes required for dotnet.spark version 1.0.0. This should also include an updated version of the Dockerfile, addressing your comments.

MichaelSimons

@indy-3rdman, Thanks for addressing my comments. I took another look and had a few more suggestions and questions.

MichaelSimons · 2020-10-21T13:25:31Z

docker/images/interactive/dotnet-interactive/Dockerfile

@@ -21,24 +14,30 @@ USER root

 RUN apt-get update \
    && apt-get install -y --no-install-recommends \
-        dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \
-        libc6 libgcc1 libgssapi-krb5-2 libicu60 libssl1.1 libstdc++6 zlib1g \
+        apt-utils \


What is requiring all of these native dependencies? Several are already provided by the base image so they don't seem necessary to declare.

This should be cleaned up now. Java obviously is required by spark.

MichaelSimons · 2020-10-21T13:26:12Z

docker/images/interactive/dotnet-interactive/Dockerfile

+        libgssapi-krb5-2 \
+        libicu60 \
+        libssl1.1 \
+        libstdc++6 zlib1g \


Multiple packages listed together, should get split apart so that zlib1g is not overlooked.

Should be in a separate line as well now.

MichaelSimons · 2020-10-21T13:29:31Z

docker/images/interactive/dotnet-interactive/Dockerfile

+        libstdc++6 zlib1g \
+        openjdk-8-jdk \
+        software-properties-common \
+        unzip \
    && wget -q --show-progress --progress=bar:force:noscroll https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb \


Consider: I've typically seen Dockerfile avoid using --show-progress as it does have a perf impact.

That raises an interesting point about the purpose of the Dockerfile(s). As far as I am aware, the focus at the moment is to enable an user to build the image(s) her/himself, instead of automating the image build process. That's why I thought it would be useful to show the download progress. Now, for small downloads that doesn't really matter and I therefore have removed it. However, I have added the following line to the dotnet-spark/Dockerfile

&& echo "\nDownloading spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ..." \

as the spark download can take a while. Does that make sense?

MichaelSimons · 2020-10-21T13:31:07Z

docker/images/interactive/dotnet-spark-base/Dockerfile

+
+RUN mkdir -p /dotnet/HelloSpark \
+    && mkdir -p /dotnet/Debug/netcoreapp${DOTNET_CORE_VERSION} \
+    && wget -q --show-progress --progress=bar:force:noscroll https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \


This should at some point perform checksum validation. @JeremyLikness, are checksums published for the Spark artifacts?

MichaelSimons · 2020-10-21T13:38:16Z

docker/images/interactive/dotnet-spark-base/Dockerfile

+RUN mkdir -p /dotnet/HelloSpark \
+    && mkdir -p /dotnet/Debug/netcoreapp${DOTNET_CORE_VERSION} \
+    && wget -q --show-progress --progress=bar:force:noscroll https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
+    && tar -xvzf Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \


Have you considered extracting the tarball to the /dotnet folder instead of extracting it to the working dir and then immediately mv it?

Should be changed in the latest commit.

MichaelSimons · 2020-10-21T13:39:39Z

docker/images/interactive/dotnet-spark-base/Dockerfile

+    && wget -q --show-progress --progress=bar:force:noscroll https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
+    && tar -xvzf Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
+    && mv Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION} /dotnet/ \
+    && cp /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.exe \


I'm curious why this is necessary?

Sorry, this is just a leftover from another file. Removed.

MichaelSimons · 2020-10-21T13:42:54Z

docker/images/interactive/dotnet-spark-base/Dockerfile

+    && chmod 755 /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker \
+    && rm Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz
+
+COPY HelloSpark /dotnet/HelloSpark


I see the HelloSpark project is used to install the correct microsoft-spark-*.jar version that is required to start a spark-submit session in debug mode. Is it necessary to have the project in the resulting image? It feels a little strange to have a sample project in a "base" image.

Moved from dotnet-spark-base to dotnet-spark Dockerfile. Additionally this is now removed, after the jar file has been copied.

MichaelSimons · 2020-10-21T13:50:35Z

docker/images/interactive/dotnet-spark/Dockerfile


-USER root
+ENV DAEMON_RUN=true


Have you considered using the multi-line ENV format? I find that is can help the readability of the Dockerfiles in that it helps the reader easily scan the Dockerfile. Tt makes it more obvious these are all ENV instructions.

ENV DAEMON_RUN=true \ DOTNETBACKEND_PORT=5567 \ HADOOP_VERSION=2.7 \ JUPYTER_ENABLE_LAB=true \ SPARK_VERSION=$SPARK_VERSION \ SPARK_HOME=/spark \ PATH="${SPARK_HOME}/bin:${DOTNET_WORKER_DIR}:${PATH}"

Updated the Dockerfiles.

MichaelSimons · 2020-10-21T13:56:37Z

docker/images/interactive/dotnet-spark/Dockerfile

-ARG DOTNET_SPARK_VERSION=0.12.1
-ENV DOTNET_SPARK_VERSION=$DOTNET_SPARK_VERSION
-ENV DOTNET_WORKER_DIR=/dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}
+ARG SPARK_VERSION=3.0.1


Having version numbers like this hard coded gives me pause. Is this done so that the Dockerfile as it is checked in is buildable without having to specify any args? The problem that introduces is a maintenance burden of keeping it up-to-date.

This is related to my earlier point about the purpose of the Dockerfile(s). The intention was to have a build-able Dockerfile even if the build script is not used. I agree with your observation about maintenance. Maybe @rapoth has a view on that.

MichaelSimons · 2020-10-21T14:04:46Z

docker/images/interactive/build.sh

@@ -8,12 +8,17 @@ set -o nounset   # abort on unbound variable
 set -o pipefail  # don't hide errors within pipes

 readonly image_repository='3rdman'
-readonly supported_apache_spark_versions=("2.3.3" "2.3.4" "2.4.0" "2.4.1" "2.4.3" "2.4.4" "2.4.5" "2.4.6")
-readonly supported_dotnet_spark_versions=("0.9.0" "0.10.0" "0.11.0" "0.12.1")
+readonly supported_apache_spark_versions=(


This may be a question for Spark team. Thoughts on how to keep this version list up-to-date and other versions included in this script up-to-date? It feels like there should be long term plans for getting this updated "automatically" as part of the release process. Without this they will become stale and/or be a maintenance burden.

MichaelSimons · 2020-10-21T16:30:18Z

docker/images/interactive/dotnet-interactive/Dockerfile

-ENV PATH="${PATH}:${HOME}/.dotnet/tools"
-
-ENV DOTNET_RUNNING_IN_CONTAINER=true \
+ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION \


Per the Dockerfile Best Practices, sort multi-line instructions to improve readability where possible (e.g. cross dependencies)

MichaelSimons · 2020-10-21T16:34:31Z

docker/images/interactive/dotnet-spark/Dockerfile

+    && rm -rf /dotnet/HelloSpark \
+    && cd / \
+    && echo "\nDownloading spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ..." \
+    && wget -q https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
    && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \


You can extract to the spark directory with a single instruction which would eliminate the need for mv

I assume you mean to use tar with --directory. But wouldn't that required that the directory exist already? In that case I'd have to add a mkdir first.

You're correct, I missed what was happening here. Please ignore my comment.

MichaelSimons · 2020-10-21T16:39:22Z

docker/images/interactive/dotnet-spark/Dockerfile

+RUN cd /dotnet/HelloSpark \
+    && dotnet build \
+    && cp /dotnet/HelloSpark/bin/Debug/netcoreapp${DOTNET_CORE_VERSION}/microsoft-spark-*.jar ${HOME}/ \
+    && rm -rf /dotnet/HelloSpark \


The unfortunate consequence of this pattern is that HelloSpark remains in the image as a result of obtaining it via COPY. This is not desirable. Is there a way this can be generated during the Docker build or can it be a published tarball so that is can get copied and deleted within a single Dockerfile instruction?

Thanks again @MichaelSimons for your great feedback! Just creating a dummy project during the build process now.

…tive-docker-image

indy added 4 commits October 5, 2020 12:41

initial docker images

d2d621a

Merge branch 'master' of https://github.com/dotnet/spark into docker-…

13b4283

…init

initial interactive notebook docker image files

f805a97

Merge branch 'master' of https://github.com/dotnet/spark into interac…

53c4c54

…tive-docker-image

indy-3rdman mentioned this pull request Oct 5, 2020

[FEATURE REQUEST]: Docker images for runtime, interactive and dev #710

Open

4 tasks

indy added 3 commits October 6, 2020 08:53

Dockerfile update to fix Dialog issue

3e2de92

removed Microsoft.dotnet-interactive version specification

7f85c97

change default notebook directory

873fa34

MichaelSimons reviewed Oct 14, 2020

View reviewed changes

indy added 2 commits October 21, 2020 14:16

Merge branch 'master' of https://github.com/dotnet/spark into interac…

d00509b

…tive-docker-image

updated for dotnet-spark version 1.0.0

afd854e

MichaelSimons reviewed Oct 21, 2020

View reviewed changes

Dockerfile(s) cleanup

33dc9f4

MichaelSimons reviewed Oct 21, 2020

View reviewed changes

indy added 3 commits October 21, 2020 20:00

Removed copy of HelloSpark project

dbef269

Merge branch 'master' of https://github.com/dotnet/spark into interac…

7fb707f

…tive-docker-image

Dockerfile cleanup

3e434b6

Base automatically changed from master to main March 18, 2021 16:48

		@@ -0,0 +1,46 @@
		FROM jupyter/base-notebook:ubuntu-18.04

		ARG NB_USER=jovyan

Interactive docker image #709

Are you sure you want to change the base?

Interactive docker image #709

Conversation

indy-3rdman commented Oct 5, 2020

MichaelSimons left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

indy-3rdman commented Oct 21, 2020

MichaelSimons left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelSimons Oct 21, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelSimons Oct 21, 2020 •

edited