An accessible infrastructure for artificial intelligence using a docker-based Jupyterlab in Galaxy

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Artificial intelligence (AI) programs that train on a large amount of data require powerful compute infrastructure. Jupyterlab notebook provides an excellent framework for developing AI programs but it needs to be hosted on a powerful infrastructure to enable AI programs to train on large data. An open-source, docker-based, and GPU-enabled jupyterlab notebook infrastructure has been developed that runs on the public compute infrastructure of Galaxy Europe for rapid prototyping and developing end-to-end AI projects. Using such a notebook, long-running AI model training programs can be executed remotely. Trained models, represented in a standard open neural network exchange (ONNX) format, and other resulting datasets are created in Galaxy. Other features include GPU support for faster training, git integration for version control, the option of creating and executing pipelines of notebooks, and the availability of multiple dashboards for monitoring compute resources. These features make the jupyterlab notebook highly suitable for creating and managing AI projects. A recent scientific publication that predicts infected regions of COVID-19 CT scan images is reproduced using multiple features of this notebook. In addition, colabfold, a faster implementation of alphafold2, can also be accessed in this notebook to predict the 3D structure of protein sequences. Jupyterlab notebook is accessible in two ways - first as an interactive Galaxy tool and second by running the underlying docker container. In both ways, long-running training can be executed on Galaxy’s compute infrastructure. The scripts to create the docker container are available under MIT license at https://github.com/anuprulez/ml-jupyter-notebook .

Contact

kumara@informatik.uni-freiburg.de

anup.rulez@gmail.com

Article activity feed

  1. Abstract

    Reviewer 2: Milot Mirdita

    Kumar et al. present a Docker-based integration of Jupyter Notebooks in the Galaxy workflow system that can utilize GPUs. This notebook is also available in the Galaxy Europe instance.

    I was able to create a Galaxy Europe account, find the newly introduced Galaxy tool and submit a job. However, it remained stuck with the message "This job is waiting to run" and the job info "Stopped" for multiple hours. I was able to download the docker image and run it on a local server with multiple Nvidia GPUs. This resulted in a running Jupyter Lab, however running the GPU based examples resulted in driver mismatch errors/warnings (pynvml.nvml.NVMLError_LibRmVersionMismatch: RM has detected an NVML/RM version mismatch; kernel version 470.141.3 does not match DSO version 515.65.1 -- cannot find working devices in this configuration). Thus, the examples ran on CPU only. I did not try to resolve this issue and only repeated some examples.

    The authors show two use-cases for the GPU Jupyter Docker and provide a step-by-step tutorial for usage on Galaxy Europe. Shipping machine learning applications that utilize GPUs as Jupyter Notebooks has become popular recently and supporting these through well-known and freely accessible Galaxy servers, such as Galaxy Europe, would be of clear benefit to users. Additionally, it would be very valuable for method developers like me to easily deploy GPU-based methods to Galaxy servers.

    Major:

    • As mentioned before, I had issues getting a running Jupyter Lab on the Galaxy Europe server. Is this due to a limited number of GPUs or was this due to an error?
    • Our ColabFold Multiple Sequence Alignment server currently processes about 10-20k MSAs per day. We do not know how many of these are running on Google Colab or on users' local machines. However, a substantial number of predictions are running inside Google Colab. The authors claim that Google Colab's and Kaggle's resources are scarce. However, generally, users (with either free or pro accounts) are given an instance nearly immediately on Colab. I recognize that it is extremely difficult to compete with these commercial platform providers. However, providing a long-term, freely available and securely funded, platform with ML accelerators would be extremely beneficial for the whole community. I would like to see a discussion on what GPU resources are currently available to users of Galaxy Europe (and the whole Galaxy Project) and what plans exist to expand these in the future.
    • The size of the docker container (compressed ~10GB, uncompressed ~22GB) seems difficult to sustain. Both keeping up an up-to-date Docker image and ensuring the availability of older images for reproducibility looks difficult to me, especially with such fast moving dependencies such as machine learning frameworks. How do the authors plan to deal with this issue?

    Minor:

    • Please highlight the tutorial (https://training.galaxyproject.org/training-material/topics/statistics/tutorials/gpu_jupyter_lab/tutorial.html) on GitHub and inside the container readme (home_page.ipynb). It is very easy to overlook. I also nearly overlooked the example notebook repository (https://github.com/anuprulez/gpu_jupyterlab_ct_image_segmentation). I found it confusing, that I could not find the two shown example use-cases inside the Docker container. I only later figured out that I have to clone the example repository into the running container.
    • The manuscript highlights various workflow methods (elyra, kubeflow, airflow), however it needs clarification on how the Galaxy workflow integration works. I saw that it is possible to give input of another Galaxy output to the tool. I would appreciate a tutorial on how to make the GPU Jupyter Docker into part of a Galaxy workflow with multiple tools running. I think the above mentioned tutorials can be expanded to show how the output can be given to the next tool.
    • Docker Hub has introduced many business-model changes such as deleting container images that are rarely used, which poses a challenge for reproducibility. I know that Dr Grüning is involved in the Biocontainers project. I would recommend investigating if it is possible to combine these efforts to make this GPU container and derived containers long term available.
    • The Docker container is explicitly running as a root user, while the manuscript highlights the security benefits of Docker. The cited report by Baset et al. highlights the security benefits and the many security challenges that Docker containers pose. I suggest checking what security best practices for Docker containers are possible to implement, while still allowing GPUs to be exposed to users.
    • I recommend revising the manuscript for conciseness, with an additional focus on capitalization of words.
  2. Abstract

    This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad028), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Philippe Boileau

    This manuscript introduces the new docker-based JupyterLab framework in Galaxy, describing its core components and demonstrating its use in the reproduction of two analyses. The proposed framework is also thoroughly compared to competitors, like Google’s Colab and Amazon’s SageMaker. This tool is bound to have an impact on the life sciences: it democratizes computational analyses and facilitates reproducibility. I thank the authors for their important work. However, I think that this technical note should be reviewed for grammatical errors and faulty punctuation. I’ve identified some such issues in the comments below but wasn’t able to address all of them. Included in the comments are other remarks which, if addressed, could strengthen some key takeaways. • The first sentence of the abstract states that AI programs require “powerful compute infrastructure” when applied to large datasets. I think readers would like to know how you qualify an infrastructure as “powerful”. A brief definition could be included in the second sentence instead of repeating “. . . hosted on a powerful infrastructure . . . ”. • Is it “JupyterLab” or “jupyterlab notebook”? The Project Jupyter site seems to use the former. Based on the documentation, JupyterLab is a web-based user interface that can open Jupyter notebooks (.ipynb files). • The statement “Artificial intelligence (AI) approaches such as machine learning (ML) and deep learning (DL) . . . ” implies that ML and DL are distinct aspects of AI. This distinction is insinuated throughout the rest of text. Isn’t DL a subset of ML? I suggest replacing “ML and DL algorithms” by “ML algorithms” and specifying “DL algorithms” only as needed. • I believe there’s a missing comma between “ecosystems” and “enabling” in the first sentence of the Docker container section. • Consider reformatting “A container runs . . . of the running software.” to “A container runs an isolated environment with minimal interactions between it and the host OS. Running software in a container is more secure.” • Related to the suggestion above: Can you explain why this increased security is necessary? An example might help emphasize the importance of a secure container. • I think “Docker container inherits . . . ” should be “The Docker container inherits . . . ”. Same goes for “Docker container is decoupled . . . ”. • Consider reformatting “Moreover, it can easily be extended by installing suitable packages only by adding their appropriate package names in its dockerfile.” to “Moreover, the Docker container is easily extended: additional software packages can be installed by adding their names to the dockerfile.” • Consider replacing “some of the popular ones are” by “including” • I believe there’s an unneeded comma between “. . . platform for both” and “rapid prototyping. . . ”. • I believe that there’s missing a word in the last sentence of the Features of jupyterlab and notebook infrastructure section: “. . . an H5 file.” • “google” and “amazon” should be capitalized. • Consider removing “and non-ideal” from Related infrastructure section. • I believe the comma in “. . . but they come at a price, . . . ” should be replace by a colon. • I believe there’s missing a comma between “. . . free of charge” and “similar to colab . . . ”. • Why is sharing a sessions’s resources across multiple notebooks more useful than operating each notebook in a separate session? Isn’t the latter preferable when a notebook causes a session to crash? • “deep learning” in the Implementation section should be replaced by “DL” for consistency. • I think that readers would find a link to your tool on Galaxy Europe useful: https://usegalaxy.eu/root?tool_id=interactive_tool_ml_jupyter_notebook. The same is true for your tutorial: I think readers would find a URL in the text more easily than in the references. However, the tool failed to execute on usegalaxy.edu with the following error message: “This tool is restricted to authorized users”. I was unable to follow the tutorial. Was this a one-off issue with the Galaxy servers?