locked
Azure ML Pipeline Estimator GPU support broken? RRS feed

  • Question

  • Hi There,

    Today I created an Standard_NC6 VM and deployed an GPU enabled docker image (MS official: https://github.com/Azure/AzureML-Containers/tree/master/base/gpu).

    I am running Tensorflow both installed via the TensorflowEstimator as well as installed via the Environment.python.conda_dependencies. Meaning I tested both.

    To enable docker and gpu support I used the deprecated 'use_docker' and 'use_gpu' bools as well as the new standard Environment configuration. Both lead to the same result.

    My pipeline step breaks on the loading of libcuda.so.1 during the python import of tensorflow (1.2, 1.3).

    I have studied this and understand the following.

    1. The VM has GPU support.
    2. The docker engine has GPU support.
    3. The docker image has GPU support.
    4. Azure ML starts the container without passing on the GPU resource (this is the problem)

    When I start the container manually on the VM without any extra docker run options I can reproduce the error I get in the pipeline while performing a Python import of tensorflow.

    When I add 'docker run --gpus all <image>' I can successfully import tensorflow within the docker container on the NC6 VM.

    I inspected the docker run command performed by Azure ML within the pipeline step and indeed it does not define the '--gpus' option. This leads me to the conclusion that Azure Machine Learning Pipeline GPU support is broken.

    As far as I can see there is no way to alter the docker run command generated by the Azure ML pipeline in any transparant way.

    Where do I report this issue as it is part of the Azure ML platform and not so much of any individual SDK.

    Anyone else worked with GPUs in the context of Azure ML pipelines?

    Monday, May 4, 2020 8:30 PM

Answers

  • Hi Rohit,

    Some additional bug confirmation I achieved today. I also confirmed the fix.

    Under the modern way of defining an estimator with an Environment, which has a DockerSection I add DockerSection.arguments, like so:

    DOCKER_ARGUMENTS = ["--gpus", "all"]
    TENSORFLOW_ENVIRONMENT = Environment(name="tensorflow-env")
    TENSORFLOW_ENVIRONMENT.docker.enabled = True
    TENSORFLOW_ENVIRONMENT.docker.gpu_support = True  # deprecated but ok
    TENSORFLOW_ENVIRONMENT.docker.arguments = DOCKER_ARGUMENTS
    TENSORFLOW_ENVIRONMENT.docker.base_image = (
        "mcr.microsoft.com/azureml/base-gpu:intelmpi2018.3-cuda10.0-cudnn7-ubuntu16.04"
    )
    TENSORFLOW_CONDA_DEP = CondaDependencies()
    TENSORFLOW_CONDA_DEP.add_pip_package("tensorflow-gpu==1.13.2")
    TENSORFLOW_ENVIRONMENT.python.conda_dependencies = TENSORFLOW_CONDA_DEP

    With these settings the Docker containers on the NC6 VM have access to the GPU and my tensorflow code can create a GPU Tensorflow device.

    This confirms that there is a bug in the GPU support in Azure ML pipelines regarding AMLCompute remote compute with docker containers. (As users we are forced to use docker containers on such a compute_target, but that is a different story).

    It also confirms my suggested fix (docker run --gpus all) and I feel that adding these specific docker run arguments is not something that endusers can be expected to do under normal use of the azureml pipeline related to EstimatorStep and Environments. It also applies to users of the legacy TensorFlow Estimator with the "use_gpu" boolean as well as users of PythonScriptStep as these too are run within containers and might want to access the GPU.

    Wednesday, May 6, 2020 11:22 AM

All replies

  • Hello Luuk,

    Could you please let us know what type of compute_target have you defined in your estimator? Is it local? 

    It would be great if you can share the estimator configuration and also let us know the error so we can check with our product engineering team for more details?

    -Rohit

    Tuesday, May 5, 2020 1:33 PM

  • Thanks Rohit,

    The compute target is a Standard_NC6 with docker containers. I create the compute target within the azureml portal and retrieve it based on the workspace and the name "ComputeTarget(workspace, name)". The pipeline is executed remotely on the NC6 within docker containers after I submit it from another machine. I verified that the docker image that is used by azureml to build a derivative image is an official GPU enabled image (https://github.com/Azure/AzureML-Containers/tree/master/base/gpu). The docker engine also has the nvidea extension.

    I am using both legacy and new ways of configuring the estimator as a step in the pipeline.

    legacy:

    https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-p

    compute_target = ComputeTarget(workspace, name)

    TensorFlow(source_directory, compute_target=compute_target, script_params=script_params, use_gpu=True, use_docker=True, framework_version="1.13", inputs=[my_datastore], outputs=[my_pipeline_data])



    Modern:

    https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py

    tensorflow_environment = Environment(name='myenv') tensorflow_environment.docker.enable = True tensorflow_environment.docker.use_gpu = True # this is deprecated and superflous but ok tensorflow_environment.docker.base_image = "mcr.microsoft.com/azureml/base-gpu:intelmpi2018.3-cuda10.0-cudnn7-ubuntu16.04" conda_dep = CondaDependencies() conda_dep.add_pip_package("tensorflow-gpu==1.3.2") tensorflow_environment.python.conda_dependencies = conda_dep

    compute_target = ComputeTarget(workspace, name) estimator = Estimator(source_directory, compute_target=compute_target, script_params=script_params, environment_definition=tensorflow_environment, inputs=[my_datastore], outputs=[my_pipeline_data])


    After defining an estimator in one of these two ways I add them to a pipeline

    pipeline = Pipeline(workspace=workspace, steps=[estimator])
    pipeline_run = experiment.submit(pipeline)

    The error that the estimator generates in the pipeline step output is the following:

    """

     File "/usr/lib/python3.6/imp.py", line xxx, in load_dynamic
        return _load(spec)
    ImportError: libcuda.so.1: cannot open shared object file: No such file or directory


    Failed to load the native TensorFlow runtime.

    """

    I can reproduce this error by ssh-ing into the NC6 VM then launching a container with the image that azureml build within the pipeline: "docker run -it <image_name> bash". I then start python within the container and try to "import tensorflow" -> same error.

    I can see what the root cause is and workaround the error by adding extra parameters to the docker run on the NC6 VM: "docker run --gpus all -it <image_name> bash" then starting python I can successfully "import tensorflow".

    Having found this explanation that AzureML does not pass on the GPU resource to the docker container running on the NC6 VM I inspect the run logs of the estimator pipeline step and look at the docker run command that is executed. I can not find any trace of "--gpus" or gpu related parameters. For now I am satisfied that this is the error.









    Tuesday, May 5, 2020 3:12 PM
  • Hi Rohit,

    Some additional bug confirmation I achieved today. I also confirmed the fix.

    Under the modern way of defining an estimator with an Environment, which has a DockerSection I add DockerSection.arguments, like so:

    DOCKER_ARGUMENTS = ["--gpus", "all"]
    TENSORFLOW_ENVIRONMENT = Environment(name="tensorflow-env")
    TENSORFLOW_ENVIRONMENT.docker.enabled = True
    TENSORFLOW_ENVIRONMENT.docker.gpu_support = True  # deprecated but ok
    TENSORFLOW_ENVIRONMENT.docker.arguments = DOCKER_ARGUMENTS
    TENSORFLOW_ENVIRONMENT.docker.base_image = (
        "mcr.microsoft.com/azureml/base-gpu:intelmpi2018.3-cuda10.0-cudnn7-ubuntu16.04"
    )
    TENSORFLOW_CONDA_DEP = CondaDependencies()
    TENSORFLOW_CONDA_DEP.add_pip_package("tensorflow-gpu==1.13.2")
    TENSORFLOW_ENVIRONMENT.python.conda_dependencies = TENSORFLOW_CONDA_DEP

    With these settings the Docker containers on the NC6 VM have access to the GPU and my tensorflow code can create a GPU Tensorflow device.

    This confirms that there is a bug in the GPU support in Azure ML pipelines regarding AMLCompute remote compute with docker containers. (As users we are forced to use docker containers on such a compute_target, but that is a different story).

    It also confirms my suggested fix (docker run --gpus all) and I feel that adding these specific docker run arguments is not something that endusers can be expected to do under normal use of the azureml pipeline related to EstimatorStep and Environments. It also applies to users of the legacy TensorFlow Estimator with the "use_gpu" boolean as well as users of PythonScriptStep as these too are run within containers and might want to access the GPU.

    Wednesday, May 6, 2020 11:22 AM
  • Hello Luuk,

    Thanks for your detailed explanation of the issue along with the steps. We are checking with our internal team on this for a confirmation. We shall get back to you soon.

    -Rohit

    Wednesday, May 6, 2020 1:26 PM
  • Hi Rohit,

    Another update. The Docker arguments workaround described above seems to be applicable on the Compute Instances and not on the Compute Cluster. I can confirm that the GPU is enabled successfully by default on the Compute Cluster.

    The difference is that on the Compute instance the "docker" client is used and on the Compute Cluster the "nvidea_docker" client. This might be the root of the issue; selecting the "nvidea_docker" client on the compute instance for GPU enabled workloads.

    I read that MS has recently switched to auto-detecting GPU enabled workloads, thus I expect the issue to be in that area specific for the compute instance. See "gpu_support": https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.environment.dockersection?view=azure-ml-py#variables

    Tuesday, May 19, 2020 11:07 AM