none
MS ML Server 9.4.7 fails to start from azure CLI on Windows Server 2019 with HTTPS configured for Kestrel section RRS feed

  • Question

  • Hi support,

    I installed MS ML Server 9.4.7 on Windows Server 2019 for POC purpose. 

    For default Kestrel configuration bootstrap and diagnostic commands were executed successfully. All tests were green.

    I created self signed certificate and provided read private key privileges for NETWORK SERVICE user, Subject matches host name. I used Enable SSL or TLS for Connection Security in Machine Learning Server article.

    Kestrel section as below:

        "Kestrel": {
            "Host": "localhost",
            "HttpsCertificate": {
                "Description": "Enable this section if you want to enable SSL on Kestrel",
                "Enabled": true,
                "StoreLocation": "LocalMachine",
                "StoreName": "My",
                "SubjectName": "CN=localhost",
                "Thumbprint": "4d6b475a84b7c426715d7f45d9bcc3dfd441d071"
            },
            "HttpsEnabled": true,
            "MaxRequestBodySizeInMB": 500,
            "Port": 12800
        }, 

    when I execute azure CLI command 

    az mlserver admin node start

    1. powershell session hangs (no messages there like node started, failed, etc)

    2. dotnet.exe process eats memory (just one is available for webnode)

    3. python.exe eats memory (just one for webnode is running)

    4. There are no python processes and dotnet.exe for computenode.

    5. Webnode log full of errors like: 2020-01-21 00:00:52.224 -05:00 [Error] {"CorrelationId":"b83b60a9-6a21-467c-a85a-ec2c645ba7ab","Subject":{"uri":"http://localhost:12805","details":-1}}

    6. on the netstat -ao there are crazy amount of tcp connections (that constantly increases), example below.

      TCP    127.0.0.1:12800        WIN-CFLK382QOO1:59332  TIME_WAIT       0
      TCP    127.0.0.1:12800        WIN-CFLK382QOO1:59333  TIME_WAIT       0
      TCP    127.0.0.1:12800        WIN-CFLK382QOO1:59334  TIME_WAIT       0

    7. No errors in the compute node logs. Even with Trace level.

    8. when I execute stop command from another powershell session I got message below. Which states HTTP for some reason.

    pass Web Node stopped. (PID: 7740, Listening on URI: http://localhost:12800/)
    fail Compute Node was unable to stop.
    Compute Node was not able to stop because it is not running.

    Could you please help me to resolve the issue? 

    P.S. during my investigating I installed dotnet sdk 2.1.3 and run command below from command line(compute node first and then webnode) I got message that they both successfully started and some information messages written (as logging level is trace). netstat -ao does not show some weird activity.

    dotnet Microsoft.MLServer.ComputeNode.dll

    info: default[0]
          {"CorrelationId":null,"Subject":"o16NPythonDirectory Python"}
    info: default[0]
          {"CorrelationId":null,"Subject":"pythonInterpreterPath C:\\Program Files\\Microsoft\\ML Server\\PYTHON_SERVER\\python.exe"}
    info: default[0]
          {"CorrelationId":null,"Subject":"o16NPythonDirectory Python"}
    info: default[0]
          {"CorrelationId":null,"Subject":"pythonInterpreterPath C:\\Program Files\\Microsoft\\ML Server\\PYTHON_SERVER\\python.exe"}
    info: default[0]
          {"CorrelationId":null,"Subject":"o16NPythonDirectory Python"}
    info: default[0]
          {"CorrelationId":null,"Subject":"pythonInterpreterPath C:\\Program Files\\Microsoft\\ML Server\\PYTHON_SERVER\\python.exe"}
    info: default[0]
          {"CorrelationId":null,"Subject":"o16NPythonDirectory Python"}
    info: default[0]
          {"CorrelationId":null,"Subject":"pythonInterpreterPath C:\\Program Files\\Microsoft\\ML Server\\PYTHON_SERVER\\python.exe"}
    info: default[0]
          {"CorrelationId":null,"Subject":"o16NPythonDirectory Python"}
    info: default[0]
          {"CorrelationId":null,"Subject":"pythonInterpreterPath C:\\Program Files\\Microsoft\\ML Server\\PYTHON_SERVER\\python.exe"}
    Hosting environment: Production
    Content root path: C:\Program Files\Microsoft\ML Server\R_SERVER\o16n\Microsoft.MLServer.ComputeNode
    Now listening on: http://0.0.0.0:12805
    Application started. Press Ctrl+C to shut down.
    info: default[0]
          {"CorrelationId":"b1062468-49e2-46d9-a10a-da773d6ded9b","Subject":"o16NPythonDirectory Python"}
    info: default[0]
          {"CorrelationId":"b1062468-49e2-46d9-a10a-da773d6ded9b","Subject":"pythonInterpreterPath C:\\Program Files\\Microsoft\\ML Server\\PYTHON_SERVER\\python.exe"}
    info: default[0]
          {"CorrelationId":"b1062468-49e2-46d9-a10a-da773d6ded9b","Subject":{"ActiveShellCount":1,"CurrentPoolSize":5,"RuntimeType":"Python"}}
    info: default[0]

    dotnet Microsoft.MLServer.WebNode.dll
    info: default[0]
          {"CorrelationId":"b1062468-49e2-46d9-a10a-da773d6ded9b","Subject":"Compute nodes used: http://localhost:12805/"}
    info: default[0]
          {"CorrelationId":"0edf6d31-8259-4106-a775-9bdee84222fc","Subject":"Found 0 service with dedicated pool"}
    Hosting environment: Production
    Content root path: C:\Program Files\Microsoft\ML Server\R_SERVER\o16n\Microsoft.MLServer.WebNode
    Now listening on: https://0.0.0.0:12800
    Application started. Press Ctrl+C to shut down.
    info: default[0]
          {"CorrelationId":"b1062468-49e2-46d9-a10a-da773d6ded9b","Subject":"Compute nodes used: http://localhost:12805/"}
    info: default[0]
          {"CorrelationId":"0edf6d31-8259-4106-a775-9bdee84222fc","Subject":"Found 0 service with dedicated pool"}
    info: default[0]
          {"CorrelationId":"b1062468-49e2-46d9-a10a-da773d6ded9b","Subject":"Compute nodes used: http://localhost:12805/"}

    Also I ran az command with --debug and everything hanged after python process that executed web start python script was killed:

    az mlserver admin node start -w --debug
    Command arguments: ['mlserver', 'admin', 'node', 'start', '-w', '--debug']
    Event: Cli.PreExecute []
    Event: CommandParser.OnGlobalArgumentsCreate [<function CLILogging.on_global_arguments at 0x03C66C00>, <function OutputProducer.on_global_arguments at 0x03CEFC48>, <function CLIQuery.on_global_arguments at 0x03D138A0>]
    Event: CommandInvoker.OnPreCommandTableCreate [<function CLILogging.remove_logger_flags at 0x03C66C48>]
    Installed command modules ['acr', 'acs', 'advisor', 'appservice', 'backup', 'batch', 'batchai', 'billing', 'cdn', 'cloud', 'cognitiveservices', 'configure', 'consumption', 'container', 'cosmosdb', 'dla', 'dls', 'eventgrid', 'extension', 'feedback', 'find', 'interactive', 'iot', 'keyvault', 'lab', 'monitor', 'network', 'profile', 'rdbms', 'redis', 'reservations', 'resource', 'role', 'servicefabric', 'sql', 'storage', 'vm']
    Loaded module 'acr' in 0.004 seconds.
    Loaded module 'acs' in 0.003 seconds.
    Loaded module 'advisor' in 0.002 seconds.
    Loaded module 'appservice' in 0.005 seconds.
    Loaded module 'backup' in 0.003 seconds.
    Event: CommandLoader.OnLoadCommandTable []
    Loaded module 'batch' in 0.009 seconds.
    Loaded module 'batchai' in 0.003 seconds.
    Loaded module 'billing' in 0.002 seconds.
    Event: CommandLoader.OnLoadCommandTable []
    Loaded module 'cdn' in 0.003 seconds.
    Loaded module 'cloud' in 0.195 seconds.
    Loaded module 'cognitiveservices' in 0.002 seconds.
    Loaded module 'configure' in 0.001 seconds.
    Loaded module 'consumption' in 0.003 seconds.
    Loaded module 'container' in 0.002 seconds.
    Loaded module 'cosmosdb' in 0.004 seconds.
    Loaded module 'dla' in 0.065 seconds.
    Loaded module 'dls' in 0.003 seconds.
    Loaded module 'eventgrid' in 0.002 seconds.
    Loaded module 'extension' in 0.001 seconds.
    Loaded module 'feedback' in 0.001 seconds.
    Loaded module 'find' in 0.001 seconds.
    Loaded module 'interactive' in 0.001 seconds.
    Loaded module 'iot' in 0.005 seconds.
    Loaded module 'keyvault' in 0.005 seconds.
    Loaded module 'lab' in 0.112 seconds.
    Loaded module 'monitor' in 0.004 seconds.
    Loaded module 'network' in 0.018 seconds.
    Loaded module 'profile' in 0.004 seconds.
    Loaded module 'rdbms' in 0.004 seconds.
    Loaded module 'redis' in 0.003 seconds.
    Loaded module 'reservations' in 0.003 seconds.
    Loaded module 'resource' in 0.006 seconds.
    Loaded module 'role' in 0.003 seconds.
    Loaded module 'servicefabric' in 0.002 seconds.
    Loaded module 'sql' in 0.005 seconds.
    Event: CommandLoader.OnLoadCommandTable []
    Loaded module 'storage' in 0.191 seconds.
    Loaded module 'vm' in 0.014 seconds.
    Loaded all modules in 0.697 seconds. (note: there's always an overhead with the first module loaded)
    Extensions directory: 'C:\Users\yup\.azure\cliextensions'
    Found 1 extensions: ['azure-ml-admin-cli']
    ADMIN_CLI_SSL_VERIFY True
    ADMIN_CLI_SSL_CA_CERT None
    ADMIN_CLI_CERT_FILE None
    ADMIN_CLI_KEY_FILE None
    ADMIN_CLI_PROXY None
    ADMIN_CLI_PROFILE_DIR None
    None
    mls-admin-profile base location: C:\Users\yup\.azure
    mls-admin-profile location: C:\Users\yup\.azure\mls-admin-profile.json
    Loaded extension 'azure-ml-admin-cli' in 0.067 seconds.
    Event: CommandInvoker.OnPostCommandTableCreate [<function add_id_parameters at 0x03F2A738>]
    Event: CommandInvoker.OnCommandTableLoaded []
    Event: CommandInvoker.OnPreParseArgs [<function _documentdb_deprecate at 0x047687C8>]
    Event: CommandInvoker.OnPostParseArgs [<function OutputProducer.handle_output_argument at 0x03CEFC90>, <function CLIQuery.handle_query_parameter at 0x03D138E8>]
    {'webnode': True, 'computenode': None}
    attempting to read file C:\Users\yup\.azure\mls-admin-profile.json as utf-8-sig
    =======================================================
    Process command: "C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\python.exe" "C:/Users/yup/.azure/cliextensions/azure-ml-admin-cli/azext_admin/directives/scripts/python/windows_start_web_node.py"
    =======================================================
    ===========================================================
    Command: "C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\python.exe" "C:/Users/yup/.azure/cliextensions/azure-ml-admin-cli/azext_admin/directives/scripts/python/windows_start_web_node.py"
    stdout: {"returnCode": 0, "stdout": {"pid": 6772}, "success": true}

    stderr:
    returncode: 0
    pid: 1900
    ===========================================================
    Process forced killed: 1900


    Tuesday, January 21, 2020 11:16 AM