Spark timeout on remote SSH connection RRS feed

  • Question

  • I'm using Machine Learning Server 9.2 on Hadoop (deployed as a parcel on Cloudera Hadoop) and attempting to bring up a Spark compute context on Spark 2.2.0 from remote client via SSH.

    I'm able to establish the initial connection, set to remote execution context, and run functions that interact with HDFS (e.g. rxHadoopListFiles), but attempting to start a Spark application always times out. Note that the same occurs upon connection when using rxSparkConnect instead of RxSpark (presumably because the former proactively starts a Spark application). It's probably also worth noting that this timeout occurs whether I'm remote or on the edge node (which is assigned the MLS server role).

    Here's an example scenario:

    sc <-
        sshUsername = ssh.username,
        sshHostname = ssh.hostname,
        sshSwitches = ssh.switches,
        sshProfileScript = ssh.profile.script,
        consoleOutput = TRUE,
        hdfsShareDir = hdfs.share.dir,
        shareDir = local.share.dir,
        reset = TRUE
    rxHadoopCopyFromClient(file.path(dataPath = "/opt/cloudera/parcels/MLServer/libraries/RServer/RevoScaleR/SampleData/AirlineDemoSmall.csv"), "/share/SampleData")
    airDS <- RxTextData(file = "/share/SampleData/AirlineDemoSmall.csv", missingValueString = "M", 
                        fileSystem = RxHdfsFileSystem())
    # Timeout here
    adsSummary <- rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data = airDS)

    Tuesday, February 13, 2018 1:11 AM