none
rxNeuralNet struggling with large datasets on a GPU RRS feed

  • Question

  • Hello,
    I got a new Surface Book 2 with a GTX 1050 GPU, and looking to convert some existing code from CPU to GPU.
    The code worked, albeit slowly, on a CPU.  But it's not working on a GPU.
    I was able to run the sample code here with no issues: 
    https://blogs.msdn.microsoft.com/microsoftrservertigerteam/2017/03/10/get-started-with-microsoftmls-rxneuralnet-with-gpu-acceleration/  (Took some help from some other forums, to rename the CUDA DLLs to match v6.5...)

    But my code still didn't work.  I think I've found the problem... my data is much bigger than the sample.  I've modified the sample code to mimic my issue.  In the code below, I have it run rxNeuralNet with both SSE and GPU.  SSE works fine, but the GPU call hangs when trying to initialize.

    I'm wondering if perhaps the dataset is too large for the GPU, and rxNeuralNet isn't gracefully handling the error?  If so, how do I determine the limits?

    Thanks,
    Mike

    library(MicrosoftML)
    set.seed(1, kind = "Mersenne-Twister")
    r = 400000
    c = 250
    normData <- data.frame(matrix(rnorm(r*c,0,1),r,c))
    label <- sample(1:12, r, replace = T)
    normData <- cbind(normData, label = as.factor(label))
    # This works
    model_NN_sse = rxNeuralNet(formula = label~., data = normData, type = "multiClass", acceleration = 'sse', 
    optimizer = adaDeltaSgd(), numIterations = 5)
    # R Studio hangs here
    model_NN_gpu = rxNeuralNet(formula = label~., data = normData, type = "multiClass", acceleration = 'gpu',
                               optimizer = adaDeltaSgd(), numIterations = 5,
                               miniBatchSize = 256)


    Sunday, August 12, 2018 6:18 AM

All replies

  • suffering the same issue.

    but in my case, even if I reduces the number of columns to 1 the process never starts

    :(

    Tuesday, June 18, 2019 1:41 PM
  • I am not sure what your problem is, but I got it to work on my GPU using Rstudio and Microsoft Machine Learning Server 9.3 following the blog article you referenced above.  It worked for me on a Windows Hyper-V VM with only 4 GB of available memory.

    I tried the modified example code you posted above and got the following output:

    Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
    Beginning processing data.
    Rows Read: 400000, Read Time: 0.001, Transform Time: 0
    Beginning processing data.
    Beginning processing data.
    Rows Read: 400000, Read Time: 0.002, Transform Time: 0
    Beginning processing data.
    Beginning processing data.
    Rows Read: 400000, Read Time: 0.001, Transform Time: 0
    Beginning processing data.
    Using: GPU CUDA Math on Quadro K1200(0000:01:00), Compute Capability: 5.0, 4 SMs, Global Memory: 4,096 MB, Theoretical B/W: 80.16 GB/sec
    
    ***** Net definition *****
      input Data [250];
      hidden H [100] sigmoid { // Depth 1
        from Data all;
      }
      output Result [12] softmax { // Depth 0
        from H all;
      }
    ***** End net definition *****
    Input count: 250
    Output count: 12
    Output Function: SoftMax
    Loss Function: LogLoss
    PreTrainer: NoPreTrainer
    ___________________________________________________________________
    Starting training...
    Using Adadelta with decay 0.950000 and conditioner 0.000001
    InitWtsDiameter: 0.100000
    Mini-Batch Size: 256
    ___________________________________________________________________
    Initializing 1 Hidden Layers, 26312 Weights...
    Estimated Pre-training MeanError = 3.448843
    Iter:1/5, MeanErr=3.448398(-0.01%%), 3326.77M WeightUpdates/sec
    Iter:2/5, MeanErr=3.443378(-0.15%%), 3350.66M WeightUpdates/sec
    Iter:3/5, MeanErr=3.441491(-0.05%%), 3226.54M WeightUpdates/sec
    Iter:4/5, MeanErr=3.440017(-0.04%%), 3228.51M WeightUpdates/sec
    Iter:5/5, MeanErr=3.438047(-0.06%%), 3355.68M WeightUpdates/sec
    Done!
    Estimated Post-training MeanError = 3.433511
    ___________________________________________________________________
    Not training a calibrator because it is not needed.
    Elapsed time: 00:00:30.3114851

     
    Wednesday, June 19, 2019 11:54 PM
  • What kind of GPU are you using?

    My guess all along has been that rxNeuralNet isn't handling the case when the GPU runs out of memory. Your GOU may have more than mine...

    Thursday, June 20, 2019 3:35 PM
  • I would think it better to be careful with licensing, I might be wrong
    but technically "Microsoft Machine Learning Server" is licensed under SQL Server license model.

    Full parallelism (for production use) available only on Enterprise edition of SQL Server (with Volume / per-Core license {packs of 2})

    Therefore running compute workloads in production on additional core would require this core to be licensed (covered by license)
    CUDA/GPU aren't multi-threaded in a CPU terms, much lower frequency and performance/cache (although better memory bandwidth)
    which makes it extremely expensive choice to buy a license for it
    (because buying regular high-end CPU turns out to be much cheaper in terms of SQL Server license cost per performance unit)

    Even Xeon Phi (with it's AVX-512 vector/float tuned performance) isn't a such a good option (yet) from this perspective (SQL Server Licensing costs)
    simply because it's too (twice) slow (although number of x4 threads gives it quite a boost on per-core {not per thread} cost trade-off basis)
    if compared with Xeon Platinum (only 2 threads but x2 cache, almost x2 frequency)

    So short story - check with your license manager if it's legal at all and if nature of workload specifics vs performance gain justifies the per-core/volume costs.

    Thursday, June 20, 2019 7:33 PM