none
Error when using rxLogit: RevoScaleR cannot be used in this R session anymore RRS feed

  • Question

  • I ran into a curious error today while working with rxLogit - it appears that when handling NA values and excluding the built-in intercept term, rxLogit only scans the first 1000 rows of a dataset and then somehow crashes if all of those rows ends up excluded due to NA values.

    To reproduce:

    test.data <- data.frame(y = rbinom(1:1002, 1, 0.5), x = c(rep(NA,1000),rnorm(1:2)), intercept = rep(1,1002))
    test.data$x[1000] <- 1
    rxLogit(as.formula("y ~ -1 + intercept + x"), test.data)
    # calculates just fine
    
    test.data$x[1000] <- NA
    rxLogit(as.formula("y ~ x"), test.data)
    # also calculates just fine
    
    rxLogit(as.formula("y ~ -1 + intercept + x"), test.data)
    # Error in doTryCatch(return(expr), name, parentenv, handler) : 
    #  fatal error: RevoScaleR cannot be used in this R session anymore, if possible restart R session
    # error code -1073740791, detailed error message might be found in: (standard output unavailable) and (standard error output unavailable)
    
    # this also happens if we filter out the bad rows with a rowSelection statement!
    rxLogit(as.formula("y ~ -1 + intercept + x"), test.data, rowSelection = parse(text="!is.na(x)"))

    From a quick google search, error code -1073740791 appears to refer to the generic "Stack Buffer Overrun" error - https://windows-hexerror.linestarve.com/q/so47060066-Read_Ncol-exit-with-error-code-1073740791

    A few points regarding workarounds:

    • we use rxLogit within a software tool we develop as opposed to a manual data science workflow, so there are cases where datasets get passed in with lots of NA values that we want to be automatically handled. We figured that rowSelection would be an ok way to filter out those rows so that the actual rxLogit calculation logic wouldn't see them, but that appears to not be the case here.
    • we include our own intercept term for convenience so that we can control the naming of that term. It looks like this code works if we include the default intercept term, so that could be a temporary workaround, but it still seems strange that that would somehow prevent this error from occurring.

    This was using Microsoft R Server version 9.3.0 ("Microsoft R Server version 9.3.0.2135 (2018-02-11 00:56:50 UTC)"), and I had a colleague also reproduce using 9.0.1, albeit with a different error message (but similar behavior with corrupting the R session)
    Wednesday, November 21, 2018 5:37 PM

All replies

  • Hi John,

    Thanks for the bug report. My suggestion would be that you try fitting the same model with 'rxGlm()' and set the family='binomial'.  This will fit the same model as rxLogit(), but the rxGlm() function is newer and more robust than rxLogit().  Give it a try and let me know if that works for your scenario.

    In the meantime thanks for reporting the buggy rxLogit() behavior.


    Thursday, November 22, 2018 12:00 AM
  • Thanks steve!

    It does look like rxGlm avoids this issue, including when adding in an RxSqlServerData that has to be read in multiple chunks.

    However, rxGlm does appear to be slower than rxLogit based on a couple of quick local tests and the documentation at https://docs.microsoft.com/en-us/machine-learning-server/r/how-to-revoscaler-logistic-regression (which calls rxLogit optimized). We'll likely just avoid this issue by ensuring NA values are removed through other means while sticking with rxLogit

    Monday, November 26, 2018 5:48 PM