locked
In rxPredict, the number of parameters does not match the number of variables RRS feed

  • Question

  • Im trying to fit a linear model to data from an SQL Server source using rxPredict. I already found out the error "In rxPredict, the number of parameters does not match the number of variables ..." isn't specific to SQL Server data sources. So I'm using a data.frame here to illustrate the problem more easily:

    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    # lm() and predict() don't have a problem with missing factor levels:
    facNum <- rep(c("one", "two", "three"), times = 2)
    facChr <- rep(c("a", "b", "c"), each = 2)
    val = c(1, 2, 6, 2, 6, 9)
    trainingData <- data.frame(facNum, facChr, val, stringsAsFactors = TRUE)
    lmModel <- lm(val ~ facNum + facChr, data = trainingData)
    print(summary(lmModel))
    predictionData = data.frame(facNum = c("one", "three", "three", "one"), facChr = c("b", "b", "a", "a"))
    lmPred <- predict(lmModel, newdata = predictionData)
    lmPred
    # The result is OK:
    # 1 2 3 4
    # 2 6 5 1

    # rxLinMod() and rxPredict() behave different:
    rxModel <- rxLinMod(val ~ facNum + facChr, data = trainingData)
    rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
    # The following errors are thrown:
    # ERROR:order of factor levels in the data are inconsistent with
    # the order of the model coefficients:facChr = a versus facNum = two. Set checkFactorLevels = FALSE to ignore.
    # ERROR:order of factor levels in the data are inconsistent with
    # the order of the model coefficients:facChr = a versus facNum = two. Set checkFactorLevels = FALSE to ignore.
    # ERROR:order of factor levels in the data are inconsistent with
    # the order of the model coefficients:facChr = a versus facNum = two. Set checkFactorLevels = FALSE to ignore.
    # ERROR:order of factor levels in the data are inconsistent with
    # the order of the model coefficients:facChr = a versus facNum = two. Set checkFactorLevels = FALSE to ignore.
    rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE, checkFactorLevels = FALSE)
    # The following errors are thrown:
    # INTERNAL ERROR:In rxPredict, the number of parameters does not match the number of variables:5 vs. 7.
    # INTERNAL ERROR:In rxPredict, the number of parameters does not match the number of variables:5 vs. 7.
    levels(predictionData$facNum) <- c("two", "three", "one")
    levels(predictionData$facChr) <- c("c", "b", "a")
    rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE, checkFactorLevels = FALSE)
    rxPred
    # val_Pred    facNum  facChr
    # 2           two     b
    # 6           three   b
    # 5           three   c
    # 1           two     c
    # This looks suspicious at best. While the prediction values are still correct looking at the order of the
    # records in trainingData, the model variables are messed up.

    allData <- data.frame(lineNumber = 1:9, facNum = rep(c("one", "two", "three"), times = 3), facChr = rep(c("a", "b", "c"), each = 3), val = c(1, 2, 3, 2, 4, 6, 3, 6, 9))
    lmPredAll <- predict(lmModel, newdata = allData)
    lmPredAll
    # 1 2 3 4 5 6 7 8 9
    # 1 2 5 2 3 6 5 6 9

    levels(allData$facNum) <- c("two", "three", "one")
    levels(allData$facChr) <- c("c", "b", "a")
    rxPredAll <- rxPredict(rxModel, data = allData, writeModelVars = TRUE, checkFactorLevels = FALSE, extraVarsToWrite = "lineNumber")
    rxPredAll
    # val_Pred  lineNumber  val     facNum  facChr

    # val_Pred  lineNumber  val     facNum  facChr
    # 1         1           1       two     c
    # 2         2           2       one     c
    # 5         3           3       three   c
    # 2         4           2       two     b
    # 3         5           4       one     b
    # 6         6           6       three   b
    # 5         7           3       two     a
    # 6         8           6       one     a
    # 9         9           9       three   a

    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    The example above is just made as simple as possible to illustrate the problem. In my SQL Server scenario there's a factor with about 10.000 levels (known only while the model is created) and several more factors with about 5 levels each (all known). Given arbitrary data for prediction it's impossible to specify a "correct" order for all the levels in each factor. It seems

    * the levels for rxPredict must be the same as those in the model created by rxLinMod, and in addition checkFactorLevels must be set to FALSE to avoid any errors. However, this messes up the model variables.
    * the predictions are still correct. In my example I use lineNumber as a sort of primary key for the predictions
    * the messed up model (factor) variables can just be ignored
    * checkFactorLevels just checks the order of the factor levels rather than the levels itself

    Is that correct? Does anybody know of a better way to work around the issue?

    Wednesday, September 28, 2016 10:33 AM

Answers

  • The factor levels in the prediction dataset need to match those in the training set.  To do so:

    > levels(predictionData$facChr) <- levels(trainingData$facChr)
    > levels(predictionData$facNum) <- levels(trainingData$facNum)
    > rxPred <- rxPredict(rxModel, data = predictionData)
    Rows Read: 4, Total Rows Processed: 4, Total Chunk Time: 0.001 seconds
    > rxPred
      val_Pred
    1        2
    2        6
    3        5
    4        1
    >

    • Marked as answer by SLSvenR Thursday, September 29, 2016 7:48 AM
    Wednesday, September 28, 2016 10:05 PM

All replies

  • The factor levels in the prediction dataset need to match those in the training set.  To do so:

    > levels(predictionData$facChr) <- levels(trainingData$facChr)
    > levels(predictionData$facNum) <- levels(trainingData$facNum)
    > rxPred <- rxPredict(rxModel, data = predictionData)
    Rows Read: 4, Total Rows Processed: 4, Total Chunk Time: 0.001 seconds
    > rxPred
      val_Pred
    1        2
    2        6
    3        5
    4        1
    >

    • Marked as answer by SLSvenR Thursday, September 29, 2016 7:48 AM
    Wednesday, September 28, 2016 10:05 PM
  • Thank you. Setting the levels the same as for the training set also allows to omit specifying checkFactorLevels = FALSE in rxPredict:

    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


    almostAllData <- data.frame(lineNumber = 1:9, facNum = rep(c("two", "three", "one"), times = 3), facChr = rep(c("a", "c", "b"), each = 3), val = c(2, 3, 1, 6, 9, 3, 4, 6, 2))
    # Remove one line to make setting factor levels a necessity.
    almostAllData <- subset(almostAllData, lineNumber != 6)
    almostAllData
    #lineNumber facNum facChr val
    #1 1 two a 2
    #2 2 three a 3
    #3 3 one a 1
    #4 4 two c 6
    #5 5 three c 9
    #7 7 two b 4
    #8 8 three b 6
    #9 9 one b 2

    lmPredAll <- predict(lmModel, newdata = almostAllData)
    lmPredAll
    #1 2 3 4 5 7 8 9
    #2 5 1 6 9 3 6 2

    levels(almostAllData$facChr) <- levels(trainingData$facChr)
    levels(almostAllData$facNum) <- levels(trainingData$facNum)
    rxPredAll <- rxPredict(rxModel, data = almostAllData, writeModelVars = TRUE, extraVarsToWrite = "lineNumber")
    rxPredAll
    #val_Pred lineNumber val facNum facChr
    #1 2 1 2 two a
    #2 5 2 3 three a
    #3 1 3 1 one a
    #4 6 4 6 two c
    #5 9 5 9 three c
    #6 3 7 4 two b
    #7 6 8 6 three b
    #8 2 9 2 one b

    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    It seems even the model variables can be trusted again this way.

    Thursday, September 29, 2016 7:48 AM