Information variables: rxIsTestChunk and rxIsPrediction RRS feed

  • Question

  • rxIsTestChunk is an information variable that indicates if the chunk being processed is a test sample of the data. rxIsPrediction indicates if the chunk is used for a prediction rather than a model estimation. 

    How the values of these variables are set? How to select chunks from an XDF file for training, testing, or prediction?  Can we use some blocks for training, others for testing and others for prediction while working on the same XDF file? (i.e, without creating a new XDF file to hold the training subset, a new XDF file to hold the testing subset, etc.)


    • Edited by ansiwen Monday, November 21, 2016 11:54 AM
    Monday, November 21, 2016 11:54 AM

All replies

  • @ansiwen,

    I think you can essentially have bite size chunks from your single XDF file if treat the XDF file as a database via RxOdbcData and then create variables for your different chunks via T-SQL.  You can perform T-SQL on the data when it is coming into RevoR by adding an SQL query to RxOdbcData. A sample would look like:

    testXDF <- RxOdbcData(sqlQuery = "SELECT TOP 100 * FROM xdf_database", 
    connectionString = connectionString) 

    Another option is to use rxDataStep to transform the data in the xdf into a dataframe. A dataframe must be held in memory so this may not be an option if you have extremely large .xdf files. You could then use the sqldf package which allows you to use R SQL SELECT statements on dataframes.

    SOTATS, Inc.

    Wednesday, November 23, 2016 1:25 AM
  • Another option is to create an indicator variable for train/test to the main dataset and then use the rowSelection option to select the desired subset. 
    Wednesday, December 7, 2016 5:49 PM
  • Thank you!

    I use this method, but the question is how the values of the information variables are set? 

    Tuesday, December 20, 2016 8:09 AM
  • Here's some sample code to give you an idea of how to add an indicator variable and use it to filter the rows used in both modeling and prediction without explicitly splitting the input object.

    airDS <- RxXdfData(file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.xdf"))
    tmpDS <- RxXdfData(tempfile(fileext='.xdf'))

    # add an indicator for train test

    rxDataStep(tmpDS, tmpDS, overwrite=TRUE, transforms=list(train=runif(.rxNumRows)<0.7))
    rxGetInfo(tmpDS, getVarInfo=TRUE)

    # train using the train subset

    linMod <- rxLinMod(ArrDelay~CRSDepTime+DayOfWeek, data=tmpDS, rowSelection=train )

    # predict on the test subset
    #   note: this works locally. You'll need an intermediate XDF if running distributed.

    pred <- rxPredict(linMod, data=rxDataStep(tmpDS, rowSelection=!train), writeModelVars=TRUE)
    rxGetInfo(pred, getVarInfo=TRUE)

    Wednesday, December 21, 2016 6:12 PM