none
Calculate NULL percentage of variable in an XDF RRS feed

  • Question

  • Hi, 

    Context: I have 20 csv files of 10 GB each (20 million records * 100 variables). I'm using rxImport() to convert these Csv files to a single XDF file

    Problem: I need to calculate the percentage of null values in each of these variables. I cannot use data frame in this regard as I have ~400 million records

    What I did: To calculate null percentage using XDF, I'm using rxDataStep() with transformFunc as described below.

    transformFunc <- function(dataFile){

    nullPerc <- 0.0 # setting nullPerc as dummy value vector of type Double

    nullCols <- 'dummy'  # setting nullCols as dummy value vector of type character

    nullList <- data.frame() # dummy data frame

    nullPerc <- apply(dataFile, 2, function(x) length(which(x == "" | is.na(x) | x == "NA"))/length(x)

    nullCols <- colnames(dataFile)

    nullList <- cbind(as.data.frame(nullPerc), as.data.frame(nullCols))

    return(nullList)}

    outFile <- rxDataStep(

    inData =  fileXDF, # fileXDF = name of my collated XDF file 

    transformFunc = transformFunc,

    overwrite = T

    )

    I'm expecting to get the output as a data frame with column1 = <all 100 column names of data> and column2 = <respective percentage of null values>.

    I'm getting error as:

    ERROR: The sample data set for the analysis has no variables.

    Caught exception in file: /builddir/ExaRoot/ExaCore/CxAnalysis.cpp, line: 3848. ThreadID: -447668096 Rethrowing.

    Caught exception in file: /builddir/ExaRoot/ExaCore/CxAnalysis.cpp, line: 5375. ThreadID: -447668096 Rethrowing.

    Error in doTryCatch(return(expr), name, parentenv, handler): ERROR: The sample data set for the analysis has no variables.

    Logically, since XDF files work in chunks, I can't really get the percentage of null data per variables for the whole dataset unless I'm working with data frames. But, here since the data volume is huge, I can't really use data frames (SIGPIPE error, RAM constraints, etc.).

    Any workarounds/ inputs would be greatly appreciated.

    Thanks

    Wednesday, September 12, 2018 9:54 AM

All replies

  • I think you will need to write your own RevoScaleR chunking function.

    I would take a look at the below examples on writing a chunking algorithm and 'using internal variables in a transformation function':

    (Has section on using internal variables in a transformation)

    https://docs.microsoft.com/en-us/machine-learning-server/r/how-to-revoscaler-data-transform

    (Information on writing your own chunking algorithm)

    https://docs.microsoft.com/en-us/machine-learning-server/r/how-to-developer-write-chunking-algorithms

    Hope this helps.

    Thursday, September 13, 2018 4:54 PM