none
Exporting data from R to SQL Server RRS feed

  • Question

  • Hi all,

    I am cross-posting this from the R Open forum where it went unanswered.

    I am using the RevoScaleR function rxImport to pull data from SQL Server to an R memory frame via ODBC.

    Apparently, however, there is no way to go in the other direction, i.e., push an R memory frame into a SQL Server database.

    Does anyone know what is the reason for this asymmetry?

    If I want to INSERT data into a SQL Server table from R, do I need to resort to an external package such as RODBC, or there is some workaround to do it within RevoScaleR?

    Thank you in advance!

    Gianluca


    • Edited by GL Torta Saturday, December 17, 2016 9:19 PM
    Saturday, December 17, 2016 9:17 PM

Answers

  • Gianluca,

    It's possible to save data back using rxDataStep with outFile pointing to a RxSqlServerData(table="myTable", ....).  I struggled to figure out how to save data back as well and I didn't find the relevant documentation easily.

    If you have a small dataset this should work fine.  The performance of saving data using this method in this way is similar to RODBC (meaning kind of slow).  If you're using sp_execute_external_script with an output dataframe and push that result set into a table I believe you'll likely get better write speed.  And if speed is really important and you're saving millions of rows you might need to think about writing a dataframe to a flat file and then using BCP.exe to push the data back to SQL Server.

    # https://msdn.microsoft.com/en-us/library/mt637381.aspx (Create New SQL Server Table using rxDataStep (Data Science Deep Dive))
    outDataSource <- RxSqlServerData(table="tableToSaveDataTo", connectionString=conString)
    rxDataStep(inData=inData, outFile=outDataSource, append=...)
    # append="rows" will not create a new table
    # append="none" will create a new table (but check column types that are picked for you especially if you have strings in your dataframe!)

    Bob

    • Marked as answer by GL Torta Monday, December 19, 2016 11:21 AM
    Sunday, December 18, 2016 3:09 PM
  • Yes, we seem to be lacking general product doc on use of the "write" option with both RxSqlServerData and RxOdbcData and will address this.  Per the other response on this thread, the way to do it in both cases is through use of rxDataStep and by specifying an table name in the output data source definition. 

    • Edited by rkittler (MSFT) Monday, December 19, 2016 4:38 AM
    • Marked as answer by GL Torta Monday, December 19, 2016 11:21 AM
    Monday, December 19, 2016 4:37 AM

All replies

  • Gianluca,

    It's possible to save data back using rxDataStep with outFile pointing to a RxSqlServerData(table="myTable", ....).  I struggled to figure out how to save data back as well and I didn't find the relevant documentation easily.

    If you have a small dataset this should work fine.  The performance of saving data using this method in this way is similar to RODBC (meaning kind of slow).  If you're using sp_execute_external_script with an output dataframe and push that result set into a table I believe you'll likely get better write speed.  And if speed is really important and you're saving millions of rows you might need to think about writing a dataframe to a flat file and then using BCP.exe to push the data back to SQL Server.

    # https://msdn.microsoft.com/en-us/library/mt637381.aspx (Create New SQL Server Table using rxDataStep (Data Science Deep Dive))
    outDataSource <- RxSqlServerData(table="tableToSaveDataTo", connectionString=conString)
    rxDataStep(inData=inData, outFile=outDataSource, append=...)
    # append="rows" will not create a new table
    # append="none" will create a new table (but check column types that are picked for you especially if you have strings in your dataframe!)

    Bob

    • Marked as answer by GL Torta Monday, December 19, 2016 11:21 AM
    Sunday, December 18, 2016 3:09 PM
  • Yes, we seem to be lacking general product doc on use of the "write" option with both RxSqlServerData and RxOdbcData and will address this.  Per the other response on this thread, the way to do it in both cases is through use of rxDataStep and by specifying an table name in the output data source definition. 

    • Edited by rkittler (MSFT) Monday, December 19, 2016 4:38 AM
    • Marked as answer by GL Torta Monday, December 19, 2016 11:21 AM
    Monday, December 19, 2016 4:37 AM
  • Hi, are there any plans at MS to change the implementation of rx* functions, so that writing data to SQL Server tables is done in bulk and not in a row-by-row fashion? Currently inserting a few millions of rows as an output from an rx* function (eg. rxKmeans or rxDataStep) takes a very long time. This feature would be very much needed in a high-performance computing environment.
    Tuesday, December 20, 2016 12:02 PM
  • Hi GK_79,

    Although we don't use bulk upload, today the write to SQL data sources are generally done in batch of 100K rows at a time. Can you turn the SQL profiler on and see how many rows are getting inserted in a single call in your scenario? Also can you share with us what's output of "Revo.version" from you R console?

    Thanks

    Wednesday, December 21, 2016 7:04 PM
  • Vijay,

    I run the following command and traced in SQL Server Profiler what was happening on DB side:

    clustersLogInitialPD <-  rxKmeans(formula = ~LogInitialPD
                         ,data = inDataSource
                         ,algorithm = "Lloyd" 
                         ,centers = start_c
                         ,maxIterations = 1
                         ,outFile = sqlLogPDClustersDS
                         ,outColName = "ClusterNo"
                         ,overwrite = TRUE
                         ,writeModelVars = TRUE
                         ,extraVarsToWrite = c("LoadsetId", "ExposureId")
                         ,reportProgress = 0
    )

    outFile parameter points to a table in the database. The table is dropped, then recreated as a heap and the results of allocation of observation to clusters from the rxKmeans function are then inserted to the table row-by-row.

    SQL Server profiler shows a long series of the following statements:

    INSERT INTO ifrs9.logPDClusters VALUES (@P1, @P2, @P3, @P4)

    Execution plan shows a Table insert operator with Estimated number of rows = 1. 

    Below is the output from "Revo.version":

    > Revo.version
    
    platform           x86_64-w64-mingw32                                        
    arch               x86_64                                                    
    os                 mingw32                                                   
    system             x86_64, mingw32                                           
    status                                                                       
    major              8                                                         
    minor              0.3                                                       
    year               2016                                                      
    month              09                                                        
    day                07                                                        
    language           R                                                         
    version.string     Microsoft R Server version 8.0.3 (2016-09-07 05:49:27 UTC)
    nickname           Fire Safety                                               
    BuildID            Trunk-RevoR-Win-15703-497                                 
    RevoScaleR BuildID Trunk-RevoScaleR_Release-Win-16131-511 

    Do you see what might be preventing R Server from inserting rows in larger batches?

    Thank you

    Wednesday, December 28, 2016 11:03 AM
  • GK_79,

    This isn't a full answer but from testing I did this evening, performance appears to be significantly improved for saving data between R Server 8.0.3 (SQL Server RTM) and R Server 9.0.1 (SqlBindR required / https://support.microsoft.com/en-us/kb/3210262 installed).

    I wasn't able in a couple of minutes to determine exactly what was changed that improved performance (~10x faster between 8.0.3 and 9.0.1).  If you figure out the difference using SQL Profiler or if you can explain the difference, Vijay, I'd be curious to know.  Performance for larger datasets is still a good deal slower than BCP for larger datasets (https://stackoverflow.com/questions/41227844/sql-server-r-services-outputting-data-to-database-table-performance/41234389#41234389) but I'm happy to see this improvement.

    Bob

    SET STATISTICS TIME ON
    GO
    
    EXECUTE sp_execute_external_script
    	@language = N'R'
    	, @script = N'
    		sqlConnString <- "SERVER=.;DATABASE=master;UID=#UID#;PWD=#PWD#;"
    		numRows <- 100000
    		bbbb <- data.frame(a=rnorm(numRows), b=1)
    
    		sqlDataSource <- RxSqlServerData(table = "#tmp", connectionString = sqlConnString)
    		rxDataStep(inData = bbbb, outFile = sqlDataSource, overwrite=TRUE)
    		cat("done!\n")
    		';
    GO
    


    Tuesday, January 3, 2017 3:33 AM
  • Bob,

    Thanks for your comments.

    I upgraded Microsoft R Client, R Server and SQL R Services to the most recent version available.

    Revo.version now shows:

    platform       x86_64-w64-mingw32                                             
    arch           x86_64                                                         
    os             mingw32                                                        
    system         x86_64, mingw32                                                
    status                                                                        
    major          9                                                              
    minor          0.1                                                            
    year           2016                                                           
    month          11                                                             
    day            30                                                             
    language       R                                                              
    version.string Microsoft R Server version 9.0.1.2396 (2016-11-30 02:03:45 UTC)

    SQL Server version is: 13.0.4202.2

    I then rerun my previous code (from R Studio, setting compute context to SQL Server), then your code (from SSMS), watching what is going on in the database through SQL Server Profiler.

    Unfortunately, I still see the same thing - inserting row-by-row. Additionally, from my previous code I am now getting a warning:

    Warning message:
    In `[<-`(`*tmp*`, "outputDataSource", value = list(inputDataQuery = "\n  SELECT\n    <columns>\n  FROM\n    <table>\n  WHERE\n    <filter predicate>\n",
    :  implicit list embedding of S4 objects is deprecated

    What is the time of execution of your script on your environment? On my VM I get the following:

    STDOUT message(s) from external script:
    Rows Read: 100000,
    Total Rows Processed: 100000
    Total Rows written: 100000,
    Total time: 18.407,
    Total Chunk Time: 19.117 seconds done!
    SQL Server Execution Times:
    CPU time = 0 ms,
    elapsed time = 21324 ms.
    SQL Server Execution Times:
    CPU time = 16 ms,
    elapsed time = 21327 ms.

    /Grzegorz

    Tuesday, January 3, 2017 11:13 AM
  • Grzegorz,

    I see the same behavior as well that the insert is happening row-by-row.  I thought I was seeing improved performance between 8.0.3 and 9.0.1 (it's still row-by-row and slow) but I'm currently not sure if the performance difference I see is due to a difference between by two servers or if it's due to MRS versions.

    For performance, I see:
    MRS 9.0.1 server
    ----------------
    SQL Server parse and compile time: 
       CPU time = 0 ms, elapsed time = 0 ms.
    SQL Server parse and compile time: 
       CPU time = 0 ms, elapsed time = 0 ms.
    STDOUT message(s) from external script: 
    Rows Read: 100000, Total Rows Processed: 100000
    Total Rows written: 100000, Total time: 4.232
    , Total Chunk Time: 4.250 seconds 
    done!
    
     SQL Server Execution Times:
       CPU time = 0 ms,  elapsed time = 5156 ms.
    
     SQL Server Execution Times:
       CPU time = 0 ms,  elapsed time = 5156 ms.
    
    MRS 9.0.1 server
    ----------------
    SQL Server parse and compile time: 
       CPU time = 0 ms, elapsed time = 0 ms.
    SQL Server parse and compile time: 
       CPU time = 0 ms, elapsed time = 0 ms.
    STDOUT message(s) from external script: 
    Rows Read: 100000, Total Rows Processed: 100000
    Total Rows written: 100000, Total time: 4.232
    , Total Chunk Time: 4.250 seconds 
    done!
    
     SQL Server Execution Times:
       CPU time = 0 ms,  elapsed time = 5156 ms.
    
     SQL Server Execution Times:
       CPU time = 0 ms,  elapsed time = 5156 ms.
    Regarding the warning you now see:
    I believe that 
    Warning message:
    In `[<-`(`*tmp*`, "outputDataSource", value = list(inputDataQuery = "\n  SELECT\n    <columns>\n  FROM\n    <table>\n  WHERE\n    <filter predicate>\n",
    :  implicit list embedding of S4 objects is deprecated
    
    is a warning that now appears in MRS 9.0.1 (R 3.3.2) because of a difference introduced in R 3.3.0 (MRS 8.x uses R 3.2.2).  See https://github.com/richarddmorey/BayesFactor/issues/73 for how a different R project fixed this.  I get this warning using MRS9 but not MRS8 (the insert does work for both versions).

    @Vijay (or someone else at Microsoft) -- would you like me to email someone about the warning that now appears in MRS 9.0.1 when saving data using a SQL Server compute context?
    Tuesday, January 3, 2017 11:10 PM
  • Bob,

    Thanks, I hope someone at Microsoft sees into this issue and provides us with some comments/advice.

    /Grzegorz

    Wednesday, January 4, 2017 8:36 AM
  • Vijay,

    Do you think you could provide any further comments regarding the observed row-by-row mode of inserting data to SQL tables? I checked this with the most recent version of SQL Server R services (see posts below in the thread) and it looks like the behaviour is the same.

    Thanks,

    Grzegorz

    Tuesday, January 10, 2017 1:51 PM
  • Great question!  Since insert INTO is limited to 1000 rows, you can dbBulkCopy from rsqlserver package.

    dbBulkCopy is a DBI extension that interfaces the Microsoft SQL Server popular command-line utility named bcp to quickly bulk copying large files into table. For example:

    url = "Server=localhost;Database=TEST_RSQLSERVER;Trusted_Connection=True;"
    conn <- dbConnect('SqlServer',url=url)
    ## I assume the table already exist
    dbBulkCopy(conn,name='T_BULKCOPY',value=df,overwrite=TRUE)
    dbDisconnect(conn)




    OR


    By writing the data to a CSV locally and then using a BULK INSERT (not readily available as a prebuilt function akin to sqlSave), the data can be written to the MS SQL Server very quickly.

    toSQL = data.frame(...);
    write.table(toSQL,"C:\\export\\filename.txt",quote=FALSE,sep=",",row.names=FALSE,col.names=FALSE,append=FALSE);
        sqlQuery(channel,"BULK
                    INSERT Yada.dbo.yada
                    FROM '\\\\<server-that-SQL-server-can-see>\\export\\filename.txt'
                    WITH
                    (
                    FIELDTERMINATOR = ',',
                    ROWTERMINATOR = '\\n'
                    )");



    SQL Server must have permission to access the network folder holding the CSV file, or else this process will not work. While it takes some setup with various permissions (the network folder and BULK ADMIN privileges, the reward in speed is infinitely more valuable).


    MY BOOK

    Sunday, January 15, 2017 4:32 PM
  • Hi, thank you for your comments. However I am still trying to find a solution to do the following:

    - invoke a rx* function in the sql server computation context (eg. rxKmeans)

    - tell this function to store its output (a large number of rows) directly into a table, not going through temporary files ("outfile" parameter pointing to a table)

    - have rows inserted to the table _fast_ (currently observed row-by-row mode is unacceptable for performance reasons)

    This would be the most elegant solution, without the need to resort to external files and bulk copy. I am afraid it requires some changes in implementation of rx* functions on the side of MS. Or maybe I am missing something/doing sth. wrong?

    Tuesday, January 17, 2017 3:39 PM
  • This is NOT an answer.  You do not specify what inData is.  In my case, inData is a data.frame generated by my R code

    ##establish compute context
    sqlServerConnString <- "Server=.;Database=External;Trusted_Connection=true"
    sqlServerCC <- RxInSqlServer(connectionString=sqlServerConnString)
    rxSetComputeContext(sqlServerCC)

    bbbb <- data.frame(a=rnorm(5),b=1)
    outDataSource <- RxSqlServerData(table="imp.USA_Cities_Map",connectionString=sqlServerConnString)
    rxDataStep(inData=bbbb,outFile=outDataSource,append="none")

    and this fails with "inData must be an RxSqlServerData data source for this compute context"

    Well, if it was a SQL Server data source then I wouldn't need to insert it into SQL in the first place.  So far, I am unable to take the output of my R code and insert it into a SQL Server table for each iteration of my for loop



    Tuesday, August 8, 2017 7:27 PM
  • I had a similar problem ("inData must be an RxSqlServerData data source for this compute context").  I solved it by changing the compute context to local, since I was writing to a database from my machine.

    rxSetComputeContext("local")

    sqlServerConnString <- "Server=.;Database=External;Trusted_Connection=true"

    outDataSource <- RxSqlServerData(table="imp.USA_Cities_Map",connectionString=sqlServerConnString)

    rxDataStep(inData=bbbb,outFile=outDataSource,append="none")

    The above might work. 

    Tuesday, December 19, 2017 6:23 AM
  • By writing the data to a CSV locally and then using a BULK INSERT (not readily available as a prebuilt function akin to sqlSave), the data can be written to the MS SQL Server very quickly.

    toSQL = data.frame(...);
    write.table(toSQL,"C:\\export\\filename.txt",quote=FALSE,sep=",",row.names=FALSE,col.names=FALSE,append=FALSE);
        sqlQuery(channel,"BULK
                    INSERT Yada.dbo.yada
                    FROM '\\\\<server-that-SQL-server-can-see>\\export\\filename.txt'
                    WITH
                    (
                    FIELDTERMINATOR = ',',
                    ROWTERMINATOR = '\\n'
                    )");



    https://stackoverflow.com/questions/tagged/r?page=20&sort=newest&pagesize=50

    https://stackoverflow.com/questions/tagged/python-3.x?page=60&sort=newest&pagesize=50

    https://stackoverflow.com/questions/tagged/python?page=275&sort=newest&pagesize=50



    MY BOOK




    • Edited by ryguy72 Sunday, December 24, 2017 2:44 PM
    Wednesday, December 20, 2017 4:07 AM