How best to update every row in a massive table? RRS feed

  • Question

  • Hi All,

    We've added 4 new columns to a fact table with 130,000,000 rows.  That fact table is clustered on an identity(1, 1) integer column.  I now need to update those 4 new columns by joining to other tables.  I can perform a single update statement that joins to 3 other tables or I can perform three separate updates with each joining to a single table.  I opted for the latter.

    So far, I've used a method of updating in batches of various sizes.  (Script below.)  It seems like a batch size of 200,000 rows gave the best performance when testing against smaller tables of 1 million and 10 million rows, so I'm using the same size batch when executing against the full table.

    The problem is that it's updating very slowly.  I let it run overnight and in 9 hours it only updated 2,250,000 rows.  I'm obviously doing something wrong.

    I disabled all destination table indexes except for the one SQL recommends when displaying an estimated execution plan.  It's faster, but I expect it would run well over 6 hours for this update.

    This database is on SQL 2012 and uses the simple recovery model.  The transaction log is enormous and, using dbcc sqlperf(logspace), I've yet to see it use over 2% while this loop executes.  This is why I thought to raise the batch size to 500,000, but that didn't improve performance.

    Should I be rebuilding that index within each loop iteration?  Is the sort order of my clustered index contributing to the slow performance?  Should I keep my batch size below 5000, as recommended at

    Is there a better way?  I was considering creating a temp table #1 that only includes my PK integer, varchar value to look up, and the empty column to update, all clustered on the value to look up.  Then create a temp table #2 of only 2 columns of my lookup table, clustered on the value to look up and the other column being the update value.  Then perform update of temp table #1 using temp table #2.  Then re-cluster temp table #1 on the PK integer.  Then update the destination table using temp table #1, which should execute faster because they're both clustered on the identity integer column.

    I could try this 100 different ways for 50 combined hours to find the most performant method, but I might be missing some critical concepts.

    Eric B.

    SET NOCOUNT ON; DECLARE @batchSize INT = 200000; DECLARE @rowsUpdated INT = @batchSize, @totalRows INT = 0; WHILE (@batchSize = @rowsUpdated) BEGIN; UPDATE TOP (@batchSize) t SET t.MemberDWID = enroll.MemberDWID FROM dbo.DestTable AS t INNER JOIN dbo.MemberEnrollment AS enroll ON t.EnrollID = enroll.MemberEnrollID WHERE t.MemberDWID IS NULL; SET @rowsUpdated = @@ROWCOUNT; SET @totalRows = @totalRows + @rowsUpdated;

    --ALTER INDEX ncix_TestTable_MemberDWID ON dbo.TestTable REBUILD; END; CHECKPOINT;

    • Edited by SQL Server dude Friday, June 14, 2019 9:56 PM obfuscated table name
    Friday, June 14, 2019 9:55 PM

All replies

  • The reason this is slow is that for each iteration, you need a full table scan to find the rows where MemberDWID is null. Well, not if you added an index on it, but then you need to pay the cost for maintaining that index.

    Use this pattern instead:

    CREATE OR ALTER PROCEDURE UpdateAbsolute2 @chunksize int AS

       DECLARE @minID int,
               @maxID int

       SELECT @minID = MIN(OrderID) FROM BigDetails WHERE NewCol IS NULL

          SELECT @maxID = MAX(OrderID)
          FROM   (SELECT TOP(@chunksize) OrderID
                  FROM   BigDetails
                  WHERE  OrderID >= @minID
                  ORDER  BY OrderID ASC) AS B

          EXEC sp_executesql
               N'UPDATE dbo.BigDetails
                 SET    NewCol = ''Some value''
                 WHERE  OrderID BETWEEN @minID AND @maxID',
               N'@minID int, @maxID int', @minID, @maxID

          SELECT @minID = MIN(OrderID) FROM BigDetails WHERE OrderID > @maxID

    Here I am iterating over the clustered index to avoid the expensive scan. The point with the dynamic SQL is that this permits the optimizer to sniff the parameter values for a better plan. Rather than using dynamic SQL, you can simply tack on OPTION (RECOMPILE) at the end.

    It's probably better to run a single loop for all three columns at once.

    This is from a presenatation that I have, entitled "Don't Bite Off More Than You Can Chew - Take It in Chunks". Next time I will do this presentation is on SQL Saturday in Oslo on Aug 31st.

    Erland Sommarskog, SQL Server MVP,

    Saturday, June 15, 2019 8:31 AM