none
UPDATE logic with poor performances - SQL Server 2012

    Question

  • Hi,

    in order to update a fact table for a massive test, I'm using this logic:

    DECLARE @ind as int = 1
    DECLARE @RecCount as int = (SELECT COUNT(*) FROM myFACTS)
    DECLARE @Subsegment as varchar(3) = null
    DECLARE @Rand_Num as smallint
    
    WHILE @ind <= @RecCount
      begin
    	set @Rand_Num = cast((rand() * 49) as smallint)
    
    	SELECT @Subsegment = CODE FROM SUBSEGMENTS
    		WHERE Row_Num =  @Rand_Num
    
    	update myFACTS
    		set Subsegment = @Subsegment
    		where Row_Number = @ind
    
    	set @ind = @ind + 1
      end

    In the SUBSEGMENTS table there are 49 rows; so by generating a random number I set the subsegment in myFACTS table.

    I'm creating testing data.

    myFACTS table has more 3 millions of rows. This logic takes more 45-50 minutes, then I stop the related execution.

    I've created also some suggested non-clustered indexes by the Tuning Advisor, but the performances don't improve.

    Any suggests to me, please?

    Friday, July 26, 2013 8:10 PM

Answers

  • 8 and 1/2 hours seems awfully slow, even given that you are doing this in a loop instead of a set based update.  You might want to check to see if something is blocking your update.  Also make sure you have the correct indexes. You will definitely want an index on the Row_Number column in myFacts.  And also on the Row_Num column in SUBSEGMENTS.

    The other thing I would suggest is don't try the whole table until you know whatever you are doing works with small numbers of rows.  Like test it with 100 rows.  When that's working, test with 10,000 rows.  When 10,000 rows is working, and you have an idea how long that takes, you then will have an idea how long 3,000,000 rows will take (almost certainly 300 times as loong as that 10,000 rows took and maybe longer).

    Sorry, I didn't read what you were doing closely enough.  But I would still want to do this as a set based solution, not a while loop that ran through 3,000,000 rows. 

    I would do something like the following.  I assume your myFacts table has a primary key.  The following code assumes it is an integer named Row_Number, if it is something else, just replace that in the following code.

    create table #random(Row_Number int primary key, rownbr int);
    insert #random(Row_Number, rownbr)
    select Row_Number, abs(cast(newid() as binary(6)) % 49) + 1
    from myFACTS;
    
    merge myFACTS f
    using (select r.Row_Number, s.CODE from #random r inner join SUBSEGMENTS s on r.rownbr = s.Row_Nbr) as u
    on f.Row_Number = u.Row_Number
    when matched then update set Subsegment = u.CODE;
    
    drop table #random;

    The above code creates a temp table and loads it with the Row_Number values from myFACTS and to each row assigns a rnadom number from 1 to 49.  It then uses that table to assign to each row in myFACTS the corresponding value in the CODE column in SUBSEGMENTS.

    Tom

    • Marked as answer by pscorca Saturday, July 27, 2013 2:04 PM
    Saturday, July 27, 2013 6:19 AM

All replies

  • Fill the random nos in a temp table and update myfacts after join that temp table.

    Thanks and Regards, Prajesh Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.

    Friday, July 26, 2013 8:55 PM
  • Do I fill the random number into a temp table having the same number of myFACTS rows, that is more 3 million of rows?! Why do I implement this solution? It doesn't seem very fine.

    Thanks

    Friday, July 26, 2013 9:49 PM
  • OK, Thanks

    Thanks and Regards, Prajesh Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.

    Friday, July 26, 2013 10:09 PM
  • NewID() function generates a random guid for each row.  You can use that to generate a random integer between 1 and 49 with

    abs(cast(newid() as binary(6)) % 49) + 1

    So instead of doing this with a loop, you can do it with a single update statement which will be much faster.

    update myFACTS
    set Subsegment =  abs(cast(newid() as binary(6)) % 49) + 1
    Tom

    • Proposed as answer by Prajesh Friday, July 26, 2013 11:14 PM
    • Unproposed as answer by pscorca Saturday, July 27, 2013 4:22 AM
    Friday, July 26, 2013 11:06 PM
  • Hi Tom,

    I think it's an interesting topic and you may want to create a TechNet WiKi article out of it and participate in T-SQL Guru Content (see the sticky thread in this forum).


    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    Saturday, July 27, 2013 12:38 AM
  • Hi Naomi,

    Thanks for the suggestion, but I'm not a WiKi writer.  And the idea is definitely not original with me.  I'm not sure how far back it goes, but at least to Pinal Dave's blog in 2007 at http://blog.sqlauthority.com/2007/04/29/sql-server-random-number-generator-script-sql-query/

    Tom

    P.S.  It's nice seeing you back here in the forum answering questions.

    Saturday, July 27, 2013 3:24 AM
  • Hi Tom, thanks for your suggest but please the goal is to get the code value of a subsegment randomly and not to assign a random number as a subsegment for myFACTS table.

    However, I've already tried to write the UPDATE statement for myFACTS without a loop, but each myFACTS row could have a different subsegment.

    Finally, my logic is yet in execution after 8 hours and 30 minutes, to update more 3 million of rows!

    Thanks


    • Edited by pscorca Saturday, July 27, 2013 4:44 AM errata corrige
    Saturday, July 27, 2013 4:29 AM
  • 8 and 1/2 hours seems awfully slow, even given that you are doing this in a loop instead of a set based update.  You might want to check to see if something is blocking your update.  Also make sure you have the correct indexes. You will definitely want an index on the Row_Number column in myFacts.  And also on the Row_Num column in SUBSEGMENTS.

    The other thing I would suggest is don't try the whole table until you know whatever you are doing works with small numbers of rows.  Like test it with 100 rows.  When that's working, test with 10,000 rows.  When 10,000 rows is working, and you have an idea how long that takes, you then will have an idea how long 3,000,000 rows will take (almost certainly 300 times as loong as that 10,000 rows took and maybe longer).

    Sorry, I didn't read what you were doing closely enough.  But I would still want to do this as a set based solution, not a while loop that ran through 3,000,000 rows. 

    I would do something like the following.  I assume your myFacts table has a primary key.  The following code assumes it is an integer named Row_Number, if it is something else, just replace that in the following code.

    create table #random(Row_Number int primary key, rownbr int);
    insert #random(Row_Number, rownbr)
    select Row_Number, abs(cast(newid() as binary(6)) % 49) + 1
    from myFACTS;
    
    merge myFACTS f
    using (select r.Row_Number, s.CODE from #random r inner join SUBSEGMENTS s on r.rownbr = s.Row_Nbr) as u
    on f.Row_Number = u.Row_Number
    when matched then update set Subsegment = u.CODE;
    
    drop table #random;

    The above code creates a temp table and loads it with the Row_Number values from myFACTS and to each row assigns a rnadom number from 1 to 49.  It then uses that table to assign to each row in myFACTS the corresponding value in the CODE column in SUBSEGMENTS.

    Tom

    • Marked as answer by pscorca Saturday, July 27, 2013 2:04 PM
    Saturday, July 27, 2013 6:19 AM
  • Thanks Tom, I've created the indexes suggested by the Tuning Advisor: two non-clustered indexes for myFACTS (the first one respect to the subsegment and the second one respect to the Row_Num) and one non-clustered index for Row_Number of subsegments table.

    myFACTS table hasn't any primary keys: the row_num has been added nextly and setting with another update statement. I want to launch my logic then creating a pk for row_num column of myFACTS.

    However, the execution is completed in almost ten hours!

    Saturday, July 27, 2013 9:45 AM
  • Hi, I've added row_num as a primary key for myFACT.

    The update operation against 50.000 rows takes almost 8 minutes; for 100.000 rows takes almost 16 minutes, so the expected time for 8 millions of rows is almost 8 hours. Too time!

    Saturday, July 27, 2013 10:59 AM
  • Hi Tom, I've followed your code but using a table variable and not a temp table. The complete update takes 1 minute against almost ten hours! How it could be possible?! There is too too too difference! It's absurd!

    Now, does RAND() function provide a poor performance?

    Is the MERGE statement so much more performant than the UPDATE one?

    I cannot believe to what I can see!

    Thanks

    Saturday, July 27, 2013 11:45 AM
  • It's not that RAND() has poor performance.  I used NewID() instead of RAND() in my method because if you try in a single statement to apply RAND() to multiple rows, RAND() is only called once.  But you want something that is called once per row.  If you run

    select name, rand() from sys.objects;
    select name, newid() from sys.objects;

    you will see that with the RAND() function, every row gets the same value (because it is only called one time).  But with the NEWID() function, you get a new random value for each row (because it is called once per row). 

    MERGE is not necessarily more efficient an UPDATE.  If you rewrote that MERGE as the equivalent update statement, it should essentially run as fast as the MERGE.

    The cause of the difference here is writing the code as a set based operation instead of a loop.  Microsoft has put a great deal of effort into optimizing SQL for set based operations and the difference in performance, as you have seen, can be very dramatic.  I've seen bigger differences than this.  10 hours is about 600 times longer than 1 minute.  I've seen cases where rewriting the code as a set based operation made it run several thousand times faster.

    For completeness sake, I should note that there are a few things that can only be done with a loop or a cursor doing row by row operations, and a few things that can be done as either a set based operation or row by row where the row by row method is faster.  But these cases are very few.  There used to be more cases, but with recent enhancements to SQL, particularly the windowing functions, there are a lot fewer cases today.

    Tom

    Saturday, July 27, 2013 1:19 PM
  • Hi Tom, thanks for your explanation! The performance difference is really too and perhaps (I say perhaps) it could be an adding reason to explain it. For my experiences I've seen performance improvements from 20% to 50-70% and not so high, dramatically high. When I use a loop or a cursor the related performances don't worsen as much as in this case.

    Finally, in your sample the different behaviour between RAND() and NEWID() function isn't a good thing for me. They should have the same behaviour, that is or return a number/id for each row or a number/id once time for all data.

    Thanks for your helps.

    Saturday, July 27, 2013 2:04 PM
  • Finally, in your sample the different behaviour between RAND() and NEWID() function isn't a good thing for me. They should have the same behaviour, that is or return a number/id for each row or a number/id once time for all data.


    In general, if you do a statement like

    select o1.name, sysdatetime() as mytime from sys.columns o1 cross join sys.columns o2

    which gets the result of a scalar function for multiple rows, that function is only called once.  This is done for efficiency.  So in the above statement mytime will have the exact same value for every row even though the query takes several seconds to run. 

    NEWID() is an exception to this rule.  The reason for that is the original purpose of NEWID() was to provide unique values.  NEWID() was developed to provide values for guid datatypes where the guid is a primary key.  So you can use code like

    Create Table #Foo(MyKey guid primary key, <other columns>);
    Insert #Foo(MyKey, <other columns>)
    Select NewID(), <other columns> From MyTable;

    and get a unique value in MyKey for each row.

    Tom

    Saturday, July 27, 2013 3:35 PM
  • Hi Tom,

    if I create a scalar function that sums fe a fix number to an amount and I run

    SELECT dbo.fn_SUM(Amount) FROM myTable

    I can obtain the execution of the scalar function row by row.

    It seems that RAND() function is the exception and not the NEWID() one. The behaviour of the NEWID() function is the same of my scalar function ...

    Monday, July 29, 2013 7:18 AM
  • I was trying to explain why RAND() and NEWID() act differently.  Not to give a complete description of when and how many times a function is called.  So, yes, I would agree my answer was incomplete.  The follow may be a little more complete, but it isn't the complete story either.

    First, no matter what the best way for RAND() to work is (and I would agree, it probably would have been better if it had been defined to return a new random value for each row), it wasn't done that way at the beginning.  And it would be difficult for Microsoft to change the behavior now, as it might break existing code for some users.

    The complete definition of when and how many times a function gets called is intentionally not well defined.  SQL does this so that it is allowed to do things as efficiently as possible.  The only rule is that the function must do what the documentation says it does.  So if SQL can figure out a way to call it only once, SQL is allowed to do that.  If SQL decides to call it once per row, SQL is allowed to do that.

    Obviously, in general, any function that takes a parameter and the parameter comes from a column in the database, then in general, SQL will call the function for each row.  But that is not guaranteed.  SQL just has to return the correct value.  So in

    SELECT MyColumn, LOG(MyColumn), GETDATE(), dbo.fn_SUM(Amount), 2 * dbo.fn_SUM(Amount) FROM ..., the LOG function will probably be called for each row, the GETDATE function will be called once, the dbo.fn_SUM function will be probably be called either once per row or twice per row.  So one release of SQL might call it twice per row and a newer release might figure out that it is more efficient to call it only once per row, and then use that value and 2 times that value.  That fact that SQL can do this with user valued functions is the reason there are so many restrictions on what a function can do (can't update tables, can't call stored procedures, etc).

    In any case, RAND() works one way, NEWID() works the other way.  Although, you could also use RAND() with a seed value that came from the column.  Like, for example RAND(Row_Number).  That would force RAND() to be called once per row.  That would have a side effect that for your purposes might be a "feature" or it might be a "bug" or you might not care.  The side effect is that for a given value of Row_Number, you are going to always get the same result.  And, at least if Row_Number is a integer, two different values of Row_Number that are nearly the same (say 1 and 8) will tend to get RAND values that are close together.  For example

    RAND(8) - RAND(1) = 0.000130430798807368

    Tom

    Monday, July 29, 2013 2:12 PM