none
100 million row update - avoid table scan? RRS feed

  • Question

  • Hello everyone.

    I need to perform a one-off update as follows:

     

    update f
    set f.propertykey = isnull(p.primarykey,1),
    	f.tenancykey = isnull(dt.primarykey,1),
    	f.householdkey = isnull(dt.householdkey,1)
    from dw.fact_job as f
    	inner join prestage.ext_serv_site_4 as s on s.serv_site = f.servsite
    	inner join dw.dim_date as d on f.CallLogDateKey = d.DateKey
    	left join dw.dim_property as p on s.client_reference = p.propref and d.Datecalendar >= p.DateFrom and (p.DateTo > d.datecalendar or p.dateto is null)
    	left join prestage.ext_tenagree as t on t.prop_ref = p.propref and d.Datecalendar >= t.cot and (d.Datecalendar <= t.eot or t.eot = '1 Jan 1900')
    	left join dw.dim_tenancy as dt on t.tag_ref = dt.tenagreeref and d.Datecalendar >= dt.DateFrom and (d.Datecalendar < dt.DateTo or dt.DateTo is null)
    where f.propertykey != p.primarykey
    	or f.tenancykey != dt.primarykey

     

    The problem is that dw.fact_job table is over 100 million rows and the columns I'm joining onto are not indexed so the above query is taking forever to run.  I'm a bit of a novice when it comes to execution plans but I can see there the query causes a table scan on dw.fact_job.  There's also a table scan on serv_site_4 but that's a much smaller one, only 17,000 rows.  The other tables are also slightly bigger, maybe 50,000 rows but they are indexed and the indexes are being used by the query (clustered index seeks).

    I'm out of my depth here so I'm looking for any advice to how I can break this huge task up into a manageable chunks.  Is there some kind of intermediate position I can write out to another table, index it and perform the update afterwards?   Does anyone need any more information?

    I'm on SQL Server 2005 Standard Edition.

    Any advice or guidance much appreciated.

    Thanks!

     

     

     

    • Edited by WHL999 Wednesday, January 5, 2011 10:11 PM Tidied up SQL code
    Wednesday, January 5, 2011 10:07 PM

Answers

  • Check a prior thread on this topic as how to do batch updates. Still even with your explanation it's not clear to me and by looking at the query I'm still confused.

    One possibility is to select the needed info into a temp table (PK, new values for columns) and then perform batch updates based on this temp table. That's as much as I can suggest so far.


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    • Marked as answer by Ai-hua Qiu Thursday, January 13, 2011 9:05 AM
    Wednesday, January 5, 2011 10:45 PM
    Moderator

All replies

  • Are you at the liberty of adding indexes? If yes, you need indexes on the columns to JOIN. Also, the where condition looks very strange as well as LEFT JOIN conditions.

    You also will need to change this update statement into loop of 10K changes at a time - it should perform better than 1MLN update. But the whole query looks already strange to me.

    Can you add comments to the query code first to explain what do you want to achieve with it?

    If, for now, you change the UPDATE into SELECT and add PK of the fact table, does it return duplicates (e.g. multiple records per fact table PK)?


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Wednesday, January 5, 2011 10:18 PM
    Moderator
  • Hello Naom,

    thanks for your quick response, I'll try to answer as best as possible.

    Re the WHERE clause.  With it, I counted that there would be 39 million updates, instead of the whole table, which is 104 million.  It seemed to make no difference on the execution plan so I left it in thinking that there would be fewer updates required.  Correct my thinking if this clause serves no purpose.

    I agree the left join conditions are horrible. I'm essentially trying to update all the surrogate keys in my fact table called dw.fact_job.  The surrogate keys are property, tenancy and household key.

    In the first left join, I need the column called primary key from dw.dim_property.  This is an OLAP dimension table where propref is the business key and the datefrom and dateto mark the valid time for the slowly changing dimension. So there are multiple versions of each propref with attributes, uniquely indexed by primarykey.

    The second and third left join is doing a similar thing to get the tenancy and household keys.  The second left join, I need to grab the correct tenancy record based on the log date of the job record.  This join gets me the tag_ref.  The third left join uses this tenancy reference and joins onto the tenancy dimension, which again, is also slowly changing, hence the conditions on the datefrom and dateto again, like the join on the property dimension.

    I'm doing this one off update because I realised that in my ETL, a lot of the surrogate keys have been populated incorrectly.

    ****

    I'm wondering how I can change query into a loop of 10k changes at a time.  Could you elaborate a little?

    Thanks!

    • Edited by WHL999 Wednesday, January 5, 2011 10:38 PM Embarrassing typo
    Wednesday, January 5, 2011 10:36 PM
  • Check a prior thread on this topic as how to do batch updates. Still even with your explanation it's not clear to me and by looking at the query I'm still confused.

    One possibility is to select the needed info into a temp table (PK, new values for columns) and then perform batch updates based on this temp table. That's as much as I can suggest so far.


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    • Marked as answer by Ai-hua Qiu Thursday, January 13, 2011 9:05 AM
    Wednesday, January 5, 2011 10:45 PM
    Moderator