none
Best Practice for big data queries

    Pertanyaan

  • Hi there,

    I have a big source table ( 110M rows, 8GB ) that I would like to process into another big destination table. As it turns out timewise such approach is a lot slower than dividing up the destination table into several smaller dest. tables. But this approach leads to a lot of SQL code that I would like to avoid. 

    The queries follow a similar pattern like:

        insert into Temp_Table_1
        select col1, col2, col3, col4, col5, sum(col6) as Loss
        from BigTable
        where id in ('1')
        group by col1, col2, col3, col4, col5
        ;
    
        insert into Temp_Table_2
        select col1, col2, col3, col4, col5, sum(col6) as Loss
        from BigTable
        where id in ('1', '2')
        group by col1, col2, col3, col4, col5
        ;
    
     -- 10 more queries
    

    I tried several approaches like creating an index for BigTable or creating a separate table that holds an index for each {col1, col2, col3, col4, col5} groupings. But they all take too much time to process.

    Does any of the SQL Server Zens have some common practices that I could profit from? 

    The target machine is an 8 core machine with 32GB of RAM. SQL Server version is 2008 R2.

    Thanks ahead,

    Christian

    23 Maret 2012 15:16

Jawaban

  • I mean an actual table that will hold the results for each id (Temp_Table in my sample code). Then you can use this table to populate the other tables, e.g.:

    insert into Temp_Table_2
    select col1, col2, col3, col4, col5, sum(col6) as Loss
    from Temp_Table
    where id in ('1', '2')
    group by col1, col2, col3, col4, col5

    This should be much faster since this table should be much smaller.

    Regarding the index (don't forget id column):

    create index p1 on BigTable (id,col1,col2,col3,col4,col5) include (col6)

    David.



    23 Maret 2012 15:56
  • Hi Chenning,

    If you want to simply insert rows in other table then I suggest to use Import Utility of SQL Server, and If you are trying to do some processing in between then the best approach is to do it in batches.

    Best approach for both is using SSIS.

    Hope it will help you.


    ------------------ Please mark my post as an answer if I helped you to resolve the issue or vote up if I helped you.Thanks. Regards and good Wishes, Maqbool.

    24 Maret 2012 5:24

Semua Balasan

  • Hi, just sharing some thoughts.. I would try the following:

    If all queries look like those, use an intermediate table with the results for each id (that you can use to populate the other tables):

    insert into Temp_Table
    select id, col1, col2, col3, col4, col5, sum(col6) as Loss
    from BigTable
    where id in ('1', '2', '3') -- and all other possible ids
    group by id, col1, col2, col3, col4, col5

    I would also try creating a non-clustered index on columns id, col1, col2, col3, col4 and col5 and col6 as included column.

    But again, I would have to try it to see the actual performance.

    David.

    23 Maret 2012 15:29
  • Christian,

    110M rows is a big table, but not really too big to cause severe performance issues if you gave good indexing. do you have an index on the column "id" that is used in the where clause , is it a covering index ? does the execution plan show that the index is being used appropriately ?

    you have multiple queries that seem to be selecting from the same bigtable and putting rows in different staging tables based on different where  clauses. You might want to consolidate this process into an SSIS package with one select statement that has a wide enough filter to incorporate all the filters and then use a combination of multicase and conditional split to populate the different staging tables as per your filters.

    finally, you might want to consider table partitions if possible to limit the number of rows your index has to seek/scan through. 

    http://msdn.microsoft.com/en-us/library/ms190787.aspx

    hope this helps !



    Sanil Mhatre | Database Developer | MCTS | If you find my reply useful in any way, please vote it as helpful. If it has helped answer your question, please mark it as Answer.


    23 Maret 2012 15:45
  • Thanks David, a couple of questions:

    * What do you mean by "intermediate" table? Is that an actual table or a variable?

    * What's the syntax for creating such a non-clustered index? I got this

    create index p1 on BigTable(col1,col2,col3,col4,col5) include col6

    Christian

    23 Maret 2012 15:49
  • I mean an actual table that will hold the results for each id (Temp_Table in my sample code). Then you can use this table to populate the other tables, e.g.:

    insert into Temp_Table_2
    select col1, col2, col3, col4, col5, sum(col6) as Loss
    from Temp_Table
    where id in ('1', '2')
    group by col1, col2, col3, col4, col5

    This should be much faster since this table should be much smaller.

    Regarding the index (don't forget id column):

    create index p1 on BigTable (id,col1,col2,col3,col4,col5) include (col6)

    David.



    23 Maret 2012 15:56
  • Thanks Sanil, I'll do some more testing next week.

    23 Maret 2012 23:11
  • Hi Chenning,

    If you want to simply insert rows in other table then I suggest to use Import Utility of SQL Server, and If you are trying to do some processing in between then the best approach is to do it in batches.

    Best approach for both is using SSIS.

    Hope it will help you.


    ------------------ Please mark my post as an answer if I helped you to resolve the issue or vote up if I helped you.Thanks. Regards and good Wishes, Maqbool.

    24 Maret 2012 5:24