Best Practice for big data queries
-
23 Maret 2012 15:16
Hi there,
I have a big source table ( 110M rows, 8GB ) that I would like to process into another big destination table. As it turns out timewise such approach is a lot slower than dividing up the destination table into several smaller dest. tables. But this approach leads to a lot of SQL code that I would like to avoid.
The queries follow a similar pattern like:
insert into Temp_Table_1 select col1, col2, col3, col4, col5, sum(col6) as Loss from BigTable where id in ('1') group by col1, col2, col3, col4, col5 ; insert into Temp_Table_2 select col1, col2, col3, col4, col5, sum(col6) as Loss from BigTable where id in ('1', '2') group by col1, col2, col3, col4, col5 ; -- 10 more queries
I tried several approaches like creating an index for BigTable or creating a separate table that holds an index for each {col1, col2, col3, col4, col5} groupings. But they all take too much time to process.
Does any of the SQL Server Zens have some common practices that I could profit from?
The target machine is an 8 core machine with 32GB of RAM. SQL Server version is 2008 R2.
Thanks ahead,
Christian
Semua Balasan
-
23 Maret 2012 15:29
Hi, just sharing some thoughts.. I would try the following:
If all queries look like those, use an intermediate table with the results for each id (that you can use to populate the other tables):
insert into Temp_Table select id, col1, col2, col3, col4, col5, sum(col6) as Loss from BigTable where id in ('1', '2', '3') -- and all other possible ids group by id, col1, col2, col3, col4, col5I would also try creating a non-clustered index on columns id, col1, col2, col3, col4 and col5 and col6 as included column.
But again, I would have to try it to see the actual performance.
David.
-
23 Maret 2012 15:45
Christian,
110M rows is a big table, but not really too big to cause severe performance issues if you gave good indexing. do you have an index on the column "id" that is used in the where clause , is it a covering index ? does the execution plan show that the index is being used appropriately ?
you have multiple queries that seem to be selecting from the same bigtable and putting rows in different staging tables based on different where clauses. You might want to consolidate this process into an SSIS package with one select statement that has a wide enough filter to incorporate all the filters and then use a combination of multicase and conditional split to populate the different staging tables as per your filters.
finally, you might want to consider table partitions if possible to limit the number of rows your index has to seek/scan through.
http://msdn.microsoft.com/en-us/library/ms190787.aspx
hope this helps !
Sanil Mhatre | Database Developer | MCTS | If you find my reply useful in any way, please vote it as helpful. If it has helped answer your question, please mark it as Answer.
- Diedit oleh Sanil Mhatre 23 Maret 2012 15:47 added link
-
23 Maret 2012 15:49
Thanks David, a couple of questions:
* What do you mean by "intermediate" table? Is that an actual table or a variable?
* What's the syntax for creating such a non-clustered index? I got this
create index p1 on BigTable(col1,col2,col3,col4,col5) include col6
Christian
-
23 Maret 2012 15:56
I mean an actual table that will hold the results for each id (Temp_Table in my sample code). Then you can use this table to populate the other tables, e.g.:
insert into Temp_Table_2 select col1, col2, col3, col4, col5, sum(col6) as Loss from Temp_Table where id in ('1', '2') group by col1, col2, col3, col4, col5This should be much faster since this table should be much smaller.
Regarding the index (don't forget id column):
create index p1 on BigTable (id,col1,col2,col3,col4,col5) include (col6)
David.
- Ditandai sebagai Jawaban oleh amber zhangModerator 30 Maret 2012 3:23
-
23 Maret 2012 23:11
Thanks Sanil, I'll do some more testing next week.
-
24 Maret 2012 5:24
Hi Chenning,
If you want to simply insert rows in other table then I suggest to use Import Utility of SQL Server, and If you are trying to do some processing in between then the best approach is to do it in batches.
Best approach for both is using SSIS.
Hope it will help you.
------------------ Please mark my post as an answer if I helped you to resolve the issue or vote up if I helped you.Thanks. Regards and good Wishes, Maqbool.
- Ditandai sebagai Jawaban oleh amber zhangModerator 30 Maret 2012 3:23