sábado, 9 de junho de 2012 04:17
We are working on a process that will modify 2 relational tables , parent with close to a billion records and child over billion and a half records, both tables have almost 3500 partitions. a small subset like 200k records spread out the partitions will be modified, the logic has cases with over 1000 or so when statements(mainframe conversion) , therefore SQL does not perform. I wrote a simple c# function( 3 actually,) that does the job with no CPU increase and now I am getting a push back that said needs to be doing the job. I am concerned pulling data out to a different server and since the duration of the update may take a long time when the data bulk inserted back into SQL server we might overlay somebody else's changes.
What's your take on processing data already inside SQL server by pulling it out to a different server and then bulk insert back into SQL index on the temp table and then update the tables. Do you think the process cost would not be much different on SQL server, would you be concerned with data integrity in this situations, do you have any other concerns or recommendations about either method?
Todas as Respostas
sábado, 9 de junho de 2012 05:29
Gokhan, then is this the same case you were working on a few days ago, you finished your CLR routine, it *works*, and your management won't take "yes!" for an answer?
Or is it that even with your function, the join and apply are still too slow?
I'm going to guess that you have some people who are pushing to eliminate SQL Server and go with Hadoop "bigdata" systems instead. And who knows, maybe they are right. A billion row query to update 200k rows may or may not work out on SQL Server, on even the biggest hardware available today, depends on - a lot of things, but mostly the selectivity of the keys. That it is across many partitions is bad news.
sábado, 9 de junho de 2012 12:31
CLR solution does not have any drawbacks, transactionally short, there is an initial read made from 2 tables into a small temp table where the locks are held on the 2 source tables that will be modified, and then index the temp table and update source table one and 2 in 2 update statements and then commit transaction which releases the locks to be held.
As far as I can tell ssis solution will be way more complex to write, currently my c# clr is totally scriptable, all sql server objects can be queried for columns parameters dependencies etc therefore they are also easily scriptable for usage, to write a ssis app is mouse clicks , merge joins has to be manually created, parameters has to be manually created, at the end of the day development is manual. And it cannot deal with locking at the same time the overall duration of the process is longer and heavier plus network traffic.
My initial question is an remains what else are the drawbacks to writing a simple couple table updates in sql into an ssis package from your experience?
segunda-feira, 11 de junho de 2012 20:33What is the nature of the modification? You may benefit by using $Partition to identify all the paritions that need modification and then make those changes and then iteratively update the next partition, this way you dont cross partitions (this is when it can be slow). Also have you considered bulk loading all the data into an empty table on the same filegroup etc.., updating the data and switching the old and new partitions out? There are a lot of ways to skin this cat, I just dont know enough about what you are trying to do, or your ddl constraints.
segunda-feira, 11 de junho de 2012 20:36
<<My initial question is an remains what else are the drawbacks to writing a simple couple table updates in sql into an ssis package from your experience?
You will have the same challenges in SSIS. An update is not going to be faster in SSIS than SQL. I think the cost will be greater to extract data, modify it, and the load it back into the source. There is probably a better way to handle this type of update. Ultimately, you are only updating 200K rows.