# Generating Simulated Skewed Data

• ### Question

• Hello dear friends!

For the purpose of several tests I wish to make, I need to simulate a collection of skewed data (non-uniform distributed random data).

The final requirement is to reach a state where my data is so skewed that it hurts the statistics in such a way that even after performing UPDATE STATISTICS WITH FULLSCAN, I can still have a situation where I SELECT on the indexed column, and the "estimated number of rows" is profoundly different than "actual number of rows".

For example, I'd have an index value where the "actual" and "estimated" rows are identical or very close:

SELECT *
FROM MyTable
WHERE SomeID = 123

And another value (with the same query) will have a huge difference between "actual" and "estimated" number of rows:

SELECT *
FROM MyTable
WHERE SomeID = 456

This task is a lot more challenging than it looks!

I couldn't find good material on this on the internet, but I managed to create a partial solution (more or less) using Gaussian-distributed random data (thanks to this article http://www.sqlservercentral.com/articles/SQL+Uniform+Random+Numbers/91103/)

However, the data still wasn't skewed enough, so I was hoping perhaps someone on this honored forum could fare better than me.

Anyone up to the challenge?

Sunday, August 5, 2012 9:56 AM

• Skew in itself not very important. Here is an example with a perectly uniform distribution, over values 0, 10, etc to 19990. Querying for something that does not end in 0 gives an incorrect estimate.

SELECT (abs(checksum(newid())) % 2000 ) * 10 AS r
INTO   #random
FROM   sys.objects a
CROSS  JOIN sys.objects b
CROSS  JOIN sys.objects c
go
CREATE STATISTICS lies ON #random(r) WITH FULLSCAN
go
SELECT * FROM #random WHERE r = 11401
go
DROP TABLE #random

Reaching the other situation, where the actual is >> the estimate is more difficult, since SQL Server will build the histogram around the common values. The histogram has (at most) 200 steps, and if the data is skewed, this will also be reflected in the histogram.

Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
• Proposed as answer by Wednesday, August 8, 2012 8:25 AM
• Marked as answer by Wednesday, August 8, 2012 5:28 PM
Sunday, August 5, 2012 6:57 PM