locked
Generating Simulated Skewed Data RRS feed

  • Question

  • Hello dear friends!

    For the purpose of several tests I wish to make, I need to simulate a collection of skewed data (non-uniform distributed random data).

    The final requirement is to reach a state where my data is so skewed that it hurts the statistics in such a way that even after performing UPDATE STATISTICS WITH FULLSCAN, I can still have a situation where I SELECT on the indexed column, and the "estimated number of rows" is profoundly different than "actual number of rows".

    For example, I'd have an index value where the "actual" and "estimated" rows are identical or very close:

    SELECT *
    FROM MyTable
    WHERE SomeID = 123

    And another value (with the same query) will have a huge difference between "actual" and "estimated" number of rows:

    SELECT *
    FROM MyTable
    WHERE SomeID = 456

    This task is a lot more challenging than it looks!

    I couldn't find good material on this on the internet, but I managed to create a partial solution (more or less) using Gaussian-distributed random data (thanks to this article http://www.sqlservercentral.com/articles/SQL+Uniform+Random+Numbers/91103/)

    However, the data still wasn't skewed enough, so I was hoping perhaps someone on this honored forum could fare better than me.

    Anyone up to the challenge?


    Eitan Blumin; SQL Server Consultant - Madeira SQL Server Services; http://www.madeira.co.il/author/eitan/

    Sunday, August 5, 2012 9:56 AM

Answers

  • Skew in itself not very important. Here is an example with a perectly uniform distribution, over values 0, 10, etc to 19990. Querying for something that does not end in 0 gives an incorrect estimate.

    SELECT (abs(checksum(newid())) % 2000 ) * 10 AS r
    INTO   #random
    FROM   sys.objects a
    CROSS  JOIN sys.objects b
    CROSS  JOIN sys.objects c
    go
    CREATE STATISTICS lies ON #random(r) WITH FULLSCAN
    go
    SELECT * FROM #random WHERE r = 11401
    go
    DROP TABLE #random

    Reaching the other situation, where the actual is >> the estimate is more difficult, since SQL Server will build the histogram around the common values. The histogram has (at most) 200 steps, and if the data is skewed, this will also be reflected in the histogram.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Proposed as answer by Iric Wen Wednesday, August 8, 2012 8:25 AM
    • Marked as answer by EitanBlumin Wednesday, August 8, 2012 5:28 PM
    Sunday, August 5, 2012 6:57 PM