Using Statistics with Columnstore Indexes

Using Statistics with Columnstore Indexes

The query optimizer uses statistics to estimate the cardinality (number of rows) in the output from an operator in a query plan.  Statistics play an important role in getting a good query plan, so it's worth some effort to ensure that the optimizer has the right information when choosing a query plan.

1) Keep statistics up to date
Keeping statistics up to date is important so that the query optimizer has accurate information on which to base its cardinality estimates when evaluating various query plan alternatives.  Up to date statistics are important whether or not you use columnstore indexes, but the effects of stale statistics can be especially noticeable when you have a columnstore index because the performance difference can be quite large between a good query plan that uses columnstore indexes and batch mode processing and a not-so-good query plan that either does not use an available columnstore index when it should or underestimates the size of a memory grant request, causing a hash join to spill to disk.  If you add data with ascending data values, read this blog post about keep statistics updated in the presence of ascending keys: Ascending Keys and Auto Quick Corrected Statistics.

2) Columnstore indexes do not have their own statistics
Statistics are not created automatically when you create a columnstore index.  The optimizer will use statistics that exist on the base table, whether those statistics were created automatically or manually.  Hence, you will generally want to create statistics on all the columns you expect to query or leave auto create statistics turned on.

When you create a columnstore index, a statistics object is created, but the values reported by DBCC SHOW_STATISTICS are all NULL.  The statistics object associated with a columnstore index is used only for database cloning.  A database clone is a statistics-only copy of the database, used for investigating query plan issues.  For more information about database cloning, see http://blogs.msdn.com/b/psssql/archive/2009/07/08/attach-of-the-clone-databases.aspx

3) Create multicolumn statistics on correlated columns
The optimizer estimates how much each filter and join in a query plan will reduce the amount of data to be processed by the next operator.  When filters are on columns with highly correlated data, the optimizer can overestimate the data-reducing effect of multiple filters and thus underestimate the size of a join.  Creating multicolumn statistics on columns that you know, from your design, are correlated will help the optimizer make better cardinality estimations.  This recommendation is especially important for the fact tables in your design.  Better cardinality estimates for joins can result in more accurate memory grants and avoid spills to tempdb.  When designing multicolumn statistics, consider what data are correlated and which columns will have predicates in the same query.  The optimizer can use a multicolumn statistic only if the columns in the predicates match a prefix of the columns in the statistics definition.  For example, if you create a multicolumn statistic on (A, B, C), the statistics can be used for predicates on A, B, and C, predicates on A and B, or a predicate on A only.  The statistic cannot be used for queries with predicates only on A and C or predicates only on B and C.

For more information on using statistics, see:
Using Statistics to Improve Query Performance
Statistics Used by the Query Optimizer in Microsoft SQL Server



Return to main SQL Server columnstore index tuning page.
Sort by: Published Date | Most Recent | Most Useful
Comments
  • Susan, thanks for this article.

    Are up-to-date statistics also relevant for creating/rebuilding columnstore indexes for memory grant request estimate to avoid infamous 701 "There is insufficient system memory in resource pool 'default' to run this query." error? `This error occurs even if there is enough RAM available on the server and even if the resource governor setting REQUEST_MAX_MEMORY_GRANT_PERCENT is set to provide 50% of MAXMEM of 175GB (Memory grant request rule-of-thumb estimate is about 24GB for the index)?

    Our server successfully rebuilt the index today in the morning using a MAXDOP of all 24 CPU cores. Then after truncating the 14 mio rows (200 cols) table and bulk inserting 14 mio rows (+ current month data of about 60,000 rows), the statistics were out of date, creating the columnstore index failed with all 24 cores, it failed with 12 cores but it finally succeded with 10 cores - with roughly the same data (app. 60,000 difference) as in the morning.

    We are often suffering from this error and do not know why we would greatly appreciate your ideas!

  • Does any of this change with clustered columnstore index in SQL 2014?

Page 1 of 1 (2 items)