SQL Server Columnstore Index FAQ

SQL Server Columnstore Index FAQ

The SQL Server in-memory columnstore index (formerly called xVelocity) stores data by columns instead of by rows, similar to a column-oriented DBMS. The columnstore index speeds up data warehouse query processing in SQL Server 2012 and SQL Server 2014, in many cases by a factor of 10 to 100. We'll be posting answers to frequently asked questions here.

SQL Server 2012 introduced nonclustered columnstore indexes. For more information, see the 2012 version of Columnstore Indexes on MSDN. 

SQL Server 2014 has both clustered and nonclustered columnstore indexes.  For more information, see the 2014 version of Columnstore Indexes Described, Using Nonclustered Columnstore Indexes, and Using Clustered Columnstore Indexes

For both SQL Server 2012 and SQL Server 2014, see the wiki article SQL Server Columnstore Performance Tuning on Technet. 

1. Overview

What are Microsoft's in-memory technologies?

Microsoft SQL Server has a family of in-memory technologies. These are all next-generation technologies built for extreme speed on modern hardware systems with large memories and many cores. The in-memory technologies include in-memory analytics engine (used in PowerPivot and Analysis Services), and the in-memory columnstore index (used in the SQL Server database).

SQL Server 2012, SQL Server 2014, and SQL Server PDW all use in-memory technologies to accelerate common data warehouse queries. SQL Server 2012 introduced two new features: a nonclustered columnstore index and a vector-based query execution capability that processes data in units called "batches." SQL Server 2014 introduced updateable clustered columnstore indexes.

What is a columnstore?

A columnstore is data that is logically organized as a table with rows and columns, and physically stored in a columnar data format. Relational database management systems traditionally store data in row-wise fashion. The values comprising one row are stored contiguously on a page. We sometimes refer to data stored in row-wise fashion as a rowstore.

What is a columnstore index?

A columnstore index is a technology for storing, retrieving and managing data by using a columnar data format, called a columnstore. The data is compressed, stored, and managed as a collection of partial columns, called column segments. You can use a columnstore index to answer a query just like data in any other type of index.

A columnstore index appears as an index on a table when examining catalog views or the Object Explorer in Management Studio. The query optimizer considers the columnstore index as a data source for accessing data just like it considers other indexes when creating a query plan.

What do I have to do to use a columnstore index?

For nonclustered columnstore indexes, all you have to do is create a nonclustered columnstore index on one or more tables in your database. The query optimizer will decide when to use the columnstore index and when to use other types of indexes. The query optimizer will also choose when to use the new batch execution mode and when to use row execution mode.

For clustered columnstore indexes, you need to first create a table as a heap or clustered index, and then use the CREATE CLUSTERED COLUMNSTORE INDEX statement to convert the existing table to a clustered columnstore index.  If your existing table has indexes, you need to drop all indexes, except for the clustered index, before creating a clustered columnstore index.  Since the clustered columnstore index is the data storage mechanism for the entire table, the clustered columnstore index is the only index allowed on the table.

Are columnstore indexes available in SQL Azure?

No, not yet.

[TOP]

2. Creating a Columnstore Index

How do I create a nonclustered columnstore index?

You can create a nonclustered columnstore index by using a slight variation on existing syntax for creating indexes. To create an index named mycolumnstoreindex on a table named mytable with three columns, named col1, col2, and col3, you would use the following syntax:

CREATE NONCLUSTERED COLUMNSTORE INDEX mycolumnstoreindex ON mytable (col1, col2, col3);

To avoid typing the names of all the columns in the table, you can use the Object Explorer in Management Studio to create the index as follows:

  1. Expand the tree structure for the table and then right click on the Indexes icon.
  2. Select New Index and then Nonclustered columnstore index
  3. Click Add in the wizard and it will give you a list of columns with check boxes.
  4. You can either choose columns individually or click the box next to Name at the top, which will put checks next to all the columns. Click OK.
  5. Click OK.

How do I create a clustered columnstore index?

When you create a clustered columnstore index, there is no need to specify columns since all columns in the table are included in the index.  This example converts a clustered index called myindex into a clustered columnstore index.

CREATE CLUSTERED COLUMNSTORE INDEX myindex ON mytable WITH (DROP_EXISTING = ON);


Does it matter what order I use when listing the columns in the CREATE NONCLUSTERED COLUMNSTORE INDEX statement?

No. When the nonclustered columnstore index is created, it uses a proprietary algorithm to organize and compress the data.

Does the columnstore index have a primary key?

No. There is no notion of a primary key for a columnstore index.

How many columns should I put in my columnstore index?

Typically, you will put all the columns in a table in the columnstore index, although it is not necessary to include all the columns. The limit on the number of columns is the same as for other indexes (1024 columns). If you have a column that has a data type that is not supported for columnstore indexes, you must omit that column from the columnstore index.

What data types can be used with columnstore indexes?

A columnstore index can include columns with the following data types: int, big int, small int, tiny int, money, smallmoney, bit, float, real, char(n), varchar(n), nchar(n), nvarchar(n), date, datetime, datetime2, small datetime, time, datetimeoffset with precision <=2, decimal or numeric with precision <= 18.

What data types cannot be used in a columnstore index?

The following data types cannot be used in a columnstore index: decimal or numeric with precision > 18, datetimeoffset with precision > 2, binary, varbinary, image, text, ntext, varchar(max), nvarchar(max), cursor, hierarchyid, timestamp, uniqueidentifier, sqlvariant, xml.

How long does it take to create a columnstore index? Is creating a columnstore index a parallel operation?

Creating a columnstore index is a parallel operation, subject to the limitations on the number of CPUs available and any restrictions set on MaxDOP. Creating a columnstore index takes on the order of 1.5 times as long as building a B-tree on the same columns.

My MAXDOP is greater than one but the columnstore index was created with DOP = 1.  Why it was not created using parallelism?
If your table has less than one million rows, SQL Server will use only one thread to create the columnstore index.  Creating the index in parallel requires more memory than creating the index serially.  If your table has more than one million rows, but SQL Server cannot get a large enough memory grant to create the index using MAXDOP, SQL Server will automatically decrease DOP as needed to fit into the available memory grant.  In some cases, DOP must be decreased to one in order to build the index under constrained memory.

How much memory is needed to create a columnstore index?

The memory required for creating a columnstore index depends on the number of columns, the number of string columns, the degree of parallelism (DOP), and the characteristics of the data. SQL Server will request a memory grant before trying to create the index. If not enough memory is available to create the index in parallel with the current max DOP, SQL Server will reduce the DOP as needed to get an adequate memory grant. If SQL Server cannot get a memory grant to build the index with DOP = 1, the index creation will fail.

A rule of thumb for estimating the memory grant that will be requested for creating a columnstore index is:

Memory grant request in MB = [(4.2 *Number of columns in the CS index) + 68]*DOP + (Number of string cols * 34)

What can I do if I do not have enough memory to build the columnstore index?

It's possible for creation of a columnstore index to fail either at the very beginning of execution if it can't get the necessary initial memory grant, or later during execution if supplemental grants can't be obtained. If the initial grant fails, you'll see error 8657 or 8658. You may get error 701 or 802 if memory runs out later during execution. If out-of-memory error 8657 or 8658 occur at the beginning of columnstore index creation, first, check your resource governor settings. The default setting for resource governor limits a query in the default pool to 25% of available memory even if the server is otherwise inactive. This is true even if you have not enabled resource governor.  Consider changing the resource governor settings to allow the create index statement to access more memory. You can do this using TSQL:

ALTER WORKLOAD GROUP [DEFAULT] WITH (REQUEST_MAX_MEMORY_GRANT_PERCENT=X)
ALTER RESOURCE GOVERNOR RECONFIGURE
GO

where X is the percent, say 50.

If you get error 701 or 802 later during the index build, that means that the initial estimate of memory usage was too low, and additional memory was consumed during index build execution and memory ran out. The only viable way to work around these errors in this case is to explicitly reduce DOP when you create the index, reduce query concurrency, or add more memory.

For all these error conditions (701, 802, 8657, and 8658), adding more memory to your system may help.

See SQL Server Books Online for ALTER WORKLOAD GROUP for additional information.

Another way to deal with out-of-memory conditions during columnstore index build is to vertically partition a wide table into two or more tables so that each table has fewer columns. If a query touches both tables, the table will have to be joined, which will affect query performance. If you use this option, you will want to allocate columns to the different tables carefully so that queries will usually touch only one of the tables. This option would also affect any existing queries and loading scripts. Another option is to omit some columns from the columnstore index. Good candidates are columns that are infrequently touched by queries that require scanning large amounts of data.

In some cases, you may not be able to create a columnstore index due to insufficient memory soon after the server starts up, but later on it may work. This is because SQL Server, by default, gradually requests memory from the operating system as it needs it. So it may not have enough memory available to satisfy a large memory grant request soon after startup. If this happens, you can make the system grab more memory by running a query like "select count(*) from t" where t is a large table. Or, you can set both the min server memory and max server memory to the same value using sp_configure, which will force SQL Server to immediately grab the maximum amount of memory it will use from the operating system when it starts up.

Can I create a columnstore index on a compressed table?

Yes. The base table can have PAGE compression, ROW compression, or no compression. The columnstore index will have its own compression, which cannot be specified by the user.

I tried to create a columnstore index with SQL Server Management Studio using the Indexes->New Index menu and it timed out after 20 minutes. How can I work around this?

Run a CREATE NONCLUSTERED COLUMNSTORE INDEX statement manually in a T-SQL window instead of using the graphical interface. This will avoid the timeout imposed by the Management Studio graphical user interface.

[TOP]

3. Limitations on Creating a Columnstore Index

Can I create a filtered columnstore index?

No. A columnstore index must contain data from all the rows in the table.

Can I create a columnstore index on a computed column?

No. A computed column cannot be part of a columnstore index.

Can I create a columnstore index on a sparse column?

No. A sparse column cannot be part of a columnstore index.

Can I create a columnstore index on an indexed view?

No. A columnstore index cannot be created on an indexed view. You also cannot use a columnstore index to materialize a view.

Can I create multiple columnstore indexes?

No. You can only create one columnstore index on a table. The columnstore index can contain data from all, or some, of the columns in a table. Since the columns can be accessed independently from one another, you will usually want all the columns in the table to be part of the columnstore index.

[TOP]

4. More Details on Columnstore Technology

What are the advantages and disadvantages of row stores and column stores?

When data is stored in column-wise fashion, the data can often be compressed more effectively than when stored in row-wise fashion. Typically there is more redundancy within a column than within a row, which usually means the data can be compressed to a greater degree. When data is more compressed, less IO is required to fetch the data into memory. In addition, a larger fraction of the data can reside in a given size of memory. Reducing IO can significantly speed up query response time.  Retaining more of your working set of data in memory will speed up response time for subsequent queries that access the same data.

When data is stored column-wise, it is possible to access the column individually. If a query only references a few of the columns in the table, it is only necessary for a subset of the columns to be fetched from disk into memory. For example, if a query references five columns from a table with 50 columns (i.e. 10% of the columns), IO is reduced by 90% (in addition to any benefits from compression).

On the other hand, storing columns in independent structures means that the data must be recombined to return the data as a row.  When a query touches only one (or a few) rows, having all the data for one row stored together can be an advantage if the row can be quickly located with a B-tree index.  Row stores may offer better query performance for very selective queries, such as queries that lookup a single row or a small range of rows.  Updating data is also simpler in a row store.

What is the difference between a pure column store and a hybrid column store?

SQL Server columnstore indexes are pure column stores. That means that the data is stored and compressed in column-wise fashion and individual columns can be accessed separately from other columns.  A hybrid columnstore stores a set of rows together, but within that set of rows, data is organized and compressed in column-wise fashion. A hybrid column store can achieve good compression from a column-wise organization within the set of rows, but when data is fetched from disk, the pages being fetched contain data from all the columns in each row.  Even if a query references only 10% of the columns in a table, all the columns must be fetched from disk, and unused columns also take up space in main memory. SQL Server columnstore indexes require less I/O and give better main-memory buffer pool hit rates than a hybrid columnstore.

Is a columnstore index better than a covering index that has exactly the columns I need for a query

The answer depends on the data and the query. Most likely the columnstore index will be compressed more than a covering row store index.  If the query is not too selective, so that the query optimizer will choose an index scan and not an index seek, scanning the columnstore index will be faster than scanning the row store covering index. In addition, depending on the nature of the query, you can get batch mode processing when the query uses a columnstore index.  Batch mode processing can substantially speed up operations on the data in addition to the speed up from a reduction in IO.  If there is no columnstore index used in the query plan, you will not get batch mode processing. On the other hand, if the query is very selective, doing a single lookup, or a few lookups, in a row store covering index might be faster than scanning the columnstore index.

Another advantage of the columnstore index is that you can spend less time designing indexes.  A row store index works well when it covers all the columns needed by a query. Changing a query by adding one more column to the select list can render the covering index ineffective.  Building one columnstore index on all the columns in the table can be much simpler than designing multiple covering indexes.

Is the columnstore index the same as a set of covering indexes, one for each column?

No. Although the data for individual columns can be accessed independently, the columnstore index is a single object; the data from all the columns is organized and compressed as an entity.  While the amount of compression achieved is dependent on the characteristics of the data, a columnstore index will most likely be much more compressed than a set of covering indexes, resulting in less IO to read the data into memory and the opportunity for more of the data to reside in memory across multiple queries.  In addition, queries using columnstore indexes can benefit from batch mode processing, whereas a query using covering indexes for each column would not use batch mode processing.

Is columnstore index data still compressed after it is read into memory?

Yes. Column segments are compressed on disk and remain compressed when cached in memory.

Do columnstore indexes use bitmap indexes?

No. Columnstore indexes use a proprietary data representation based on Vertipaq. It’s not the same as a bitmap index and doesn’t use one. But it has some similar benefits to bitmap indexes, such as  reducing the time it takes to filter on a column with a small number of distinct values.

I want to show other people how cool SQL Server columnstore indexes are. What can I show them?
OR
Where can I find more information (including documents and videos) about SQL Server columnstore indexes?

White paper:
http://download.microsoft.com/download/8/C/1/8C1CE06B-DE2F-40D1-9C5C-3EE521C25CE9/Columnstore%20Indexes%20for%20Fast%20DW%20QP%20SQL%20Server%2011.pdf

Product documentation:
http://msdn.microsoft.com/en-us/library/gg492088(SQL.110).aspx

SQL Server Columnstore FAQ:
http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-index-faq.aspx

SQL Server Columnstore Performance Tuning Guide:
http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-performance-tuning.aspx

The Coming In-Memory Tipping Point, by David Campbell
http://blogs.technet.com/b/dataplatforminsider/archive/2012/04/09/the-coming-in-memory-database-tipping-point.aspx 

Microsoft Virtual Academy talk video, 47 minutes, March 2012:
http://technet.microsoft.com/en-us/edge/Video/hh859842

TechEd 2011 talk video, Columnstore Indexes Unveiled, 1 hour, 9 minutes:
http://channel9.msdn.com/Events/TechEd/NorthAmerica/2011/DBI312

TechEd 2012 talk video, SQL Server Columnstore Performance Tuning, 1 hour, 15 minutes:
http://channel9.msdn.com/Events/TechEd/NorthAmerica/2012/DBI409

Columnstore performance and partition switching demo video, 9 minutes:
http://channel9.msdn.com/posts/SQL11UPD02-REC-02

Columnstore performance demo video, 4 minutes:
http://www.youtube.com/watch?v=vPN8_PCsJm4 

ACM SIGMOD 2011 paper on SQL Server columnstore indexes:
http://dl.acm.org/citation.cfm?doid=1989323.1989448

IEEE Data Engineering Bulletin Paper on SQL Server columnstore indexes, March 2012:
http://sites.computer.org/debull/A12mar/apollo.pdf

VertiPaq vs ColumnStore: Performance Analysis of the xVelocity Engine, v1.0, rev 2, Aug 3, 2012.
http://www.sqlbi.com/wp-content/uploads/Vertipaq-vs-ColumnStore1.pdf

Microsoft SQL Server 2012 Columnstore for Real Time Reporting in Manufacturing Automation (COPA-DATA zenon Analyzer), 2012.
http://www.kreatron.ro/news/newsdetail_65.html

Case Study (bwin.party):
http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012/bwin.party/Company-Cuts-Reporting-Time-by-up-to-99-Percent-to-3-Seconds-and-Boosts-Scalability/710000000087

Case Study (Motricity: Migration from Sybase IQ to xVelocity columnstore index):
http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012-Enterprise/Motricity/Mobile-Advertiser-Makes-Gains-with-Easy-Migration-of-Sybase-Database-to-Microsoft/710000000170

Case Study (MS People):
http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012/Microsoft-Information-Technology-Group-MSIT/Microsoft-Cuts-Costs-and-Improves-Access-to-Information-with-Enhanced-Data-Warehouse/4000011545

Case Study (Columnstore Indexes to Speed ETL):
http://prologika.com/CS/blogs/blog/archive/2011/12/07/columnstore-indexes-to-speed-etl.aspx

Case Study (Mediterranean Shipping Company):
http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012/Mediterranean-Shipping-Company-MSC/Shipper-Supports-Expansion-by-Boosting-Speed-Control-and-Savings-with-Microsoft/4000011460

Case Study (Beth Israel Deaconess Medical Center):
http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012/Beth-Israel-Deaconess-Medical-Center/Hospital-Improves-Availability-and-Speeds-Performance-to-Deliver-High-Quality-Care/5000000011

Case Study (Belgacom)
http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012-Enterprise/BICS/Telecom-Performs-Database-Queries-Five-Times-Faster-Gains-Ability-to-Sustain-Growth/710000000579

Case Study (BNZ - New Zealand Bank)
http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000000356

Case Study (RHI - Refractory Materials Manufacturer)
http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012-Enterprise/RHI/Manufacturer-Speeds-Queries-and-Improves-Business-Decisions-with-New-BI-Solution/710000001276

Case Study (Recall -- Records Management Firm)
http://www.microsoft.com/casestudies/Microsoft-SQL-Server-2012-Enterprise/Recall/Records-Management-Firm-Saves-1-Million-Gains-Faster-Data-Access-with-Microsoft-BI/710000001279

Slide deck on CDR (Telecom) application design loading 100M rows per day with 3 year retention
http://sqlug.be/media/p/1238.aspx

Internal Microsoft Columnstore Benchmark:
http://download.microsoft.com/download/7/2/E/72E63D2D-9F73-42BB-890F-C1CA0931511C/SQL_Server_2012_xVelocityBenchmark_DatasheetMar2012.pdf

SQL Server Column-Store available for all major SAP BW releases
http://blogs.msdn.com/b/saponsqlserver/archive/2012/10/29/sql-server-column-store-generally-available-for-sap-bw.aspx

SQL Server 2012 and Tableau -- speeding things up
http://random-thunks.com/2012/11/23/sql-server-2012-and-tableau-speeding-things-up/

What determines how many segments there will be?

Each physical partition of a columnstore index is broken into one-million-row chunks called segments (a.k.a. row groups). The index build process creates as many full segments as possible. Because multiple threads work to build an index in parallel, there may be a few small segments (typically equal to the number of threads) at the end of each partition with the remainder of the data after creating full segments. That's because each thread might hit the end of its input at different times. Non-partitioned tables have one physical partition.

[TOP]

5. Using Columnstore Indexes

How do I know whether the columnstore index is being used for my query?

You can tell whether a columnstore index is being used by looking at showplan.  In graphical showplan, there is a new icon for columnstore index scans.  In addition, columnstore index scans have a new property, storage, with the value ColumnStore.

How can I force the query to use a columnstore index?

Existing hints work with columnstore indexes.  If you have a nonclustered columnstore index named mycsindex on a table named mytable you could use a table hint such as

 … FROM mytable WITH (INDEX (mycsindex)) …

How can I prevent the use of a columnstore index in my query?

You can either use a table hint to force the use of a different index, or you can use a new query hint: IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX.  This new hint will prevent the use of any nonclustered columnstore indexes in the query.  Below is an example of using the hint to prevent use of any nonclustered columnstore index in a query:
SELECT DISTINCT (SalesTerritoryKey)
FROM dbo.FactResellerSales
OPTION (IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX);

Are columnstore indexes an in-memory database technology?

SQL Server columnstores provide the performance benefits of a pure in-memory system with the convenience and economics of a system that stores data on disk and caches recently used data in memory. Columnstores hold data in memory in a different format than is kept on disk. This in-memory representation is highly optimized to support fast query execution on modern processors. Not all data has to fit in memory with a SQL Server columnstore index. But if all columnstore data does fit in memory, SQL Server provides pure-in-memory levels of performance.

Why require all data to fit in memory (capping your database size or demanding a large budget to purchase memory, and demanding slow system startup times) if you can get the best of both worlds, that is, state-of-the-art query performance on economical hardware?

Does all the data have to fit in memory when I use a columnstore index?

No, a columnstore index is persisted on disk just like any other index. It is read into memory when needed just like other types of indexes. The columnstore index is divided into units called segments, which are the unit of transfer. A segment is stored as a LOB, and can consist of multiple pages. We elected to bring columnstore index data into memory on demand rather than require that all data fits in memory so customers can access databases much bigger than will fit in main memory. If all your data fits in memory, you'll get reduced I/O and the fastest possible query performance. But it's not necessary for all data to fit in memory, and that's a plus.

What determines whether the columnstore index is stored in memory?

A columnstore index is read into memory when needed just like other types of indexes.

Can I force a whole columnstore index to be loaded into memory?

You cannot force the columnstore index to be loaded, or kept, in memory but you can warm the cache by running a query that will cause the columnstore data to be read into memory.

When should I build a columnstore index?

Columnstore indexes are designed to accelerate data warehouse queries, not OLTP workloads.  Use columnstore indexes when your query workload entails scanning and aggregating large amounts of data or joining multiple tables, especially in a star join pattern.  The restrictions on how you update the data will also affect your choice.  Columnstore indexes will be easiest to manage if you have a read-mostly workload and if partition switching to update the data will fit into your workflow.  Partition switching for handling updates is easier if most updates consist of appending new data to the existing table and can be placed in a staging table that can be switched into the table during periodic load cycles.

Typically you will want to build a columnstore index on large fact tables and maybe on large dimension tables as well.  You can build a columnstore index on very small tables, but the performance advantage is less noticeable when the table is small.  If you frequently update your dimension tables, and they are not too large, you may find the maintenance effort outweighs the benefit of a columnstore index.

When should I not build a columnstore index?

If you frequently update the data in a table, or if you need to update a large table but partition switching does not fit your workflow, you might not want to create a columnstore index.  If most of your queries are small lookup queries, seeking into a B-tree index may be faster and you may not find a columnstore index to be beneficial.  If you test a columnstore index and it does not benefit your workload, you can drop or disable the index.

Can you do trickle load and real-time query with a columnstore index?

Yes. Even though tables with a columnstore index are read-only, you can maintain two tables, the one with the columnstore, and a second table with the same schema structured as a B-tree or heap. The second table, called a differential file, holds newly inserted rows. You query the combined table by modifying your queries to aggregate results from the two tables separately, and combine them. This is called local-global aggregation. Periodically, (say during a nightly batch window) you move data from the row-structured table to the columnstore table. See here for details and an example on how to do trickle load.


[TOP]


6. Managing Columnstore Indexes

Do columnstore indexes work with Transparent Data Encryption?

Yes.

Can I compress the columnstore index?

The columnstore index is compressed when it is created. You cannot apply PAGE or ROW compression to a columnstore index. When a columnstore index is created, it uses the VertiPaqTM compression algorithms, which compress the data more than either PAGE or ROW compression. There is no user control over compression of the columnstore index.

What is the difference in storage space used between the base table and the columnstore index?

Based on our experiments with a variety of different data sets, columnstore indexes are about 4X to 15X smaller than an uncompressed heap or clustered B-tree index, depending on the data.

Do columnstore indexes work on partitioned tables?

Yes, you can create a columnstore index on a partitioned table. The columnstore index must be partition-aligned with the base table. If you do not specify a partition scheme when you create the columnstore index, the index will be automatically created using the same partition scheme as the base table. You can switch a partition in and out of a partitioned table with the same requirements regarding matching indexes as exist for other types of clustered and nonclustered indexes.

Can I partition a columnstore index?

Yes, you can partition a columnstore index, but the base table must also be partitioned and the columnstore index must be partition-aligned with the base table.

How do I add to, or modify, the data in a table with a columnstore index?

Once you create a columnstore index on a table, you cannot directly modify the data in that table. A query with INSERT, UPDATE, DELETE, or MERGE will fail and return an error message. To add or modify the data in the table, you can do one of the following:

  • Disable or drop the columnstore index. You can then update the data in the table. If you disable the columnstore index, you can rebuild the columnstore index when you finish updating the data. For example,

    ALTER INDEX mycolumnstoreindex ON mytable DISABLE;

    -- update the data --

    ALTER INDEX mycolumnstoreindex ON mytable REBUILD;

    Now the columnstore index is ready to use again.

  • Load data into a staging table that does not have a columnstore index. Build a columnstore index on the staging table. Switch the staging table into an empty partition of the main table.
  • Switch a partition from the table with the columnstore index into an empty staging table. If there is a columnstore index on the staging table, disable the columnstore index. Perform any updates. Build (or rebuild) the columnstore index. Switch the staging table back into the (now empty) partition of the main table.

See also the question about trickle load.

What happens if I try to update a table that has a columnstore index?

The update will fail and return an error message.

Can I disable and rebuild the index on a single partition?

No. You can only disable or rebuild a columnstore index on the entire table. If you want to rebuild only one partition, you should switch the partition into an empty staging table, disable/rebuild the index on the staging table, and switch the staging table back into the main table. There is no need to rebuild the index except when you want to modify the data in the table.

How can I tell whether there is a columnstore index on my table?

There are two ways to determine whether a columnstore exists on a table. In Management Studio, you can look at the Object Explorer. Each table has an entry for Indexes. Columnstore indexes are included in the list of indexes and have their own icon and description. You can also look at various catalog tables. In sys.indexes, a columnstore index has type = 6 and type_desc = “NONCLUSTERED COLUMNSTORE.” A new catalog table, sys.column_store_index_stats, has one row for each columnstore index.

How can I find out more about my columnstore indexes? Is there metadata?

There are two new catalog tables with data about columnstore indexes:    

  • sys.column_store_segments
  • sys.column_store_dictionaries

VIEW DEFINITIONS permission on a table is required to see information in the catalog tables about a columnstore index on that table. In addition, a user must have SELECT permission on the table to see data in the following columns:

sys.column_store_segments:
    has_nulls, base_id, magnitude, min_data_id, max_data_id, null_value, data_ptr

sys.column_store_dictionaries:
    last_id, entry_count, data_ptr

A user who does not have SELECT permission on a table will see NULL as the value in the columns listed above.

Does the columnstore compression algorithm compress each partition separately?

Yes, each partition is compressed separately. Each partition has its own dictionaries. All segments within a partition share dictionaries. Dictionaries for different partitions are independent. This allows partition switching to be a metadata-only operation.

How big are my columnstore indexes?

You can use the new catalog tables or sys.dm_db_partition_stats to determine how big the columnstore indexes are on disk. A relatively simple query to get the size of one columnstore index is:

   SELECT SUM(s.used_page_count) / 128.0 on_disk_size_MB 
    FROM sys.indexes AS i 
    JOIN sys.dm_db_partition_stats AS S 
        ON i.object_id = S.object_id 
                        and I.index_id = S.index_id 
    WHERE i.object_id = object_id('<tablename>') 
    AND i.type_desc = 'NONCLUSTERED COLUMNSTORE'

Here are some other queries that total up column store component sizes.

-- total size
with total_segment_size as (
SELECT
SUM (css.on_disk_size)/1024/1024 AS segment_size_mb
FROM sys.partitions AS p
JOIN sys.column_store_segments AS css
ON p.hobt_id = css.hobt_id
)
,
total_dictionary_size as (
SELECT SUM (csd.on_disk_size)/1024/1024 AS dictionary_size_mb
FROM sys.partitions AS p
JOIN sys.column_store_dictionaries AS csd
ON p.hobt_id = csd.hobt_id
)
select
segment_size_mb,
dictionary_size_mb,
segment_size_mb + isnull(dictionary_size_mb, 0) as total_size_mb
from total_segment_size
left outer join total_dictionary_size
on 1 = 1
go
-- size per index
with segment_size_by_index AS (
SELECT
p.object_id as table_id,
p.index_id as index_id,
SUM (css.on_disk_size)/1024/1024 AS segment_size_mb
FROM sys.partitions AS p
JOIN sys.column_store_segments AS css
ON p.hobt_id = css.hobt_id
group by p.object_id, p.index_id
) ,
dictionary_size_by_index AS (
SELECT
p.object_id as table_id,
p.index_id as index_id,
SUM (csd.on_disk_size)/1024/1024 AS dictionary_size_mb
FROM sys.partitions AS p
JOIN sys.column_store_dictionaries AS csd
ON p.hobt_id = csd.hobt_id
group by p.object_id, p.index_id
)
select
object_name(s.table_id) table_name,
i.name as index_name,
s.segment_size_mb,
d.dictionary_size_mb,
s.segment_size_mb + isnull(d.dictionary_size_mb, 0) as total_size_mb
from segment_size_by_index s
JOIN sys.indexes AS i
ON i.object_id = s.table_id
and i.index_id = s.index_id
left outer join dictionary_size_by_index d
on s.table_id = s.table_id
and s.index_id = d.index_id
order by total_size_mb desc
go
-- size per table
with segment_size_by_table AS (
SELECT
p.object_id as table_id,
SUM (css.on_disk_size)/1024/1024 AS segment_size_mb
FROM sys.partitions AS p
JOIN sys.column_store_segments AS css
ON p.hobt_id = css.hobt_id
group by p.object_id
) ,
dictionary_size_by_table AS (
SELECT
p.object_id AS table_id,
SUM (csd.on_disk_size)/1024/1024 AS dictionary_size_mb
FROM sys.partitions AS p
JOIN sys.column_store_dictionaries AS csd
ON p.hobt_id = csd.hobt_id
group by p.object_id
)
select
t.name AS table_name,
s.segment_size_mb,
d.dictionary_size_mb,
s.segment_size_mb + isnull(d.dictionary_size_mb, 0) as total_size_mb
from dictionary_size_by_table d
JOIN sys.tables AS t
ON t.object_id = d.table_id
left outer join segment_size_by_table s
on d.table_id = s.table_id
order by total_size_mb desc
go
-- size per column
with segment_size_by_column as (
SELECT
p.object_id as table_id,
css.column_id,
SUM (css.on_disk_size)/1024/1024.0 AS segment_size_mb
FROM sys.partitions AS p
JOIN sys.column_store_segments AS css
ON p.hobt_id = css.hobt_id
GROUP BY p.object_id, css.column_id
),
dictionary_size_by_column as (
SELECT
p.object_id as table_id,
csd.column_id,
SUM (csd.on_disk_size)/1024/1024.0 AS dictionary_size_mb
FROM sys.partitions AS p
JOIN sys.column_store_dictionaries AS csd
ON p.hobt_id = csd.hobt_id
GROUP BY p.object_id, csd.column_id
)
-- It may be that not all the columns in a table will be or can be included
-- in a nonclustered columnstore index,
-- so we need to join to the sys.index_columns to get the correct column id.
Select Object_Name(s.table_id) as table_name, C.column_id,
col_name(S.table_id, C.column_id) as column_name, s.segment_size_mb,
d.dictionary_size_mb, s.segment_size_mb + isnull(d.dictionary_size_mb, 0) total_size_mb
from segment_size_by_column s
join
sys.indexes I -- Join to Indexes system table
ON I.object_id = s.table_id
join
sys.index_columns c --Join to Index columns
ON c.object_id = s.table_id
And I.index_id = C.index_id
and c.index_column_Id = s.column_id --Need to join to the index_column_id with the column_id
left outer join
dictionary_size_by_column d
on s.table_id = d.table_id
and s.column_id = d.column_id
Where I.type_desc = 'NONCLUSTERED COLUMNSTORE'
order by total_size_mb desc
go 

Why is a columnstore index built from a heap larger than a columnstore index built on the same data from a clustered B-tree?

The columnstore index has to store an extra bookmark column (containing the record id, or rid, for the row) when the base table is a heap. The bookmark is 8 bytes long and unique. Hence, if you have 1 million rows, that's an extra 8MB to store, since the columnstore index cannot compress distinct values. So, please keep that in mind when you build a columnstore index directly on top of a heap.  If compression is a high priority, consider building a clustered index before you build a nonclustered columnstore index.


Are there statistics for columnstore indexes?

The query optimizer uses table statistics to help choose query plans. Tables with a columnstore index can have statistics. The statistics are gathered from the underlying B-tree or heap on the table with the columnstore, not from the columnstore itself. No statistics are created as a byproduct of creating a columnstore index. This is different from creation of a B-tree, where statistics are created for the B-tree key. See here for additional information about statistics and columnstore indexes.

Is there a best practice about putting columnstore indexes on filegroups?

For columnstore indexes in large data warehouses, we recommend you use the same best practices for file group management as for clustered indexes for large fact tables described in the Fast Track 3.0 guidelines here: http://msdn.microsoft.com/en-us/library/gg605238.aspx. As the Fast Track guidelines evolve, we expect to provide explicit guidance for filegroup placement of columnstore indexes.

Can columnstore indexes be used with FILESTREAM?

Yes. Although a FILESTREAM column can't be included in a columnstore index, other columns of the table can.

I am running out of space in my PRIMARY file group with columnstores. How can I avoid this?

Metadata for each row group is kept in the primary file group in a set of internal system tables, even if your tables are kept in other file groups. Every time a new row group is created, a little more space is used in the primary file group. A row group typically contains about one million rows, although smaller row groups can be created under certain conditions.

Each row in the column segment system table is 96 bytes. Total space for a rowgroup = Number of columns * 96 bytes.
Each row in the dictionary system table is 64 bytes. Total space per rowgroup = Number of dictionaries (primary + secondary) in the HoBt * 64.

Query sys.column_store_dictionaries and sys.column_store_segments to see how much row group metadata you have.

Make sure to provide enough space in your primary file group to accommodate this metadata. For example, a 300 column table could use close to 50,000 bytes per row group. If this table has ten billion rows it will have about ten thousand row groups. This could take up to 500MB for the row group metadata in the primary file group. Provision plenty of space in advance for the primary file group, or leave autogrow on and provide enough raw disk space to accommodate the growth.

[TOP]

7. Batch Mode Processing

What is batch mode processing?

Batch mode processing uses a new iterator model for processing data a-batch-at-a-time instead of a-row-at-a-time.  A batch typically represents about 1000 rows of data. Each column within a batch is stored as a vector in a separate area of memory, so batch mode processing is vector-based. Batch mode processing also uses algorithms that are optimized for the multicore CPUs and increased memory throughput that are found on modern hardware.  Batch mode processing spreads metadata access costs and other types of overhead over all the rows in a batch, rather than paying the cost for each row.  Batch mode processing operates on compressed data when possible and eliminates some of the exchange operators used by row mode processing.  The result is better parallelism and faster performance.

How do I know whether batch mode processing is being used for my query?

Batch mode processing is only available for certain operators. Most queries that use batch mode processing will have part of the query plan executed in row mode and part in batch mode. You can tell whether batch mode processing is being used for an operator by looking at showplan. If you look at the properties for a scan or other operator in the Actual Execution Plan, you will see two new properties: EstimatedExecutionMode and ActualExecutionMode. Only EstimatedExecutionMode is displayed in the Estimated Execution Plan. The values for these two properties can be either row or batch. There is also a new operator for hash joins when they are being executed in batch mode. The BatchHashTableBuild operator appears in graphical showplan and has a new icon.

Can EstimatedExecutionMode and ActualExecutionMode be different? When and why?

The query optimizer chooses whether to use batch mode processing when it formulates the query plan. Most of the time, EstimatedExecutionMode and ActualExecutionMode will have the same value, either batch or row. At run time, two things can cause a query plan to be executed in row mode instead of batch mode: not enough memory or not enough threads. The most common reason for the ActualExecutionMode to be row when the EstimatedExecutionMode was batch is that there was a large hash join and all the hash tables could not fit in memory. Batch mode processing uses special in-memory hash tables. If the hash tables do not fit in memory, execution of the query reverts to using row mode and traditional hash tables that can spill to disk. The other reason for changing to row mode is when not enough threads are available for parallel execution. Serial execution always occurs in row mode. You can tell that a fall back to serial execution occurred if the estimated query plan shows parallel execution but the actual query plan is executed serially.

If the query executes in parallel but falls back to row mode processing, you can infer that memory was the problem. There is also an xevent (batch_hash_table_build_bailout) that is fired when there is not enough memory during hash join and the query falls back to row mode processing. If this happens, incorrect cardinality estimation may have contributed to the problem. Check the cardinality estimation and consider updating statistics on the table.

Is a parallel query plan required to get batch mode processing?
Yes.  Batch mode processing occurs only for parallel query execution.  If the cost of the query plan is small, the optimizer may choose a serial plan that is "good enough."  When experimenting with columnstore indexes you may need a large data set to see the effects of batch mode processing.  Check the degree of parallelism if you see that a query was executed in row mode when you expected batch mode.

Can I get batch mode processing even if I don’t have a columnstore index?

No. Batch mode processing only occurs when a columnstore index is being used in the query.

What query execution plan operators are supported in batch mode in Denali?

Filter
Project
Scan
Local hash (partial) aggregation
Hash inner join
(Batch) hash table build

What about the parallelism operators in batch mode hash joins?  Why are they always in row mode?

Some of the parallelism operators in query plans for batch mode hash joins are not needed in batch mode.  Although the operator appears in the query plan, the number of rows for the operator is zero and the query does not incur the cost of redistributing rows among different threads.  The operator remains in the query plan because, if the hash join must spill to disk (if all the hash tables do not fit into the memory allotted for the query), the query reverts to row mode when it spills to disk.  The parallelism operators are required for executing the query in row mode.  If the hash join spills to disk you will see the warning "Operator used tempdb to spill data during execution."  If you look at the properties for the parallelism operators (Repartition Streams), you will see that the actual number of rows is greater than zero if the hash join has spilled.

[TOP]

Sort by: Published Date | Most Recent | Most Useful
Comments