The SQL Server in-memory columnstore index (formerly called xVelocity) stores data by columns instead of by rows, similar to a column-oriented DBMS. The columnstore index speeds up data warehouse query processing in SQL Server 2012 and SQL Server 2014, in
many cases by a factor of 10 to 100. We'll be posting answers to frequently asked questions here.
SQL Server 2012 introduced nonclustered columnstore indexes. For more information, see the 2012 version of Columnstore
Indexes on MSDN.
SQL Server 2014 has both clustered and nonclustered columnstore indexes. For more information, see the 2014 version of
Columnstore Indexes Described,
Using Nonclustered Columnstore Indexes, and
Using Clustered Columnstore Indexes.
For both SQL Server 2012 and SQL Server 2014, see the wiki article SQL Server Columnstore Performance Tuning
What are Microsoft's in-memory technologies?
Microsoft SQL Server has a family of in-memory technologies. These are all next-generation technologies built for extreme speed on modern hardware systems with large memories and many cores. The in-memory technologies include in-memory analytics engine
(used in PowerPivot and Analysis Services), and the in-memory columnstore index (used in the SQL Server database).
SQL Server 2012, SQL Server 2014, and SQL Server PDW all use in-memory technologies to accelerate common data warehouse queries. SQL Server 2012 introduced two new features: a nonclustered columnstore index and a vector-based query execution capability that
processes data in units called "batches." SQL Server 2014 introduced updateable clustered columnstore indexes.
What is a columnstore?
A columnstore is data that is logically organized as a table with rows and columns, and physically stored in a columnar data format. Relational database management systems traditionally store data in row-wise fashion. The values comprising one row are stored
contiguously on a page. We sometimes refer to data stored in row-wise fashion as a rowstore.
What is a columnstore index?
A columnstore index is a technology for storing, retrieving and managing data by using a columnar data format, called a columnstore. The data is compressed, stored, and managed as a collection of partial columns, called column segments. You can use a columnstore
index to answer a query just like data in any other type of index.
A columnstore index appears as an index on a table when examining catalog views or the Object Explorer in Management Studio. The query optimizer considers the columnstore index as a data source for accessing data just like it considers other indexes when creating
a query plan.
What do I have to do to use a columnstore index?
For nonclustered columnstore indexes, all you have to do is create a nonclustered columnstore index on one or more tables in your database. The query optimizer will decide when to use the columnstore index and when to use other types of indexes. The query
optimizer will also choose when to use the new batch execution mode and when to use row execution mode.
For clustered columnstore indexes, you need to first create a table as a heap or clustered index, and then use the CREATE CLUSTERED COLUMNSTORE INDEX statement to convert the existing table to a clustered columnstore index. If your existing table has indexes,
you need to drop all indexes, except for the clustered index, before creating a clustered columnstore index. Since the clustered columnstore index is the data storage mechanism for the entire table, the clustered columnstore index is the only index allowed
on the table.
Are columnstore indexes available in SQL Azure?
No, not yet.
How do I create a nonclustered columnstore index?
You can create a nonclustered columnstore index by using a slight variation on existing syntax for creating indexes. To create an index named mycolumnstoreindex on a table named mytable with three columns, named col1, col2, and col3, you would use the following
How do I create a clustered columnstore index?
When you create a clustered columnstore index, there is no need to specify columns since all columns in the table are included in the index. This example converts a clustered index called myindex into a clustered columnstore index.
No. When the nonclustered columnstore index is created, it uses a proprietary algorithm to organize and compress the data.
Does the columnstore index have a primary key?
No. There is no notion of a primary key for a columnstore index.
How many columns should I put in my columnstore index?
Typically, you will put all the columns in a table in the columnstore index, although it is not necessary to include all the columns. The limit on the number of columns is the same as for other indexes (1024 columns). If you have a column that has a data
type that is not supported for columnstore indexes, you must omit that column from the columnstore index.
What data types can be used with columnstore indexes?
A columnstore index can include columns with the following data types: int, big int, small int, tiny int, money, smallmoney, bit, float, real, char(n), varchar(n), nchar(n), nvarchar(n), date, datetime, datetime2, small datetime, time, datetimeoffset with
precision <=2, decimal or numeric with precision <= 18.
What data types cannot be used in a columnstore index?
The following data types cannot be used in a columnstore index: decimal or numeric with precision > 18, datetimeoffset with precision > 2, binary, varbinary, image, text, ntext, varchar(max), nvarchar(max), cursor, hierarchyid, timestamp, uniqueidentifier,
How long does it take to create a columnstore index? Is creating a columnstore index a parallel operation?
Creating a columnstore index is a parallel operation, subject to the limitations on the number of CPUs available and any restrictions set on MaxDOP. Creating a columnstore index takes on the order of 1.5 times as long as building a B-tree on the same columns.
My MAXDOP is greater than one but the columnstore index was created with DOP = 1. Why it was not created using parallelism?
If your table has less than one million rows, SQL Server will use only one thread to create the columnstore index. Creating the index in parallel requires more memory than creating the index serially. If your table has more than one million rows,
but SQL Server cannot get a large enough memory grant to create the index using MAXDOP, SQL Server will automatically decrease DOP as needed to fit into the available memory grant. In some cases, DOP must be decreased to one in order to build the index under
How much memory is needed to create a columnstore index?
The memory required for creating a columnstore index depends on the number of columns, the number of string columns, the degree of parallelism (DOP), and the characteristics of the data. SQL Server will request a memory grant before trying to create the
index. If not enough memory is available to create the index in parallel with the current max DOP, SQL Server will reduce the DOP as needed to get an adequate memory grant. If SQL Server cannot get a memory grant to build the index with DOP = 1, the index
creation will fail.
A rule of thumb for estimating the memory grant that will be requested for creating a columnstore index is:
Memory grant request in MB = [(4.2 *Number of columns in the CS index) + 68]*DOP + (Number of string cols * 34)
What can I do if I do not have enough memory to build the columnstore index?
It's possible for creation of a columnstore index to fail either at the very beginning of execution if it can't get the necessary initial memory grant, or later during execution if supplemental grants can't be obtained. If the initial grant fails, you'll
see error 8657 or 8658. You may get error 701 or 802 if memory runs out later during execution. If out-of-memory error 8657 or 8658 occur at the beginning of columnstore index creation,
first, check your resource governor settings. The default setting for resource governor limits a query in the default pool to 25% of available memory even if the server is otherwise inactive. This is true even if you have not enabled resource governor. Consider
changing the resource governor settings to allow the create index statement to access more memory. You can do this using TSQL:
Can I create a columnstore index on a compressed table?
Yes. The base table can have PAGE compression, ROW compression, or no compression. The columnstore index will have its own compression, which cannot be specified by the user.
I tried to create a columnstore index with SQL Server Management Studio using the Indexes->New Index menu and it timed out after 20 minutes. How can I work around this?
Run a CREATE NONCLUSTERED COLUMNSTORE INDEX statement manually in a T-SQL window instead of using the graphical interface. This will avoid the timeout imposed by the Management Studio graphical user interface.
Can I create a filtered columnstore index?
No. A columnstore index must contain data from all the rows in the table.
Can I create a columnstore index on a computed column?
Can I create a columnstore index on a sparse column?
Can I create a columnstore index on an indexed view?
Can I create multiple columnstore indexes?
What are the advantages and disadvantages of row stores and column stores?
When data is stored in column-wise fashion, the data can often be compressed more effectively than when stored in row-wise fashion. Typically there is more redundancy within a column than within a row, which usually means the data can be compressed to a
greater degree. When data is more compressed, less IO is required to fetch the data into memory. In addition, a larger fraction of the data can reside in a given size of memory. Reducing IO can significantly speed up query response time. Retaining more of
your working set of data in memory will speed up response time for subsequent queries that access the same data.
When data is stored column-wise, it is possible to access the column individually. If a query only references a few of the columns in the table, it is only necessary for a subset of the columns to be fetched from disk into memory. For example, if a query
references five columns from a table with 50 columns (i.e. 10% of the columns), IO is reduced by 90% (in addition to any benefits from compression).
On the other hand, storing columns in independent structures means that the data must be recombined to return the data as a row. When a query touches only one (or a few) rows, having all the data for one row stored together can be an advantage if the row
can be quickly located with a B-tree index. Row stores may offer better query performance for very selective queries, such as queries that lookup a single row or a small range of rows. Updating data is also simpler in a row store.
What is the difference between a pure column store and a hybrid column store?
SQL Server columnstore indexes are pure column stores. That means that the data is stored and compressed in column-wise fashion and individual columns can be accessed separately from other columns. A hybrid columnstore stores a set of rows together, but
within that set of rows, data is organized and compressed in column-wise fashion. A hybrid column store can achieve good compression from a column-wise organization within the set of rows, but when data is fetched from disk, the pages being fetched contain
data from all the columns in each row. Even if a query references only 10% of the columns in a table, all the columns must be fetched from disk, and unused columns also take up space in main memory. SQL Server columnstore indexes require less I/O and give
better main-memory buffer pool hit rates than a hybrid columnstore.
Is a columnstore index better than a covering index that has exactly the columns I need for a query
The answer depends on the data and the query. Most likely the columnstore index will be compressed more than a covering row store index. If the query is not too selective, so that the query optimizer will choose an index scan and not an index seek, scanning
the columnstore index will be faster than scanning the row store covering index. In addition, depending on the nature of the query, you can get batch mode processing when the query uses a columnstore index. Batch mode processing can substantially speed up
operations on the data in addition to the speed up from a reduction in IO. If there is no columnstore index used in the query plan, you will not get batch mode processing. On the other hand, if the query is very selective, doing a single lookup, or a few
lookups, in a row store covering index might be faster than scanning the columnstore index.
Another advantage of the columnstore index is that you can spend less time designing indexes. A row store index works well when it covers all the columns needed by a query. Changing a query by adding one more column to the select list can render the covering
index ineffective. Building one columnstore index on all the columns in the table can be much simpler than designing multiple covering indexes.
Is the columnstore index the same as a set of covering indexes, one for each column?
No. Although the data for individual columns can be accessed independently, the columnstore index is a single object; the data from all the columns is organized and compressed as an entity. While the amount of compression achieved is dependent on the characteristics
of the data, a columnstore index will most likely be much more compressed than a set of covering indexes, resulting in less IO to read the data into memory and the opportunity for more of the data to reside in memory across multiple queries. In addition,
queries using columnstore indexes can benefit from batch mode processing, whereas a query using covering indexes for each column would not use batch mode processing.
Is columnstore index data still compressed after it is read into memory?
Yes. Column segments are compressed on disk and remain compressed when cached in memory.
Do columnstore indexes use bitmap indexes?
No. Columnstore indexes use a proprietary data representation based on Vertipaq. It’s not the same as a bitmap index and doesn’t use one. But it has some similar benefits to bitmap indexes, such
as reducing the time it takes to filter on a column with a small number of distinct values.
I want to show other people how cool SQL Server columnstore indexes are. What can I show them?
Where can I find more information (including documents and videos) about SQL Server columnstore indexes?
SQL Server Columnstore FAQ:
SQL Server Columnstore Performance Tuning Guide:
The Coming In-Memory Tipping Point, by David Campbell
Microsoft Virtual Academy talk video, 47 minutes, March 2012:
TechEd 2011 talk video, Columnstore Indexes Unveiled, 1 hour, 9 minutes:
TechEd 2012 talk video, SQL Server Columnstore Performance Tuning, 1 hour, 15 minutes:
Columnstore performance and partition switching demo video, 9 minutes:
Columnstore performance demo video, 4 minutes:
ACM SIGMOD 2011 paper on SQL Server columnstore indexes:
IEEE Data Engineering Bulletin Paper on SQL Server columnstore indexes, March 2012:
VertiPaq vs ColumnStore: Performance Analysis of the xVelocity Engine, v1.0, rev 2, Aug 3, 2012.
Microsoft SQL Server 2012 Columnstore for Real Time Reporting in Manufacturing Automation (COPA-DATA zenon Analyzer), 2012.
Case Study (bwin.party):
Case Study (Motricity: Migration from Sybase IQ to xVelocity columnstore index):
Case Study (MS People):
Case Study (Columnstore Indexes to Speed ETL):
Case Study (Mediterranean Shipping Company):
Case Study (Beth Israel Deaconess Medical Center):
Case Study (Belgacom)
Case Study (BNZ - New Zealand Bank)
Case Study (RHI - Refractory Materials Manufacturer)
Case Study (Recall -- Records Management Firm)
Slide deck on CDR (Telecom) application design loading 100M rows per day with 3 year retention
Internal Microsoft Columnstore Benchmark:
SQL Server Column-Store available for all major SAP BW releases
SQL Server 2012 and Tableau -- speeding things up
What determines how many segments there will be?
Each physical partition of a columnstore index is broken into one-million-row chunks called segments (a.k.a. row groups). The index build process creates as many full segments as possible. Because multiple threads work to build an index in parallel, there
may be a few small segments (typically equal to the number of threads) at the end of each partition with the remainder of the data after creating full segments. That's because each thread might hit the end of its input at different times. Non-partitioned tables
have one physical partition.
How do I know whether the columnstore index is being used for my query?
How can I force the query to use a columnstore index?
How can I prevent the use of a columnstore index in my query?
Are columnstore indexes an in-memory database technology?
SQL Server columnstores provide the performance benefits of a pure in-memory system with the convenience and economics of a system that stores data on disk and caches recently used data in memory. Columnstores hold data in memory in a different format than is
kept on disk. This in-memory representation is highly optimized to support fast query execution on modern processors. Not all data has to fit in memory with a SQL Server columnstore index. But if all columnstore data does fit in memory, SQL Server provides
pure-in-memory levels of performance.
Why require all data to fit in memory (capping your database size or demanding a large budget to purchase memory, and demanding slow system startup times) if you can get the best of both worlds, that is, state-of-the-art query performance on economical hardware?
Does all the data have to fit in memory when I use a columnstore index?
No, a columnstore index is persisted on disk just like any other index. It is read into memory when needed just like other types of indexes. The columnstore index is divided into units called segments, which are the unit of transfer. A segment is stored
as a LOB, and can consist of multiple pages. We elected to bring columnstore index data into memory on demand rather than require that all data fits in memory so customers can access databases much bigger than will fit in main memory. If all your data fits
in memory, you'll get reduced I/O and the fastest possible query performance. But it's not necessary for all data to fit in memory, and that's a plus.
What determines whether the columnstore index is stored in memory?
Can I force a whole columnstore index to be loaded into memory?
You cannot force the columnstore index to be loaded, or kept, in memory but you can warm the cache by running a query that will cause the columnstore data to be read into memory.
When should I build a columnstore index?
Columnstore indexes are designed to accelerate data warehouse queries, not OLTP workloads. Use columnstore indexes when your query workload entails scanning and aggregating large amounts of data or joining multiple tables, especially in a star join pattern.
The restrictions on how you update the data will also affect your choice. Columnstore indexes will be easiest to manage if you have a read-mostly workload and if partition switching to update the data will fit into your workflow. Partition switching for
handling updates is easier if most updates consist of appending new data to the existing table and can be placed in a staging table that can be switched into the table during periodic load cycles.
Typically you will want to build a columnstore index on large fact tables and maybe on large dimension tables as well. You can build a columnstore index on very small tables, but the performance advantage is less noticeable when the table is small. If
you frequently update your dimension tables, and they are not too large, you may find the maintenance effort outweighs the benefit of a columnstore index.
When should I not build a columnstore index?
Do columnstore indexes work with Transparent Data Encryption?
Can I compress the columnstore index?
The columnstore index is compressed when it is created. You cannot apply PAGE or ROW compression to a columnstore index. When a columnstore index is created, it uses the VertiPaqTM compression algorithms, which compress the data more than either
PAGE or ROW compression. There is no user control over compression of the columnstore index.
What is the difference in storage space used between the base table and the columnstore index?
Based on our experiments with a variety of different data sets, columnstore indexes are about 4X to 15X smaller than an uncompressed heap or clustered B-tree index, depending on the data.
Do columnstore indexes work on partitioned tables?
Yes, you can create a columnstore index on a partitioned table. The columnstore index must be partition-aligned with the base table. If you do not specify a partition scheme when you create the columnstore index, the index will be automatically created using
the same partition scheme as the base table. You can switch a partition in and out of a partitioned table with the same requirements regarding matching indexes as exist for other types of clustered and nonclustered indexes.
Can I partition a columnstore index?
Yes, you can partition a columnstore index, but the base table must also be partitioned and the columnstore index must be partition-aligned with the base table.
How do I add to, or modify, the data in a table with a columnstore index?
Once you create a columnstore index on a table, you cannot directly modify the data in that table. A query with INSERT, UPDATE, DELETE, or MERGE will fail and return an error message. To add or modify the data in the table, you can do one of the following:
-- update the data --
Now the columnstore index is ready to use again.
See also the question about
What happens if I try to update a table that has a columnstore index?
The update will fail and return an error message.
Can I disable and rebuild the index on a single partition?
No. You can only disable or rebuild a columnstore index on the entire table. If you want to rebuild only one partition, you should switch the partition into an empty staging table, disable/rebuild the index on the staging table, and switch the staging table
back into the main table. There is no need to rebuild the index except when you want to modify the data in the table.
How can I tell whether there is a columnstore index on my table?
There are two ways to determine whether a columnstore exists on a table. In Management Studio, you can look at the Object Explorer. Each table has an entry for Indexes. Columnstore indexes are included in the list of indexes and have their own icon and description.
You can also look at various catalog tables. In sys.indexes, a columnstore index has type = 6 and type_desc = “NONCLUSTERED COLUMNSTORE.” A new catalog table,
sys.column_store_index_stats, has one row for each columnstore index.
How can I find out more about my columnstore indexes? Is there metadata?
There are two new catalog tables with data about columnstore indexes:
VIEW DEFINITIONS permission on a table is required to see information in the catalog tables about a columnstore index on that table. In addition, a user must have SELECT permission on the table to see data in the following columns:
has_nulls, base_id, magnitude, min_data_id, max_data_id, null_value, data_ptr
last_id, entry_count, data_ptr
A user who does not have SELECT permission on a table will see NULL as the value in the columns listed above.
Does the columnstore compression algorithm compress each partition separately?
Yes, each partition is compressed separately. Each partition has its own dictionaries. All segments within a partition share dictionaries. Dictionaries for different partitions are independent. This allows partition switching to be a metadata-only operation.
How big are my columnstore indexes?
You can use the new catalog tables or sys.dm_db_partition_stats to determine how big the columnstore indexes are on disk. A relatively simple query to get the size of one columnstore index is:
Here are some other queries that total up column store component sizes.
-- size per index
-- size per table
-- size per column
-- It may be that not all the columns in a table will be or can be included
-- in a nonclustered columnstore index,
-- so we need to join to the sys.index_columns to get the correct column id.
I -- Join to Indexes system table
c --Join to Index columns
--Need to join to the index_column_id with the column_id
= 'NONCLUSTERED COLUMNSTORE'
What is batch mode processing?
Batch mode processing uses a new iterator model for processing data a-batch-at-a-time instead of a-row-at-a-time. A batch typically represents about 1000 rows of data. Each column within a batch is stored as a vector in a separate area of memory, so batch
mode processing is vector-based. Batch mode processing also uses algorithms that are optimized for the multicore CPUs and increased memory throughput that are found on modern hardware. Batch mode processing spreads metadata access costs and other types of
overhead over all the rows in a batch, rather than paying the cost for each row. Batch mode processing operates on compressed data when possible and eliminates some of the exchange operators used by row mode processing. The result is better parallelism and
How do I know whether batch mode processing is being used for my query?
Batch mode processing is only available for certain operators. Most queries that use batch mode processing will have part of the query plan executed in row mode and part in batch mode. You can tell whether batch mode processing is being used for an operator
by looking at showplan. If you look at the properties for a scan or other operator in the Actual Execution Plan, you will see two new properties: EstimatedExecutionMode and ActualExecutionMode. Only EstimatedExecutionMode is displayed in the Estimated Execution
Plan. The values for these two properties can be either row or batch. There is also a new operator for hash joins when they are being executed in batch mode. The
BatchHashTableBuild operator appears in graphical showplan and has a new icon.
Can EstimatedExecutionMode and ActualExecutionMode be different? When and why?
The query optimizer chooses whether to use batch mode processing when it formulates the query plan. Most of the time, EstimatedExecutionMode and ActualExecutionMode will have the same value, either
batch or row. At run time, two things can cause a query plan to be executed in row mode instead of batch mode: not enough memory or not enough threads. The most common reason for the ActualExecutionMode to be row when the EstimatedExecutionMode
was batch is that there was a large hash join and all the hash tables could not fit in memory. Batch mode processing uses special in-memory hash tables. If the hash tables do not fit in memory, execution of the query reverts to using row mode and traditional
hash tables that can spill to disk. The other reason for changing to row mode is when not enough threads are available for parallel execution. Serial execution always occurs in row mode. You can tell that a fall back to serial execution occurred if the estimated
query plan shows parallel execution but the actual query plan is executed serially.
If the query executes in parallel but falls back to row mode processing, you can infer that memory was the problem. There is also an xevent (batch_hash_table_build_bailout) that is fired when there is not enough memory during hash join and the query
falls back to row mode processing. If this happens, incorrect cardinality estimation may have contributed to the problem. Check the cardinality estimation and consider updating statistics on the table.
Is a parallel query plan required to get batch mode processing?
Yes. Batch mode processing occurs only for parallel query execution. If the cost of the query plan is small, the optimizer may choose a serial plan that is "good enough." When experimenting with columnstore indexes you may need a large data set
to see the effects of batch mode processing. Check the degree of parallelism if you see that a query was executed in row mode when you expected batch mode.
Can I get batch mode processing even if I don’t have a columnstore index?
No. Batch mode processing only occurs when a columnstore index is being used in the query.