Product Targeted

MS SQL Server 2012

Summary

In SQL server, few specific data handling scenarios require Unicode source data to be moved to non-Unicode destination format like ASCII format. During data conversion, it can be observed that the Unicode characters getting converted to ‘?’ instead of retaining its Unicode data meaning in the destination.

Unicode data in SQL Server or flat file source if ported to SQL server table in a VARCHAR column, using SSIS, BCP or any other mean will show similar behavior.

Problem Description

MS SQL Server by default supports Code Page 1252 (ISO Character Set - SQL_Latin1_General_CP1_CI_AS) but it differs in how it implements other code pages related to other ISO standards. Below query will give details on all the supported code pages in SQL Server...

SELECT

 [name]

 ,COLLATIONPROPERTY([name], 'CodePage') AS [CodePage]

 ,COLLATIONPROPERTY([name], 'LCID') AS [LCID]

 ,COLLATIONPROPERTY([name], 'ComparisonStyle') AS [ComparisonStyle]

 ,COLLATIONPROPERTY([name], 'Version') AS [Version]

 ,[description]

FROM ::fn_helpcollations() ORDER BY CODEPAGE

Little tricky part is SQL Server after 2005 to SQL Server 2008 R2 does not support code page 65001 (UTF-8 encoding) and making it difficult to move data between UTF16 – UTF8 encoding without losing data. Given above using SQL Server versions mentioned above Due to this limitation, it is impossible to port Unicode data from NVARCHAR column to VARCHAR column without losing special characters.

Good news is SQL Server 2012 supports porting of data from UTF16 – UTF 8 encoding.  i.e. it is possible to export Unicode data to file with code page 65001 (Down convert) and also import back in to column with VARCHAR data type in SQL table (Up convert). This can be achieved by first exporting Unicode data to a file with code page 65001 (Down convert) and then import back in to VARXHAR column in SQL server table(Up convert)

Solution Approach

In SQL Server 2012 there is a support for code page 65001, so one can use import export wizard quickly to export data from SQL table to non-Unicode format (also can save resulting SSIS package for further use) and import that back to SQL Server table in table with VARCHAR column. Only difference is the data in VARCHAR column will look like garbage because of native data character set conversion, but will retain original value based on collation setting for the object.

To obtain the original Unicode value one can up convert this VARCHAR data back into NVARCHAR data using specific code page and importing exported flat file back into Unicode column.

Steps

1.     Use SQL import export wizard (?) to export Unicode data to UTF8, using code page 65001 to flat file destination

2.   Create new table with similar schema but replace Unicode column data type with non-Unicode data column (i.e. separate table with VARCHAR column for NVARCHAR column data)

3.   Import the exported flat file in to new table and it should be successful but original Data will look different like shown below based on the column collation.

4.   If there is a need to up convert the data from UTF-8 to UTF -16 i.e. from VARCHAR to NVARCHAR back to original value , just need to export it back again to flat file and import again in NARCHAR column and it will retain the original value

5.   If data is being used through the web application then up conversion can be handled by client application using appropriate encoding.