none
sqlsrv driver, database collation and UTF-8

    Question

  • Good day,

    when trying out UTF-8 with the sqlsrv driver version 2.0.1802.200 in PHP 5.3.6 with SQL Server 2005, I've noticed that if the 'CharacterSet' connection parameter is set to 'UTF-8', the characters are stored correctly in varchar(100) and nvarchar(100) columns, i. e. characters such as 'ľščťžýáíéúäôň' show up correctly in both SQL Server Management Studio and in the web application. However, according to the information on this page:

    http://msdn.microsoft.com/en-us/library/cc626307.aspx

    it would seem that unless the database column data type is nchar or nvarchar the data are unlikely to be saved correctly. The database collation is Slovak_CI_AS.

    What I'd like to know is whether there is an automatic conversion being performed on the background (by the driver or SQL Server?) and whether it's safe to presume that this won't change.

    Also, I have been under the impression that this query:

    SELECT
    	*
    FROM
    	test_table
    WHERE
    	left(test_column, 6) = left(test_column2, 6)
    ;
    

    should not return the same results as if both columns are of the same data type if test_column is a varchar and test_column2 is a nvarchar column.

    Could this: "Objects that use nchar or nvarchar are assigned the default collation of the database unless a specific collation is assigned using the COLLATE clause" be responsible for that happening (automatic conversion) in the query? Does this, somehow, relate to the conversion from/to UTF-8?

    I quoted this documentation article above: http://msdn.microsoft.com/en-us/library/ms186939%28v=SQL.90%29.aspx .

    Thanks in advance.

    Tuesday, May 31, 2011 9:27 AM

Answers

  • Hi Slapo,

    In SQL Server, Collations controls the code page that is used to store the character data in non-Unicode data types like char and varchar. Code pages define bit patterns for uppercase and lowercase characters, digits, symbols, and special characters. Characters in non-Unicode data types without an associated code point are mapped to an either an alternate character or to the “?” character.

    nchar and nvarchar are Unicode data types. Unicode data types use Unicode character representation. Code pages do not apply to these data types.

    If ľščťžýáíéúäôň is a Slovak word, you can store it in a non-Unicode data type with a Slovak collation or store it in a Unicode data type.

    For more information, please see:

    http://msdn.microsoft.com/en-us/library/ms144260.aspx 

    http://weblogs.sqlteam.com/dang/archive/2009/07/26/Collation-Hell-Part-1.aspx


    Please remember to mark the replies as answers if they help and unmark them if they provide no help.
    Get or Request Code Sample from Microsoft
    If you have any feedback, please tell us.
    • Marked as answer by KJian_ Tuesday, June 07, 2011 5:50 AM
    Friday, June 03, 2011 3:54 AM

All replies

  • Hi Slapo,

    In SQL Server, Collations controls the code page that is used to store the character data in non-Unicode data types like char and varchar. Code pages define bit patterns for uppercase and lowercase characters, digits, symbols, and special characters. Characters in non-Unicode data types without an associated code point are mapped to an either an alternate character or to the “?” character.

    nchar and nvarchar are Unicode data types. Unicode data types use Unicode character representation. Code pages do not apply to these data types.

    If ľščťžýáíéúäôň is a Slovak word, you can store it in a non-Unicode data type with a Slovak collation or store it in a Unicode data type.

    For more information, please see:

    http://msdn.microsoft.com/en-us/library/ms144260.aspx 

    http://weblogs.sqlteam.com/dang/archive/2009/07/26/Collation-Hell-Part-1.aspx


    Please remember to mark the replies as answers if they help and unmark them if they provide no help.
    Get or Request Code Sample from Microsoft
    If you have any feedback, please tell us.
    • Marked as answer by KJian_ Tuesday, June 07, 2011 5:50 AM
    Friday, June 03, 2011 3:54 AM
  • Hello sir, i don't know if this is the right place to ask this Unicode UTF-8 things regarding in sql server 2008, PHP 5.3.10 and sqlsrv 2.0 both in nts vc9 installation.

    The web applications is done in sharepoint. One of the features is the dictionary in Malay and there is a phonetics that we can enter using the web applications. Now, i have an excel file with lot of phonetics that i should input to the server. We don't want to manually input the 42,000 phonetics. What i do is used the PHP to do it for us.

    I created a small script that can connect to our sql server using sqlsrv 2.0 without problem. we can read and right data. But in PHP the phonetics shows exactly what in excel files, but when we stored it on our sql server it becomes different.

    The table uses collation "SQL_Latin1_General_CP1_CI_AS" and the phonetic column is nvarchar. In excel file and php display the correct phonetic using the php "header( 'Content-Type: text/html; charset=UTF-8' );"

    Here is the sample phonetic in excel.  [mᴧŋasaˀ]

    PHP display the phonetic without charset.  [mᴧŋasaË€]

    Using the charset utf-8, php display this.  [mᴧŋasaˀ]

    then i proceed to store it to sql server using sqlsrv 2.0.

    but when view with our web applications and also view inside the sql studio management. the phonetic is like php showing without the charset set to utf-8. [mᴧŋasaˀ]

    how can we actually store the exact unicode to the sql server using php and sqlsrv?

    thanks a lot.

    Thursday, March 15, 2012 9:12 AM
  • You must make sure that when you save the data, and read the data in your web applications, you include the 'CharacterSet' option when you connect to SQL Server:

    $conn = sqlsrv_connect('server', array(/** Other parameters, **/ 'CharacterSet'=>'UTF-8'));

    Rob

    Thursday, March 15, 2012 9:48 AM
  • Thanks for the fast replay, and sorry for any neglect information. If i use PHP only the storing and retrieving the data in SQL Server 2008 using PHP 5.310 and SQLSRV driver 2.0, the data display correctly in PHP without problem.

    But the thing is, we only use PHP to migrate the PHONETIC data from excel files to SQL Server 2008. Instead of using our WEB APPLICATIONS which is done in Sharepoint to manually input the data from excel.

    So when i stored the data from excel using PHP and SQLSRV to SQL Server, the data in SQL Server Studio Management and our WEB APPLICATIONS in Sharepoint display the data incorrectly.

    Excel File: [mᴧŋasaˀ]
    PHP (without charset): [mᴧŋasaˀ]
    PHP (with charset utf8): [mᴧŋasaˀ]

    Using SQLSRV and PHP to store the phonetic to SQL Server. Table collation is "SQL_Latin1_General_CP1_CI_AS" and Column is NVARCHAR.

    The excel file is exported to TEXT file, tab delimited in UTF-8 encoding.

    PHP code:
    header( 'Content-Type: text/html; charset=UTF-8' );
    mb_internal_encoding( 'UTF-8' );
    $file_handle = fopen ($file, "r");
    $text = fgets($file_handle, 4096);
    $text = explode("\t", $text);
    $Fonetik = $text[0];
    fclose($file_handle);
    $connectionInfo = array("UID"=>$username, "PWD"=>$password, "Database"=>$db, "CharacterSet" => "UTF-8");
    $sqlconnect = sqlsrv_connect($server, $connectionInfo) or die ("ERROR: Connect. " . sqlsrv_errors());
    $sqlquery = "INSERT INTO $table (Fonetik) VALUES (?)";
    $sqlparam = array($Fonetik, SQLSRV_PARAM_IN, SQLSRV_PHPTYPE_STRING('UTF-8');
    $sqlresult = sqlsrv_query($sqlconnect, $sqlquery, $param1) or die ("ERROR: Insert (LEMA). " . sqlsrv_errors());

    the Phonetic data from EXCEL successfully stored in SQL Server 2008. But the data is [mᴧŋasaˀ]. So when we used our WEB APPLICATIONS in Sharepoint it display [mᴧŋasaˀ]. Even at SQL Server Studio Management it also display in the column [mᴧŋasaˀ].

    What we want is, to be stored like [mᴧŋasaˀ]. How would we able to do that?

    Thanks a lot and i do appreciate your help.



    Thursday, March 15, 2012 10:16 AM
  • Continued here.

    Rob

    Thursday, March 15, 2012 12:49 PM