Solved

ODX Server and unicode data source fields

  • 17 October 2023
  • 6 replies
  • 75 views

Userlevel 3
Badge +2

Hi community,

I have a MariaDB data source which I connect to using ODX Server (v 6346) and the CData/ADO.net connector for MariaDB v 23.0.8565.0.

The source database is set to collation utf8mb4_general_ci. I noticed by copying from ODX Server to a Data Area that some special characters are slightly different than the source/ODX. I can fix this by manually changing the data type to nvarchar instead of varchar, but that is a lot of field editing.

The Override Data Type function can help as well, but as we have a wide range of field lengths, this also means a lot of override rules.

Is there a way to tell ODX Server that all varchar fields should be unicode/nvarchar by default (eg. based on the source database collation)?

Example from ODX Server. These fields should all be nvarchar(xx) instead of varchar(xx):

 

Implementing this idea would work for me as well :)

 

In TX v20 Business Units, we had this option:

 

icon

Best answer by Christian Hauggaard 18 October 2023, 15:52

View original

6 replies

Userlevel 6
Badge +5

Hi @erik.van.mastrigt 

I tested this and as far as I can tell it seems like unicode characters are still coming through despite the data source explorer showing varchar. Below is a screenshot of the data from a Maria DB using Maria Cdata connector. Please note that in my DW table I manually changed the data type to nvarchar. Can you please test if the same is true on your end?

Maria DB / MySQL seems to be a bit different in terms of data types i.e. even if you set the data type to nvarchar, when you go back to alter the table, the column is still varchar.

Reading up on this online, I found the following potential explanation:

There is no need for NVARCHAR here as Mysql handles Unicode fine with VARCHAR. (Actually, NVARCHAR is just VARCHAR with predefined utf8 char set - see https://dev.mysql.com/doc/refman/5.7/en/charset-national.html)

Userlevel 3
Badge +2

Hi Christian,

Reading from MariaDB into Data Lake is going fine. Here’s an example character in field Naam, ‘≤’

 

When I put the table into a Data Area, this field becomes a VARCHAR and this particular character has changed to ‘=’

 

When I manually change the field to NVARCHAR, the transfer goes fine:

 

Point is that this happens in a lot of fields and I don’t want to manually change all of them.

Userlevel 6
Badge +5

OK, I will reach out to Cdata to see if they have any other ideas.

The only workarounds I can think of currently are:

  • manually changing the field to NVARCHAR in the DW as you describe above
  • using the override with a set length (e.g. nvarchar(100))  or nvarchar(max)
Userlevel 6
Badge +5

I got the following response from Cdata:

“MySQL doesn't really have a NVARCHAR data type, and when you create a NVARCHAR, it actually just stores as a VARCHAR. With the metadata we get from the MariaDB server, there isn't a way for us to differentiate between columns that were created with NVARCHAR or VARCHAR, because the server itself is reporting the column as a VARCHAR in both scenarios.

You can see in the MariaDB docs that the NVARCHAR data type doesn't exist, and the VARCHAR page explains more on how NVARCHAR are just VARCHAR:
String Data Types - MariaDB Knowledge Base

So the workarounds mentioned above seem to be the only feasible workarounds at the moment.

Userlevel 5
Badge +7

Hi @Christian Hauggaard ,

perhaps it is better to approach this the other way around: an encoding of utf8mb4 can have at most 4 bytes per character and the collation defines the id of the codepage. In my opinion that specific combination should always result in nvarchar. I understand that from CData's perspective this may not always be true and therefore they may not want to change, but in the context of dealing with Microsoft's SQL stack this would be always true (or at least not incorrect).

Userlevel 3
Badge +2

Thanks for asking CData @Christian Hauggaard . MySQL/MariaDB doesn’t have a nvarchar type, and as you can see in my example the characters are transferred from source to data lake without any modifications.

The character changes when the data gets into SQL Server. TX v20 (business units) had an option to override the encoding to unicode, which solved the issue for all fields.

I agree with @rory.smith, thanks!

Reply