Skip to main content

Hi,


Let's assume I am importing from CSV file. The Data is very dirty and somehow I have stings in my int filed "ID" so the filed is interpreted as NVARCHAR(MAX). If I change data type of the field (right click > Edit Field) from string back to integer I receive a runtime error during the execution. What I was expecting was to see was corresponding entries in the _M and _L tables. Is this not part of data cleansing after all?


We have this problem all over and could not find a better solution than to build custom views with TRY_CAST etc. where we lose lineage etc.


What is the best way to clean up such data (keep valid rows, write errors/warnings on invalid rows)? We are talking 100+ tables so I'm looking for a highly scalable solution here not a one time work around.


Thanks and BR, Tobias

Hi Tobias,

Thank you for your question!

A quick solution I could come up with is creating custom data selection rules to separate valid from invalid records:

Creating log messages for the invalid rows takes some additional work - so that may or may not be feasible for large numbers of tables.

This will allow you to separate out the dirty records from the clean ones:

I will propose to the product team to offer TRY_CAST as an option and log messages for dirty records in a more "natural/integrated" way.


Hi Tobias,

my colleague Thomas suggested that you can also use RSD files to clean "dirty" columns.

I think this is a great solution also for larger numbers of columns/tables.

These are the steps to take:

  • In your CSV data source you set these two parameters: Generate Schema Files and Schema Location
  • Running the Synchronise task will generate a RSD file.
  • Then you modify the generated RSD files to clean the data on reading, in the columns you specify:
<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.cdata.com/ns/rsbscript/2" xmlns:other="http://apiscript.com/ns?v1">

  <api:info title="dirtyfacts.csv" other:catalog="CData" other:schema="CSV" description="" other:version="20">
    <attr name="cleanint" xs:type="int" isnullable="true" other:internalname="cleanint" />
    <attr name="cleanstring" xs:type="string" isnullable="true" other:internalname="cleanstring" />
    <attr name="dirtyint" xs:type="int" isnullable="true" other:internalname="dirtyint" />
  </api:info>

  <api:set attr="uri" value=""_connection.uri]" />

  <api:script method="GET">
    <api:call op="csvproviderGet">

      <api:set attr="_out1.dirtyint" value="rdirtyint | isnumeric(ndirtyint], -999)]" />

      <api:push />


    </api:call>
  </api:script>

</api:script>

This will give you the columns, cleaned, with invalid values already replaced with a default integer value.

From this source:

This data will be produced:

I hope this helps.


Reply