Skip to main content
Solved

Change Data Type in Cleansing

  • September 9, 2022
  • 2 replies
  • 142 views

Hi,

Let's assume I am importing from CSV file. The Data is very dirty and somehow I have stings in my int filed "ID" so the filed is interpreted as NVARCHAR(MAX). If I change data type of the field (right click > Edit Field) from string back to integer I receive a runtime error during the execution. What I was expecting was to see was corresponding entries in the _M and _L tables. Is this not part of data cleansing after all?

We have this problem all over and could not find a better solution than to build custom views with TRY_CAST etc. where we lose lineage etc.

What is the best way to clean up such data (keep valid rows, write errors/warnings on invalid rows)? We are talking 100+ tables so I'm looking for a highly scalable solution here not a one time work around.

Thanks and BR, Tobias

Best answer by fwagner

Hi Tobias,

Thank you for your question!

A quick solution I could come up with is creating custom data selection rules to separate valid from invalid records:

Creating log messages for the invalid rows takes some additional work - so that may or may not be feasible for large numbers of tables.

This will allow you to separate out the dirty records from the clean ones:

I will propose to the product team to offer TRY_CAST as an option and log messages for dirty records in a more "natural/integrated" way.

View original
Did this topic help you find an answer to your question?

2 replies

fwagner
Employee
Forum|alt.badge.img+4
  • Employee
  • 33 replies
  • Answer
  • September 26, 2022

Hi Tobias,

Thank you for your question!

A quick solution I could come up with is creating custom data selection rules to separate valid from invalid records:

Creating log messages for the invalid rows takes some additional work - so that may or may not be feasible for large numbers of tables.

This will allow you to separate out the dirty records from the clean ones:

I will propose to the product team to offer TRY_CAST as an option and log messages for dirty records in a more "natural/integrated" way.


fwagner
Employee
Forum|alt.badge.img+4
  • Employee
  • 33 replies
  • September 27, 2022

Hi Tobias,

my colleague Thomas suggested that you can also use RSD files to clean "dirty" columns.

I think this is a great solution also for larger numbers of columns/tables.

These are the steps to take:

  • In your CSV data source you set these two parameters: Generate Schema Files and Schema Location
  • Running the Synchronise task will generate a RSD file.
  • Then you modify the generated RSD files to clean the data on reading, in the columns you specify:
<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.cdata.com/ns/rsbscript/2" xmlns:other="http://apiscript.com/ns?v1">

  <api:info title="dirtyfacts.csv" other:catalog="CData" other:schema="CSV" description="" other:version="20">
    <attr name="cleanint" xs:type="int" isnullable="true" other:internalname="cleanint" />
    <attr name="cleanstring" xs:type="string" isnullable="true" other:internalname="cleanstring" />
    <attr name="dirtyint" xs:type="int" isnullable="true" other:internalname="dirtyint" />
  </api:info>

  <api:set attr="uri" value="[_connection.uri]" />

  <api:script method="GET">
    <api:call op="csvproviderGet">

      <api:set attr="_out1.dirtyint" value="[dirtyint | isnumeric([dirtyint], -999)]" />

      <api:push />


    </api:call>
  </api:script>

</api:script>

This will give you the columns, cleaned, with invalid values already replaced with a default integer value.

From this source:

This data will be produced:

I hope this helps.


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings