Let's assume I am importing from CSV file. The Data is very dirty and somehow I have stings in my int filed "ID" so the filed is interpreted as NVARCHAR(MAX). If I change data type of the field (right click > Edit Field) from string back to integer I receive a runtime error during the execution. What I was expecting was to see was corresponding entries in the _M and _L tables. Is this not part of data cleansing after all?
We have this problem all over and could not find a better solution than to build custom views with TRY_CAST etc. where we lose lineage etc.
What is the best way to clean up such data (keep valid rows, write errors/warnings on invalid rows)? We are talking 100+ tables so I'm looking for a highly scalable solution here not a one time work around.
Thanks and BR, Tobias
Page 1 / 1
Hi Tobias,
Thank you for your question!
A quick solution I could come up with is creating custom data selection rules to separate valid from invalid records:
Creating log messages for the invalid rows takes some additional work - so that may or may not be feasible for large numbers of tables.
This will allow you to separate out the dirty records from the clean ones:
I will propose to the product team to offer TRY_CAST as an option and log messages for dirty records in a more "natural/integrated" way.
Hi Tobias,
my colleague Thomas suggested that you can also use RSD files to clean "dirty" columns.
I think this is a great solution also for larger numbers of columns/tables.
These are the steps to take:
In your CSV data source you set these two parameters: Generate Schema Files and Schema Location
Running the Synchronise task will generate a RSD file.
Then you modify the generated RSD files to clean the data on reading, in the columns you specify: