Skip to main content

Hey all, 

 

We use incremental load in the ODX on a datetime field and we use an offset of 3 days (again in ODX). Now we've noticed that when we load a table from ODX to the data warehouse (even when full loading) that the _R table can contain duplicate records. 

In our Data Lake each increment is a parquet file of the last 3 days of data, since we load every day these files can overlap by 2 days. I believe this overlap is the cause of the duplicates in our _R table. 
 

What happens next is that the Error report gets spammed with “Violation of primary key constraint” (which is true). But our Valid table does have the expected result. 

A full load transfer task solves the issue, but of course we do not want to do that all the time. 

 

 

Is this the intended design? Is there a page that explains how the ODX handles these parquet files before inserting them into the Raw table? 

Any suggestions or insights on how to handle this issue would be greatly appreciated.

Dear @sierd.zuiderveld ,

Which version do you use? 

I've had this happen to me once and it came down to a setting which I forget to turn on in the Incremental Load.
In the Incremental selection rule you can set up the Updates and Deletes.



You need to select the updates (always, otherwise you get duplicates. Unless the tables you are loading is 100% transactional) or the Deletes (or both). Mind you that the selecting the check for deletes option might kill your performance.

Hope this helps
= Daniel


Hey @daniel

Thanks, we do load with deletes. But not with updates. Do you think that with updates would solve our issue? 


Hey @sierd.zuiderveld ,
I think so yes. As the record (Primary Key) is not deleted, but changed it will add the record to the set, but TX will not replace the record so now you've got two records with the same PK.


Hi @sierd.zuiderveld 

Yes you need to use updates on your incremental rules.

That is why you get the primary key violation errors.

The DW instance can’t do updates on incrementally loaded tables if the options isn’t set. It will expect all rows to be unique and do no checks of whether it exists in the valid table already.

I did attempt to explain this in our incremental load guide.

Perhaps the link between ODX and DWH isn’t obvious. 

Due to how incremental load works when connecting to an ODX instance, a DW incremental table will only behave as it does because of the data sources incremental rules.


Hi @sierd.zuiderveld has this issue been resolved? If so please help us by marking a best answer above, if you have any follow up questions please let us know


@Thomas Lind: This is still an issue at our environment. Could you shed your light on how the ‘Incremental load with primary key updates’ works? Maybe we can figure out then why we get the primary key violation errors in the transfer from ODX to MDW.


Hi @JogchumSiR 
It starts with a request.

SELECT max(incrementaldatefield) FROM table WHERE incrementaldatefield > (last max value)

Then if there is a value higher than this it will run

SELECT * FROM table WHERE incrementaldatefield > (last max value)

So all the fields that have an higher date than this will be added.

There is no check for updates on the transfer to the data lake parquet file. The rows with a higher value is just added in a new batch file. If you use a program to read the values in the files you will likely find rows with the same primary keys.

When you then transfer from ODX to DW it will attempt to add the new values to the existing ones if an row with PK values is the same and have a higher value.

If the source table is transactional, meaning it will add a new row for each change and there is multiple changes in one transfer, then maybe the additional ones will be sent to the errors view.

 


@Thomas Lind : Thanks for your explanation. We've profiled it ourselves and did not see extra queries as opposed to only having the ‘Incremental load with primary key deletes’ and your story confirms this.

For updates, we expected some kind of extra file or field with the latest update timestamp in the source so it can compare with the other data, but we don't see that. In that regards it is a bit like a black box for us.

From our information having the updates settings set to on, would bring a big performance drawback in the ODX loading process. We will go on and test this and check if it resolved the primary key violation errors and does not result in a big performance dip opposed to having it of.


Reply