Solved

SCD does not update correctly

  • 9 November 2023
  • 5 replies
  • 95 views

Hi!

I beg your pardon for this somewhat lengthy post, but I try to be thorough to avoid confusion. :)

I have a problem with updating an SCD.

We are moving an old adress history table from another system into a SCD in TX. It has gone well so far since the old system made one insert per person per day regardless of whether there was an update or not, which has made us go from about 108M rows to about 275k rows with actual changes. So far so good. :)


The problem now is that have a hard time to get this moved table to update correctly from a new source. The setup looks like this at the moment:
 

The MDW table KIR_Persons is the correct and prepared historical data from the old system where I have fixed the From- and To-dates along with the “Is Current” flag. The “break date” for the data from the old system is 2023-10-16 which is when I made the final export.

The new source for the address data is the DSA table KI0VAA0POPULATIONPERSON_QT which is the result of a query table. I’m using the DSA view Kir_Persons_Stage_v  as a source and for data preparation for the target MDW table KIR_Persons_stage for testing this out. The settings for the SCD looks like this:
 


 

Should this actually be Automatic?
The TipTool says that the “Truncate valid table..” is ignored if using this. 


In the attached file Initial_SCD_Data (had to do it like that because I don’t know how you do this fancy “click on the picture to zoom in” thing :)) you see the important columns, along with the metadata, that the stage table was filled with before the new load.
This data was copied from the prepared KIR_Persons table (that’s why the screenshot is from that table but it’s the same query that I used filling the stage table) and in the attachment you can also see the comments about different cases that I wanted to investigate.

Then I executed the load to the stage table, with the same id:s that I had preloaded, with the keys and dates like this (obviously not all columns are present in this screenshot, but the PersonID and the UpdateDate are the most important ones here):
 

Some had an UpdateDate before 2023-10-16 and some after - and as you can see some had two update dates after 2023-10-16. Now, the UpdateDate is not in the Type2 so that should not matter for what gets updated or not. It’s just shown to illustrate what changes that should occur.
What I would presume is that rows with a date before 2023-10-16 would typically not result in an update while rows with a date after 2023-10-16 should be a new active row. If there are more than one row with updates for an ID you would get a very short timespan between the From- and To-dates but it would still show.


So, I had 15 rows in the KIR_Persons_stage table - loaded 16 rows with the expectation that some of the old rows would just be ignored, and some would end up as new active rows on the same key with correct From- and To-dates. This was not the case as can be seen in the attachment SCD_data_after_insert.

I ended up with 31 rows in the table and it seems like in most cases the SCD functionality just ignored the hash of the Type2 columns and just inserted a new row anyway. I can say that the ID starting with 0x00077B… got updated even though it had the same adress data but that was because of other changes in the columns.

For the ID starting with 0x05AD.. that had two updates after the break date both the new rows got inserted as active rows while the first insert did actually set the already existing row as not active, so half way there.

Is there something I’m missing here in the setup of the SCD?
When does the comparison of the existing checksum in the valid table and the checksum of the incoming data actually take place?
I noticed that the latest imported data is still in the RAW-table but that get truncated at each load, right? Or do I have to “preload” the raw-table with current data the first time for the check to go right?

And one other thing that I came to think of while writing this - how do the SCD treat NULLs compared to empty string?

 

I hope all this makes sense. :)

 

icon

Best answer by rory.smith 10 November 2023, 08:43

View original

5 replies

Userlevel 6
Badge +5

Hi @thzman 

OK, I took a quick look at it. My best guess would be that it is because the date field is a primary key field and not just a normal Type 0 field.

OK. Thank you for the quick reply.
Hmmm… Well, I set the PK like that out of table integrity and to ensure that there would not be more than one update on one day. Which…….. when I come to think about it, kind of don’t matter since we should only get one Current flag anyway. 

I’ll do a test to remove it to see if that changes things.

Update:
I removed the UpdateDate from the PK and it seems like that gives a more consistent behaviour regarding the updates. What didn’t change though were when there was more than one update in a batch. Then only the first (oldest) row got inserted while the newer one didn’t.
Example:

Original data
Incoming data from the view with obvious changes
Data from RAW-table (just for reference)
End result

So I guess that the question of when TX evaluates the hash key to determine when to insert or not still stands. The reason for this is that even if we set up a daily job we are not in control of the information from the supplier. If there is a problem on their side and they send us a “catch up” file with multiple updates per PersonId there is a risk that we don’t catch the latest data without human intervention, which we obviously are trying to avoid.

Userlevel 6
Badge +7

Hi @thzman ,

 

if you are running TimeXtender's history you can only have one record per natural key (and PK) per execution coming into the _R table. You can check the Error Reports where you should find the 2nd, 3rd, etc. listed as primary key violations.

I ran into this a long time ago as well. I have been discussing sorting options with @alex.van.wijnen  which could help make this less onerous. If you want to recreate history, you will have to sort the data on age and cut it into batches to avoid PK clashes.

@rory.smith 
Ok. Thank you for the information. :)
Yes, the splitting of data into batches was what i was doing to get the data into the history table in the first place but that was supposed to be a onetime thing. Now, hopefully this problem should not occur very often since like how many times in a week does a person change addresses.
Technically though a person could, for example, move to a new address and then like get a divorce or something and that would render two changes in one batch.
I guess I have to set up some kind of watchdog to make sure that we catch these kinds of events.

Reply