Skip to main content

Dear Community,

I like to build my data estates with supernatural keys but in lager datasets the data cleansing starts to take very, very long. Do you happen to have the same issues? Is there a way to make the supernatural keys load faster? Even with incremental loading it begins to be super slow:
I've ran a test on 435,397 records. This is on a Azure SQL with 10 vCores
1. is a full load on the table with 7 supernatural keys.
2. is a full load on the same table without the supernatural keys.


1 has data cleansing of 1 second. 2 has a data cleansing of 104 seconds!

Second I've done a test on the same tables but now incremental loads:

1. incremental load with 7 supernatural keys
2. incremental load without supernatural keys


1 has a data cleansing time 1,6 seconds and 2 a data cleansing of a whopping 129 seconds!

I'm not so sure I want to keep using the supernatural keys. What do you guys do?

 

Take care

= Daniel

Hi Daniel

I did some research.

I have two tables with the same source data. One has a bunch of supernatural keys and the other has none. Besides that both have the same lookup fields and transformations. Another difference is that one uses Incremental load with hard deletes and keep lookup values up to date.

These 3 images are the times, raw and valid rows for three executions. First is an inital load, second is a execution where no new rows were added and the final one is an incremental load where new rows was added.

With Super keys:

Without supernatural keys.

As you can see there is a difference for about 10 seconds between them.

So to see if I can increase or decrease the times I made a switch, so the one with supernatural keys is not using Hard deletes and keep lookup values up to date and the other is using this.

Then this are the times.

With super keys no hard deletes and no keep lookup values up to date

With hard deletes, keep lookup values up to date and no supernatural keys.

Now the speed is increased on the one without supernatural keys but remaining almost the same for the one without.

So my point with this is that it may not just be the supernatural key generation part that takes longer.

You need to consider what you want to use it for. It was added so you would not have to make foreign keys from the source tables added to the fact table. So adding the DW_Id from the Customer table as a foreign key instead of a supernatural key. Then whenever there was a change in either table you could have to run a full load to make the data align correctly.


As Thomas points out, the table settings are important: deletion handling requires pulling the full PK set from source and comparing, which may be very expensive for incrementals with a long history. Keep-lookup-values up-to-date is also very expensive and has a pretty narrow use-case.

Supernatural key stores are also things to check: if your keys can be NULL you get key store bloat because NULL == NULL is false, this means any null keys just get added to your store without doing anything useful other than slowing things down.

For some reason unknown to me the DW_Key field in the keystore is a timestamp which is casted to int everywhere in the code, potentially causing bad statistics to influence query plans.

I tend to use supernatural keys in cases where I want to be able to couple records across separate source systems (the key is deterministic), or where the business key is broader than the overhead of hashing. For generic surrogate keys where history (type 2 or new records upon reintroduced deleted records) or source deletion may occur I tend to prefer the DW_Id + incremental fact + type 1 dimension route.

 

 


Reply