Skip to main content

Best practice when using Supernatural keys


Hi,

I have seen clients use Supernatural key fields to create keys between the fact and dimension tables, usually when the logical key is a combination of fields. I have seen these two methods used:

  • The Supernatural field key is created in both the fact and dim table, using the same key store
  • The Supernatural key is created in the dim table, then mapped to the fact table with a conditional lookup

Is there a difference in performance here? Is there a best practice to advocate for? 

Dear @pontus.berglund ,

 

My go to feature to use was the supernatural key. Loved the fuctionality. Really easy to hook up your facts and dims with the corresponding keys. 
 

But as the data of the client grew, I got formance issues with the supernatural keys. This due to the fact that when the data is loaded (incrementally) still all the keys need to be checked for changes in the key store. When I deleted all the supernatural keys and replaced them with lookups from the dims, the performance increase was massive. A table which used to take 30 mins (supernatural keys) to load now loaded in 3 mins (Incremental).

So now I create my keys in my DSA / Transformation layer by looking up the DW_Id from the dim table to the fact tables. This worka like a charm  the downside is that this method does not keep the changed or deleted rows up to date in the fact table so you should either run full loads once in a while or turn on the ‘keep lookups up to date’ feature. This will decrease the performance but it will be faster then the supernatural keys (unfortunately) 

 

a third option might be the custom hash key  love this functionality as well  but it is to bad that Power BI / SSAS does not support varbinary data types. if you’re using Qlik this will work though! 
 

Hope this helps

 

= Daniel

 

 

 

 

 


Hi @daniel if we are using supernatural keys and we made full load at least once a week, all supernatural keys are rebuilt, right?


Dear @rvgfox ,

 

I would believe so. In a full load all tables are truncated and so would a key store be handled. Because the keys in the key store are hashed and then auto incremented (thats the setting I use) it might be that the keys get the same number because of the order the data is loaded. I’ve also seen that the keys changed (number wise) from the previouse (incremental) load because of this. So after this full load, all the dims attached to this fact also need to full load. To make it even more complicated: I usually set up the dims as ‘masters’, they provide the keys to the key store (this is a best practise fron Kimball and master data management) and the facts may only read from the stores. This might give issues with early arriving facts so I set up a sql snippet which checks the supernatural keys and puts in a ‘-1’ when empty. If the fact is also allowed to put keys into the key store this row will get a number from the key store but is has not corresponding dimension in the dim table. This may hurt the data quality of your model. So after setting up the ‘-1’ rule, I go to the dimension and set up a custom data row with the -1 as the key and in the names or categories or whatever columns might be in the dimension and enter an ‘— Empty’ there. Now my referential integrity of my model will be 100%. The ‘— Empty’ starts with two - so this entry will pop up to the top in the visualisation or filterpane and hopefully the business will take this as an call to action

 

Hope this helps

 

= Daniel
 

 

 


The reason the Supernatural key feature was added was to help resolve issues when using foreign keys to do relations between fact an dimension tables.

So what you would do before was to add a lookup field containing the DW_Id field from a source table say a customer table and then you would add the DW_Id field as a key in the Customer table as well and could relate the two fields.

The issues with this is that the ids would keep being unaligned due to incremental load and similar. A DW_Id field in  the fact table did not point to the correct DW_Id value in the dimension.

So with supernatural keys you resolve this issue as it will always generate the same id when the key fields are the same no matter what table is being used.

@daniel have you tried to set the Only read from the store option to this on the fact tables and kept it turned off for the dimension?

 

I wonder if that performs better.


Hi @Thomas Lind can you explain in more detail the differences between set on and set off the “Only read for store option”?


Hi @Thomas Lind 

I have, but for big fact tables the supernatural keys become a serious impact on the performance. The reload time become too big. 
I believe I have posted this before here somewhere with some screen shots and such. 
 

Again I really like the supernatural key functionality but in high performance enviroments they’re not delivering. So for now I stared using the DW_Id’s again and the checking the keep lookups fields up to date. 
 

= Daniel

 

 

 

 


@rvgfox 

If the option is set the table will not add rows to the key store, it will only pull from it.

So if you have turned it off on a fact table and a value that do not exist in the key store appears it will become null.

@daniel 

So if you are not using keep field values up to date it still updates all the supernatural keys in the existing valid table, not just the new arrivals?


Hi,

I have seen clients use Supernatural key fields to create keys between the fact and dimension tables, usually when the logical key is a combination of fields. I have seen these two methods used:

  • The Supernatural field key is created in both the fact and dim table, using the same key store
  • The Supernatural key is created in the dim table, then mapped to the fact table with a conditional lookup

Is there a difference in performance here? Is there a best practice to advocate for? 

Yes, there is a performance difference. Maybe in smaller fact tables you won’t notice so much difference, but with huge fact tables with lots of supernatural keys your performance can decrease significantly. A (conditional) lookup has a better performance.

From an aesthetical point of view I prefer the first method, because in the GUI you immediately see which fields are supernatural keys, but in practice I mostly use the latter. In terms of performance it’s equal to the DW_Id lookup method. Difference is that DW_Id changes with every reload, and the supernatural key does not.

Key stores are never truncated and rebuilt, unless you manually reset them in the TX GUI. Equal input results in the same supernatural key. So you can apply this method on incrementally loaded fact tables, also when your dimensions get a daily flush-and-fill.


If you prevent deletions in your dimension tables by making them Type1 historic, you remove one of the desync scenarios in DW_Id vs. Supernatural Key. Supernatural keys also avoid circular references by their nature as they do not have explicit lookups.

I always use ‘Only read from store’ at the Fact side if I use Supernatural keys, early-arriving-facts then get handled by a dummy record on the dimension side and fallback transformation on the SK. If you want to explicitly add the early Business Keys to the dimension I would use the Related Table Insert method over allowing the Fact to write to the Keystore.

You should also check the size of your keystore, it should not be much larger than the dimension. If it is, there are probably junk records in there slowing down every lookup.

I suspect one of the performance differences stems from the fact that the DW_Key field is a timestamp, but is used as an int which means the the implicit conversion messes with SQL Server's query statistics.


I see that the KeyStore tables are never emptied. 
Is it a good practice to empty them on a full load?

How can that be done?


I would never empty a keystore unless if it has incorrect entries. As the hashes of Business Key fields are deterministic, there shouldn't be a reason to. The actual key value is based on load order, but the value does not matter as long as it is in sync across all places relying on the key.

If you do clear a keystore, you need to full reload all tables using the keystore to be sure the correct DW_Key values are loaded into those tables. Otherwise you may get the same issues as with a DW_Id SK in incremental tables.


@rory.smith I agree with you, but if we are thinking in a development environment and a full load (all tables are emptied and filled again) maybe it makes sense. or not?


I don't think it will have much of an advantage: from the Dim table you are essentially checking whether the key exists in the store and adding it if not, from the Fact you are always pulling from the store (as long as you have ‘Only read from store’).

If the sort order that your source delivers data is always the same, the keystore post-wipe will be identical to the keystore pre-wipe.

I can only think of very obscure situations where you have multiple sources feeding a consolidated dimension and you change the priorities of the source systems and then want the keystore records’ DW_Sourcetable to match the new source. I.e. SourceA and SourceB both have a record for DimElement x and the business is migrating from SourceA to SourceB. Once SourceA is decommissioned, you might want your keystore to only point to SourceB for records that are live there.


Regarding the use of the Supernatural key in the fact table, I think it is better to use a lookup to the dimension with the business key instead of redefining the Supernatural key with the option "Only read from store", isn't it?


I think that is a matter of taste somewhat: if your fact table is too large for a Supernatural key in the Fact without a lookup from the Dimension I would use a DW_Id and make all dimensions coupled to that Fact Type 1 historic.

You can also lookup the Supernatural Key to the Fact but that is mixing two styles of approach to my feeling.


I think that the use of relationships to get the surrogate key from the dimension (obtained with the supernatural key) it’s more clear and will be useful when we tansfer the fact table to the semantic model because there the relationship will be infered and always it will use the surrogate keys….


Well that depends on your layering approach: I would set up SK in a DSA layer and push them to MDW where they form the basis of the relation between Dim and Fact. That relation then gets picked up in Semantic Models.

If you are reading from the Key Store in a Fact table in DSA, you will usually set the relation on those fields in MDW.


We are actually facing this problem as well. Old setup used SNK's in both DIM and FACT table. A medium large fact table (44M records, 14 DIM SNK's and 1 own SNK) took over 20 hours for a full load (on a 16 core hyperscale).

Changing the dim SNK's in the fact table to lookups saves around 80-90% of the load time. I have had our Azure SQL expert take a look at this and as it turns out, the SNK's are resolved one-by-one and also one row at a time, which is a very time consuming process. Also, the key stores store the key itself in a field of datatype ‘timestamp’, which is casted to a bigint during the lookup process. This all does not look very efficient in terms of SQL code:

(taken from data cleansing proc)

But, we would like to keep the benefits of the SNK's themselve, primarily because they return the same ID for the same business key so we are able to truncate/reload dimensions without having to reload the facts the dim is used in, and second because we can make it database unique so we can prevent the so called power users to join the wrong dimension to an ID in a fact table (and yes that happens 🙂 ).

So, our approach now is to use an SNK in the DIM tables (and also in FACT tables that need a surrogate key for themselves because they are linked to other fact tables usink link tables), and to do a lookup to that SNK from the fact table using the business key:

 

This way we can use the benefits of the SNK's and the performance of the lookups in the fact table.


@wouter.goslinga That’s exactly what I’m doing


Hi @wouter.goslinga ,

That is a very nice sollution. Although I also kind like he History Type1 solution.
I had a chat with an other developer and he creates a custom hash field and converts that to varchar like this: CONVERT(VARCHAR, Field1,(2)).
I really like the custom hash fields, but Power BI does not like the varbinary data type. 

Thanks for the inspiration.

= Daniel


The reason why supernatural keys are slow is because the way the updates of the fact tables have been implemented. If you look at the data cleansing stored procedure for a table you will notice that all supernatural keys are updated separately. When you do lookups of the DW_Id from dimensions tables they are all grouped together into a single update statement.

I have raised a feature request to group all supernatural keys on a table into a single update statement. My testing on 14m records and 12 supernatural keys:

Separate update statements: 06:11:41

One update statement: 02:11:13

Improvement: 65%

@Thomas Lind, any updates on grouping supernatural keys into a single update statement? 


@LarsBoyen 

Not really, it is still on the idea stage. It would of course be a benefit if it was added.


Reply