Skip to main content

Hi all,

 

I'm looking at making the right keys in our database model. I've been reading some topics on the forum but haven't found a shared opinion on the best practice. 

  1. Super natural keys
    1. Advantage:
      Easy to use.
       
    2. Disadvantage
      Huge performance impact. According to some topics reloads can increase in time from 3 till 30 minutes. 
      Their are relations in reloading the tables in the correct sequence. TX will calculate this sequence, however their is an increased change of circle references.
  2. Hashed Keys
    1. Advantage:
      Easy to use. 
      Should be faster than SNK. It doesn't need a look up to the keystore. 
    2. Disadvantage:
      Change on duplication of keyus is limited to a minimum Source: https://stackoverflow.com/questions/297960/hash-collision-what-are-the-chances
  3. Look up DW_Id from different tables
    1. Advantage:
      Able to make all joins on last layer.
      Easy to use. 
      BIGINT in keys, is better than VARCHARS in keys. 
    2. Disadvantage:
      It will cost the simple mode in transfering data from data area X to Y
  4. Concat values (joining multiple columns together and don't hash)
    1. Advantage: …?
    2. Disadvantage:
      VARCHAR in key instead of INT.

 

My personal preference is not to use the Super natural keys. It's taking, (in my opinion) to big of an impact on the performance of TX reloads. So in that case I would choose one of the other three other options. However I'm not sure which one is the best of all options. Hashed Keys seems doable, but I'm not sure of the impact of this method.

What do you think?

Hi,

as with everything there is no “right” way of doing things as there are a lot of considerations. If you must choose only one I would go for DW_Id as 1) and 2) will not work with type 2 history or deletion tracking by themselves. There is one that is wrong/bad, and that is 4)

Supernatural keys are “easy” and are deterministic, the latter attribute is sometimes a requirement. DW_Ids are ephemeral and that is sometimes a requirement. I am pretty sure I have some longer descriptions in the community here, so maybe ask XPilot.


Dear @Maarsen ,

I’d like to add that for point 2 there is another downside (atm). Not all visualisation tools are compatible with varbinary datatypes (Power BI). So you cannot use those keys in AAS/ SSAS / Power BI.

I’m using a combination of supernatural keys and look ups.@erik.van.mastrigt  opts here (

) for a solution where you make the SNK in the Dimension and then do a look up on the natural key to the fact. You still have the SNK functionality but now with performance.

I would not use the DW_Id. Especially when loading incrementally this will cause issues when the Dimension changes / gets updates or get a full load. Then the old DW_id, dont hook up to the correct DW_Id’s in the dimension because the fact downt update the keys.

Hope this helps.

= Daniel


My current approach on doing surrogate keys in TX is by combining supernatural keys and lookups.
Use key stores and create the super natural key in the dimension tables and do a lookup in the fact tables to fetch the super natural key from the dimension table. Then it’s down where you want to do the work, in the DSA or the DW layer. My preference is to have it in the DSA layer to keep the DW layer clean but there can be benefits to have it in the DW layer. E.g. when multiple tables are mapped to a single fact table in the DW layer then you only have to do one lookup instead of adding lookups in all of the source tables in DSA.


Hi all,

“I would not use the DW_Id. Especially when loading incrementally this will cause issues when the Dimension changes / gets updates or get a full load. Then the old DW_id, dont hook up to the correct DW_Id’s in the dimension because the fact downt update the keys.” ← this only occurs if dimensions linked to incremental fact tables are not protected for deletion by making them History tables. SNK incur a hashing overhead and fragility for column sorting in the key definition. Which of the options is more performant is pretty much “it depends”, which of the many options is “better” is a matter of taste or consideration of other aspects of the platform as a whole.

In my personal opinion looking up the DW_Id based on the Supernatural key to avoid defending against fact-table desynch is only a user-side difference. The history table uses hashes internally anyway...


Hey all,

I know that some make all the dimensions history tables. In that case you should be able to use the DW_Id Like @rory.smith just mentiontioned. My issue with this is that everyone in the team needs to know this and potentionally have deep understanding of what is going on and why. The combiination of SNK and a lookup is my style and it is easy to implement and to explain to others who are continueing my work.

I’m not looking up the SNK on the DW_Id btw. This is a lookup on the natural key from the dimension where the SNK is created to the fact table whichs needs this SNK. 


One reason I didn’t like using lookups on the DW_Id is aesthetics in naming convention. The dimension tables need to be historical and you would want to do that in the DW layer. Then you are stuck with having the SK field named DW_Id in the dimension tables which I don’t like.


Hi,

  • you can add a custom field and name it SK_<dimension> and copy the DW_Id value in in the dimension table. DW_Ids are per definition the surrogate key (and physical primary key) in all TX tables
  • In all of the tranings I have given, SNK is the topic most people struggle with. Given that there are situations where DW_Id is required (type 2 history or incrementally loaded tables with deletion handling that are looked up from) and not that many common ones where SNK are mandatory, I tend to avoid SNK. The fact that the default hashing chosen is slow and you often run into varchar(8) hashed against varchar(20) situations (I am looking at you SAP), makes for more complexity

By all means, do whatever is easy for you and your team to understand and maintain however.


I see your point @rory.smith .
In terms of adding a custom field in TX and copying the DW_Id, that will not work since it will use the DW_Id value from the raw table. We would need an option to create a calculated column inside TX.

It would be good if there was an option to overwrite what field TX uses as the physical PK for the table.

And you are right, we do need to go another way when dealing with Type 2 history and the easiest option for that is to make use of the DW_Id.


Hi @andri ,

 

good point: that would make for an interesting situation in your MDW layer :-)


Interesting views and opinions!

 

We have a fact table that contains a lot of values that are not in the dimension table. So the ‘optimized’ SNK is not a possible solution in this case. 

 

For now we will use the HASH key method. We can export it to Qlik so it isn't a disadvantage for us.

 

 


Hi @Maarsen 

I have set your last response as the best answer, I hope that is OK. Then others will be able to see your solution as well as this discussion.

I personally do not have a lot of knowledge about the speed issues, I would always use SNK whenever I needed it. I guess I have limited data to make it slow.

I would agree with the others about the DW_Id method, I believe the Supernaturalkey was developed as an answer to this


Reply