Currently, junk dimension tables are automatically keyed with varbinary 64 fields. This takes up a lot of room in a fact table and can’t be used as a join in Tabular. Can junk dimensions share the functionality of the supernatural key stores by storing the hash locally, and using an integer as a key instead?
6 comments

Bas Hopstaken Totally agree!

Paw Jensen Hi Doug,
The key data type is defined by the Hashing algorithm:

CalmCo Support We've been wrestling with this as well.
Our workaround is to cast the hash as bigint, which is in fact a plain old truncate of 64 bytes to 8 bytes, and use that for joins and key fields.
It's not optimal but given the chance of collisions is in the billions and  as much as our customers are growing  aren't running into tables with billions of entries, it works (and is very fast when used in multidimensional, tabular and qlik as key :)).
The best way to generate these keys would be to have TimeXtender include a CRC64 function, or a noncryptographic hash implementation which can be used to generate these bigints natively.
Example for Postgres SQL: https://github.com/theory/pghash64
Wikipedia has a good list: https://en.wikipedia.org/wiki/List_of_hash_functions
fasthash is used by many: https://github.com/ZilongTan/fasthash
There are opensource CRC64 implementations in Csharp: https://github.com/damieng/DamienGKit/blob/master/CSharp/DamienG.Library/Security/Cryptography/Crc64.cs
So if I might make a suggestion: implement this natively in TX DH https://github.com/ZilongTan/fasthash

Radek Buczkowski It is an interesting topic. Like Paw showed, the truncation algorithm is already implemented. In the hashing algorithm drop down list (see Paw's picture), just choose "Legacy Integer". It will do exactly what Wim mentioned: cast the hash (20 bytes) as bigint (8 bytes/64bits) integer. As much as I understand about SHA1, its values are evenly distributed. The existing solution is therefore as good as the best algorithms calculating a 64bit hash.
Unfortunately the cast is over "Legacy Binary", which does not work as good as the more modern typesafe algorithms in Discovery Hub, e.g. if the combined length of all columns is longer than 8000 characters a Legacy hash will ignore everything after 8000 characters.
It would be fairly easy to add another algorithm to the list doing the same cast to bigint over one of the typesafe algorithms. (It would be just one extra line of code.) It would work as good as any of the algorithms presented by Wim.
However, with 64bits the chances of collisions are many orders of magnitude bigger than with the regular SHA1 hashing algorithms. This is a passage from the document about hash fields:
"A standard 20 bytes long SHA1 hash field can hold an astronomically big decimal number which is 48 digits long: 999999999999999999999999999999999999999999999999. The longest hash field in Discovery Hub (with the “SHA2 512” hashing algorithm) can hold a 154 digits long decimal number, which is considerably more than there are atoms in the entire universe. (The number of atoms in the whole observable universe is estimated to be “only” 80 decimal digits long.) But even the standard Hashing Algorithm (“SHA1”) with 48 decimal digits is long enough for a very secure representation of any possible data you can calculate your hash values from, despite of its combined lengths and the number of combined columns. This means that hash fields in Discovery Hub are very safe to use.
Note also that we don’t use hash fields in Discovery Hub for any cryptographic applications, so algorithm vulnerability to hacker’s attacks plays absolutely no role. (Even though the hashing algorithms used in Discovery Hub can be used in cryptography, they are only used as a hashing function mapping data of any arbitrary length and number of constituents to a hash value of fixed size.)"
This means that with SHA2 256 chances of collisions of hashes or supernatural keys are in all practical sense impossible. With SHA1 in a practical sense, they are so unlikely that we can also assume they are close to impossible.
With bigint it is a little different. bigint is only 8 bytes, so the maximum number of values it can store is "only" 18446744073709551615. It is still a very big number, but it cannot compare to the safety of the standard hashing algorithms above. I am fairly confident that values of both SHA1 (which is also used by the Legacy hashing algorithms including "Legacy Integer") and SHA2 are evenly distributed. Which means that chances of collisions depend only on the number of bits in the result.To summarize, the simplest solution would be to add the extra algorithm truncating a typesafe algorithm to bigint, which will work as good as any algorithm listed by Wim.
But an even better, safer, and more modern solution would be to combine junk dimensions with supernatural keys, like it was suggested by Doug. The only question, as always, is how many users would use it.

Doug Wynkoop Hi Radek,
You summarized my thoughts on this perfectly! The legacy algorithms aren't safe to use, and may result in collisions. Thanks for weighing in so thoughtfully on this!
Best regards,
Doug Wynkoop

Joseph Treadwell I know our current recommended fix for Qlik circular references is to use a "link table" which we create through a junk Dimension. So at least for Qlik users, this would be a big improvement.