PK table in ODX database is growing rapidly in terms of disk space
Hi community,
I’ve a question about the ‘PK’ table that is created in the ODX when you enabled the ‘Handle primary key updates’ and ‘Handle primary key deletes’.
I’ve noticed the ODX database (SQL Database) is growing rapidly in disk space after enable an incremental load schedule to reload data every 5 minutes. When I used the default SQL ‘Disk usage by top tables’ I noticed the PK tables are the biggest in terms of disk space.
My PK table of the GL Entry table contains 4.5 billion rows! And is 95 GB. While my DATA table contains only 118 million records.
When I query the PK table and filter the primay key in this table on 1 value. This value is saved 75 times in this table. For every odx_batch_number (odx_batchnumber 0 – 74). Is this normal? I think it is a bit strange that my PK tables are the biggest tables in the ODX in terms of disk space.
Even when I run the storage management task. It doesn’t clean the PK table’s. It always contains the primary key for every odx batch load.
The customer is using version 20.10.37.64
Page 1 / 1
Hi @bas.hopstaken
It adds all PK values to the PK table every time you do a execution.
It is used to compare what fields are deleted or updated. It does not matter whether you use SQL or Data Lake to store the data in.
To decrease this size run a full load and use the storage management task to only allow 1 version.
Also this rollup feature may be able to reduce the size.
It does not apply to the SQL Storage type though.
Hi Thomas,
Thanks for the quick reply, I'm only wondering. Is it intended to work like this?
I understand the purpose of the PK table. But if you save the last 2 loads/batches in this PK table should be enough right?
Is it really needed to save all primary keys for every load/batch?
Hi Bas,
I expect having older PK states would be useful for replaying history. This way the ODX replaces the “type 2 every field in every table” approach. Handle deletes and updates are quite expensive in performance, so they are not useful for every table generally.
Hi @rory.smith,
Replaying history can be a reason to save the PK's for all batches. I know that this option is quite expensive in performance. But I didn't know all PK's for all batches are saved.
Nevertheless, it would be great if we have the option to choose how many batches need to be saved in the PK tables. Maybe the storage management task can be used for this. This way we can save a lot of disk space for the PK tables.
Hi @bas.hopstaken ,
I agree: Storage Management tasks should have an option to clean the _PK tables (as this could be quite expensive). I would intuitively expect those tasks to also clean that up I think.
Please post and upvote this in the idea section. Many thanks!
Hi All,
We are running into the same issue. The size of the PK table is absolutely ridiculous compared to the table it is being used for. Is there any progress on cleaning the PK table from TX?
Hi all,
As a temporary workaround, I've created a dynamic SQL script to truncate all ‘PK’ tables:
-- Create a cursor to loop through table names DECLARE table_cursor CURSOR FOR SELECT table_name , table_schema FROM information_schema.tables a WHERE TABLE_TYPE = 'BASE TABLE' AND table_name LIKE '%_PK_%'
-- Loop through the tables and generate TRUNCATE TABLE statements OPEN table_cursor FETCH NEXT FROM table_cursor INTO @TableName, @TableSchema
WHILE @@FETCH_STATUS = 0 BEGIN SET @SQL = 'TRUNCATE TABLE ' + @TableSchema +'.l' + @TableName + ']' EXEC sp_executesql @SQL
FETCH NEXT FROM table_cursor INTO @TableName, @TableSchema END
-- Close and deallocate the cursor CLOSE table_cursor DEALLOCATE table_cursor
Please be aware: Do not run this script while reloads are running.
I created a idea for this a while ago:
Let's upvote this idea and hope it will be implemented quickly.
Hi @bas.hopstaken ,
you are dropping the whole _PK set here right? Does that not destroy the deletion handling / PK updating feature? I would only drop everything older than the last state, but that is probably an expensive deletion and might require index maintenance.
As the _PK table seems to be required only for table-internal processing, you should be able to post-script on the _PK table after cleanse without (b)locking other processes. Depending on process duration you might either do a simple post-process or run weeken maintenance or something like that.
I would really like to know what the reasoning behind storing all the states is, as I cannot imagine actually using the data without writing a whole load of complex SQL in some kind of disaster recovery scenario.