Batch cleansing option - best practices, tips and tricks

  • 26 January 2023
  • 1 reply

Userlevel 2
Badge +3

Hi all,

I'd like to start a discussion about the batch cleansing option. You can find this option in: Table Settings → Performance → Enable batch data cleansing. 

This option is disabled by default. I'm looking for best practices when you should use it and how to determine the best batch cleansing size.

I normally enable it on big tables (5.000.000 + rows), this reduces the disk space used for logging. But in some cases it also improves performance. It can result in faster loading of a table. However, I'm always struggling determining the correct batch size. I know the answer will be ‘it depends’. But I'm trying to find some best practices for this option.

For now I start at 250.000 rows for batch cleansing and increase with 250.000 rows. I write down the timings and pick the best setting. This setting depends on the number of rows and number of transformations in the table. 

I'm curious if there are any practices or tips and tricks for this?

1 reply

Userlevel 5
Badge +5

Hi Bas

This is a good topic.

I did testing of this feature when it was initially added.

The main benefit of this feature was that it would allow it to release the data from memory, so you would not use a swap memory to store it in.

I found as you say that doing some testing with various setups to find the number that gives the best speed.

Here is an old snippet I sent to another partner.

Having it set to 50.000 rows uses far less memory, but it also requires many batches which makes the executions take more time. A large number can be faster, but it will use far more memory.

While it is some time since I tested this, I believe I found 500.000 to be the best number where the amount of data weren't too high and the speed wasn't too low.
What you also must take into consideration is the server you do this on, so it may be different on your server.

I can't remember if this was done on tables with a million rows, but I would probably choose 250.000 if it contains about an million.
Hope this explains it?

The point where it is best, is the point where the batch is not too big for the memory to handle and not too small to make the it slow due to all the batches that needs to be created. The amount of free memory on the server is the deciding factor.