Skip to main content

Hi all,

I'd like to start a discussion about the batch cleansing option. You can find this option in: Table Settings → Performance → Enable batch data cleansing. 

This option is disabled by default. I'm looking for best practices when you should use it and how to determine the best batch cleansing size.

I normally enable it on big tables (5.000.000 + rows), this reduces the disk space used for logging. But in some cases it also improves performance. It can result in faster loading of a table. However, I'm always struggling determining the correct batch size. I know the answer will be ‘it depends’. But I'm trying to find some best practices for this option.

For now I start at 250.000 rows for batch cleansing and increase with 250.000 rows. I write down the timings and pick the best setting. This setting depends on the number of rows and number of transformations in the table. 

I'm curious if there are any practices or tips and tricks for this?

Hi Bas

This is a good topic.

I did testing of this feature when it was initially added.

The main benefit of this feature was that it would allow it to release the data from memory, so you would not use a swap memory to store it in.

I found as you say that doing some testing with various setups to find the number that gives the best speed.

Here is an old snippet I sent to another partner.

Having it set to 50.000 rows uses far less memory, but it also requires many batches which makes the executions take more time. A large number can be faster, but it will use far more memory.

While it is some time since I tested this, I believe I found 500.000 to be the best number where the amount of data weren't too high and the speed wasn't too low.
What you also must take into consideration is the server you do this on, so it may be different on your server.


I can't remember if this was done on tables with a million rows, but I would probably choose 250.000 if it contains about an million.
Hope this explains it?

The point where it is best, is the point where the batch is not too big for the memory to handle and not too small to make the it slow due to all the batches that needs to be created. The amount of free memory on the server is the deciding factor.


How have you found the batch size that is most suitable for your data, was it just trial and error? Or is there a golden standard that looks at free memory, the rows/fields and amount/type of transformations performed on the data? You are stating that the amount of free memory is the deciding factor, so i am wondering if there is a way to find a nice batchsize without having to wait hours for all of my trials to run.


Hi,

it isn't really that simple: your table that you set batch-cleansing on is running in a broader execution context controlled by TimeXtender, the resources you use as processing engine have their own limits and scaling and context. I.e. you might determine a trend on performance vs package size when running your table separately which suggests an optimum which is not optimal during your nightly run because you are running 12 threads in parallel.

Optimizing this is no different than the type of approach you would use to optimize queries or resources in SQL Server. There are no hard-and-fast “this is the best way” there, it will always be “it depends”.

In this context you need to get a baseline during your scheduled loads and then change settings and try to find an optimum. You also need to realize that tables that grow or shrink may affect their ordering in the run, thereby changing the scarcity of resources avaialable.


Reply