Batch cleansing option - best practices, tips and tricks

2 years ago
January 26, 2023
3 replies
694 views

bas.hopstaken
TimeXtender Xpert
91 replies

Hi all,

I'd like to start a discussion about the batch cleansing option. You can find this option in: Table Settings → Performance → Enable batch data cleansing.

This option is disabled by default. I'm looking for best practices when you should use it and how to determine the best batch cleansing size.

I normally enable it on big tables (5.000.000 + rows), this reduces the disk space used for logging. But in some cases it also improves performance. It can result in faster loading of a table. However, I'm always struggling determining the correct batch size. I know the answer will be ‘it depends’. But I'm trying to find some best practices for this option.

For now I start at 250.000 rows for batch cleansing and increase with 250.000 rows. I write down the timings and pick the best setting. This setting depends on the number of rows and number of transformations in the table.

I'm curious if there are any practices or tips and tricks for this?

Thomas Lind
Community Manager
1100 replies
2 years ago
January 26, 2023

Hi Bas

This is a good topic.

I did testing of this feature when it was initially added.

The main benefit of this feature was that it would allow it to release the data from memory, so you would not use a swap memory to store it in.

I found as you say that doing some testing with various setups to find the number that gives the best speed.

Here is an old snippet I sent to another partner.

Having it set to 50.000 rows uses far less memory, but it also requires many batches which makes the executions take more time. A large number can be faster, but it will use far more memory.

While it is some time since I tested this, I believe I found 500.000 to be the best number where the amount of data weren't too high and the speed wasn't too low.
What you also must take into consideration is the server you do this on, so it may be different on your server.

I can't remember if this was done on tables with a million rows, but I would probably choose 250.000 if it contains about an million.
Hope this explains it?

The point where it is best, is the point where the batch is not too big for the memory to handle and not too small to make the it slow due to all the batches that needs to be created. The amount of free memory on the server is the deciding factor.

Mickey0
Starter
1 reply
7 months ago
November 26, 2024

How have you found the batch size that is most suitable for your data, was it just trial and error? Or is there a golden standard that looks at free memory, the rows/fields and amount/type of transformations performed on the data? You are stating that the amount of free memory is the deciding factor, so i am wondering if there is a way to find a nice batchsize without having to wait hours for all of my trials to run.

rory.smith
TimeXtender Xpert
711 replies
7 months ago
November 26, 2024

Hi,

it isn't really that simple: your table that you set batch-cleansing on is running in a broader execution context controlled by TimeXtender, the resources you use as processing engine have their own limits and scaling and context. I.e. you might determine a trend on performance vs package size when running your table separately which suggests an optimum which is not optimal during your nightly run because you are running 12 threads in parallel.

Optimizing this is no different than the type of approach you would use to optimize queries or resources in SQL Server. There are no hard-and-fast “this is the best way” there, it will always be “it depends”.

In this context you need to get a baseline during your scheduled loads and then change settings and try to find an optimum. You also need to realize that tables that grow or shrink may affect their ordering in the run, thereby changing the scarcity of resources avaialable.

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Reply

Related topics

Batch Data Cleansing

Troubleshooting data cleansing rulesicon

The More You Know: Why TimeXtender is Essential in Today's Data-Driven World

The More You Know? Time is Money! What are the 5 features of TimeXtender that are "so money"? And what's with the name TimeXtender?

Connecting to AFAS with the TX REST Connector

Most helpful members this week

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings