Solved

Optimizing data selection in prepare instance when loading incremental Parquet files

18 days ago
June 16, 2025
2 replies
47 views

sva
Explorer
5 replies

I am loading a table incrementally into my ODX storage in an Azure Data Lake, where new Parquet files are added daily. This approach is because the source only holds data every two weeks, and I want to maintain a log in the ODX. The Parquet storage is very compact.

However, for downstream analysis, I only need to retrieve data from the last 1 to 2 days into my prepare instance. I am using a data selection rule on the mapping, and I have also tried applying it directly on the table. Both approaches take a very long time to complete (+1 hour), whereas running the same query on the source SQL database filtering for 2 days of data completes in about 10 seconds.

I suspect that the prepare instance is scanning through all the Parquet files, including older days, causing the slow performance.

My question:
Is there a way to configure the TX prepare instance to only process the most recent X Parquet files (e.g., the last 2 days) instead of scanning all files? This would significantly improve the selection speed.

Thanks in advance for your help!

Best answer by Thomas Lind

Hi @sva

When you are using a Data Lake to store your data in and are using incremental load on your data sources all the parquet files that gets generated are all the current data. So the second transfer will only contain data that is newer than the latest.

When you use incremental load on a data source, it will automatically be applied to a table from this source when it is added to the Data Area. So on each execution of the table the latest parquet file is added as new rows.

My point is that besides the first transfer all subsequent ones are filtered to only add the newest file.

What I would do is to add the Data Selection Rule on the second level.

I always have two data areas in my Prepare instances. One called Landing where the tables are added from the Ingest and one called Present which is the one that I connect to my Deliver instance.

I would set the filter rule on the Present version of the table. You can make a dynamic filter that only returns the last two days of data.

A completely alternate method would be to add a Managed Query on the data source itself. So you create a version of the table that has a dynamic date filter to only contain the last 2 days of data. This could then be connected to.

The issues with this approach is that some data source types do not support managed queries.

View original

Did this topic help you find an answer to your question?

+5

Thomas Lind
Community Manager
1094 replies
Answer
11 days ago
June 23, 2025