I am loading a table incrementally into my ODX storage in an Azure Data Lake, where new Parquet files are added daily. This approach is because the source only holds data every two weeks, and I want to maintain a log in the ODX. The Parquet storage is very compact.
However, for downstream analysis, I only need to retrieve data from the last 1 to 2 days into my prepare instance. I am using a data selection rule on the mapping, and I have also tried applying it directly on the table. Both approaches take a very long time to complete (+1 hour), whereas running the same query on the source SQL database filtering for 2 days of data completes in about 10 seconds.
I suspect that the prepare instance is scanning through all the Parquet files, including older days, causing the slow performance.
My question:
Is there a way to configure the TX prepare instance to only process the most recent X Parquet files (e.g., the last 2 days) instead of scanning all files? This would significantly improve the selection speed.
Thanks in advance for your help!