Skip to main content
Solved

Partition Parquet Files

  • December 16, 2025
  • 3 replies
  • 68 views

rsanchez
Starter

We are loading data from a SQL Server (onprem) to a Datalake (ingest instance). Data is stored as parquet, but TX is creating a huge parquet file for each table.  We are using ADF to ingest the data into the ingest instance.  Is there a way to load the data in such a way that when it gets into the Datalake the parquet can be divided in parts?  table_xxx_0001.parquet, table_xxx_0002.parquet, etc. (I know this is not the parquet format it is just an example).  Is this possible? We are having performance issues reading from parquet to AZURE SQL DB prepare instance.  We are using ADO.NET from ingest to prepare.

Best answer by rory.smith

Hi,

Azure SQL DB S6 at 400 DTU delivers at most 4 * 400 = 1600 IOPS at 10ms latency. Log rate is a result of that as far as I am aware.

Azure SQL DB Hyperscale Premium at 2 vCore would deliver 8500 IOPS at 4ms latency with 150 mb/s Log rate.

Price-wise a 400 DTU database is more or less equivalent to a 2 vCore Hyperscale Premium machine.

https://learn.microsoft.com/en-us/azure/azure-sql/database/service-tiers-dtu?view=azuresql

https://learn.microsoft.com/en-us/azure/azure-sql/database/resource-limits-vcore-single-databases?view=azuresql#hyperscale-premium-series-part-1-of-3

So I would check wait statistics to see if I/O is your bottleneck here, if so you could consider trying out a different type of PaaS database.

3 replies

Thomas Lind
Community Manager
Forum|alt.badge.img+5
  • Community Manager
  • December 17, 2025

Hi ​@rsanchez 

Are you using incremental load?

It doesn’t partition the files, but besides the initial run, it should be smaller parquet files that would be generated.

On the other end when the file data are imported to the prepare instance it will only be this new file instead of the whole thing.

For incremental load there also is a rollup task feature that can be executed against the parquet file folder that will split out files into evenly sized chunks. The requirement is that you do not use Updates and Deletes in your incremental rule and that you use Azure Data Factory to do the transfer.


rory.smith
TimeXtender Xpert
Forum|alt.badge.img+8
  • TimeXtender Xpert
  • December 17, 2025

Hi,

when you say huge - how large is the parquet compared to the storage size of the SQL table? Are the fields appropriately typed?

If you are using Azure SQL DB your Log I/O is limited which will likely be slowing the load into Prepare down. Hyperscale will have 150 mb/s compared to something like 36 mb/s in lower vCore Azure SQL DB. You can see from the metrics in the Azure Portal or by looking at Log Governor Wait Statistics.


rsanchez
Starter
  • Author
  • Starter
  • December 17, 2025

Huy guys, thanks for replying.  This is a development environment, and the client is using DTU servers. NO serverless or vCores. 400 DTU. Data from ingest to prepare is using ADO.net


rory.smith
TimeXtender Xpert
Forum|alt.badge.img+8
  • TimeXtender Xpert
  • Answer
  • December 18, 2025

Hi,

Azure SQL DB S6 at 400 DTU delivers at most 4 * 400 = 1600 IOPS at 10ms latency. Log rate is a result of that as far as I am aware.

Azure SQL DB Hyperscale Premium at 2 vCore would deliver 8500 IOPS at 4ms latency with 150 mb/s Log rate.

Price-wise a 400 DTU database is more or less equivalent to a 2 vCore Hyperscale Premium machine.

https://learn.microsoft.com/en-us/azure/azure-sql/database/service-tiers-dtu?view=azuresql

https://learn.microsoft.com/en-us/azure/azure-sql/database/resource-limits-vcore-single-databases?view=azuresql#hyperscale-premium-series-part-1-of-3

So I would check wait statistics to see if I/O is your bottleneck here, if so you could consider trying out a different type of PaaS database.