We are currently running a database on hyperscale with 12vcores and premium memory optimized hardware. ODX is a data lake.
We are running execution packages on 4 threads and we want to reduce our odx transfer times as that is out biggest impact right now.
This is an example of a load.
I am seeing long ODX transfer times for very small tables that I cant find the cause of or explain.
We have alot of small tables (see example below) that is taking 1-3 minutes, while some tables doing data cleansing on 1million + rows is doing it in the same amount of time.
See this small table that is 400 rows, no transformations and only 10 columns.
I can see we are peaking log io at 100% during certain periods.
However, that seems to be during the later parts of the loads and not during odx transfer in the start.
Is there any thing that can cause this or possibility to improve it?
Thank you, Victor
Page 1 / 1
I would also be intrested to know the reason behind this.
Hi @Victor
Please try to use SQL profiler in Azure Data Studio to identify which step is taking the longest time.
Also are you using incremental load? Can you please share screenshots of table settings?
Have you considered using simple mode for these tables without transformations in an attempt to improve performance?
We have had some customers reporting performance issues with hyperscale recently, so it might be worth reaching out to somebody at Microsoft as well.
Hello @Christian Hauggaard
We are running incremental load on as many tables as we can. The load example (an hourly load) therefore has a combination of fully loaded and incremental tables.
We have simple mode on every table in the mdw + any DSA tables that are not having transformations.
We made the switch from Premium DTU (1000) to hyperscale two weeks ago to try and get a increase in performance, but it has stayed the same. We first tried an increase in dtu and then later increased vcores from 8=>12 =>16 without an improvement in odx tranfser (only data cleansing got better).
We are seeing an increase in ODX data transfers over the past months. Alot of smaller tables taking longer time to run odx execute on.
It reminds me of a situation a year back when our odx backlog file was corrupted which made every table take slightly longer time every load. However, we are not seeing the same error messages to indicate it’s the same problem.
Previously when we needed better performance we tweaked the database, however this is not improving the odx transfer.
DSA => MDW step is extremly quick and barely noticeable in the execution package.
The issues seems to lie in the data lake => dsa step.
Here is a table in the dsa with 3 rows and 3 columns I ran twice, that took 60s to run execute on in the dsa while no other package was running.
Today while monitoring the loads, I saw the CPU usage on our vm machine running the ODX server sometimes spike to 90-100%.
I assume this bottleneck ado.net transfer? However, most of the time it is running on sub 60% which was the case during the above example.
Hi,
there are quite a few things to look at here:
look at your database metrics at a fine time granularity: you are probably running into log i/o governor waits (100% log I/O). Make sure you choose the max-variant of the metrics and choose a fine enough time granularity so your peaks are not hidden
look at the query store in your database through SSMS and analyze your waits. If the log i/o governor is a fairly high share then you are bottle-necked by that. The only way to get faster log i/o is by running IaaS or changing the amount of data or ordering of load.
look at the most expensive queries in your query store: if you are spooling to TempDB, that is slower on Hyperscale than on the non-Hyperscale databases
if you are using ADO.Net to transfer from ADLS to Azure SQL DB Hyperscale, you may be limited by the memory on your TimeXtender application server. The parquet files need to be transferred to the VM and decompressed there before being sent to the Hyperscale DB.There is a setting to limit the memory, if you are not running out of memory you can increase this. If you are running out of CPU you can of course add more cores to the VM. If you are running a VM in the D-series, consider scaling it to an E-series for more RAM per core
You can consider using ADF with an Azure runtime or self-hosted integration runtime on a dedicated VM (or cluster) to make this transfer (potentially) faster. This has some settings you will need to tweak to ensure fast flows
If your ODX is refreshing data, any DSA transfer will wait for that refresh to finish. This is shown as Data Movement time and not separately visible
If your ODX Data Source runs out of parallel connections (DSA transfers count), your DSA transfer will be pending and waiting in the queue. This shows up as Data Movement time and is not separately visible
I have not seen much benefit from the premium memory-optimized compared to regular premium hardware, but that will depend strongly on the specific cleansing you are doing. Adding cores does not increase log i/o that is 100 mb/s for all scalings, premium hardware simply ensures you always get gen8 CPU instead of something from gen5 to gen8
There have been some issues with Hyperscale resources about a month ago, but those are resolved now as far as I am aware
Hello Rory,
Thank you for your feedback.
I have looked into each point before but I went through them again today running tests.
I have been keeping track of the log io and while we reach 100% at later stages (understandable due to data size) it is not during the time where we run alot of small tables where we have a MAX log io of 40%, nor during my testing when schedulings är turned off.
CPU or Memory on the VM machine doing ado.net never went about 30%.
We do not have any sources running ODX, syncs, transfers or any other tasks during scheduled loads or during my testing.
But the question remain the same.
Why is a 3 row table taking 60s-120s for Timextender to transfer from the odx data lake to a sql database? The ODX transfer tasks takes 1s in comparison.
This table is not running during expensive queries during any of my testing.
The log/io, cpu usage and any other metric on the sql database is below 20% during all testing on the above table. The vm machine is running at 20-30% cpu and memory. No ODX tasks are running that should be affecting it.
Please note I took a small table as an example, but this is affecting alot of other “small” tables aswell.
I can find no metrics that can explain this bottleneck nor can I find any way to improve it.
Hi @Victor ,
how are your incremental loads set up? If you have delete handling or PK updates turned on your incremental loads will always pull the full primary key set from source which may be substantial. If that set has crossed some kind of tipping point it could explain increased slowness.
Are your sources database connections or CData? If they are CData, do you have logging turned on? This can substantially slow down ingestion.
It may be worth having a look at disc queues on your TimeXtender VM while you are ingesting, the ODX Server uses a SQLite database as a local cache and that may be part of your bottleneck.
It may be interesting to try and see if you can get logging from your source. That way you could monitor any latency between triggering an execute in TX and the source system being queried and subsequently the writing of parquet to ADLS. If there is lag, I would expect the SQLite DB or something else local to the VM to be the cause.
If your ODX → DWH is the main culprit, you could consider attempting a restore of a backup of your Hyperscale database to a new database and verifying whether that has a different performance profile. I have encountered (and am still trying to find RCA with Microsoft) that Hyperscale database performance can ‘degrade’ even though nothing seems wrong from an infrastructural perspective.
@Victor do you have an update on Rory’s questions above? Were you able to restore a backup of your Hyperscale database to a new database?
Hello,
@Christian Hauggaard, sorry for the late response.
During the past month we have finally had time to run further tests.
We tried several different setups and combinations on the same data sets running for several days.
1,We went over to business critical again and got the same result.
2,We restored aswell as went back to hyperscale and got the same result.
3.We tested several different VM machines and while we found one that resulted in a slightly better transfer compared to the others it was not noticeable enough.
I also had the opporunity to test the ODX with a data lake (as we had above) and the same ODX with a SQL database to compare the differences in another enviornment.
The performance between the two were very different and the results suprised me.With the ODX running on a SQL database having on average 3x fast transfer times with ODX transfer tasks (source => odx) and aswell as better odx => dsa performance.This was tested across multiple sources and data sizes.
While I do not have the technical knowledge to fully understand the infrastructure behind the solution at a depth, this testing leaves me to believe the culprit lies in the parquet and/or data lake structure as the difference between them was so big.
Ignoring the cost difference for the moment, I would have a hard time recommending data lake over sql database for any solution requiring quicker loads and data based on the results we got.
Has anyone else tested this and seen different or similar results?