Solved

CData REST API connector hangs when loading large datasets


Userlevel 3
Badge +2

We’re facing issues with CData REST API connector when requesting an API for a large dataset. TX sends the request, but seems to never receive a response. The execution fails when it hits the timeout limit (currently set to 3600 seconds).

In the logs (verbosity 4) I see that nothing happens between the time that the request was sent and the moment of timeout.

 

I put the same URI, headers and parameters in Azure Data Factory. After 9 minutes it receives a response of 204MB and 176K rows. In Postman, I received a response after 8 minutes.

 

While I agree that it might be better to somehow make smaller API request. In fact, when I limit the date range to only this year, TX gets a response in about a minute. However, I still expect TX/CData to finish the request when the dataset is larger and it takes more time before the response is generated by the server.

Due to NDAs I cannot post logs, RSDs or credentials here, but I’ll send some additional files via email.

Tx version 20.10.39
CData REST API version 22.0.8423.0
 

icon

Best answer by Thomas Lind 24 March 2023, 16:28

View original

15 replies

Userlevel 6
Badge +5

Hi Erik

What sort of settings are you using in the get call for this?
Do you use pagination?
Have you set up a batch size in the setup of the provider?

 

Userlevel 3
Badge +2

Hi Thomas,

Just sent the settings via email.

Pagination is not used and batch size on Business Unit level is set to 300K rows.

Userlevel 6
Badge +5

Hi Erik

I can think of some workarounds, but it should not just timeout. I have asked CData about this.

Userlevel 3
Badge +2

Hi Thomas,

Thank you. I agree that it should not just timeout. We’re also discussing the issue with the creator of this API. It would be better to load this API incrementally, but so far we don’t have the parameters needed.

Userlevel 6
Badge +5

Hi Erik

If you have a start and end date option of the URI, you can apply something close to incremental load. The issue will be if you don’t have a date field to apply the incremental rule on.

If that is the case, I would do the following.

Use the dynamic date option I have in my RSD guide in the file you got. For the first few executions you run one year with fixed start and stop date.

You store this in a table with history applied and no deletes. Then you run all subsequent years like this until you reach present time. and then you set it to pull the last few days on each execution.

It may transfer the same rows, but it will not be that many and since the source runs with history, it will be able to control it so no doubles are applied.

Userlevel 3
Badge +2

Hi Thomas,

Ok, thanks for the suggestion. Will have a look at it. What makes it complicated is that the API queries from a Mongodb and translates all tables to a relational model. As a result, some tables including this one do not have a primary key. Even the combination of all fields does not result in unique rows. Somehow (according to the business) this makes sense and they told me not to remove duplicates. I will discuss it once more, because I have the idea that we are not receiving all attributes.

Userlevel 6
Badge +5

Hi Erik

If the table does not have a primary key, you can get around it by generating a supernatural key that consists of all of the fields, so it hopefully becomes an unique key. You can set this key as the primary key and then still set up history.

If there is no way of making an row unique even setting all rows as pk I would argue that the only benefit would be to show how many times this rows appeared, maybe with an additional custom query that counts the rows based all the fields as a group by.

Userlevel 3
Badge +2

Hi Thomas,

We’re working on a better solution with the company who created the API, but it can take some time. Any updates from CData yet?

Userlevel 6
Badge +5

Hi Erik

Hi Thomas,

Our driver isn't receiving any kind of response from the API, not headers or anything, which is fairly strange. The only significant difference I see between the Postman request and our driver's request is that the filters in our request are URL encoded. Can the customer attempt to query the endpoint without the URL encoded filters and see if this returns the larger data set successfully? The easiest way to do this would be to remove the WHERE filters and set the URI in the RSD file to:
production.rest.com/Report/report/acde1234-321e-4a96-a7bb-0178cbf5ecab/task/report/datareport?startdate=2021-01-01T00:00:00Z&enddate=2023-12-31T23:59:59Z&outputtype=JSON

Another thing the customer can change is remove the Accept-Encoding:gzip line from the CustomHeaders property in their connection string. Our driver automatically appends this header, and it is possible that having 2 separate Accept-Encoding headers could be causing an issue in the server's response.

If they still continue to run into timeout issues, would they be willing to allow us to use their credentials to test this API on our end and see if we can identify what is causing the response to be blocked?

Let me know if they continue to run into issues or if they have any questions.

Userlevel 3
Badge +2

Hi Thomas,

Unfortunately no success.

The API developer will setup a test environment soon, also to test a new feature which will allow us to load incrementally. Maybe I can share these credentials when it’s ready.

Userlevel 6
Badge +5

Hi @erik.van.mastrigt just following up - any update on this?

Userlevel 3
Badge +2

Hi @Christian Hauggaard, no, not yet. There’s no test environment yet.

Userlevel 6
Badge +5

Hi @erik.van.mastrigt have you managed to setup a test environment?

Userlevel 3
Badge +2

Hi @Christian Hauggaard, yes we have and we perform incremental loads now. It’s been running fine for about two weeks. I don’t think the issue mentioned here is fixed, but we have a solid workaround.

I can still share the test credentials for CData if you like.

Userlevel 6
Badge +5

@erik.van.mastrigt which workaround are you currently using? If you could share the test credentials that would be great thanks

Reply