Skip to main content

There is a host of pages under the open data umbrella and one of the providers is called Socrata. This https://soda.demo.socrata.com/ is the main URL for this and in there you have a lot of data based options on specific things such as environment, governance, and health.

This API uses a specific pagination method where you need to state an offset and limit value to get all the data.

Many other APIs do something similar. I have seen some use take and size, some start and size and so forth. They all work the same way and this should work for these as well.

Initial Setup

Start by finding the base URL.

First, you find the endpoint you want to use I use the Current Employee Salaries dataset.

https://soda.demo.socrata.com/dataset/Current-Employee-Names-Salaries-and-Position-Title/9d3p-tn9t/about_data

Once located you can find the actual endpoint by clicking on Actions and then API.

You can copy it from this location.

Once copied you can add this part as the Base URL.

https://soda.demo.socrata.com/resource/

It should look like this.

Pagination

To set up pagination we need to use a dynamic SQL query and the {TX_CurrentPage} number operator option. This option will increase its value by one on each iteration starting with 0.

The two options that are used for pagination on this API are called $offset and $limit. The default limit value is 1000, but you can increase it to 2000 maximum. That is what we want to do to keep the pages as few as possible.

The dynamic SQL query I made looks like this.

SELECT {TX_CurrentPage} * 2000 as offsetValue

This means that it will use the current {TX_CurrentPage} number and increase it with 2000 on each iteration and this value will be saved in the offsetValue name.

You also point to an SQL Server, but you do not need to specify what database it should run on.

server=localhost

The User field is set to an SQL Authorization user and the Password field is this user's password.

Finally, the result of this is used in a parameter where the Name is equal to $offset and the Value is equal to {offsetValue}.

To figure out how to find when it should stop you can add a check on what is returned and then see if the returned contains what you look for. So if you set up that it should increase the page value on each iteration, the first page will look like this. 

And the last page will look like this.

This means that the rule should check whether the file is empty and if so stop. The XPath rule that can do that is the following {*/*/*}n1]. That setup means that it will run if there is any content after the third level and if there isn’t it will stop. A different wording of the XPath could be (TX_Autogenerated_Root/TX_Autogenerated_Element)r1]/*_1]. The reason it checks for three levels is that there could be another level TX_Autogenerated_Element that could be received without having any real data.

You need a Parameter with the Name field equal to stopCondition, the Type field equal to XPath, and the Value field equal to (*/*/*)d1].

It should all look like this.

There is no $limit query parameter set up on this as it will run the first iteration without this setup and therefore only return 1000 on the first page.

Endpoint

The endpoint for this dataset is Employees, so the endpoint path is Name equal to Employee salaries and Path equal to employee-salaries.json.

Besides this, you need to add the @limit query parameter as Name equal to @limit and Value equal to 2000.

As an option, you can also add an $order parameter to order the data based on one of the fields. I made the result be ordered by the ID field. To name a field add a : in front of it. 

I set the Name field to be equal to $order and the Value field to be equal to :id.

It should look like this.

That is it. It should iterate over the 32.6k rows and show this info.

Hej @Thomas Lind,

Thanks for making this guide, it will come in very useful! I have implemented the steps as described, and it seems to probably be doing something with the pagination, as I am getting the following error from the data provider: 

Failed to execute endpoint 'TX_Medewerkers':
Pagination for endpoint 'TX_Medewerkers' stopped due to hitting configured max limit of 1000 pages. Please increase limit.

I only expect to receive about 50 pages of data, of 1000 pages each. This are my settings: 

Kind regards,

Andrew

 


Hi @andrew.gebhard 

I think it returns the same URI on each iteration.

You should be able to see this if you set the Caching option to point at a local folder and look at the files that are generated.

I can’t see the SQL query in the image or the query parameter. That is the important bit. For reference this part

 


Hi @Thomas Lind,

This is what I have. For this API, the offset parameter is called ‘skip’ and the limit parameter is called ‘take’. So skip 1000, take 1000, skip 2000, take 1000 etc. 

I assume the name of the parameter should not matter, is that correct? 

Kind regards,

Andrew


Hi Thomas, 

I found the mistake myself; i was using TX_NextPage instead of TX_CurrentPage. I'm not sure why, but that meant it just recursively got the first 100 records, regardless of the take/limit parameter. The skip was not updated with each call. 

The stopCondition also did not work for me as written; with only looking at emptiness after the third level, the pagination would continue after receiving the following ‘empty’ XML:

<TX_Autogenerated_Root>
<TX_Autogenerated_Element>
<skip>39000</skip>
<take>1000</take>
</TX_Autogenerated_Element>
</TX_Autogenerated_Root>

So, I tried adding another */ to the condition, so it ends up: (*/*/*/*)/1]. This successfully instructs the connector to stop getting new pages. 

Kind regards,

Andrew


Hi @andrew.gebhard 

Good point. 

I was attempting to explain what the stop condition is referring to and how it always could contain some data and that it needed to hit a lower level than that. 

This is also a great visual representation of what it means if the source always returns data no matter if it is usable or not. 


Reply