Solved

Ingesting JSON files from nested date and hour subfolders

7 months ago
November 5, 2024
2 replies
81 views

RobvdBrink
Starter
1 reply

Hi all,

I am currently working with a client who is using TimeXtender version 6766.1 in an Azure environment. We are trying to ingest data from a source that generates JSON files stored in a hierarchical folder structure organized by date and hour. The folder structure is as follows:

subject/version/date=YYYY-MM-DD/hr=HH/*.json

Example Path: dev/Call/1.2/date=2024-10-30/hr=14/*.json

Within each hr=HH (hour) folder, there may be one or more JSON files. Similarly, each date=YYYY-MM-DD (date) folder may contain multiple hour subfolders, though it's rare for all 24 hour subfolders to be present.

I have successfully configured the data source connector in TimeXtender to read all JSON files from a specific hour folder, as shown in the example above. However, our goal is to configure TimeXtender to ingest all JSON files across all date and hour subfolders under the main path:

dev/Call/1.2/

When I attempt to use wildcards to achieve this—such as:

dev/Call/1.2/*/*/*.json

I encounter an error stating:

Error 404: Path not found

My questions are:

Is it possible to configure TimeXtender to recursively ingest JSON files from all subfolders using wildcards or via another method?
If so, could you please provide guidance on how to set up the data source connector to accomplish this?
If this functionality is not supported, would you recommend implementing an Azure Function or another solution to consolidate all JSON files into a single folder for TimeXtender to access?

Your assistance in resolving this issue would be greatly appreciated.

Thanks in advance for your support.

Best regards,

Rob van den Brink

Best answer by Thomas Lind

Hi @RobvdBrink

Let me just add my findings and the current status of this issue.

Here is the response regarding if it was possible in the normal setup to do this.

Unfortunately, all the aggregated file related properties are for CSV files, not JSON since trying to aggregate JSON files dynamically would be significantly more complicated than with CSV files. I do think this should be possible in an RSD, likely with some kind of pseudo-column input for which dates/times you want to include and use of a nested set of enum blocks to loop through all the relevant URIs. One thing that may be an issue would be the exact names of the JSON files, looking through the ZIP example here, it seems like these files have randomly generated GUIDs as names, which would make it difficult to build the URI to that JSON file. If the files were instead named with a scheme that you could iterate through (something like file1.json, file2.json, etc.), that would probably help with the implementation in the RSD file.

I then made some RSD files.

This is how it ended up.

<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
	<!-- See Column Definitions to specify column behavior and use XPaths to extract column values from JSON. -->
	<api:info title="VehicleCombination" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
		<!-- You can modify the name, type, and column size here. -->
		<attr name="VehicleId" xs:type="string" readonly="false" other:xPath="/json/Call/CallVisit/RoadVisit/VehicleCombinations/VehicleCombination/VehicleId"/>
	</api:info>

	<api:set attr="DataModel" value="DOCUMENT"/>
	<api:set attr="baseURI" value="C:\\FlatFiles\\e-mergo\\Call\\1.2\\date={date}\\hr={hour}"/>
	<api:set attr="JSONPath" value="$.Call.CallVisit.RoadVisit.VehicleCombinations.VehicleCombination"/>
	<api:set attr="EnablePaging" value="True"/>

	<!-- The GET method corresponds to SELECT. Here you can override the default processing of the SELECT statement. The results of processing are pushed to the schema's output. See SELECT Execution for more information. -->
	<api:script method="GET">
		<api:set attr="method" value="GET"/>
		<api:set attr="dates" value="2024-10-25,2024-10-25,2024-10-26,2024-10-27,2024-10-28,2024-10-29,2024-10-30,2024-10-31"/>
		<api:enum list="[dates]" separator=",">
			<api:set attr="date" value="[_value]"/>
			<api:set attr="URIDate" value="[baseURI | replace('{date}', [date])]" />
			<api:enum range="0..23">
				<api:set attr="hour" value="[_value | rjust(2, '0')]" />
				<api:set attr="URI" value="[URIDate | replace('{hour}', [hour])]" />
				<api:call op="jsonproviderGet">
					<api:catch code="500"></api:catch>
					<api:push/>
				</api:call>
			</api:enum>
		</api:enum>
	</api:script>
</api:script>

However there is an issue with this.

I do not point at a specific file or have a third enum iterating over filenames as they are all randomly given GUIDS like this 2c2426e0-9737-496b-bd2d-0fad7b62a33b.json. When you do that looping does not work.

Here is the response from CData regarding this.

Unfortunately, we only support retrieving multiple files by pointing the URI to the folder containing them when you are not using paging. When you enable the user paging, it is expected that you will provide the URI to a specific file when making the jsonproviderGet call.

So for this setup, this file do not work, but if others find this question and have a similar issue it might.

You just need to point at a specific file not just a folder.

View original

Did this topic help you find an answer to your question?

+5

Thomas Lind
Community Manager
1091 replies
7 months ago
November 11, 2024

Hi @RobvdBrink

Let me just add that we have an internal case for this. If a solution is found, I will add it here.

+5

Thomas Lind
Community Manager
1091 replies
Answer
7 months ago
November 15, 2024

Hi @RobvdBrink

Let me just add my findings and the current status of this issue.

Here is the response regarding if it was possible in the normal setup to do this.

Unfortunately, all the aggregated file related properties are for CSV files, not JSON since trying to aggregate JSON files dynamically would be significantly more complicated than with CSV files. I do think this should be possible in an RSD, likely with some kind of pseudo-column input for which dates/times you want to include and use of a nested set of enum blocks to loop through all the relevant URIs. One thing that may be an issue would be the exact names of the JSON files, looking through the ZIP example here, it seems like these files have randomly generated GUIDs as names, which would make it difficult to build the URI to that JSON file. If the files were instead named with a scheme that you could iterate through (something like file1.json, file2.json, etc.), that would probably help with the implementation in the RSD file.

I then made some RSD files.

This is how it ended up.

<api:script xmlns:api="http://apiscript.com/ns?v1" xmlns:xs="http://www.w3.org/2001/XMLSchema">
	<!-- See Column Definitions to specify column behavior and use XPaths to extract column values from JSON. -->
	<api:info title="VehicleCombination" desc="Generated schema file." xmlns:other="http://apiscript.com/ns?v1">
		<!-- You can modify the name, type, and column size here. -->
		<attr name="VehicleId" xs:type="string" readonly="false" other:xPath="/json/Call/CallVisit/RoadVisit/VehicleCombinations/VehicleCombination/VehicleId"/>
	</api:info>

	<api:set attr="DataModel" value="DOCUMENT"/>
	<api:set attr="baseURI" value="C:\\FlatFiles\\e-mergo\\Call\\1.2\\date={date}\\hr={hour}"/>
	<api:set attr="JSONPath" value="$.Call.CallVisit.RoadVisit.VehicleCombinations.VehicleCombination"/>
	<api:set attr="EnablePaging" value="True"/>

	<!-- The GET method corresponds to SELECT. Here you can override the default processing of the SELECT statement. The results of processing are pushed to the schema's output. See SELECT Execution for more information. -->
	<api:script method="GET">
		<api:set attr="method" value="GET"/>
		<api:set attr="dates" value="2024-10-25,2024-10-25,2024-10-26,2024-10-27,2024-10-28,2024-10-29,2024-10-30,2024-10-31"/>
		<api:enum list="[dates]" separator=",">
			<api:set attr="date" value="[_value]"/>
			<api:set attr="URIDate" value="[baseURI | replace('{date}', [date])]" />
			<api:enum range="0..23">
				<api:set attr="hour" value="[_value | rjust(2, '0')]" />
				<api:set attr="URI" value="[URIDate | replace('{hour}', [hour])]" />
				<api:call op="jsonproviderGet">
					<api:catch code="500"></api:catch>
					<api:push/>
				</api:call>
			</api:enum>
		</api:enum>
	</api:script>
</api:script>

However there is an issue with this.

I do not point at a specific file or have a third enum iterating over filenames as they are all randomly given GUIDS like this 2c2426e0-9737-496b-bd2d-0fad7b62a33b.json. When you do that looping does not work.

Here is the response from CData regarding this.

Unfortunately, we only support retrieving multiple files by pointing the URI to the folder containing them when you are not using paging. When you enable the user paging, it is expected that you will provide the URI to a specific file when making the jsonproviderGet call.

So for this setup, this file do not work, but if others find this question and have a similar issue it might.

You just need to point at a specific file not just a folder.

Ingesting JSON files from nested date and hour subfolders

2 replies

Reply

Most helpful members this week

Cookie policy

Cookie settings

Reply

Related topics

TimeXtender Excel data source

The bcp client received an invalid column length for column ID 1.icon

Need some help with ingestingicon

TimeXtender CSV Data source

TimeXtender Parquet Data source

Most helpful members this week

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings