Solved

Read latest CSV file (from S3 bucket)

1 year ago
January 22, 2024
10 replies
263 views

pontus.berglund
Contributor
64 replies

Hi,

I have a working connection to an S3 bucket with CSV files, using the CData CSV connector.

I would like to create a solution where I read only the latest file added to the bucket.

Calling the stored procedure “ListFiles” (https://cdn.cdata.com/help/RVJ/ado/pg_sp-listfiles.htm) in a query table returns a table of the files, with filename and created timestamp.

Is it possible to call ListFiles in an .rsd file? If so, I think it would be possible to access the created timestamp in another .rsd file using a query slicer which can be used to get the latest file.

Have anyone solved this problem, or something similar?

I am using the latest version, 6512.1.

Best answer by Christian Hauggaard

Hi @pontus.berglund

Please let us know if you have found a solution to this, by using file aggregation as suggested in the support ticket or using another method. Thanks!

View original

Did this topic help you find an answer to your question?

daniel
TimeXtender Xpert
192 replies
1 year ago
January 22, 2024

Great question @pontus.berglund

I'm really curious if this can work.

pontus.berglund
Author
Contributor
64 replies
1 year ago
January 22, 2024

@daniel ,

What version are you running? I would test this setup if I was on v.20 and had a BU:

Create a QueryTable with “EXEC ListFiles”
Make a project variable getting the filename of the file with largest ChangeDate and call it “LatestFileVariable”
Generate an .rsd file for the file I want to extract
Add an input parameter to the .rsd file called “LatestFileInput” or something like that
Add a custom data selection rule to the table resulting from the .rsd file and set it to LatestFileInput = LatestFileVariable

I don’t believe this setup is possible in later versions since we can’t access instance variables from the ODX that way. It would be interesting to see how a setup in the ODX would work.

rory.smith
TimeXtender Xpert
702 replies
1 year ago
January 23, 2024

It may be possible with .rsd magic like using the IncludeFiles parameter (see: https://cdn.cdata.com/help/RVJ/ado/RSBCSV_p_IncludeFiles.htm) which has some filters you could use if the filename contains a date but what you could also do:

create a source that extracts the CSV source metadata into a table
filter down to the files you want to load
generate a string with rsd file syntax for every table that passes your filter
export the strings to .rsd file swith a standardized name format that matches another CSV source's table filter / IncludeFiles filter
extract the files’ content from the secondary csv source

In the new release I would be tempted to handle this in PowerShell.

Thomas Lind
Community Manager
1095 replies
1 year ago
January 23, 2024

Hi @pontus.berglund

It should be possible to do something like this with RSD files.

Essentially you create a managed query that runs the stored procedure. The result of this needs to be a URI.

You can use the result of a managed query in another managed query.

So if the RSD file use the Query slicer option as suggested, it will be able to use the result of the list file stored procedure.

Additionally I want to add this part.

We recently got a question about how to get info of what is the latest file and I think it may be related to this. I got a response then about the Listfiles procedure option.

To address this need, we have introduced a stored procedure called Listfiles. This procedure provides the following information:

Name

Type

Description

FileName

String

Returns the file name.

LastModified

Long

Returns the Unix timestamp when this file was last modified.

CreatedAt

Long

Returns the Unix timestamp when this file was created. Returns -1 when the storage does not support this field.

URI

String

Returns the URI.

To execute the stored procedure, you can call it using the following query:
 
EXEC Listfiles @mask='<filename.csv>' 
This query will assist you in fetching the respective result set.

So to me this seems like you would need to generate the correct file names and may not be able to use asterisks in the name.

pontus.berglund
Author
Contributor
64 replies
1 year ago
January 23, 2024

@Thomas Lind,

Yeah that pretty much sums up what I want to do. My issue is that I don’t know how to generate a schema file for the SP Listfiles’s result set. Without doing that, I am not able to select the latest URI from the result set and pass it to the .rsd file making the data call.

I vaguely recall us talking about .rsb files, which are similar to stored procedures, and can be referenced by .rsd files. Do you think using one of those might help here?

Thomas Lind
Community Manager
1095 replies
1 year ago
January 23, 2024

You do not use a schema file for that, you create a managed query to run the procedure. You can generally always run SQL queries against the CData provider.

You can run it like this.

pontus.berglund
Author
Contributor
64 replies
1 year ago
January 23, 2024

@Thomas Lind,

I have created a QueryTable like the one in your screenshot, but I don’t see how I could pass the URI of ListFiles resultset to the .rsd call for the data.

The way I have done it before is setting up a query slicer in the .rsd, then I create a QueryTable like this:
SELECT col1, col2, col3

FROM [CSV].[TABLE1]
WHERE col1 IN (SELECT col1 FROM [CSV].[TABLE2])

Using this method, I am not able to get data from ListFiles since you cant select from the result set of a stored procedure in the same way. Using temp tables or declaring variables wont work either.

Did you have any other workaround in mind?

Thomas Lind
Community Manager
1095 replies
1 year ago
January 25, 2024

I do not know how to catch the result of the stored procedure so you can filter it and return only the URI for the newest addition.

I tried some things but I can only make it list the files.

However once this part is resolved you can make a RSD file that can take a URI input.

First you use a file as a template to generate the URI with. Note the top row in the info tag, it needs to have a specific title, schema and no URI path part.

You need to add an input field with the URI and to change that to be what is added.

Then you can run it in a Managed Query provided you can apply a path to the file as an URI.

pontus.berglund
Author
Contributor
64 replies
1 year ago
January 25, 2024

Hi @Thomas Lind,

Yes, that sounds like what I had in mind with this approach. But unfortunately, I think it is only possible in v.20.

pontus.berglund wrote:

@daniel ,

What version are you running? I would test this setup if I was on v.20 and had a BU:

Create a QueryTable with “EXEC ListFiles”
Make a project variable getting the filename of the file with largest ChangeDate and call it “LatestFileVariable”
Generate an .rsd file for the file I want to extract
Add an input parameter to the .rsd file called “LatestFileInput” or something like that
Add a custom data selection rule to the table resulting from the .rsd file and set it to LatestFileInput = LatestFileVariable

I don’t believe this setup is possible in later versions since we can’t access instance variables from the ODX that way. It would be interesting to see how a setup in the ODX would work.

I have sent the question to the CData support. I will post what they say here.

Christian Hauggaard
Community Manager
1161 replies
Answer
1 year ago
February 2, 2024

Hi @pontus.berglund

Please let us know if you have found a solution to this, by using file aggregation as suggested in the support ticket or using another method. Thanks!

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Name	Type	Description
FileName	String	Returns the file name.
LastModified	Long	Returns the Unix timestamp when this file was last modified.
CreatedAt	Long	Returns the Unix timestamp when this file was created. Returns -1 when the storage does not support this field.
URI	String	Returns the URI.

Reply

Related topics

TimeXtender CSV Data source

TimeXtender Parquet Data source

TimeXtender Excel data source

Connect to CSV data sources

Welcome to TimeXtender Data Integration!

Most helpful members this week

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings