Solved

Read latest CSV file (from S3 bucket)

  • 22 January 2024
  • 10 replies
  • 114 views

Userlevel 3
Badge +1

Hi,

I have a working connection to an S3 bucket with CSV files, using the CData CSV connector.

I would like to create a solution where I read only the latest file added to the bucket.

Calling the stored procedure “ListFiles” (https://cdn.cdata.com/help/RVJ/ado/pg_sp-listfiles.htm) in a query table returns a table of the files, with filename and created timestamp.

Is it possible to call ListFiles in an .rsd file? If so, I think it would be possible to access the created timestamp in another .rsd file using a query slicer which can be used to get the latest file.

Have anyone solved this problem, or something similar? 

I am using the latest version, 6512.1. 

icon

Best answer by Christian Hauggaard 2 February 2024, 09:58

View original

10 replies

Userlevel 4
Badge +5

Great question @pontus.berglund 

I'm really curious if this can work. 

Userlevel 3
Badge +1

@daniel ,

What version are you running? I would test this setup if I was on v.20 and had a BU:

  1. Create a QueryTable with “EXEC ListFiles”
  2. Make a project variable getting the filename of the file with largest ChangeDate and call it “LatestFileVariable”
  3. Generate an .rsd file for the file I want to extract
  4. Add an input parameter to the .rsd file called “LatestFileInput” or something like that
  5. Add a custom data selection rule to the table resulting from the .rsd file and set it to LatestFileInput = LatestFileVariable

I don’t believe this setup is possible in later versions since we can’t access instance variables from the ODX that way. It would be interesting to see how a setup in the ODX would work. 

Userlevel 5
Badge +7

It may be possible with .rsd magic like using the IncludeFiles parameter (see: https://cdn.cdata.com/help/RVJ/ado/RSBCSV_p_IncludeFiles.htm) which has some filters you could use if the filename contains a date but what you could also do:

  • create a source that extracts the CSV source metadata into a table
  • filter down to the files you want to load
  • generate a string with rsd file syntax for every table that passes your filter
  • export the strings to .rsd file swith a standardized name format that matches another CSV source's table filter / IncludeFiles filter
  • extract the files’ content from the secondary csv source

 

In the new release I would be tempted to handle this in PowerShell.

Userlevel 6
Badge +5

Hi @pontus.berglund 

It should be possible to do something like this with RSD files.

Essentially you create a managed query that runs the stored procedure. The result of this needs to be a URI.

You can use the result of a managed query in another managed query. 

So if the RSD file use the Query slicer option as suggested, it will be able to use the result of the list file stored procedure.

Additionally I want to add this part.

We recently got a question about how to get info of what is the latest file and I think it may be related to this. I got a response then about the Listfiles procedure option.

To address this need, we have introduced a stored procedure called Listfiles. This procedure provides the following information:

Name

Type

Description

FileName

String

Returns the file name.

LastModified

Long

Returns the Unix timestamp when this file was last modified.

CreatedAt

Long

Returns the Unix timestamp when this file was created. Returns -1 when the storage does not support this field.

URI

String

Returns the URI.

To execute the stored procedure, you can call it using the following query:

 
EXEC Listfiles @mask='<filename.csv>' 

This query will assist you in fetching the respective result set.

 

So to me this seems like you would need to generate the correct file names and may not be able to use asterisks in the name.

Userlevel 3
Badge +1

@Thomas Lind,

Yeah that pretty much sums up what I want to do. My issue is that I don’t know how to generate a schema file for the SP Listfiles’s result set. Without doing that, I am not able to select the latest URI from the result set and pass it to the .rsd file making the data call.

I vaguely recall us talking about .rsfiles, which are similar to stored procedures, and can be referenced by .rsd files. Do you think using one of those might help here?

Userlevel 6
Badge +5

You do not use a schema file for that, you create a managed query to run the procedure. You can generally always run SQL queries against the CData provider.

You can run it like this.

 

Userlevel 3
Badge +1

@Thomas Lind,

I have created a QueryTable like the one in your screenshot, but I don’t see how I could pass the URI of ListFiles resultset to the .rsd call for the data.

The way I have done it before is setting up a query slicer in the .rsd, then I create a QueryTable like this:
SELECT col1, col2, col3

FROM [CSV].[TABLE1]
WHERE col1 IN (SELECT col1 FROM [CSV].[TABLE2])

Using this method, I am not able to get data from ListFiles since you cant select from the result set of a stored procedure in the same way. Using temp tables or declaring variables wont work either.

Did you have any other workaround in mind?

Userlevel 6
Badge +5

I do not know how to catch the result of the stored procedure so you can filter it and return only the URI for the newest addition.

I tried some things but I can only make it list the files.

However once this part is resolved you can make a RSD file that can take a URI input.

First you use a file as a template to generate the URI with. Note the top row in the info tag, it needs to have a specific title, schema and no URI path part.

You need to add an input field with the URI and to change that to be what is added.

Then you can run it in a Managed Query provided you can apply a path to the file as an URI.

Userlevel 3
Badge +1

Hi @Thomas Lind,

Yes, that sounds like what I had in mind with this approach. But unfortunately, I think it is only possible in v.20.

@daniel ,

What version are you running? I would test this setup if I was on v.20 and had a BU:

  1. Create a QueryTable with “EXEC ListFiles”
  2. Make a project variable getting the filename of the file with largest ChangeDate and call it “LatestFileVariable”
  3. Generate an .rsd file for the file I want to extract
  4. Add an input parameter to the .rsd file called “LatestFileInput” or something like that
  5. Add a custom data selection rule to the table resulting from the .rsd file and set it to LatestFileInput = LatestFileVariable

I don’t believe this setup is possible in later versions since we can’t access instance variables from the ODX that way. It would be interesting to see how a setup in the ODX would work. 

 

I have sent the question to the CData support. I will post what they say here.

Userlevel 6
Badge +5

Hi @pontus.berglund 

Please let us know if you have found a solution to this, by using file aggregation as suggested in the support ticket or using another method. Thanks!

Reply