Tutorial

Connect to CSV data sources

  • 20 January 2023
  • 8 replies
  • 2071 views

Userlevel 6
Badge +5

This is a CData data source, more on what this is in the following link. Add a CData data source

CSV Information

Here is a link to the file I use in this example

http://mysafeinfo.com/api/data?list=moviesbyearnings2005&format=csv

It is set up with elements like this.

Year,Rank,Title,Gross $,Month Opened,Theatres2005,1,Star Wars: Episode III - Revenge of the Sith,380270577.000000,2015-05,36632005,2,The Chronicles of Narnia: The Lion the Witch and the Wardrobe,291710957.000000,2015-12,38532005,3,Harry Potter and the Goblet of Fire,290013036.000000,2015-11,38582005,4,War of the Worlds,234280354.000000,2015-06,3910

Setup of the provider

Add a data source, search for CSV and choose this provider.

The first step in setting up a CData connector should always be to click the following link CSV guide from CData it will show the documentation CData provides.

Connecting to one file

You can save the link in the CSV Information area as something csv and save it in a folder your user has rights to.

I saved mine in C:/Flat Files/Movies2005.csv. Then you just add a link to it in the Data Source field.

You can then set it up like this.

You just need to add the location of the file in the URI field and then it should work.

An important thing to check is the following Row Scan Depth field.

It is now by default this is set to 0, it used to be 100. If it is 100 it meant that it would define the data types by reading the first 100 rows. The new default of 0 means that it is a indefinite number, then it will read the whole file before determining the fields data types. If your file is very large it can take a lot of time to synchronize and here setting it to 10000 is better.

Connecting to more than one file

If you have multiple files, you can connect to a folder. Instead of adding a path to a file you add the last level in the URI field. C:\FlatFiles\TESTCSVFTP

You can also point to files across many folders. You need to be sure the following features are turned on.

Connect to more than one file across sub-directories

You need to set Include Subdirectories to True and if you do not want it to go all the way to the end of the path, you can change the Directory Retrieval Depth to another number. -1 means that it has no limit and is the default value.

​Connect to CSV files on a FTP/SFTP server

You also have the option of connecting to a FTP server that contains CSV files. The connection string is as follows.

URI=ftp://localhost:port/folder1; User=user1; Password=password1;

I have a FTP server with a folder that contains some CSV files and here is how I set it up.

I added the password to the Password field, the user account to the User field and in the URI field I added the location of the server. The default ftp port is 21.

If the files are stored on a SFTP server it works similarly, but you need to change the string slightly.

URI=sftp://127.0.0.1:port/folder1; User=user1; Password=password1;
 
​So the URI needs to start with sftp:// and the default port is 22 instead.

Connect to files on Azure Storage

Files stored in an Azure Data Lake Gen2 Storage Container are generally accessed using one of the following two authentication methods.

  1. Azure Service Principal. Click on How to Use an Azure Service Prinicipal to access files stored in an Azure Data Lake Gen2 Storage Container for a detailed explanation on how this is done.
  2. Azure Access Key. The instructions on how to use an Access Key is outlined below.

Connect using Storage Account and Access key

Start by setting the Auth Scheme field to AzureAD.

​Find the storage account name and add that to the Azure Storage Account field. You also need a Azure Access Key. This can be found here.
​It is applied as so in the Azure Authentication area

Data Lake Gen 2 ABFSS

abfss://myfilesystem/remotePath
URI=abfss://myfilesystem/folder1; AzureStorageAccount=myAccount; AzureAccessKey=myKey;

Your myfilesystem is the name of the container and the remotePath is the name of the folder

Important: specify abfss:// in the URI field in lowercase letters

I have a container called sharedfiles and a file there with the name applied.

AzureBlob

azureblob://mycontainer/myblob/ 
URI=azureblob://mycontainer/myblob/; AzureStorageAccount=myAccount; AzureAccessKey=myKey;

Your mycontainer is the name of the container and the myblob is the foldername.

For both methods what comes after the // is the same. First is the container name, then the folders and finally the file. Though you can choose to point at only the folder if you want.

AzureFile File Share

If your file is stored in a File Share instead of a container you can use the azurefile:// option

 

azurefile://fileShare/remotePath 
URI=azurefile://bucket/folder1; AzureStorageAccount='myAccount'; AzureSharedAccessSignature='mySharedAccessSignature';

Troubleshooting

Here are some of the common issues you can experience and how to resolve these issues.

The file is not using the default delimiter settings

As long as it is comma separated, this is all you need. If it is not comma separated there are two options.

There are two fields you can use to change the default settings.

The first field is the FMT field. It has the default value CsvDelimited.

As explained here Link this field can be filled with whatever sign you want to use as a delimiter.

Here I overwrote the std options with a 'pipe' | sign.

When synchronized it will search for this field to base the fields on.

The other field is this Row Delimiter field.

So if your file doesn't use CR LF as the row delimiter, you can add it here.

Aggregate multiple files into one

A feature of CSV is the option to aggregate files in a folder into one. I have this folder with some files that contain data from a specified month.

Other than that they are exactly the same. It is the same folder I pointed at in the section regarding connection to multiple files. For this to work it is important that they have the exact same field names in all of the files.

You connect to the files the same way as in the connect to multiple files, the only difference is that you set the Aggregate Files field to true.

URI: points to the folder where the files are stored. This could also be a folder on a FTP/SFTP server.

Aggregate Files: Needs to be set to True for the CSV provider to attempt to merge the files.

Specifically for this I also change the default Row Scan Depth to 10000. This is because I had no decimal points in a field for at least 100 rows, but there where one in a row above that number. In this instance I would not set it to 0 as it would read all the merged files first before choosing the data types and this makes synchronizing slow.

When I synchronize the data source it will generate a table called AggregatedFiles.

It contains data from all 10 files with the URI field showing the file name.

Select and aggregate files with similar names across folders

It can be that you have more than one file type in the folder you do aggregates at, for example some files with a date part of their names, but also files that do not contain the same fields and data types.

You only want to merge the Orders files not the others. To do this you change the Include Files field to contain the name and add an asterisk for the part you want to ignore.

Include Files: Order*.csv

Now it will only aggregate the two order files and not the others. You can also mix this with the Include Subdirectories feature to do it across many folders. This also works for all the special connections such as FTP, Azure and similar.

Controlling the names of the fields

If you cannot control that the field names are always the same, you can set it to ignore the field names and skip the 1st row in each file.

Uncheck the field Include Column Headers and set the field Skip Top to -1 in the Data Formatting area.

The above settings will give this. Where all field names is called Col1 and so forth.

Troubleshooting data types

The provider attempts to detect data types, but sometimes it may run into errors.  There are a few workarounds for quick data extraction:

  1. Generate a RSD file and change the field types manually to override specific data types.
     
  2. Set Type Detection Scheme = “None” in the provider setup dialog options.
    • This will bring all data as string type (varchar). You may convert it later to other types as needed.
    • This technique may work in any CData provider which exposes Type Detection Scheme property.

Dates are shown in the wrong timezone

Sometimes your datetime fields will have added or subtracted hours. This is due to it being converted to your timezone. The CData provider expects all dates will come in UTC and add hours according to the settings on your computer. It even knows about summertime.

If you do not want this to happen, you need to add this to the other field

ConvertDateTimeToGMT

Convert the character set

If your CSV file is in a specific encoding, you can convert it by adding this to the field.

Charset=UTF-16

It could be other encodings as well, but UTF-8 is the default. The file needs to be in this, it will not convert it by changing this.

The files were made with different culture settings

If your file is created with some specific culture settings you can use the two fields Culture and Client Culture to align it to a specific one.

The list of available options is here.

So if my files are set up with Dutch culture settings and I wanted to convert them to settings for USA, I would add nl-NL in the Client Culture field and en-US in the Culture field. Then it should align the values after this when you synchronize.

 


8 replies

Userlevel 4
Badge +1

Hi @Thomas Lind can you provide the setup if the location it’s a SharePoint folder?

Userlevel 4
Badge +1

Hi @Thomas Lind It would be nice to have too the setup to access to multiple files stored in a ODX Server’s folder.

Obviouslly these setups in the portal based data sources configuration.

Userlevel 6
Badge +5

HI @rvgfox 

You need an app registration that has rights to the SharePoint you connect to.

You do not need to state the SharePoint URL unless the file is located in a specific site, otherwise it will use the default, which is known by the app, it seems.

You need to specify that your connection type is Sharepoint Rest and the URI needs to start with sprest://folder/file.csv
It is similar to the locations in Azure DL that you don’t specify all of it in the URI only the folders.

So in the above, I point to a file that is located in the following folders.

As you can see this file is not stored on a specific site, but the default page. So it doesn’t need to be added anywhere, but if it was stored in a site, you would add the site in the Storage Base URL field.

So if you see this

The URI would be

sprest://Exported/c

Then in the Include files field I would change it so it contained

PEC_Nicor_Gardens_*.txt

I would then check that the Aggregate Files field is set to true to merge all the files in this folder.

I just updated the Excel guide that explains the same thing.

Userlevel 6
Badge +5

Hi @Thomas Lind It would be nice to have too the setup to access to multiple files stored in a ODX Server’s folder.

Obviouslly these setups in the portal based data sources configuration.

Hi @rvgfox 

I needed to update this, as it is explained in this section.

 

Userlevel 4
Badge +1

Thanks @Thomas Lind 

but how is the setup if the multiple files are in one folder on the ODX server machine (example C:XTXT).

How is the setup?
 

 

Userlevel 6
Badge +5

Hi @rvgfox 

You keep saying the same thing and I will share the same link again.

Can you maybe explain how the above is not helping you?

Is there a way to have several aggregate files within 1(ONE) datasource like order*.csv and a table customers*.csv

Userlevel 6
Badge +5

Hi @malkn 

Yes there is a way.

You can make an RSD file that Aggregates one part that is specified by what is in the include field.

So the RSD has order*.csv as the URI source. Once these work you can connect to as many as you want in one CSV data source.

It was something I did a while ago, so I will have to do some research to find them again and it was something that used to not work, so you need to be sure to use the newest version of the provider.

Reply