Before you attempt this you need to add the CSV provider to your list of available sources.
In the business unit we already have two data source providers that can connect to CSV/TXT files and for the most part I would suggest using those. There is however a few times where this can be used. Especially because this can connect to a folder with multiple files with different columns and fields in each of them.
It is now possible to aggregate multiple files with different names, but the same structure into a single file, with the file name as a field.
Go through this guide before you start.
- CSV Information
- CData Setup
- Connecting to one file
- Connecting to more than one file
- Connect to files on a FTP server
- Connect to files on an Azure Storage
- The file is not using the default delimiter settings
- Getting the data
- Aggregate multiple files into one
Here is a link to the file I use in this example
This one is set up with elements like this.
Year,Rank,Title,Gross $,Month Opened,Theatres 2005,1,Star Wars: Episode III - Revenge of the Sith,380270577.000000,2015-05,3663 2005,2,The Chronicles of Narnia: The Lion the Witch and the Wardrobe,291710957.000000,2015-12,3853 2005,3,Harry Potter and the Goblet of Fire,290013036.000000,2015-11,3858 2005,4,War of the Worlds,234280354.000000,2015-06,3910
Add a CData data source and choose the CSV provider.
The first step in setting up a CData connector should always be to click the ? in the top right corner.
You need to save the link in the information area as something csv and save it in a folder your user has rights to.
I saved mine in C:/Flat Files/Movies2005.csv. Then you just add a link to it in the Data Source field.
You can then set it up like this.
You just need to add the location of the file in the URI or Data Source field and then it should work.
Another important thing to change is the following Row Scan Depth field.
By default this is set to 100 as you can see, but I would change it to 0 which is equal indefinite and then it will read the whole file before determining the fields data types.
If you have multiple files, you can connect to a folder. Like this
You can also point to files across many folders. You just need to be sure the following features are turned on.
So you need to set Include Subdirectories to True and if you do not want it to go all the way to the end of the path, you can change the Directory Retrieval Depth to another number. -1 means that it has no limit.
You also have the option of connecting to a FTP server that contains CSV files. The connection string is as follows.
URI=ftps://localhost:990/folder1; User=user1; Password=password1;
I have a FTP server with a folder that contains some CSV files and here is how I set it up.
So I added my password to the Password field, my user account to the User field and in the URI field I added the location of the server. The default ftp port is 21.
If the files are stored on a SFTP server it works similarly, but you need to change the string slightly.
URI=sftp://127.0.0.1:22/folder1; User=user1; Password=password1;
So the URI needs to start with sftp:// and the default port is 22 instead.
Connect to files on an Azure Storage
There are two methods:
ABFSS:// Data Lake Gen 2
Important: specify abfss:// in URI in lowercase letters
For both methods what comes after the // is the same. First is the container name, then the folders and finally the file. Though you can choose to point at only the folder if you want.
As long as it is comma separated, this is all you need. If it is not comma separated there are two options.
There are two fields you can use to change the default settings.
The first field is the FMT field. Seemingly it has three options.
That is actually not the case. As explained here Link it can actually be filled with whatever sign you want to use as a delimiter.
Here I overwrote the std options with a 'pipe' | sign.
This when synchronized will give the same as if it was a , sign like default.
The other field is this Row Delimiter field.
So if your file doesn't use CR LF as the row delimiter, you can add it here.
It still needs to be saved in that folder in C:/Flat Files/Movies2005.csv. Additionally, you need to create a file called schema.ini. It has to be named exactly that, or it wont work.
You can read about the various parameters by googling csv and schema.ini, but for this we need it setup like this.
First off you need to have the full file name as the first line in square brackets. Then you need to add each change to the file that is not default, mainly delimiters and such.
Then when you synchronize it looks at this file and finds the correct delimiter. This also works for multiple files, they just need to be added below with the same setup.
Like this for a folder.
Format=Delimited(;) [Sales Invoice Line.csv]
Are you connecting to files on a FTP, they also need the ini file added to its folder.
After setting up the provider, all needed is to synchronize/read the data source.
After that you can set up incremental load and make data type overwrites.
If your data doesn't come out as you expecting, go through the CData guide and see if you can turn on a setting that will give you the correct look.
A new feature of CSV is the option to aggregate files in a folder into one.
I have this folder with some files that contain data from a specified month.
Other than that they are exactly the same. This is important for it to work, they have the exact same field names in all of the files. If not there are some options which is present in the issues segment.
I connect to the files and use the following settings.
URI: points to the folder where the files are stored. As described before, this could also be a folder on a FTP/SFTP server.
Aggregate Files: Needs to be set to True for the CSV provider to attempt to merge the files.
Specifically for this I also change the default Row Scan Depth to 10000. This is because I had no decimal points in a field for at least 100 rows, but there where one in a row above that number.
When I synchronize the data source it will generate a table called AggregatedFiles.
It contains data from all 10 files with the URI field showing the file name.
Using a asterisk to point at specific files
It can be that you have more than one file type in the folder you do aggregates at, for example some files with a date part of their names, but also files that do not contain the same fields and data types, like so.
You only want to merge the Orders files not the others. To do this you change the Include Files field to contain the name and add an asterisk for the part you want to ignore. Like so.
Now it will only aggregate the two order files and not the others. You can also mix this with the Include Subdirectories feature to do it across many folders. It also works for all the special connections like FTP, Azure and similar.
If you cannot control that the field names always is the same, you can set it to ignore the field names and skip the 1st row in each file. Such as below.
Another option is to point to a specific Metadata file in the Metadata Discovery URI field and pull the info from there.
The above settings will give this. Where all field names is called Col1 and so forth.
Troubleshooting data types
CData provider attempts to detect data types, but sometimes it may run into errors. There are a few workarounds for quick data extraction:
1. Row Scan Depth
The number of rows to scan when dynamically determining columns for the table.
Set it to 0 if the source file is not too big. Setting this value to 0 (zero) will parse the entire CSV document.
increasing RowScanDepth ( >100 ) may help in some cases.
Manually set up rules to override specific data types.
Dates are shown in the wrong timezone
Sometimes your datetime fields will have added or subtracted hours. This is due to it being converted to your timezone. The CData provider expects all dates will come in UTC and add hours according to the settings on your computer. It even knows about summertime.
If you do not want this to happen, you need to add this to the other field
Convert the character set
UPDATE - The following are not necessary there now are a field you can set.
If your CData release is a legacy version, you can set it like below still
If you have an CSV file in a specfic encoding, you can convert it by adding this to the Other field.
You can also use Charset=CP1252, which gives the same result and it could be other encodings as well.