Solved

CData Parquet - Settings for Azure Blob Storage (multiple tables in seperate folders)

  • 11 July 2023
  • 3 replies
  • 170 views

Badge +1

Hi all,

I have a client that has a datalake with parquet files in an Azure Storage Account Blob container (Datalake Gen2). This datalake has not been created by ODX server, but has been created by Synapse (scripted, delta lake parquet files). Since we want to load multiple tables from multiple folders we use the CData Parquet connector v2023 (TX 20.10.41). The issue we have is that we can’t find the right settings for the connector to recognize that the folder we’re connecting to contains multiple tables and that the parquet files are in a subfolder of the table folder. 

For the connection we use a Shared Access Signature and the connection is working:

The name of the container is ‘development’ which we configure in the Base URL. In the URI we point to the folder of the bronze layer, the system and the database:

 

The database folder contains a folder for each table. In each table folder some tables have a a partition folder followed by the actual parquet file. 

Since the connector needs to scan multiple folders we’ve set the setting ‘Include subdirectories’ to True. The ‘Aggregate files’ option is set to False (default).

Example of the full URL to a parquet file for a partitioned table:

<<storage-account>>.blob.core.windows.net / <<container>> / bronze/<<system>>/<<database-name>>/ <<table-name>>/DW_PARTITION=A000/part-00011-59d17d8c-7064-466c-b039-5edf43df4b78.c000.snappy.parquet

Some table are partitioned, some not. 

Anyone any ideas? :-) 

ps: I’m also trying the CData ADO.NET ADLS connector as mentioned in this TX KB article

 

Edit: example screenshots of the structure in a partitioned table ADRMST:

 

Example of the contents of a partition folder:

 

The folder contains multiple files since the data is created with an incremental load.

 

Example of the contents for an unpartitioned table:

 

icon

Best answer by Christian Hauggaard 13 July 2023, 15:22

View original

3 replies

Userlevel 6
Badge +5

Hi @rogier.helmus 

I have been able to extract multiple parquet files using this URI

And by setting include subdirectories to True

This leads to the following tables

Which corresponds to the parquet files in both the parent folder and the subfolders

 

Badge +1

Hi @Christian Hauggaard , thanks for the suggestion and showing a working example! 

I used the abfss:// prefix in the URL, since the connection is ADLS Gen2. 

Unfortunately I still didn’t see an improvement in the Synch operation.

I think the main difference is in the structure that is used to store the data. In your example structure there is no complexity. The connector can simply loop through the folders, check for parquet files and extract the tablename from the filename. The datalake I’m working with for the client looks different, but it does look a bit like how ODX server stores the data in the data lake. ODX server also:

  • Doesn’t store the tablename in the parquet filename
  • Can have multiple parquet files in case of an incremental load (as I do have in the non-ODX server data lake)

An added difference is that the delta lake I’m connecting to partitions the tables as well.

Because of the similarity with ODX server I’ve tested if it’s possible to use the CData Parquet connector to load data from ODX it’s own data lake.

Used URI: 

 

Result after a successful synch:

As you can see, the connector doesn’t recognize that it should extract the table name from the first folder and skip the subfolder part with \DATA_xxxx_xx_xx\ . Looks logical though if the connector indeed just loops through the folders.

In the connector I see the following property which might allow a definition of which folders and files to load and merge, but I have no idea if it can be used and how it should be done:

 

@Christian Hauggaard It looks like connecting to a more complex structure like a delta requires more flexibility/options which I can’t find in the CData Parquet connector. Are there TX clients that have a delta lake already setup and that are connecting to it via TimeXtender ODX Server? And would a ‘Virtual ODX server” like mentioned in this forum post be a possible solution in the future? 

Userlevel 6
Badge +5

@rogier.helmus I believe you are right, CData Parquet connector does not seem to support this use case. Please feel free to submit an idea or upvote this idea for virtual ODX

Perhaps the following providers might also be worth exploring

Reply