Solved

CData Parquet - Settings for Azure Blob Storage (multiple tables in seperate folders)

2 years ago
July 11, 2023
3 replies
267 views

rogier.helmus
Problem Solver

Hi all,

I have a client that has a datalake with parquet files in an Azure Storage Account Blob container (Datalake Gen2). This datalake has not been created by ODX server, but has been created by Synapse (scripted, delta lake parquet files). Since we want to load multiple tables from multiple folders we use the CData Parquet connector v2023 (TX 20.10.41). The issue we have is that we can’t find the right settings for the connector to recognize that the folder we’re connecting to contains multiple tables and that the parquet files are in a subfolder of the table folder.

For the connection we use a Shared Access Signature and the connection is working:

The name of the container is ‘development’ which we configure in the Base URL. In the URI we point to the folder of the bronze layer, the system and the database:

The database folder contains a folder for each table. In each table folder some tables have a a partition folder followed by the actual parquet file.

Since the connector needs to scan multiple folders we’ve set the setting ‘Include subdirectories’ to True. The ‘Aggregate files’ option is set to False (default).

Example of the full URL to a parquet file for a partitioned table:

<<storage-account>>.blob.core.windows.net / <<container>> / bronze/<<system>>/<<database-name>>/ <<table-name>>/DW_PARTITION=A000/part-00011-59d17d8c-7064-466c-b039-5edf43df4b78.c000.snappy.parquet

Some table are partitioned, some not.

Anyone any ideas? :-)

ps: I’m also trying the CData ADO.NET ADLS connector as mentioned in this TX KB article.

Edit: example screenshots of the structure in a partitioned table ADRMST:

Example of the contents of a partition folder:

The folder contains multiple files since the data is created with an incremental load.

Example of the contents for an unpartitioned table:

Best answer by Christian Hauggaard

@rogier.helmus I believe you are right, CData Parquet connector does not seem to support this use case. Please feel free to submit an idea or upvote this idea for virtual ODX

https://support.timextender.com/ideas/data-directly-from-data-lake-to-mdw-without-odx-1063

Perhaps the following providers might also be worth exploring

View original

Did this topic help you find an answer to your question?

Christian Hauggaard
Community Manager
2 years ago
July 11, 2023

Hi @rogier.helmus

I have been able to extract multiple parquet files using this URI

And by setting include subdirectories to True

This leads to the following tables

Which corresponds to the parquet files in both the parent folder and the subfolders

rogier.helmus
Problem Solver
2 years ago
July 11, 2023

Hi @Christian Hauggaard , thanks for the suggestion and showing a working example!

I used the abfss:// prefix in the URL, since the connection is ADLS Gen2.

Unfortunately I still didn’t see an improvement in the Synch operation.

I think the main difference is in the structure that is used to store the data. In your example structure there is no complexity. The connector can simply loop through the folders, check for parquet files and extract the tablename from the filename. The datalake I’m working with for the client looks different, but it does look a bit like how ODX server stores the data in the data lake. ODX server also:

Doesn’t store the tablename in the parquet filename
Can have multiple parquet files in case of an incremental load (as I do have in the non-ODX server data lake)

An added difference is that the delta lake I’m connecting to partitions the tables as well.

Because of the similarity with ODX server I’ve tested if it’s possible to use the CData Parquet connector to load data from ODX it’s own data lake.

Used URI:

Result after a successful synch:

As you can see, the connector doesn’t recognize that it should extract the table name from the first folder and skip the subfolder part with \DATA_xxxx_xx_xx\ . Looks logical though if the connector indeed just loops through the folders.

In the connector I see the following property which might allow a definition of which folders and files to load and merge, but I have no idea if it can be used and how it should be done:

@Christian Hauggaard It looks like connecting to a more complex structure like a delta requires more flexibility/options which I can’t find in the CData Parquet connector. Are there TX clients that have a delta lake already setup and that are connecting to it via TimeXtender ODX Server? And would a ‘Virtual ODX server” like mentioned in this forum post be a possible solution in the future?

Christian Hauggaard
Community Manager
2 years ago
July 13, 2023

@rogier.helmus I believe you are right, CData Parquet connector does not seem to support this use case. Please feel free to submit an idea or upvote this idea for virtual ODX

https://support.timextender.com/ideas/data-directly-from-data-lake-to-mdw-without-odx-1063

Perhaps the following providers might also be worth exploring

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Reply

Related topics

Detecting identical aliases in your SQL join- or where clauses

T-SQL features you might have missed - Part 1

JSON examples

Case: A message protocol-independent web service (part 1 of 2)

Release notes Indicium (2021.3.11)

Most helpful members this week

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings