Query Ingest Instance Parquet files with Azure Synapse Workspace

1 year ago
October 1, 2023
0 replies
476 views

+1

Greg Lennox
Community Manager
28 replies

Ingest Instances offer the flexibility to store its exchange-related data not only in SQL Server databases but also as Parquet files within an Azure Data Lake Gen 2 storage container. This guide clarifies the process of querying these Parquet files using SQL Server Management Studio, akin to querying data in SQL Server databases.

Once your Azure Synapse Workspace establishes a connection with the Azure Data Lake Gen2 storage container, you can seamlessly execute ad hoc queries on the Parquet files. This is made possible through the potent SQL on-demand tool, enabling you to meticulously examine and assess the performance and resultant data stored within the Ingest Instance.

In this article you will learn about:

Ingest Instance Parquet File Structure and Incremental Loads

Ingest Instances continue to revolutionize TimeXtender's data storage and management practices. A pivotal shift in this paradigm is the concept of "An Ingest Instance Never Forgets." By default, data is no longer truncated from the Ingest Instance storage. Instead, new full loads seamlessly augment existing data. The Ingest Instance adeptly oversees data extraction, ensuring only the most recent version is integrated into the Prepare Instance. However, this methodology poses challenges when querying data directly in the Ingest Instance Storage.

To help with this, an Ingest Instance now identifies data in a table by batch and version.

A batch contains only the rows for a table processed during a single execution.
A version consists of multiple batches, the initial full load (i.e. Batch 0000), as well as all subsequent incremental loads (i.e. Batches 0001-9999).

When using Azure Data Lake storage, the Ingest Instance creates a unique file structure that can be used to determine the most recent version of the data. Starting at the Container level, the file structure for a single table may appear as follows:

<DataSource>
- <Schema_Table>
  - DATA_<2020_10_25…> VERSION create date
    - DATA
      - DATA_0000.parquet Initial full load BATCH
      - DATA_0001.parquet Subsequent incremental batch
      - DATA_0004.parquet Subsequent incremental batch (batch numbers do not always follow an exact increment.)
  - DATA_<2020_10_27…> New Version created due to schema drift or manually executed full load
    - DATA
      - DATA_0000.paquet the same pattern repeats as above…

The file/folder structure shown above provides the ability to make a visual assessment of each version folder and how it includes ALL of the rows that the Ingest Instance has extracted up to the point where the next version is created. In summary, reviewing the data stored in the latest VERSION folder is a quick way to query the most recent data extracted by the Ingest Instance and stored in the data lake. The following steps outline the process for using the Azure Synapse Workspace SQL On-Demand tool to perform ad hoc queries on the Ingest Instance parquet storage files.

Prerequisites

Create a Synapse Workspace (MSFT: Create a Synapse Workspace)
- Note: When creating your workspace, make sure to create your own Security credential, rather than the default, sqladminuser.
Connect SQL Server Management Studio (SSMS) to your workspace (MSFT: Connect Synapse to SSMS)
- Note: When connecting SSMS to your Synapse workspace, make sure to use the "SQL on-demand endpoint, rather than the SQL endpoint.
In SQL Server Management Studio (SSMS), run a query similar to the following to create a credential using an identity that has access to the Azure Data Lake Gen2 storage location.

CREATE CREDENTIAL [https://<StorageAccountName>.dfs.core.windows.net]
WITH IDENTITY = 'SHARED ACCESS SIGNATURE', SECRET = '<SharedAccessKeyValue>'

Query the Data Lake with Synapse Workspace and SSMS

Using a Synapse Workspace to query Parquet files is straightforward. (MSFT: Query Parquet files)

For illustration, we'll use the AdventureWorks2014 database for our query results examples, specifically the Production.Product tables.

In TimeXtender, Execute a data source in the Ingest Instance (connected to your ADLS Gen2 account).
In Azure Storage Explorer, verify that the data source has been executed.
In SSMS, execute the following query to connect to the parquet files.

--Query Ingest Instance Parquet files (returns all records)

SELECT *
FROM OPENROWSET(
BULK 'https://<StorageAccountName>.dfs.core.windows.net/<ContainerName>/<DataSourceShortName>/<Schema_TableName>/*/*/*.parquet'
, FORMAT = 'PARQUET') AS a

Note: An alias must be used with the OPENROWSET statement. If an alias is not specified, the following error may be displayed, "A correlation name must be specified for the bulk rowset in the from clause."
Wildcards, using the * symbol, can be used in the file path
StorageAccountName is the name of the ADLS Gen2 account
ContainerName is the target storage container
DataSourceShortName is the short name of the data source, as defined in the Ingest Instance
Schema_TableName is the data source schema and table name

Execution of generic SQL queries can be run directly on the data stored in the data lake.

Using the FILEPATH function to determine Ingest Instance version and batch

The Ingest Instance data sources will not include an Ingest_Timestamp within the table, (as of TimeXtender 20.10.1), however the filepath itself does contain the timestamp information. Review the file path to identify the data timestamps.

The filepath function can be used to extract either the entire filepath string or just specific parts of the string. To get the entire filepath, use the filepath function with no arguments inside of the parentheses.

--Get the complete filepath
a.filepath() AS [FilePath]

Next, provide an argument to extract the first wildcard from the full filepath, which is the Ingest_Version. For example, filepath(1) will return the string represented by the first instance of a wildcard.

--Extract the date/time part of the file path
a.filepath(1) AS [Ingest_Version]

Lastly, provide an argument to extract the third wildcard from the full filepath, which is the Ingest_Batch. For example, filepath(3) will return the string represented by the third instance of a wildcard.

--Convert the parts of the date/time path to a DateTime data type
a.filepath(3) AS [Ingest_batch]

Further simplify the filepath wildcard string results by adding "DATA_" before the wildcard symbol in the openrowset function filepath string. For example, "/DATA_*/" will not include "DATA_" in results, but only the characters after it.

FROM OPENROWSET(
BULK 'https://<StorageAccountName>.dfs.core.windows.net/<ContainerName>/<DataSourceShortName>/<Schema_TableName>/DATA_*/*/DATA_*.parquet'
, FORMAT = 'PARQUET') AS a

Putting the whole thing together would look like the following. We'll add three columns to our query result to help clarify the timestamps, FilePath, Ingest_Version, and Ingest_Batch.

--Query Ingest Instance parquet files with TimeStamps on rows. (Returns all records)

SELECT
--Get the Filepath (you need to add the table reference or this function will not work)
a.filepath() AS [FilePath]

--Extract the date/time part of the file path
,a.filepath(1) AS [Ingest_Version]

--Convert the parts of the date/time path to a DateTime data type
, a.filepath(3) AS [Ingest_Batch]

,*
FROM OPENROWSET(
BULK 'https://<StorageAccountName>.dfs.core.windows.net/<ContainerName>/<DataSourceShortName>/<Schema_TableName>/DATA_*/*/DATA_*.parquet'
, FORMAT = 'PARQUET') AS a

NOTE: An alias must be used with the OPENROWSET statement and needs to be the same alias that was used in the items above.

The query result is much easier to understand, but we still have duplicate records.

Adding the WHERE clause to filter query result by version

Add a sub-query WHERE clause to select only the most current data load.

WHERE a.filepath(1) = (
SELECT MAX(a.filepath(1))
FROM OPENROWSET(
BULK 'https://<StorageAccountName>.blob.core.windows.net/<ContainerName>/<DataSourceShortName>/<Schema_TableName>/DATA_*/*/DATA_*.parquet'
, FORMAT = 'PARQUET'
) AS rows

The following comprehensive query encorporates all of the above.

--Query and filter Ingest Instance Parquet files with TimeStamps on rows. (Returns only latest records)
SELECT
--Get the Filepath (you need to add the table reference or this function will not work)
a.filepath() AS [FilePath]

--Extract the date/time part of the file path
,a.filepath(1) AS [Ingest_Version]

--Convert the parts of the date/time path to a DateTime data type
, a.filepath(3) AS [Ingest_Batch]

,*
FROM OPENROWSET(
BULK 'https://<StorageAccountName>.dfs.core.windows.net/<ContainerName>/<DataSourceShortName>/<Schema_TableName>/DATA_*/*/DATA_*.parquet'
, FORMAT = 'PARQUET') AS [a]

WHERE a.filepath(1) = (
SELECT MAX(b.filepath(1))
FROM OPENROWSET(
BULK 'https://<StorageAccountName>.dfs.core.windows.net/<ContainerName>/<DataSourceShortName>/<Schema_TableName>/DATA_*/*/DATA_*.parquet'
, FORMAT = 'PARQUET'
) AS [b] )

NOTE: An alias must be used with the OPENROWSET statement and needs to be the same alias that was used in the items above.

Voila! With this latest query, we can see that our results are now only including the latest data loaded into the Ingest Instance.

Troubleshooting

Error message in SSMS, "Incorrect Syntax near Format"
- Ensure you are using the SQL on-demand endpoint to connect SSMS to your Synapse Workspace. If you re-connect to your Synapse workspace, you may need to create a new query to point to the SQL on-demand endpoint.
- If your login to the Synapse Workspace (from SSMS) is sqladminuser, you may try either changing the login credentials (i.e. username and password), or create a new workspace.

Error message in SSMS, "A correlation name must be specified for the bulk rowset in the from clause.
- Ensure that you have aliased the OPENROWSET clause, and the alias matches throughout the query.
Error message in TimeXtender, "errorCode": "2200", message": "Failure happened on 'Sink' side."
- The issue may be resolved by installing Java Runtime from https://www.java.com/en/download. (Please see KB article JreNotFound error when using Azure Data Factory Data Source with ODX)
Error message in SSMS, “Cannot find the CREDENTIAL ‘<filepath>’, because it does not exist or you do not have permission.”
- Check that you have run the query to create the CREDENTIAL using an identity as Outlined in the prerequisites section above.

Did this topic help you find an answer to your question?

Be the first to reply!

Query Ingest Instance Parquet files with Azure Synapse Workspace

Ingest Instance Parquet File Structure and Incremental Loads

Prerequisites

Query the Data Lake with Synapse Workspace and SSMS

Using the FILEPATH function to determine Ingest Instance version and batch

Adding the WHERE clause to filter query result by version

Troubleshooting

0 replies

Reply

Most helpful members this week

Cookie policy

Cookie settings

Ingest Instance Parquet File Structure and Incremental Loads

Prerequisites

Query the Data Lake with Synapse Workspace and SSMS

Using the FILEPATH function to determine Ingest Instance version and batch

Adding the WHERE clause to filter query result by version

Troubleshooting

Reply

Related topics

Zyxek T56 modem i.c.m TP LINK deco mesh systeemicon

Tp-Link Deco M9 met Zyxel T50icon

Modem (Zyxel T-54) vervangen door TP-LINK Deco BE65 (3 pack)icon

Zyxel T56 en TP link Deco mesh m4icon

Helft van de snelheid van mijn 1000 Mbps abonnement met Zyxel T56 en TP-Link Deco M4icon

Most helpful members this week

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings