Skip to main content

Benefits of Ingest Instances


Christian Hauggaard
Community Manager
Forum|alt.badge.img+5

This article describes the benefits of Ingest Instances. Please see some of the key benefits below:

  • Data Sources 
    • A growing list of providers for various data sources that are regularly updated.
    • These data sources can be updated to new versions without having to upgrade TimeXtender to a new version
  • Reaping the performance benefits of Azure Data Factory (ADF) 
    • Access to ADF data sources
    • When using a Ingest Instance server, ADF can be leveraged for transferring source data to storage, and to a prepare instance. The data pipeline architecture behind ADF offers improved performance and is highly scalable, meaning that transfer times can be rapidly reduced
    • The TimeXtender Ingest Service (TIS) can also be installed on-premise and use on-premise data sources. This is possible because of Self-hosted Integration Runtime, which allows the ADF engine to access and transfer on-premise data, and allows TX to leverage ADF pipeline architecture, while the firewall remains in place as-is
  • Promoting the Data Lake Concept and Infrastructure 
    • An Ingest Instance Server does not support the transformation of data or provide the ability to override original field names during ingestion of data by the Ingest Instance server. This promotes the data lake concept by ensuring that the data is initially ingested and stored in a raw format. Data scientists often request raw data, and initially storing data in its raw form also allows for greater insight into data lineage (i.e. what is the original source of the data and how is it later being transformed). It also makes the transfer of data from source to storage faster due to fewer transformations. However, it is worth noting that it is possible to “query tables” using SQL code when setting up data sources, and thereby apply transformations and renaming of fields – although in general this is not considered best practice
    • An Ingest Instance Server allows for data storage within the Azure Data Lake infrastructure, which cannot be achieved when using a Business Unit.
    • The Ingest Instance Server creates multiple versions of source data, which are stored after each execution of a transfer task (e.g. one version of a file per transfer). TX projects pointing to the Ingest Instance server automatically uses the latest version of data. It is possible to configure and schedule a storage management task to delete, and manage, old versions of data to free up storage. This archival process is driven by inexpensive storage, and lends itself to the data lake concept. It also allows for the creation of backup files which may be used for recovery
  • Select data quickly and dynamically
    • Selection of data is quick, as it is simple to dynamically choose which tables and columns from data source are brought in (e.g. all tables from a particular schema, or tables with names containing a particular term)
    • You can use a different simple selection option to individually choose what tables and fields your data source should connect to.
  • TIS server improves team development and supports the server-client experience 
    • TimeXtender Data Integration (TDI) does not have to be installed on the Ingest Instance Server. Developers can install TimeXtender Data Integration (TDI) on their own machines, and load source data into storage through the Ingest Instance server transfer tasks, without ingesting the data onto the developer machine. The data is transferred directly to the Ingest Instance destination server
    • This provides much more server-client like experience, as opposed to a desktop-client experience. Using Ingest Instances in combination with ADF Transfer to prepare instances, means the TimeXtender Data Integration (TDI) application becomes purely an orchestration tool, transferring data without involving the application server
  • You can use incremental loading of tables into your Ingest Instance storage
    • Doing this will automatically set up tables in your Prepare Instance to incrementally load data from the store with no setup required.
    • Instead of generating a new folder containing all the data a batch file is added to the existing folder with only the new or updated rows.
Did this topic help you find an answer to your question?

6 replies

daniel
TimeXtender Xpert
Forum|alt.badge.img+7
  • TimeXtender Xpert
  • 188 replies
  • November 9, 2023

Nice article @Christian Hauggaard !

Do you happen to know what the road map is for working with Delta Lake?

Is the issue fixed with Microsoft where ADF could not go beyond the 4 DIU's? ADF is very scaleable but if it can't go above the 4 DIU's that is a pretty big bottle neck. A while back we've tested this and the outcome of the tests showed us that getting and moving the data from and to the ODX was fasted (and cheapest) with ADO.NET.

Thanks again!


Christian Hauggaard
Community Manager
Forum|alt.badge.img+5

Thanks @daniel!

I cannot provide an update for Delta Lake at this time, although I know it is being investigated. Regarding the ADF limit, this still seems to be the case, please see the below article for more info:

https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features

 


daniel
TimeXtender Xpert
Forum|alt.badge.img+7
  • TimeXtender Xpert
  • 188 replies
  • November 10, 2023

Dear @Christian Hauggaard ,

That is too bad. This will take the ODX server to the next level.
Thanks for the information on ADF. Hopefully this will get changed in the near furure :)


rory.smith
TimeXtender Xpert
Forum|alt.badge.img+7
  • TimeXtender Xpert
  • 649 replies
  • November 10, 2023

Hi @Christian Hauggaard ,

 

individual source table → parquet pipelines and parquet → SQL table pipelines cannot be scaled above 4 if you are using Azure Runtimes. This will likely never change, the pattern Microsoft wants you to use is to split the source table into chunks and handle each chunk through a separate pipeline. Likewise, large tables in parquet can be problematic to pull into SQL in one go. There are settings that can be used to partition the streams like: https://learn.microsoft.com/en-us/azure/data-factory/connector-sql-server?tabs=data-factory#parallel-copy-from-sql-database .

 

"- Copy from partition-option-enabled data stores (including Azure Database for PostgreSQL, Azure SQL Database, Azure SQL Managed Instance, Azure Synapse Analytics, Oracle, Netezza, SQL Server, and Teradata): 2-256 when writing to a folder, and 2-4 when writing to one single file. Note per source data partition can use up to 4 DIUs.” ← if the ODX Server is loading from one of these DB sources and large source tables are partitioned, you can reach more DIU for that table transfer. Individual chunks will never go over 4 DIU. With the current folder structure in the ODX ADLS, I think only large tables could be partitioned to reach a higher speed. Delta-parquet will already mitigate day-to-day performance, but initial load would still benefit.

If you want to speed up ADF, you should run self-hosted integration runtimes (you can scale up the VM as far as your budget will allow). These can be clustered to allow for more performance. As you usually load data from other networks, I typically use a self-hosted integration runtime for source-->ODX transfers and AzureRuntimes for ODX to DWH. If the ODX-->DWH transfer becomes a bottleneck, that can be migrated to a separate self-hosted integration runtime.


daniel
TimeXtender Xpert
Forum|alt.badge.img+7
  • TimeXtender Xpert
  • 188 replies
  • November 10, 2023

Thanks @rory.smith thats very insightful. Perhaps this makes me like the ADO.net option even more.
Sure you can self host the runtime and scale up the VM. This will make it faster then the ADO.net, but also way more expensive. 

For really big tables I'm hoping that TX will implement a achriving strategy option so you can incrementally load the table on a ‘this year’ dataset or load batches incrementally which overwrites one ‘block’ of data instead of the whole data set. In my opinion this will make the loading of data much faster (especially for big transactional tables) and the possibility to only get data from ‘this year’ and not overriting the ‘other years’ of data.


rory.smith
TimeXtender Xpert
Forum|alt.badge.img+7
  • TimeXtender Xpert
  • 649 replies
  • November 10, 2023

Hi @daniel,

 

you can also run the self-hosted integration runtime on your ODX Server /TX machine if you want, as long as you have enough RAM free. Note that for ADF the packing/unpacking of parquet is done in the memory space of your runtime, be it Azure or self-hosted. Using ADO.net it is done on your ODX Server VM; this is the reason for the ‘limit memory use’ setting.

 

The parquet files aren't really optimized for IOPS speed (column cardinality and sorting would improve this), so the parquet packing/unpacking probably adds a relatively high overhead due to file bloat.

ADF also generally seems slow because an AzureRuntime can take 60 seconds to spin up (you can tweak settings to keep them alive). 


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings