Incremental loading facilitates faster load times by enabling you to load only new transactional data into the data warehouse and staging databases. This is useful when the volume of transactional data in the data sources causes scheduled execution to take too long.
You should consider enabling incremental loading if your scheduled executions run longer than your execution strategy allows. For instance, if you load data during the night when business is closed to have fresh data in the morning, the load obviously needs to be completed before the users begin their day. If you want to transfer data from sources that are in use, you might also like to use incremental loading to minimize the impact on performance data transfer can have on users.
The default load plan in Discovery Hub is a full load. During a full load, the existing tables in the staging database and data warehouse are truncated, which removes all of the existing data, and new data is subsequently loaded from the data sources.
While it can be slow, the advantage of full loading is that you can be sure that any changes made in the source systems are carried over to the data warehouse. Therefore, it is common strategy to schedule periodical full loads while otherwise utilizing incremental loading. For instance, you could set up a full load during the weekend to make absolutely sure that your data warehouse is up to date, and schedule incremental loads during the workweek.
If your data contains one ore more fields that can define what records are new, you can use incremental loading. During the first deployment after incremental loading has been enabled, Discovery Hub will create additional tables in the staging database and data warehouse that have an _INCR or _I suffix. Discovery Hub will then do a full load. During subsequent executions of the project, truncation is disabled so that the data from the full load is not removed. Using the field, or number of fields, that you have chosen, Discovery Hub determines which records have been added to the data source since the last load and only transfers new records to the staging database and data warehouse.
Incremental load is enabled on the table level. Naturally, you will get the greatest increase in performance by enabling it on tables with a large amount of transactional data, such as those that contain large volumes of general ledger and inventory transactions. On the other hand, smaller tables with relatively few records, such as a Customer or Item table, usually do not take long to execute in the first place and enabling incremental load on them will only give you marginally better overall performance.
An issue with incremental load is when records are deleted in the data source and cause the data in the data warehouse to come out of sync with the data source. You can handle this by scheduling a periodical full load, which is also useful if your incremental selection rule do not catch records that have been changed.
For larger data sources, a full load might not be feasible or you might want a better sync between source and data warehouse between full loads. For these situations, you can enable hard deletes, where the records deleted in the data source are also deleted in the data warehouse, or soft deletes, where the records are just marked as deleted in the data warehouse, on the incrementally loaded table.
The feature is implemented as shown in the figure below. On each incremental load, data is loaded from the source and into two instances of the table. The raw instance contains all the new records, while the primary key (“PK”) instance contains the content of the primary key columns for all records. The “magic” happens once data has been transferred from raw to valid through the transformation view. Discovery Hub will then delete any record in the valid instance that has a primary key that cannot be found in either the raw or the PK instance. If soft delete is enabled, Discovery Hub will add a column, “Is tombstone” to the valid instance and updated deleted records with the value “true”.
The implementation is designed to be fast and simple. To handle more advanced use cases, the feature can be used in conjunction with, or replaced by, scripts. Among the cases you should be aware of are the following:
- If a record is undeleted/restored in the source, Discovery Hub will not undelete it in the valid in-stance unless it is loaded into the raw table again. Discovery Hub will not do anything with primary keys that are in the PK instance, but not in the valid instance. However, with a good incremental selection rule that loads both new and modified records or loads the new records and e.g. 100 more, this should not be an issue.
- If the primary key is transformed from raw to valid, Discovery Hub will not be able to match them to the primary keys stored in the raw and PK instance. All records in valid will therefore be deleted.
Enabling Incremental Loading
To use incremental loading on a table, the table must contain a field that represents new data. This could be an identifier field, an entry number or a date. In addition to that, the table must have a primary key defined. To define a field or fields to be used as the primary key for the table, right-click a field and click Include in Primary Key.
To enable source based incremental loading, follow the steps below.
- Right click the table for which you want to enable incremental loading and click Table Settings.
- Click the Data Extraction tab and select Enable source-based incremental Load.
- Click on the option that represents how you would like to handle records deleted in the source. For more information, see Handling Records Deleted in the Data Source. You have the following options:
- Don't handle deletes
- Use hard deletes
- Use soft deletes
- Click Keep field values up-to-date to recalculate conditional lookup field and supernatural key field values on tables that are already on the data warehouse when the values being looked up have changed.
Note: This option cannot be used when the table is on a Azure Synapse Analytics data warehouse.
- If an error icon appears next to the setting, it means that another setting needs to be changed to enable source based incremental loading. Move you mouse over the error icon to see the error message. Once any settings have been changed as required, click OK. The table icon will now be overlaid with an "I" to make it easy for you to identify it as an incrementally loaded table.
- If the table belongs to the staging database, right click the corresponding source table under Data Sources and click Add Incremental Selection Rule.
- OR -
If the table belongs to the data warehouse, right click the table and click Add Incremental Selection Rule.
- The Incremental Selection Rules pane appears. Select the fields identifying which records have been added or changed since the last incremental load.
In the Subtract from value, you can enter an integer (from numeric fields) or an amount of time (for date fields) to subtract from the value of the selected field before determining what records are new. This allows you to create an overlap in the records loaded to make sure any recent changes in existing records are reflected in the database.
The fields you choose should ideally be fields that are generated by the system and incremented sequentially when new records are added. Here are some recommended fields for the Microsoft Dynamics ERP systems:
- Dynamics NAV: timestamp
- Dynamics AX: Modified_Datetime
- Dynamics GP: DEX_ROW_ID
- Repeat steps 1-7 for all tables you want to enable incremental loading on.
- Deploy and execute the table(s). Discovery Hub will begin the first full load of the tables with source based incremental loading now enabled. The necessary incremental tables will be automatically added to the staging database and populated with the latest incremental values. The next time the table is executed, Discovery Hub will query these tables to determine the last record that was previously loaded and will only extract the data from the data source that was added after the last execution.
Enabling Incremental Loading for Multiple Tables
With the Set Up Incremental Load wizard, you can automate the set-up of incremental load to enable it for a number of tables in one go. You still have to choose primary keys and what fields to use for the incremental load rule, but you selections are applied automatically. In addition to enabling incremental load on the tables you select, any settings that are incompatible with incremental load will also be changed.
Note: When you run the wizard on a data warehouse, additional help will be available in the form of button, Auto Suggest. It will suggest primary keys based on the primary keys set on the staging database. For selection rules, the suggestion depends is a rule based on a custom field created on the table and mapped to the 'DW_TimeStamp' system field of the source table. This rule is displayed in the wizard as 'IncrementalTimeStamp' and has a robot icon. If the table is source from an ODX data storage, the selection rule suggestion will be to use the '__ODX_TIMESTAMP' instead.
To use the Set Up Incremental Load wizard, follow the steps below.
- Right click a data source, staging database or data warehouse, click Automate and click Set Up Incremental Load.
- The wizard appears.
In the Available tables list, double-click the tables you want to enable incremental load for. You can also click a table and then click Add to add an individual table. To add all visible tables, click Add all. Use the filter below the list to filter the list on table name. The following wildcards are supported:
- %: Any string of zero or more characters.
- _: Any single character.
- [ ]: Any single character within the specified range ([a-f]) or set ([abcdef]).
- [^]: Any single character not within the specified range ([^a-f]) or set ([^abcdef]).
- Click Next. All incrementally loaded tables need to have a primary key, which you can add on this page. Any tables that do not have a primary key when you proceed to the next page, will be removed from the wizard.
In the Available fields list, double-click the fields you want to use as primary keys on the tables. You can also click a field and then click Add to add an individual field.
The tables in the Available fields list that have at least one primary key field will be shown in bold.
In the Included fields list, any fields that are already primary keys on the tables you selected in the previous step are listed in gray. You cannot remove existing primary keys on this page, only add new ones.
- Click Next. All incrementally loaded tables need an incremental selection rule. You select those on this page.
Select the fields you want to use as primary keys on the tables. Under Rule Settings, you can enter a subtraction value for a field if you want the incremental load to overlap the previous load.
The tables in the Available fields list that have a field selected for incremental selection rule will be shown in bold.
- Click Next. Discovery Hub can handle records deleted in the data source for you. On tables with hard delete enabled, records deleted in the data source is also deleted from the data warehouse. For more information about delete handling, see Handling Records Deleted in the Data Source.
Select the tables you want to enable hard deletes for.
- Click Next.
Select tables you want to enable soft deletes for. The tables you do not select for either hard and soft deletes, will not have any delete handling enabled.
- Click Next.
Click Previous to go back to an earlier page and adjust the selections there or click Finish to apply the changes if they are correct.