Go through this guide before you start.
Contents
XML Information
XML files can be structured in different ways
Here below is how a XML file structured in element form looks:
<?xml version="1.0" encoding="utf-8"?>
<top100rocksongs_billboard_2013>
<t>
<ID>1</ID>
<Sequence>1</Sequence>
<Song>STAIRWAY TO HEAVEN</Song>
<Performer>Led Zeppelin</Performer>
</t>
<t>
<ID>2</ID>
<Sequence>2</Sequence>
<Song>BOHEMIAN RHAPSODY</Song>
<Performer>QUEEN</Performer>
</t>
</top100rocksongs_billboard_2013>
Here is how a XML file structured in attribute form looks:
<top100rocksongs_billboard_2013>
<t ID="1" Sequence="1" Song="STAIRWAY TO HEAVEN" Performer="Led Zeppelin" />
<t ID="2" Sequence="2" Song="BOHEMIAN RHAPSODY" Performer="QUEEN" />
<t ID="3" Sequence="3" Song="FREE BIRD" Performer="LYNYRD SKYNYRD" />
<t ID="4" Sequence="4" Song="SMOKE ON THE WATER" Performer="DEEP PURPLE" />
<t ID="5" Sequence="5" Song="COMFORTABLY NUMB" Performer="Pink Floyd" />
<t ID="6" Sequence="6" Song="KASHMIR" Performer="Led Zeppelin" />
<t ID="7" Sequence="7" Song="STARGAZER" Performer="RAINBOW" />
<t ID="8" Sequence="8" Song="ALL RIGHT NOW" Performer="FREE" />
<t ID="9" Sequence="9" Song="WHOLE LOTTA LOVE" Performer="Led Zeppelin" />
</top100rocksongs_billboard_2013>
To make the Kings.XML file Download this link and save it as Kings.XML
This is the URI link to the file I use for the element setup.
This is the URI link to the file that uses attributes.
Setup
Add a data source and choose the XML provider.
The first step in setting up a connector should always be to look at the guide for the provider.
XML guide from CData
With Elements
If the XML source is a file you need to fill out the following fields. Add a link to the file. Be sure that all accounts that could decide to execute this have access to the folder where the file is stored.
- URI: <the location of the kings.xml file>
If your XML is stored in an online location, you need to fill out the following fields. Add a link to the file.
- URI: http://mysafeinfo.com/api/data?list=moviesbyearnings2006&mappingtype=element&rootname=moviesbyear&elementname=Movie
With Attributes
If the XML file is using attributes as its element type, you need to take into account that it cant automatically read the XPath. You need to provide it.
You need to add a link to the location of the data and a XPath:
- URI:Â https://mysafeinfo.com/api/data?list=top100rocksongsbillboard2013&format=xml&select=Sequence,Song,Performer,ID&case=default&token=e3ODIxQFAgT9TD69GSlvZ6rfgMKnZdVt&download=false
- XPath: top100rocksongsbillboard2013\t
XPath is the path to the data. It is the rootname followed by the element name. It is a good idea to locate this beforehand.
Getting the data
Following the setup, you need to map the data source to an Ingest Instance, see that the data source has been added there and execute the synchronize task. This is no different than all other data source types.
Besides this you can also set up Query Tables, Data Type overwrites, Incremental load and data selection rules.
Advanced features
The options for this provider are similar to those you got for REST and to some degree JSON. You can use REST instead of XML, the only limit is that the data must be XML, otherwise they are interchangeable.
Automatically getting the correct data types
By default it will scan the first 100 rows, determine what fields, and data types that are in this. Sometimes this gives an error because the 101 row contained a longer string than was present in the previous rows. The solution is to change the Row Scan Depth field to a higher number. In regards to JSON/XML files, this field doesn't only solve the issue of data type, but also what tables are available. If the file contains more than 100 rows in total these will not be added. So a good idea is to add 0 in the field, which will make sure it reads the whole file before determining the fields and their data types.
Parsing Hierarchical Data
If your file is containing nested objects, you now have the opportunity of seeing this. I copied the Raw data from here and saved it in a file on my computer. Raw nested data. It has this structure.
<?xml version="1.0" encoding="UTF-8" ?>
<root>
<rootAttr1>rootValue1</rootAttr1>
<people>
<personal>
<age>20</age>
<gender>M</gender>
<name>
<first>John</first>
<last>Doe</last>
</name>
</personal>
<jobs>support</jobs>
<jobs>coding</jobs>
<vehicles>
<type>car</type>
<model>Honda Civic</model>
<insurance>
<company>ABC Insurance</company>
<policy_num>12345</policy_num>
</insurance>
<features>sunroof</features>
<features>rims</features>
<maintenance>
<date>07-17-2017</date>
<desc>oil change</desc>
</maintenance>
<maintenance>
<date>01-03-2018</date>
<desc>new tires</desc>
</maintenance>
</vehicles>
<vehicles>
<type>truck</type>
<model>Dodge Ram</model>
<insurance>
<company>ABC Insurance</company>
<policy_num>12345</policy_num>
</insurance>
<features>lift kit</features>
<features>tow package</features>
<maintenance>
<date>08-27-2017</date>
<desc>new tires</desc>
</maintenance>
<maintenance>
<date>01-08-2018</date>
<desc>oil change</desc>
</maintenance>
</vehicles>
<addresses>
<type>work</type>
<zip>12345</zip>
</addresses>
<addresses>
<type>home</type>
<zip>12357</zip>
</addresses>
<source>internet</source>
</people>
<people>
<personal>
<age>24</age>
<gender>F</gender>
<name>
<first>Jane</first>
<last>Roberts</last>
</name>
</personal>
<jobs>sales</jobs>
<jobs>marketing</jobs>
<source>phone</source>
<vehicles>
<type>car</type>
<model>Toyota Camry</model>
<insurance>
<company>Car Insurance</company>
<policy_num>98765</policy_num>
</insurance>
<features>upgraded stereo</features>
<maintenance>
<date>05-11-2017</date>
<desc>tires rotated</desc>
</maintenance>
<maintenance>
<date>11-03-2017</date>
<desc>oil change</desc>
</maintenance>
</vehicles>
<vehicles>
<type>car</type>
<model>Honda Accord</model>
<insurance>
<company>Car Insurance</company>
<policy_num>98765</policy_num>
</insurance>
<features>custom paint</features>
<features>custom wheels</features>
<maintenance>
<date>10-07-2017</date>
<desc>new air filter</desc>
</maintenance>
<maintenance>
<date>01-13-2018</date>
<desc>new brakes</desc>
</maintenance>
</vehicles>
<addresses>
<type>home</type>
<zip>98765</zip>
</addresses>
<addresses>
<type>work</type>
<zip>98753</zip>
</addresses>
</people>
<rootAttr2>rootValue2</rootAttr2>
<rootAttr3>rootValue3</rootAttr3>
<rootAttr3>rootValue4</rootAttr3>
</root>
When I create a file from this, adds it as a data source like this and use no other features. It gives me one table called people and when I preview it, each nested part is put into one field.
What you can do about this is to use the Data Model feature. As standard it is not used and it has two options besides that. Flatten Documents and Relational.
The setup is as follows.
Relational creates a table for each nested part and creates ids to relate them to each other.
Like this.
People now looks like this.
And Vehicles
As you can see the jobs and features fields still have issues, but the flatten arrays feature below can solve this as well.
FlattenDocument option makes all these previous tables one large one.
Like this.
Again it has the same issues with jobs and features, but this is still solvable with the flatten arrays feature.
Flatten Arrays
Is an option to split fields into more fields while keeping the existing one.
If I use the RawData.XML file as from above and set the Flatten Arrays field to 10. As you can see here.
It will look as below when you synchronize.
The original field still exists and new versions of the field contain the data with the name from before and a digit. Like features, features.0 features.1, and so forth.
You do not need to know the amount of iterations it needs to go through, it just has to be high enough to get all values added.