Skip to main content

Hello,

 

I want to get metadate from my Azure datalake using their Blob API. 

I wasn't seeing any data in the Ingest storage so I turned on cashing to file, to try to see what's happening. 

 

There are three files in my cashing folder: 

  • Data_.raw: The return of the call, i.e. my actual data. This look excellent, except that it's a .raw file. Contents:
    <?xml version="1.0" encoding="utf-8"?>
    <EnumerationResults ServiceEndpoint="https://xxxx.blob.core.windows.net/" ContainerName="datalake">
    <Prefix>my_prefix</Prefix>
    <Blobs>
    <Blob>
    ....
    </Blob>
    </Blobs>
    <NextMarker/>
    </EnumerationResults>

     

  • Data_.xml: Basically the same as the Data_.raw, but with the content of Data_.raw as the data of a value-element. The data also contains the XML header (so now the document has two headers) and the brackets have been encoded (i.e. all the `<` are now `<`). 
    <?xml version="1.0" encoding="utf-8"?>
    <Table_flattening_name>
    <value>
    &lt;?xml version="1.0" encoding="utf-8"?&gt;
    &lt;EnumerationResults
    ServiceEndpoint="https://xxxx.blob.core.windows.net/"
    ContainerName="datalake"&gt;
    &lt;Prefix&gt;my_prefix&lt;/Prefix&gt;
    &lt;Blobs&gt;
    &lt;Blob&gt;
    ...
    &lt;/Blob&gt;
    &lt;/Blobs&gt;
    &lt;NextMarker /&gt;
    &lt;/EnumerationResults&gt;
    </value>
    </Table_flattening_name>

     

  • Data_transformed_1.xml: The result of my XSLT on Data_.xml

Data_transformed_1.xml contains one empty element, which is caused by Data_.xml being malformed. 

I can't really figure out what's going on. In other APIs I only had two files. Not sure what the Data_.raw file is doing, but everything would work if that file were Data_.xml. 

 

What could be causing this? Why is there a Data_.raw file? How can I fix this?

 

Hi ​@Benny 

It would seem like there is no data in both of the examples, unless the …. is supposed to mean that one contains data.

The two files are the raw example of the source and the XML is the data when it is converted into that.

What are you connecting to, it seems like some sort of Microsoft?


The ellipsis is indeed meant to replace actual data. That all looks fine. 

I figured that the .raw represents raw data. But in the cashing of REST endpoints that do work, I don't see this file. There's just an XML with the raw data and a transformed.xml containing the transformed data. 

So that leaves me puzzled to what's happening. Especially since the .xml is malformed. 

 

I’m trying to get data from an Azure storage, specifically list the blobs in a certain container: https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs?tabs=microsoft-entra-id.

 

From other rest endpoints I connected to I expected to see something like:

Data_.xml --> Data_transformed.xml

In stead the cashing folder implies

Data_.raw --> Data_.xml --> Data_transfromed.xml

Data_.raw contains exactly the XML I need. Some weird transformation is applied that turns in into the malformed Data_.xml. Then my table flattening XSLT is applied, which can't make sense of the malformed XML. 


Hi ​@Benny 

I have been trying to replicate this behavior, but it just creates one Data_.xml file.

Are you in version 6814.1 or newer working on version 7.1.0.0 of the REST data source?


Hello Thomas,

 

Ingest and DI were updated this morning to 6848.1. REST data source is on version 7.1.0.0.


Hi ​@Benny 

Because we had this conversation in Zendesk I wanted to add what was done to resolve this for others to see.

The raw file is having a BOM. This means that the XML does not see the < sign at the beginning of the file and instead will read the BOM definition.
 
The developers believe that this behavior will be fixed in the next release, where you can force the code to treat data as XML.
 
In the meantime, you may get it to work by simply turning off caching to a file and keeping it set to in memory.

 

The Byte-Order-Mark (or BOM), is a special marker added at the very beginning of an Unicode file encoded in UTF-8, UTF-16 or UTF-32. It is used to indicate whether the file uses the big-endian or little-endian byte order. The BOM is mandatory for UTF-16 and UTF-32, but it is optional for UTF-8.

BOM

For reference to others, turning off caching of files is what was done to make it work and it was confirmed as working by Benny.

 


Reply