Skip to main content

XML

Restonomer can parse the api response of text type in XML format. User need to configure the checkpoint in below format:

name = "checkpoint_xml_response_dataframe_converter"

data = {
data-request = {
url = "http://localhost:8080/xml-response-converter"
}

data-response = {
body = {
type = "Text"
text-format = {
type = "XMLTextFormat"
row-tag = "ROW"
}
}

persistence = {
type = "LocalFileSystem"
file-format = {
type = "ParquetFileFormat"
}
file-path = "/tmp/response_body"
}
}
}

Compression

In case the xml text that is returned by the api is compressed, user can configure the checkpoint in below format:

name = "checkpoint_xml_response_dataframe_converter"

data = {
data-request = {
url = "http://localhost:8080/xml-response-converter"
}

data-response = {
body = {
type = "Text"
compression = "GZIP"
text-format = {
type = "XMLTextFormat"
row-tag = "ROW"
}
}

persistence = {
type = "LocalFileSystem"
file-format = {
type = "ParquetFileFormat"
}
file-path = "/tmp/response_body"
}
}
}

As of now, restonomer supports only GZIP compression format.

XML Text Format Configurations

Just like row-tag, user can configure below other properties for XML text format that will help restonomer for parsing:

Parameter NameDefault ValueDescription
attribute-prefix_The prefix for attributes so that we can differentiate attributes and elements.
charsetUTF-8Defaults to 'UTF-8' but can be set to other valid charset names.
column-name-of-corrupt-record_corrupt_recordAllows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.
date-formatyyyy-MM-ddSets the string that indicates a date format.
exclude-attributefalseWhether you want to exclude attributes in elements or not.
ignore-surrounding-spacesfalseDefines whether or not surrounding whitespaces from values being read should be skipped.
ignore-namespacefalseIf true, namespaces prefixes on XML elements and attributes are ignored.
Note that, at the moment, namespaces cannot be ignored on the rowTag element, only its children.
Note that XML parsing is in general not namespace-aware even if false.
infer-schematrueInfers the input schema automatically from data.
modeFAILFASTAllows a mode for dealing with corrupt records during parsing. Allowed values are PERMISSIVE, DROPMALFORMED and FAILFAST.
null-valuenullThe value to read as a null value.
row-tagrowThe row tag of your XML files to treat as a row.
sampling-ratio1.0Defines fraction of rows used for schema inferring.
timestamp-formatyyyy-MM-dd HH:mm:ssSets the string that indicates a timestamp format.
value-tag_VALUEThe tag used for the value when there are attributes in the element having no child.
wildcard-col-namexs_anyName of a column existing in the provided schema which is interpreted as a 'wildcard'. It must have type string or array of strings.
It will match any XML child element that is not otherwise matched by the schema. The XML of the child becomes the string value of the column.
If an array, then all unmatched elements will be returned as an array of strings. As its name implies, it is meant to emulate XSD's xs:any type.