Skip to main content

JSON

Restonomer can parse the api response of text type in JSON format. User need to configure the checkpoint in below format:

name = "checkpoint_json_response_dataframe_converter"

data = {
data-request = {
url = "http://localhost:8080/json-response-converter"
}

data-response = {
body = {
type = "Text"
text-format = {
type = "JSONTextFormat"
primitives-as-string = true
}
}

persistence = {
type = "LocalFileSystem"
file-format = {
type = "ParquetFileFormat"
}
file-path = "/tmp/response_body"
}
}
}

Compression

In case the json text that is returned by the api is compressed, user can configure the checkpoint in below format:

name = "checkpoint_json_response_dataframe_converter"

data = {
data-request = {
url = "http://localhost:8080/json-response-converter"
}

data-response = {
body = {
type = "Text"
compression = "GZIP"
text-format = {
type = "JSONTextFormat"
primitives-as-string = true
}
}

persistence = {
type = "LocalFileSystem"
file-format = {
type = "ParquetFileFormat"
}
file-path = "/tmp/response_body"
}
}
}

As of now, restonomer supports only GZIP compression format.

JSON Text Format Configurations

Just like primitives-as-string, user can configure below other properties for JSON text format that will help restonomer for parsing:

Parameter NameDefault ValueDescription
allow-backslash-escaping-any-characterfalseAllows accepting quoting of all character using backslash quoting mechanism.
allow-commentsfalseIgnores Java/C++ style comment in JSON records.
allow-non-numeric-numberstrueAllows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as legal floating number values.
allow-numeric-leading-zerosfalseAllows leading zeros in numbers (e.g. 00012).
allow-single-quotestrueAllows single quotes in addition to double quotes.
allow-unquoted-control-charsfalseAllows JSON Strings to contain unquoted control characters
(ASCII characters with a value less than 32, including tab and line feed characters) or not.
allow-unquoted-field-namesfalseAllows unquoted JSON field names.
column-name-of-corrupt-record_corrupt_recordAllows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.
data-column-nameNoneThe name of the column that actually contains the dataset. If present, the API will only parse the dataset of this column to the dataframe.
date-formatyyyy-MM-ddSets the string that indicates a date format.
drop-field-if-all-nullfalseWhether to ignore columns of all null values or empty arrays during schema inference.
enable-date-time-parsing-fallbacktrueAllows falling back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps
if values do not match the set patterns.
encodingUTF-8Decodes the JSON files by the given encoding type.
infer-schematrueInfers the input schema automatically from data.
line-sep\nDefines the line separator that should be used for parsing. Maximum length is 1 character.
localeen-USSets a locale as a language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.
modeFAILFASTAllows a mode for dealing with corrupt records during parsing. Allowed values are PERMISSIVE, DROPMALFORMED, and FAILFAST.
multi-linefalseParse one record, which may span multiple lines, per file.
prefers-decimalfalseInfers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.
primitives-as-stringfalseInfers all primitive values as a string type.
sampling-ratio1.0Defines the fraction of rows used for schema inferring.
timestamp-formatyyyy-MM-dd HH:mm:ssSets the string that indicates a timestamp format.
timestamp-ntz-formatyyyy-MM-dd'T'HH:mm:ss[.SSS]Sets the string that indicates a timestamp without timezone format.
time-zoneUTCSets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values.