Skip to main content

CSV

Restonomer can parse the api response of text type in CSV format. User need to configure the checkpoint in below format:

name = "checkpoint_csv_response_dataframe_converter"

data = {
data-request = {
url = "http://localhost:8080/csv-response-converter"
}

data-response = {
body = {
type = "Text"
text-format = {
type = "CSVTextFormat"
sep = ";"
}
}

persistence = {
type = "LocalFileSystem"
file-format = {
type = "ParquetFileFormat"
}
file-path = "/tmp/response_body"
}
}
}

Compression

In case the csv text that is returned by the api is compressed, user can configure the checkpoint in below format:

name = "checkpoint_csv_response_dataframe_converter"

data = {
data-request = {
url = "http://localhost:8080/csv-response-converter"
}

data-response = {
body = {
type = "Text"
compression = "GZIP"
text-format = {
type = "CSVTextFormat"
sep = ";"
}
}

persistence = {
type = "LocalFileSystem"
file-format = {
type = "ParquetFileFormat"
}
file-path = "/tmp/response_body"
}
}
}

As of now, restonomer supports only GZIP compression format.

CSV Text Format Configurations

Just like sep, user can configure below other properties for CSV text format that will help restonomer for parsing:

Parameter NameDefault ValueDescription
char-to-escape-quote-escaping\ Sets a single character used for escaping the escape for the quote character.
column-name-of-corrupt-record_corrupt_recordAllows renaming the new field having malformed string created by PERMISSIVE mode.
This overrides spark.sql.columnNameOfCorruptRecord.
comment#Sets a single character used for skipping lines beginning with this character.
date-formatyyyy-MM-ddSets the string that indicates a date format.
empty-value"" (empty string)Sets the string representation of an empty value.
enable-date-time-parsing-fallbacktrueAllows falling back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps
if values do not match the set patterns.
encodingUTF-8Decodes the CSV files by the given encoding type.
enforce-schematrueIf it is set to true, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored.
If the option is set to false, the schema will be validated against all headers in CSV files in the case when the header option is set to true.
Field names in the schema and column names in CSV headers are checked by their positions taking into account spark.sql.caseSensitive.
Though the default value is true, it is recommended to disable the enforce-schema option to avoid incorrect results.
escape\ Sets a single character used for escaping quotes inside an already quoted value.
headertrueBoolean flag to tell whether csv text contains header names or not.
infer-schematrueInfers the input schema automatically from data.
ignore-leading-white-spacefalseA flag indicating whether or not leading whitespaces from values being read should be skipped.
ignore-trailing-white-spacefalseA flag indicating whether or not trailing whitespaces from values being read should be skipped.
line-sep\nDefines the line separator that should be used for parsing. Maximum length is 1 character.
localeen-USSets a locale as a language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.
max-chars-per-column-1Defines the maximum number of characters allowed for any given value being read.
max-columns20480Defines a hard limit of how many columns a record can have.
modeFAILFASTAllows a mode for dealing with corrupt records during parsing. Allowed values are PERMISSIVE, DROPMALFORMED, and FAILFAST.
multi-linefalseParse one record, which may span multiple lines, per file.
nan-valueNaNSets the string representation of a non-number value.
negative-inf-InfSets the string representation of a negative infinity value.
null-valuenullSets the string representation of a null value.
positive-infInfSets the string representation of a positive infinity value.
prefer-datetrueDuring schema inference (infer-schema), attempts to infer string columns that contain dates as Date if the values satisfy the date-format option or the default date format.
For columns that contain a mixture of dates and timestamps, try inferring them as TimestampType if the timestamp format is not specified; otherwise, infer them as StringType.
quote"Sets a single character used for escaping quoted values where the separator can be part of the value.
For reading, if you would like to turn off quotations, you need to set not null but an empty string.
record-sep\nDelimiter by which rows are separated in a CSV text.
sampling-ratio1.0Defines the fraction of rows used for schema inferring.
sep,Delimiter by which fields in a row are separated in a CSV text.
timestamp-formatyyyy-MM-dd HH:mm:ssSets the string that indicates a timestamp format.
timestamp-ntz-formatyyyy-MM-dd'T'HH:mm:ss[.SSS]Sets the string that indicates a timestamp without timezone format.
unescaped-quote-handlingSTOP-AT-DELIMITERDefines how the CsvParser will handle values with unescaped quotes.
Allowed values are STOP-AT-CLOSING-QUOTE, BACK-TO-DELIMITER, STOP-AT-DELIMITER, SKIP-VALUE, RAISE-ERROR