CSV
Restonomer can parse the api response of text type in CSV format. User need to configure the checkpoint in below format:
name = "checkpoint_csv_response_dataframe_converter"
data = {
data-request = {
url = "http://localhost:8080/csv-response-converter"
}
data-response = {
body = {
type = "Text"
text-format = {
type = "CSVTextFormat"
sep = ";"
}
}
persistence = {
type = "LocalFileSystem"
file-format = {
type = "ParquetFileFormat"
}
file-path = "/tmp/response_body"
}
}
}
Compression
In case the csv text that is returned by the api is compressed, user can configure the checkpoint in below format:
name = "checkpoint_csv_response_dataframe_converter"
data = {
data-request = {
url = "http://localhost:8080/csv-response-converter"
}
data-response = {
body = {
type = "Text"
compression = "GZIP"
text-format = {
type = "CSVTextFormat"
sep = ";"
}
}
persistence = {
type = "LocalFileSystem"
file-format = {
type = "ParquetFileFormat"
}
file-path = "/tmp/response_body"
}
}
}
As of now, restonomer supports only GZIP
compression format.
CSV Text Format Configurations
Just like sep
, user can configure below other properties for CSV text format that will help restonomer for parsing:
Parameter Name | Default Value | Description |
---|---|---|
char-to-escape-quote-escaping | \ | Sets a single character used for escaping the escape for the quote character. |
column-name-of-corrupt-record | _corrupt_record | Allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord . |
comment | # | Sets a single character used for skipping lines beginning with this character. |
date-format | yyyy-MM-dd | Sets the string that indicates a date format. |
empty-value | "" (empty string) | Sets the string representation of an empty value. |
enable-date-time-parsing-fallback | true | Allows falling back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns. |
encoding | UTF-8 | Decodes the CSV files by the given encoding type. |
enforce-schema | true | If it is set to true, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to false, the schema will be validated against all headers in CSV files in the case when the header option is set to true. Field names in the schema and column names in CSV headers are checked by their positions taking into account spark.sql.caseSensitive. Though the default value is true, it is recommended to disable the enforce-schema option to avoid incorrect results. |
escape | \ | Sets a single character used for escaping quotes inside an already quoted value. |
header | true | Boolean flag to tell whether csv text contains header names or not. |
infer-schema | true | Infers the input schema automatically from data. |
ignore-leading-white-space | false | A flag indicating whether or not leading whitespaces from values being read should be skipped. |
ignore-trailing-white-space | false | A flag indicating whether or not trailing whitespaces from values being read should be skipped. |
line-sep | \n | Defines the line separator that should be used for parsing. Maximum length is 1 character. |
locale | en-US | Sets a locale as a language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps. |
max-chars-per-column | -1 | Defines the maximum number of characters allowed for any given value being read. |
max-columns | 20480 | Defines a hard limit of how many columns a record can have. |
mode | FAILFAST | Allows a mode for dealing with corrupt records during parsing. Allowed values are PERMISSIVE, DROPMALFORMED, and FAILFAST. |
multi-line | false | Parse one record, which may span multiple lines, per file. |
nan-value | NaN | Sets the string representation of a non-number value. |
negative-inf | -Inf | Sets the string representation of a negative infinity value. |
null-value | null | Sets the string representation of a null value. |
positive-inf | Inf | Sets the string representation of a positive infinity value. |
prefer-date | true | During schema inference (infer-schema), attempts to infer string columns that contain dates as Date if the values satisfy the date-format option or the default date format. For columns that contain a mixture of dates and timestamps, try inferring them as TimestampType if the timestamp format is not specified; otherwise, infer them as StringType. |
quote | " | Sets a single character used for escaping quoted values where the separator can be part of the value. For reading, if you would like to turn off quotations, you need to set not null but an empty string. |
record-sep | \n | Delimiter by which rows are separated in a CSV text. |
sampling-ratio | 1.0 | Defines the fraction of rows used for schema inferring. |
sep | , | Delimiter by which fields in a row are separated in a CSV text. |
timestamp-format | yyyy-MM-dd HH:mm:ss | Sets the string that indicates a timestamp format. |
timestamp-ntz-format | yyyy-MM-dd'T'HH:mm:ss[.SSS] | Sets the string that indicates a timestamp without timezone format. |
unescaped-quote-handling | STOP-AT-DELIMITER | Defines how the CsvParser will handle values with unescaped quotes. Allowed values are STOP-AT-CLOSING-QUOTE, BACK-TO-DELIMITER, STOP-AT-DELIMITER, SKIP-VALUE, RAISE-ERROR |