CSV

Restonomer can parse the api response of text type in CSV format. User need to configure the checkpoint in below format:

name = "checkpoint_csv_response_dataframe_converter"

data = {
  data-request = {
    url = "http://localhost:8080/csv-response-converter"
  }

  data-response = {
    body = {
      type = "Text"
      text-format = {
        type = "CSVTextFormat"
        sep = ";"
      }
    }

    persistence = {
      type = "LocalFileSystem"
      file-format = {
        type = "ParquetFileFormat"
      }
      file-path = "/tmp/response_body"
    }
  }
}

Compression

In case the csv text that is returned by the api is compressed, user can configure the checkpoint in below format:

name = "checkpoint_csv_response_dataframe_converter"

data = {
  data-request = {
    url = "http://localhost:8080/csv-response-converter"
  }

  data-response = {
    body = {
      type = "Text"
      compression = "GZIP"
      text-format = {
        type = "CSVTextFormat"
        sep = ";"
      }
    }

    persistence = {
      type = "LocalFileSystem"
      file-format = {
        type = "ParquetFileFormat"
      }
      file-path = "/tmp/response_body"
    }
  }
}

As of now, restonomer supports only GZIP compression format.

CSV Text Format Configurations

Just like sep, user can configure below other properties for CSV text format that will help restonomer for parsing:

Parameter Name	Default Value	Description
char-to-escape-quote-escaping	\	Sets a single character used for escaping the escape for the quote character.
column-name-of-corrupt-record	_corrupt_record	Allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides `spark.sql.columnNameOfCorruptRecord`.
comment	#	Sets a single character used for skipping lines beginning with this character.
date-format	yyyy-MM-dd	Sets the string that indicates a date format.
empty-value	"" (empty string)	Sets the string representation of an empty value.
enable-date-time-parsing-fallback	true	Allows falling back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns.
encoding	UTF-8	Decodes the CSV files by the given encoding type.
enforce-schema	true	If it is set to true, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to false, the schema will be validated against all headers in CSV files in the case when the header option is set to true. Field names in the schema and column names in CSV headers are checked by their positions taking into account spark.sql.caseSensitive. Though the default value is true, it is recommended to disable the enforce-schema option to avoid incorrect results.
escape	\	Sets a single character used for escaping quotes inside an already quoted value.
header	true	Boolean flag to tell whether csv text contains header names or not.
infer-schema	true	Infers the input schema automatically from data.
ignore-leading-white-space	false	A flag indicating whether or not leading whitespaces from values being read should be skipped.
ignore-trailing-white-space	false	A flag indicating whether or not trailing whitespaces from values being read should be skipped.
line-sep	\n	Defines the line separator that should be used for parsing. Maximum length is 1 character.
locale	en-US	Sets a locale as a language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.
max-chars-per-column	-1	Defines the maximum number of characters allowed for any given value being read.
max-columns	20480	Defines a hard limit of how many columns a record can have.
mode	FAILFAST	Allows a mode for dealing with corrupt records during parsing. Allowed values are PERMISSIVE, DROPMALFORMED, and FAILFAST.
multi-line	false	Parse one record, which may span multiple lines, per file.
nan-value	NaN	Sets the string representation of a non-number value.
negative-inf	-Inf	Sets the string representation of a negative infinity value.
null-value	null	Sets the string representation of a null value.
positive-inf	Inf	Sets the string representation of a positive infinity value.
prefer-date	true	During schema inference (infer-schema), attempts to infer string columns that contain dates as Date if the values satisfy the date-format option or the default date format. For columns that contain a mixture of dates and timestamps, try inferring them as TimestampType if the timestamp format is not specified; otherwise, infer them as StringType.
quote	"	Sets a single character used for escaping quoted values where the separator can be part of the value. For reading, if you would like to turn off quotations, you need to set not null but an empty string.
record-sep	\n	Delimiter by which rows are separated in a CSV text.
sampling-ratio	1.0	Defines the fraction of rows used for schema inferring.
sep	,	Delimiter by which fields in a row are separated in a CSV text.
timestamp-format	yyyy-MM-dd HH:mm:ss	Sets the string that indicates a timestamp format.
timestamp-ntz-format	yyyy-MM-dd'T'HH:mm:ss[.SSS]	Sets the string that indicates a timestamp without timezone format.
unescaped-quote-handling	STOP-AT-DELIMITER	Defines how the CsvParser will handle values with unescaped quotes. Allowed values are STOP-AT-CLOSING-QUOTE, BACK-TO-DELIMITER, STOP-AT-DELIMITER, SKIP-VALUE, RAISE-ERROR

CSV

Compression​

CSV Text Format Configurations​

Compression

CSV Text Format Configurations