Skip to main content
Version: Next

File Formats

File-based data connectors — including s3://, abfs://, file://, ftp://, sftp://, and others — support multiple structured and document file formats. This page details the format-specific parameters available for each.

Common Parameters

These parameters apply across multiple file formats.

ParameterTypeDefaultDescription
file_formatStringInferredSelects the file reader. If omitted, format is inferred from the file extension. See Supported Formats.
file_extensionStringDerivedOverrides the file extension filter used when listing files. Defaults to the extension matching the resolved format.
schema_infer_max_recordsInteger1000Maximum number of records scanned to infer the schema.
file_compression_typeStringUNCOMPRESSEDFile-level compression for CSV, TSV, and JSON files. Valid values: GZIP, BZIP2, XZ, ZSTD, UNCOMPRESSED.
hive_partitioning_enabledBooleanfalseEnables Hive-style partition discovery from directory structure.

Supported Formats

The file_format parameter accepts these values:

ValueReaderDefault ExtensionNotes
parquetApache Parquet.parquet
csvCSV.csvUses csv_* parameters.
tsvTSV (tab-delimited).tsvUses tsv_* parameters. Delimiter is tab.
jsonJSON.jsonAuto-detects format. Uses json_format to control parsing mode.
jsonlJSON Lines.jsonlLine-delimited JSON.

When file_format is omitted, Spice infers the format from the dataset path extension. If the extension does not match one of the values above, a configuration error is returned.

Parquet

Spice reads any Parquet file regardless of the compression codec or data encoding.

Supported compression codecs:

Supported data encodings:

CSV

Parameters

ParameterTypeDefaultDescription
csv_has_headerBooleantrueWhether the first row contains column headers.
csv_quoteChar"Character used to quote fields containing special characters.
csv_escapeCharnoneCharacter used to escape special characters within a field.
csv_delimiterChar,Character used to separate fields.
csv_schema_infer_max_recordsInteger1000Deprecated. Use schema_infer_max_records instead. Maximum records scanned for schema inference.

TSV

TSV (tab-separated values) is a first-class format. Set file_format: tsv or use a .tsv file extension. The delimiter is always tab and cannot be changed.

Parameters

ParameterTypeDefaultDescription
tsv_has_headerBooleantrueWhether the first row contains column headers.
tsv_quoteChar"Character used to quote fields containing special characters.
tsv_escapeCharnoneCharacter used to escape special characters within a field.
tsv_schema_infer_max_recordsInteger1000Deprecated. Use schema_infer_max_records instead. Maximum records scanned for schema inference.

JSON

Set file_format: json for JSON files. Use the json_format parameter to select the parsing mode.

Parsing Modes

The json_format parameter controls how JSON content is interpreted.

ValueDescription
autoDefault. Auto-detects the format by inspecting content. Detects SODA responses, JSON arrays, single objects, and line-delimited JSON.
jsonAuto-detects array vs line-delimited JSON by peeking at the first byte, but does not perform SODA auto-detection. Used implicitly when file_format: json is set explicitly.
jsonl, ndjson, ldjsonLine-delimited JSON. Each line contains one JSON value.
arrayThe file contains a single top-level JSON array. Each element becomes a row.
objectThe file contains a single JSON object, producing one row.
soda, socrataSocrata Open Data API (SODA) format. Schema is derived from meta.view.columns in the response. Cannot be combined with json_pointer.

When file_format is omitted and the file extension is .json, the default parsing mode is auto, which includes SODA auto-detection. When file_format: json is set explicitly, the default mode is json, which skips SODA auto-detection.

Parameters

ParameterTypeDefaultDescription
json_formatStringautoParsing mode. See Parsing Modes.
json_pointerStringnoneExtracts a sub-value from the document before parsing. Alias: json_path. Cannot be used with soda format.
flatten_jsonBooleanfalseWhen true, nested JSON objects are flattened with . as a separator (e.g., address.city).
soda_metadataStringdisabledWhen enabled, includes Socrata metadata columns (:sid, :id, :position, :created_at, :created_meta, :updated_at, :updated_meta, :meta) in the output. Only applies when parsing SODA format data.

Setting file_format: jsonl uses the DataFusion JSON Lines reader directly, without json_format, flatten_json, or json_pointer support.

Examples

Extract a nested value from a JSON document using json_pointer:

datasets:
- from: s3://my-bucket/data/
name: events
params:
file_format: json
json_pointer: /results/events

Read a SODA response with metadata columns included:

curl -sL "https://data.ct.gov/api/views/kf98-j89e/rows.json?accessType=DOWNLOAD" -o house_price_index.json
datasets:
- from: file:house_price_index.json
name: house_price_index
params:
json_format: soda
soda_metadata: enabled
sql> select * from house_price_index limit 5;

+--------------------+--------------------------------------+-----------+-------------+---------------+-------------+---------------+-------+---------------------+---------+
| :sid | :id | :position | :created_at | :created_meta | :updated_at | :updated_meta | :meta | observation_date | ctsthpi |
| varchar | varchar | int64 | int64 | varchar | int64 | varchar |varchar| timestamp[s] | float64 |
+--------------------+--------------------------------------+-----------+-------------+---------------+-------------+---------------+-------+---------------------+---------+
| row-r4ag~gfrd~dqcz | 00000000-0000-0000-6A52-5730E0309BF7 | 0 | 1768216213 | | 1768216213 | | { } | 1975-01-01T00:00:00 | 62.9 |
| row-65s5_stm6-jbjc | 00000000-0000-0000-3B25-D45DF23837A7 | 0 | 1768216213 | | 1768216213 | | { } | 1975-04-01T00:00:00 | 62.94 |
| row-buhc_mzb7.95pa | 00000000-0000-0000-A414-989C0238E96F | 0 | 1768216213 | | 1768216213 | | { } | 1975-07-01T00:00:00 | 61.93 |
| row-3fp4~bx38-mwgi | 00000000-0000-0000-58CE-D4AF76C589B0 | 0 | 1768216213 | | 1768216213 | | { } | 1975-10-01T00:00:00 | 61.85 |
| row-khut~7dd3-vi9e | 00000000-0000-0000-A274-646F4876CC81 | 0 | 1768216213 | | 1768216213 | | { } | 1976-01-01T00:00:00 | 64.83 |
+--------------------+--------------------------------------+-----------+-------------+---------------+-------------+---------------+-------+---------------------+---------+

Time: 0.0028895 seconds. 5 rows.