File Formats
File-based data connectors — including s3://, abfs://, file://, ftp://, sftp://, and others — support multiple structured and document file formats. This page details the format-specific parameters available for each.
Common Parameters
These parameters apply across multiple file formats.
| Parameter | Type | Default | Description |
|---|---|---|---|
file_format | String | Inferred | Selects the file reader. If omitted, format is inferred from the file extension. See Supported Formats. |
file_extension | String | Derived | Overrides the file extension filter used when listing files. Defaults to the extension matching the resolved format. |
schema_infer_max_records | Integer | 1000 | Maximum number of records scanned to infer the schema. |
file_compression_type | String | UNCOMPRESSED | File-level compression for CSV, TSV, and JSON files. Valid values: GZIP, BZIP2, XZ, ZSTD, UNCOMPRESSED. |
hive_partitioning_enabled | Boolean | false | Enables Hive-style partition discovery from directory structure. |
Supported Formats
The file_format parameter accepts these values:
| Value | Reader | Default Extension | Notes |
|---|---|---|---|
parquet | Apache Parquet | .parquet | |
csv | CSV | .csv | Uses csv_* parameters. |
tsv | TSV (tab-delimited) | .tsv | Uses tsv_* parameters. Delimiter is tab. |
json | JSON | .json | Auto-detects format. Uses json_format to control parsing mode. |
jsonl | JSON Lines | .jsonl | Line-delimited JSON. |
When file_format is omitted, Spice infers the format from the dataset path extension. If the extension does not match one of the values above, a configuration error is returned.
Parquet
Spice reads any Parquet file regardless of the compression codec or data encoding.
Supported compression codecs:
Supported data encodings:
PLAINPLAIN_DICTIONARY/RLE_DICTIONARYRLEBIT_PACKED(deprecated in favor ofRLE)DELTA_BINARY_PACKEDDELTA_LENGTH_BYTE_ARRAYDELTA_BYTE_ARRAYBYTE_STREAM_SPLIT
CSV
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
csv_has_header | Boolean | true | Whether the first row contains column headers. |
csv_quote | Char | " | Character used to quote fields containing special characters. |
csv_escape | Char | none | Character used to escape special characters within a field. |
csv_delimiter | Char | , | Character used to separate fields. |
csv_schema_infer_max_records | Integer | 1000 | Deprecated. Use schema_infer_max_records instead. Maximum records scanned for schema inference. |
TSV
TSV (tab-separated values) is a first-class format. Set file_format: tsv or use a .tsv file extension. The delimiter is always tab and cannot be changed.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
tsv_has_header | Boolean | true | Whether the first row contains column headers. |
tsv_quote | Char | " | Character used to quote fields containing special characters. |
tsv_escape | Char | none | Character used to escape special characters within a field. |
tsv_schema_infer_max_records | Integer | 1000 | Deprecated. Use schema_infer_max_records instead. Maximum records scanned for schema inference. |
JSON
Set file_format: json for JSON files. Use the json_format parameter to select the parsing mode.
Parsing Modes
The json_format parameter controls how JSON content is interpreted.
| Value | Description |
|---|---|
auto | Default. Auto-detects the format by inspecting content. Detects SODA responses, JSON arrays, single objects, and line-delimited JSON. |
json | Auto-detects array vs line-delimited JSON by peeking at the first byte, but does not perform SODA auto-detection. Used implicitly when file_format: json is set explicitly. |
jsonl, ndjson, ldjson | Line-delimited JSON. Each line contains one JSON value. |
array | The file contains a single top-level JSON array. Each element becomes a row. |
object | The file contains a single JSON object, producing one row. |
soda, socrata | Socrata Open Data API (SODA) format. Schema is derived from meta.view.columns in the response. Cannot be combined with json_pointer. |
When file_format is omitted and the file extension is .json, the default parsing mode is auto, which includes SODA auto-detection. When file_format: json is set explicitly, the default mode is json, which skips SODA auto-detection.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
json_format | String | auto | Parsing mode. See Parsing Modes. |
json_pointer | String | none | Extracts a sub-value from the document before parsing. Alias: json_path. Cannot be used with soda format. |
flatten_json | Boolean | false | When true, nested JSON objects are flattened with . as a separator (e.g., address.city). |
soda_metadata | String | disabled | When enabled, includes Socrata metadata columns (:sid, :id, :position, :created_at, :created_meta, :updated_at, :updated_meta, :meta) in the output. Only applies when parsing SODA format data. |
Setting file_format: jsonl uses the DataFusion JSON Lines reader directly, without json_format, flatten_json, or json_pointer support.
Examples
Extract a nested value from a JSON document using json_pointer:
datasets:
- from: s3://my-bucket/data/
name: events
params:
file_format: json
json_pointer: /results/events
Read a SODA response with metadata columns included:
curl -sL "https://data.ct.gov/api/views/kf98-j89e/rows.json?accessType=DOWNLOAD" -o house_price_index.json
datasets:
- from: file:house_price_index.json
name: house_price_index
params:
json_format: soda
soda_metadata: enabled
sql> select * from house_price_index limit 5;
+--------------------+--------------------------------------+-----------+-------------+---------------+-------------+---------------+-------+---------------------+---------+
| :sid | :id | :position | :created_at | :created_meta | :updated_at | :updated_meta | :meta | observation_date | ctsthpi |
| varchar | varchar | int64 | int64 | varchar | int64 | varchar |varchar| timestamp[s] | float64 |
+--------------------+--------------------------------------+-----------+-------------+---------------+-------------+---------------+-------+---------------------+---------+
| row-r4ag~gfrd~dqcz | 00000000-0000-0000-6A52-5730E0309BF7 | 0 | 1768216213 | | 1768216213 | | { } | 1975-01-01T00:00:00 | 62.9 |
| row-65s5_stm6-jbjc | 00000000-0000-0000-3B25-D45DF23837A7 | 0 | 1768216213 | | 1768216213 | | { } | 1975-04-01T00:00:00 | 62.94 |
| row-buhc_mzb7.95pa | 00000000-0000-0000-A414-989C0238E96F | 0 | 1768216213 | | 1768216213 | | { } | 1975-07-01T00:00:00 | 61.93 |
| row-3fp4~bx38-mwgi | 00000000-0000-0000-58CE-D4AF76C589B0 | 0 | 1768216213 | | 1768216213 | | { } | 1975-10-01T00:00:00 | 61.85 |
| row-khut~7dd3-vi9e | 00000000-0000-0000-A274-646F4876CC81 | 0 | 1768216213 | | 1768216213 | | { } | 1976-01-01T00:00:00 | 64.83 |
+--------------------+--------------------------------------+-----------+-------------+---------------+-------------+---------------+-------+---------------------+---------+
Time: 0.0028895 seconds. 5 rows.
