Semantic Model
The semantic model is the layer that gives every dataset, view, and column a human-readable description plus structured metadata, and exposes that context uniformly to SQL queries, LLM tools, and language models.
Every description ends up in two places at once:
- on the dataset's Arrow schema as
descriptionmetadata — readable from SQL viaobj_descriptionandcol_description; and - in the LLM tool context — surfaced by the built-in
list_datasetsandtable_schematools, so models see your descriptions automatically.
Defining a Semantic Model​
Semantic data models are defined within the spicepod.yaml file under the datasets section. Each dataset supports description, metadata, and a columns field where individual columns are described with metadata and features for utility and clarity.
Example Configuration​
datasets:
- name: taxi_trips
description: NYC taxi trip rides
metadata:
instructions: Always provide citations with reference URLs.
reference_url_template: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_<YYYY-MM>.parquet
columns:
- name: tpep_pickup_time
description: 'The time the passenger was picked up by the taxi'
- name: notes
description: 'Optional notes about the trip'
embeddings:
- from: hf_minilm # A defined Spice Model
chunking:
enabled: true
target_chunk_size: 512
overlap_size: 128
trim_whitespace: true
Dataset Metadata​
Datasets can be defined with the following metadata:
instructions: Optional. Instructions to provide to a language model when using this dataset.reference_url_template: Optional. A URL template for citation links.
Arbitrary additional keys may be added under metadata: and are passed through to the Arrow schema unchanged, so anything an LLM tool or downstream consumer expects to find there (e.g. governance labels, owning team, source-of-truth links) can be attached without code changes.
For detailed metadata configuration, see the Dataset Reference.
Column Definitions​
Each column can be defined with the following attributes:
description: Optional. A description of the column's contents and purpose.type(aliasdata_type): Optional. Declared column type (Postgres-style or Arrow display form).nullable: Optional. Override column nullability.embeddings: Optional. Vector embeddings configuration for this column.full_text_search: Optional. Full-text-search configuration.metadata: Optional. Arbitrary key/value metadata attached to the column.
For detailed columns configuration, see the Dataset Reference.
Source-Side Comments​
When a dataset is loaded from a source that exposes table or column comments, Spice automatically imports those comments into the Arrow schema metadata under the description key. This means database-native COMMENT ON TABLE and COMMENT ON COLUMN annotations show up alongside Spicepod-defined description values, giving the semantic model the same context that already lives in the source database.
Source-side comments are imported automatically from:
| Source | What is imported |
|---|---|
| PostgreSQL | Table and column comments via obj_description / col_description in pg_catalog. |
| MySQL | information_schema.tables.table_comment and information_schema.columns.column_comment. |
| Snowflake | information_schema.tables.comment and information_schema.columns.comment. |
| Databricks SQL Warehouse | Table and column comments returned by the SQL Warehouse driver's metadata API. |
| BigQuery (via ADBC catalog) | INFORMATION_SCHEMA.TABLE_OPTIONS (description) for the table and INFORMATION_SCHEMA.COLUMN_FIELD_PATHS.description for top-level columns. |
Connectors not listed above (object stores, file formats, HTTP, etc.) do not have a comment concept on the source side; for those datasets, attach descriptions in the Spicepod.
Precedence​
When both a Spicepod description and a source-side comment exist for the same table or column, the Spicepod value wins. This lets you override unhelpful or out-of-date source comments without round-tripping a DDL change through the upstream database.
If the Spicepod does not set a description, the source-side comment passes through unchanged.
Reading descriptions from SQL​
Two PostgreSQL-compatible scalar UDFs read the description metadata at query time:
-- Table-level description
SELECT obj_description('public.taxi_trips');
-- Column-level description (by name or 1-based ordinal)
SELECT col_description('public.taxi_trips', 'fare_amount');
SELECT col_description('public.taxi_trips', 3);
Full signatures, including PostgreSQL-shaped (oid, 'pg_class') variants and four-part (catalog, schema, table, column) forms, are documented in obj_description and col_description.
Because these UDFs read the same metadata that source-side comment extraction populates, queries against PostgreSQL or Snowflake datasets that were authored with COMMENT ON … will return those comments without any additional configuration.
How LLMs see the semantic model​
The built-in LLM tools surface descriptions and metadata into the model's context automatically.
list_datasets​
Returns every dataset, view, and catalog table visible to the runtime as JSON, with each entry's fully-qualified table name, description, metadata, columns, and capability flags (e.g. can_search_documents). Descriptions come from the Spicepod, falling back to source-side comments when no Spicepod description is set.
table_schema​
Returns one or more tables' columns as a markdown table. When called with output: full (the default), the rendered schema includes a Metadata: block at the top with the table-level description and any other dataset metadata keys, and a per-column Metadata cell containing the column's description and any column metadata. With output: minimal, only column names, types, and nullability are returned.
**Table: spice.public.taxi_trips**
Metadata:
description: NYC taxi trip rides
instructions: Always provide citations with reference URLs.
| Column | Sql Type | Arrow Type | Nullable | Metadata |
| ---------------- | --------- | ---------------------- | -------- | --------------------------------------------------- |
| tpep_pickup_time | TIMESTAMP | Timestamp(Microsecond) | true | description: The time the passenger was picked up |
| notes | VARCHAR | Utf8 | true | description: Optional notes about the trip |
Because both list_datasets and table_schema are part of the auto and all tool groups, no extra configuration is required — every model declared with tools: auto (or any explicit list that includes table_schema / list_datasets) gets the semantic model in its context.
When the Tool Registry is active, dataset descriptions also feed the hybrid search index used by tool_search to match user questions to relevant dataset-bound tools, so well-described datasets are easier for the model to discover at scale.
End-to-end flow​
spicepod.yaml description source COMMENT ON TABLE / COLUMN
| |
| (Spicepod wins on overlap) |
+---------------+----------------+
v
Arrow schema metadata (`description` key)
|
+---------------+-----------------+
v v v
obj_description table_schema list_datasets
col_description (LLM context) (LLM context)
(SQL)
