Skip to main content

Observability & Monitoring

Spice can be monitored using the Spice Prometheus-compatible Metrics Endpoint. Spice also supports distributed tracing by integrating with Zipkin and compatible tracing systems.

observability

Monitoring clients configuration:

Spice Metrics Endpoint Configuration​

The metrics endpoint uses port 9090 by default. The metrics endpoint configuration is logged at startup.

2024-11-28T19:48:10.942003Z  INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090

Pass the --metrics parameter to bind to a specific port. For example, to bind to port 9091:

 spiced --metrics 0.0.0.0:9091

or when using Docker:

FROM spiceai/spiceai:latest

# Docker configuration ...

# Configure the metrics endpoint on port 9090
CMD ["--metrics", "0.0.0.0:9090"]
EXPOSE 9090

Configuration of the metrics endpoint can be verified using a HTTP GET request, for example:

curl http://localhost:9090/metrics

# HELP runtime_flight_server_started Indicates the runtime Flight server has started.
# TYPE runtime_flight_server_started counter
runtime_flight_server_started 1
# HELP runtime_http_server_started Indicates the runtime HTTP server has started.
# TYPE runtime_http_server_started counter
runtime_http_server_started 1

# HELP dataset_load_state Status of the dataset. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
# TYPE dataset_load_state gauge
dataset_load_state{dataset="taxi_trips"} 2
dataset_load_state{dataset="taxi_trips_accelerated"} 2

# HELP dataset_active_count Number of currently loaded datasets.
# TYPE dataset_active_count gauge
dataset_active_count{engine="None"} 1
dataset_active_count{engine="duckdb"} 1
...

Metrics​

MetricDescription
accelerated_ready_state_federated_fallback
(count)
Number of times the federated table was queried due to the accelerated table loading the initial data.
accelerated_zero_results_federated_fallback
(count)
Number of times the federated table was queried due to the accelerated table returning zero results.
ai_inferences_with_spice_count
(count)
AI Inferences with Spice count.
catalog_load_errors
(count)
Number of errors loading the catalog provider.
catalog_load_state
(gauge)
Status of the catalog provider. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
component_metric_registered_count
(gauge)
Number of currently registered component metrics.
dataset_acceleration_ingestion_lag_ms
(gauge)
Lag between the current wall-clock time and the maximum time_column value after the refresh operation, in milliseconds.
dataset_acceleration_last_refresh_time_ms
(gauge)
Unix timestamp in milliseconds when the last refresh completed.
dataset_acceleration_max_timestamp_after_refresh_ms
(gauge)
Maximum value of the dataset's time_column after the refresh operation, in milliseconds.
dataset_acceleration_max_timestamp_before_refresh_ms
(gauge)
Maximum value of the dataset's time_column before the refresh operation, in milliseconds.
dataset_acceleration_refresh_data_fetches_skipped
(count)
Number of refresh data fetches skipped due to unchanged file metadata.
dataset_acceleration_refresh_duration_ms
(histogram)
Duration in milliseconds to load a full or appended refresh data.
dataset_acceleration_refresh_errors
(count)
Number of errors refreshing the dataset.
dataset_acceleration_refresh_lag_ms
(gauge)
Difference between the maximum time_column value after and before the refresh operation, in milliseconds.
dataset_acceleration_refresh_worker_panics
(count)
Number of times a refresh worker panicked while refreshing a dataset.
dataset_acceleration_snapshot_bootstrap_bytes
(gauge)
Number of bytes downloaded when bootstrapping the acceleration from a snapshot.
dataset_acceleration_snapshot_bootstrap_checksum
(gauge)
Checksum of the snapshot downloaded during bootstrap (emitted with checksum attribute).
dataset_acceleration_snapshot_bootstrap_duration_ms
(count)
Time in milliseconds taken to download the snapshot used to bootstrap acceleration.
dataset_acceleration_snapshot_failure_count
(count)
Number of failures encountered while writing snapshots.
dataset_acceleration_snapshot_write_bytes
(gauge)
Number of bytes written for the most recent snapshot.
dataset_acceleration_snapshot_write_checksum
(gauge)
Checksum of the most recent snapshot write (emitted with checksum attribute).
dataset_acceleration_snapshot_write_duration_ms
(histogram)
Time in milliseconds taken to write the latest snapshot to object storage.
dataset_acceleration_snapshot_write_timestamp
(gauge)
Unix timestamp (seconds) when the most recent snapshot write completed.
dataset_active_count
(gauge)
Number of currently loaded datasets.
dataset_load_errors
(count)
Number of errors loading the dataset.
dataset_load_state
(gauge)
Status of the dataset. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
dataset_unavailable_time_ms
(gauge)
Time dataset went offline in milliseconds.
embeddings_active_count
(gauge)
Number of currently loaded embeddings.
embeddings_cache_evictions
(count)
Number of cache evictions.
embeddings_cache_hit_ratio
(gauge)
Cache hit ratio (hits / total requests).
embeddings_cache_hits
(count)
Cache hit count.
embeddings_cache_items_count
(gauge)
Number of items currently in the cache.
embeddings_cache_max_size_bytes
(gauge)
Maximum allowed size of the cache in bytes.
embeddings_cache_misses
(count)
Cache miss count.
embeddings_cache_requests
(count)
Number of requests to get a key from the cache.
embeddings_cache_size_bytes
(gauge)
Size of the cache in bytes.
embeddings_cache_stale_swr_count
(count)
Number of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation.
embeddings_cache_swr_background_query_count
(count)
Number of background queries triggered for stale-while-revalidate cache refreshes.
embeddings_failures
(count)
Number of embedding failures.
embeddings_internal_request_duration_ms
(histogram)
The duration of running an embedding(s) internally.
embeddings_load_errors
(count)
Number of errors loading the embedding.
embeddings_load_state
(gauge)
Status of the embedding. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
embeddings_requests
(count)
Number of embedding requests.
flight_do_exchange_data_updates_sent
(count)
Number of data updates sent via DoExchange.
flight_request_duration_ms
(histogram)
Measures the duration of Flight requests in milliseconds.
flight_requests
(count)
Total number of Flight requests.
http_requests
(count)
Number of HTTP requests.
http_requests_duration_ms
(histogram)
Measures the duration of HTTP requests in milliseconds.
llm_failures
(count)
Number of LLM failures.
llm_internal_request_duration_ms
(histogram)
The duration of running an LLM request internally.
llm_load_state
(gauge)
Status of the LLM model. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
llm_requests
(count)
Number of LLM requests.
model_active_count
(gauge)
Number of currently loaded models.
model_load_duration_ms
(histogram)
Duration in milliseconds to load the model.
model_load_errors
(count)
Number of errors loading the model.
model_load_state
(gauge)
Status of the model. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
query_active_count
(histogram)
Number of concurrent top-level queries actively being processed in the runtime. Includes the protocol dimension (http, flight, flightsql, internal) to indicate the query type.
query_duration_ms
(histogram)
The total amount of time spent planning and executing queries in milliseconds.
query_execution_duration_ms
(histogram)
The total amount of time spent only executing queries (0 for cached queries).
query_executions
(count)
Number of query executions.
query_failures
(count)
Number of query failures.
query_processed_bytes
(count)
Number of bytes processed by the runtime.
query_produced_spills
(count)
Number of spills produced by the query.
query_returned_bytes
(count)
Number of bytes returned to query clients.
query_returned_rows
(histogram)
Number of rows returned to query clients.
query_spilled_bytes
(count)
Number of spilled bytes produced by the query.
query_spilled_rows
(count)
Number of spilled rows produced by the query.
results_cache_evictions
(count)
Number of cache evictions.
results_cache_hit_ratio
(gauge)
Cache hit ratio (hits / total requests).
results_cache_hits
(count)
Cache hit count.
results_cache_items_count
(gauge)
Number of items currently in the cache.
results_cache_max_size_bytes
(gauge)
Maximum allowed size of the cache in bytes.
results_cache_misses
(count)
Cache miss count.
results_cache_requests
(count)
Number of requests to get a key from the cache.
results_cache_size_bytes
(gauge)
Size of the cache in bytes.
results_cache_stale_swr_count
(count)
Number of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation.
results_cache_swr_background_query_count
(count)
Number of background queries triggered for stale-while-revalidate cache refreshes.
runtime_flight_server_started
(count)
Indicates the runtime Flight server has started.
runtime_http_server_started
(count)
Indicates the runtime HTTP server has started.
search_results_cache_evictions
(count)
Number of cache evictions.
search_results_cache_hit_ratio
(gauge)
Cache hit ratio (hits / total requests).
search_results_cache_hits
(count)
Search cache hit count.
search_results_cache_items_count
(gauge)
Number of items currently in the search cache.
search_results_cache_max_size_bytes
(gauge)
Maximum allowed size of the search cache in bytes.
search_results_cache_misses
(count)
Cache miss count.
search_results_cache_requests
(count)
Number of requests to get a key from the search cache.
search_results_cache_size_bytes
(gauge)
Size of the search cache in bytes.
search_results_cache_stale_swr_count
(count)
Number of stale-while-revalidate background refreshes skipped due to existing in-flight revalidation.
search_results_cache_swr_background_query_count
(count)
Number of background queries triggered for stale-while-revalidate cache refreshes.
secrets_store_load_duration_ms
(histogram)
Duration in milliseconds to load the secret stores.
tool_active_count
(gauge)
Number of currently loaded LLM tools.
tool_load_errors
(count)
Number of errors loading the LLM tool.
tool_load_state
(gauge)
Status of the LLM tools. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
view_load_errors
(count)
Number of errors loading the view.
view_load_state
(gauge)
Status of the views. 0=Initializing, 1=Ready, 2=Disabled, 3=Error, 4=Refreshing, 5=ShuttingDown.
worker_active_count
(gauge)
Number of currently loaded workers.
workers_load_duration_ms
(histogram)
Duration in milliseconds to load the worker.
Component Metrics

In addition to these core metrics, individual components can expose their own metrics. For example, the MySQL data connector exposes connection pool metrics. See Component Metrics for more information.