Parquet Files

Este conteúdo não está disponível em sua língua ainda.

Parquet is the recommended format for Agent Context. It’s self-describing (schema embedded in the file), columnar (fast aggregations), and compressed. No configuration needed — just point at the file.

How It Works

When you connect a Parquet file (from S3, GCS, or local upload), Agent Context:

Reads the embedded schema directly from the Parquet metadata
Maps Parquet types to SQL types automatically
Enables filter pushdown — queries that filter columns only read relevant row groups

No sampling, no inference, no guessing. The schema is exact.

Supported Types

All standard Parquet/Arrow types are supported:

Category	Types
Numeric	Int8, Int16, Int32, Int64, Float32, Float64, Decimal128, Decimal256
Text	Utf8 (String), LargeUtf8
Binary	Binary, LargeBinary
Boolean	Boolean
Date/Time	Date32, Date64, Timestamp, Time32, Time64, Duration
Complex	List, LargeList, Map, Struct

Configuration

Parquet files work with zero configuration. Just select Parquet as the file format (or let auto-detection pick it up from the .parquet extension).

Performance

Parquet files get automatic optimizations:

Filter pushdown — WHERE clauses push down into Parquet row group/page filtering, skipping irrelevant data
Column pruning — SELECT col1, col2 only reads those columns from the file
Page index — row-level filtering using Parquet page indexes (enabled by default)

Partitioned Datasets

Hive-style partitioned directories (e.g., year=2024/month=01/data.parquet) are supported by SpiceD but not currently configurable through the UI. Point at a single Parquet file or a flat directory of Parquet files for the best experience.

Limitations

Limitation	Details
Complex types in queries	Struct, List, and Map columns are supported in the schema but may need explicit casting in some SQL contexts.
Hive partitioning not in UI	Partitioned directory structures are supported by the engine but not currently configurable through the UI.

Best Practices

Use Parquet when possible. It’s faster, smaller, and schema-exact compared to CSV or JSON.
Use Snappy compression (default for most tools) — best speed/size tradeoff.
Partition large datasets by date or category for faster filtered queries.