Pular para o conteúdo

Parquet Files

Este conteúdo não está disponível em sua língua ainda.

Parquet is the recommended format for Agent Context. It’s self-describing (schema embedded in the file), columnar (fast aggregations), and compressed. No configuration needed — just point at the file.

When you connect a Parquet file (from S3, GCS, or local upload), Agent Context:

  1. Reads the embedded schema directly from the Parquet metadata
  2. Maps Parquet types to SQL types automatically
  3. Enables filter pushdown — queries that filter columns only read relevant row groups

No sampling, no inference, no guessing. The schema is exact.

All standard Parquet/Arrow types are supported:

CategoryTypes
NumericInt8, Int16, Int32, Int64, Float32, Float64, Decimal128, Decimal256
TextUtf8 (String), LargeUtf8
BinaryBinary, LargeBinary
BooleanBoolean
Date/TimeDate32, Date64, Timestamp, Time32, Time64, Duration
ComplexList, LargeList, Map, Struct

Parquet files work with zero configuration. Just select Parquet as the file format (or let auto-detection pick it up from the .parquet extension).

Parquet files get automatic optimizations:

  • Filter pushdownWHERE clauses push down into Parquet row group/page filtering, skipping irrelevant data
  • Column pruningSELECT col1, col2 only reads those columns from the file
  • Page index — row-level filtering using Parquet page indexes (enabled by default)

Hive-style partitioned directories (e.g., year=2024/month=01/data.parquet) are supported by SpiceD but not currently configurable through the UI. Point at a single Parquet file or a flat directory of Parquet files for the best experience.

LimitationDetails
Complex types in queriesStruct, List, and Map columns are supported in the schema but may need explicit casting in some SQL contexts.
Hive partitioning not in UIPartitioned directory structures are supported by the engine but not currently configurable through the UI.
  • Use Parquet when possible. It’s faster, smaller, and schema-exact compared to CSV or JSON.
  • Use Snappy compression (default for most tools) — best speed/size tradeoff.
  • Partition large datasets by date or category for faster filtered queries.