Skip to content

Conversation

@AlyAbdelmoneim
Copy link

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

change catalog.format from string to CatalogFormat.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

- Add CatalogFormat enum with valid format values (CSV, PARQUET, JSON, NDJSON, AVRO, ARROW)
- Implement FromStr for CatalogFormat with validation and clear error messages
- Implement Display and ConfigField traits for CatalogFormat
- Update CatalogOptions.format from Option<String> to Option<CatalogFormat>
- Update session_state_defaults.rs to work with enum
- Add test for catalog format validation
@github-actions github-actions bot added the common Related to common crate label Dec 26, 2025
@AlyAbdelmoneim
Copy link
Author

@Jefffrey Hey, I opened this PR to validate catalog format, since I didn't want to have all validations in one PR as you requested. I think it's important you review that my assumption for the default is correct.

In datafusion/common/src/format.rs:323:

impl Default for CatalogFormat { fn default() -> Self { CatalogFormat::CSV } }

This is needed because Option<CatalogFormat> requires CatalogFormat: Default for the ConfigField implementation, though the actual field default is None. Please confirm if CSV is the right default value here.

if this PR is good I will proceede with the other validations.
Thank you !

@Jefffrey
Copy link
Contributor

I don't know if this config field is one we should be tackling. I don't think we can ever have an exhaustive list of valid values here if we consider user extensibility?

@AlyAbdelmoneim
Copy link
Author

I don't know if this config field is one we should be tackling. I don't think we can ever have an exhaustive list of valid values here if we consider user extensibility?

Oh you are right! I just wasn't aware of that use case where users can register custom table provider factories. I saw the default_table_factories() in session_state_defaults.rs and thought those were the only valid variants. I'll revert this change.

For my next PR, I'll be working on validating ParquetOptions fields. Here's what I plan to validate:

Fields to validate with enum wrappers (following the pattern from PR #17549):

  1. writer_version: StringDFWriterVersion enum

    • Valid values: "1.0", "2.0"
    • Maps to parquet::file::properties::WriterVersion
    • Files: datafusion/common/src/config.rs (update field type), new enum in datafusion/common/src/format.rs (or separate file)
  2. coerce_int96: Option<String>Option<DFTimeUnit> enum

    • Valid values: "ns", "us", "ms", "s"
    • Maps to arrow::datatypes::TimeUnit
    • Files: datafusion/common/src/config.rs, enum in datafusion/common/src/format.rs
    • Reference: parse_coerce_int96_string() in datafusion/datasource-parquet/src/source.rs:497
  3. statistics_enabled: Option<String>Option<DFEnabledStatistics> enum

    • Valid values: "none", "chunk", "page"
    • Maps to parquet::file::properties::EnabledStatistics
    • Files: datafusion/common/src/config.rs, enum in datafusion/common/src/format.rs
    • Reference: parse_statistics_string() in datafusion/common/src/file_options/parquet_writer.rs:388
  4. encoding: Option<String>Option<DFEncoding> enum

    • Valid values: "plain", "plain_dictionary", "rle", "bit_packed", "delta_binary_packed", "delta_length_byte_array", "delta_byte_array", "rle_dictionary", "byte_stream_split"
    • Maps to parquet::basic::Encoding
    • Files: datafusion/common/src/config.rs, enum in datafusion/common/src/format.rs

And I have a question for this field

  1. compression: Option<String> → Format pattern validation
    • Pattern: codec_name or codec_name(level) where:
      • Codecs requiring levels: gzip(level), brotli(level), zstd(level)
      • Codecs without levels: uncompressed, snappy, lzo, lz4, lz4_raw
    • This is more complex than a simple enum due to the pattern and level requirements
    • Reference: parse_compression_string() in datafusion/common/src/file_options/parquet_writer.rs:323
    • Question: Should I validate the compression format pattern at config time, or would you prefer a different approach? The current implementation only validates when actually writing parquet files, which means invalid values like "invalid_codec" or "snappy(2)" (snappy doesn't support levels) are accepted at SET time and only fail later.

if you think this are the edits we want, I will create a new PR for these changes.
@Jefffrey please review when you have time, thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants