Skip to content

arro3.io

arro3.io

infer_csv_schema builtin

infer_csv_schema(
    file: IO[bytes] | Path | str,
    *,
    has_header: bool | None = None,
    max_records: int | None = None,
    delimiter: str | None = None,
    escape: str | None = None,
    quote: str | None = None,
    terminator: str | None = None,
    comment: str | None = None
) -> Schema

Infer a CSV file's schema

infer_json_schema builtin

infer_json_schema(
    file: IO[bytes] | Path | str, *, max_records: int | None = None
) -> Schema

Infer a JSON file's schema

read_csv builtin

read_csv(
    file: IO[bytes] | Path | str,
    schema: ArrowSchemaExportable,
    *,
    has_header: bool | None = None,
    batch_size: int | None = None,
    delimiter: str | None = None,
    escape: str | None = None,
    quote: str | None = None,
    terminator: str | None = None,
    comment: str | None = None
) -> RecordBatchReader

Read a CSV file to an Arrow RecordBatchReader

read_ipc builtin

read_ipc(file: IO[bytes] | Path | str) -> RecordBatchReader

Read an Arrow IPC file to an Arrow RecordBatchReader

read_ipc_stream builtin

read_ipc_stream(file: IO[bytes] | Path | str) -> RecordBatchReader

Read an Arrow IPC Stream file to an Arrow RecordBatchReader

read_json builtin

read_json(
    file: IO[bytes] | Path | str,
    schema: ArrowSchemaExportable,
    *,
    batch_size: int | None = None
) -> RecordBatchReader

Read a JSON file to an Arrow RecordBatchReader

read_parquet builtin

read_parquet(file: Path | str) -> RecordBatchReader

Read a Parquet file to an Arrow RecordBatchReader

Parameters:

  • file (Path | str) –

    description

Returns:

write_csv builtin

write_csv(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    header: bool | None = None,
    delimiter: str | None = None,
    escape: str | None = None,
    quote: str | None = None,
    date_format: str | None = None,
    datetime_format: str | None = None,
    time_format: str | None = None,
    timestamp_format: str | None = None,
    timestamp_tz_format: str | None = None,
    null: str | None = None
) -> None

Write an Arrow Table or stream to a CSV file

write_ipc builtin

write_ipc(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
) -> None

Write an Arrow Table or stream to an IPC File

write_ipc_stream builtin

write_ipc_stream(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
) -> None

Write an Arrow Table or stream to an IPC Stream

write_json builtin

write_json(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    explicit_nulls: bool | None = None
) -> None

Write an Arrow Table or stream to a JSON file

write_ndjson builtin

write_ndjson(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    explicit_nulls: bool | None = None
) -> None

Write an Arrow Table or stream to a newline-delimited JSON file

write_parquet builtin

write_parquet(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    bloom_filter_enabled: bool | None = None,
    bloom_filter_fpp: float | None = None,
    bloom_filter_ndv: int | None = None,
    column_compression: dict[ParquetColumnPath, ParquetCompression]
    | None = None,
    column_dictionary_enabled: dict[ParquetColumnPath, bool] | None = None,
    column_encoding: dict[ParquetColumnPath, ParquetEncoding] | None = None,
    column_max_statistics_size: dict[ParquetColumnPath, int] | None = None,
    compression: ParquetCompression | None = None,
    created_by: str | None = None,
    data_page_row_count_limit: int | None = None,
    data_page_size_limit: int | None = None,
    dictionary_enabled: bool | None = None,
    dictionary_page_size_limit: int | None = None,
    encoding: ParquetEncoding | None = None,
    key_value_metadata: dict[str, str] | None = None,
    max_row_group_size: int | None = None,
    max_statistics_size: int | None = None,
    write_batch_size: int | None = None,
    writer_version: Literal["parquet_1_0", "parquet_2_0"] | None = None
) -> None

Write an Arrow Table or stream to a Parquet file.

Parameters:

Other Parameters:

  • bloom_filter_enabled (bool | None) –

    Sets if bloom filter is enabled by default for all columns (defaults to false).

  • bloom_filter_fpp (float | None) –

    Sets the default target bloom filter false positive probability (fpp) for all columns (defaults to 0.05).

  • bloom_filter_ndv (int | None) –

    Sets default number of distinct values (ndv) for bloom filter for all columns (defaults to 1_000_000).

  • column_compression (dict[ParquetColumnPath, ParquetCompression] | None) –

    Sets compression codec for a specific column. Takes precedence over compression.

  • column_dictionary_enabled (dict[ParquetColumnPath, bool] | None) –

    Sets flag to enable/disable dictionary encoding for a specific column. Takes precedence over dictionary_enabled.

  • column_encoding (dict[ParquetColumnPath, ParquetEncoding] | None) –

    Sets encoding for a specific column. Takes precedence over encoding.

  • column_max_statistics_size (dict[ParquetColumnPath, int] | None) –

    Sets max size for statistics for a specific column. Takes precedence over max_statistics_size.

  • compression (ParquetCompression | None) –

    Sets default compression codec for all columns (default to uncompressed). Note that you can pass in a custom compression level with a string like "zstd(3)" or "gzip(9)" or "brotli(3)".

  • created_by (str | None) –

    Sets "created by" property (defaults to parquet-rs version <VERSION>).

  • data_page_row_count_limit (int | None) –

    Sets best effort maximum number of rows in a data page (defaults to 20_000).

    The parquet writer will attempt to limit the number of rows in each DataPage to this value. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

    Note: this is a best effort limit based on value of set_write_batch_size.

  • data_page_size_limit (int | None) –

    Sets best effort maximum size of a data page in bytes (defaults to 1024 * 1024).

    The parquet writer will attempt to limit the sizes of each DataPage to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

    Note: this is a best effort limit based on value of set_write_batch_size.

  • dictionary_enabled (bool | None) –

    Sets default flag to enable/disable dictionary encoding for all columns (defaults to True).

  • dictionary_page_size_limit (int | None) –

    Sets best effort maximum dictionary page size, in bytes (defaults to 1024 * 1024).

    The parquet writer will attempt to limit the size of each DataPage used to store dictionaries to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

    Note: this is a best effort limit based on value of set_write_batch_size.

  • encoding (ParquetEncoding | None) –

    Sets default encoding for all columns.

    If dictionary is not enabled, this is treated as a primary encoding for all columns. In case when dictionary is enabled for any column, this value is considered to be a fallback encoding for that column.

  • key_value_metadata (dict[str, str] | None) –

    Sets "key_value_metadata" property (defaults to None).

  • max_row_group_size (int | None) –

    Sets maximum number of rows in a row group (defaults to 1024 * 1024).

  • max_statistics_size (int | None) –

    Sets default max statistics size for all columns (defaults to 4096).

  • write_batch_size (int | None) –

    Sets write batch size (defaults to 1024).

    For performance reasons, data for each column is written in batches of this size.

    Additional limits such as such as set_data_page_row_count_limit are checked between batches, and thus the write batch size value acts as an upper-bound on the enforcement granularity of other limits.

  • writer_version (Literal['parquet_1_0', 'parquet_2_0'] | None) –

    Sets the WriterVersion written into the parquet metadata (defaults to "parquet_1_0"). This value can determine what features some readers will support.