Skip to content

arro3.io

arro3.io

ParquetColumnPath module-attribute

ParquetColumnPath = str | Sequence[str]

Allowed types to refer to a Parquet Column.

ParquetCompression module-attribute

ParquetCompression = (
    Literal[
        "uncompressed",
        "snappy",
        "gzip",
        "lzo",
        "brotli",
        "lz4",
        "zstd",
        "lz4_raw",
    ]
    | str
)

Allowed compression schemes for Parquet.

ParquetEncoding module-attribute

ParquetEncoding = Literal[
    "plain",
    "plain_dictionary",
    "rle",
    "bit_packed",
    "delta_binary_packed",
    "delta_length_byte_array",
    "delta_byte_array",
    "rle_dictionary",
    "byte_stream_split",
]

Allowed Parquet encodings.

ArrowArrayExportable

Bases: Protocol

An object with an __arrow_c_array__ method.

Supported objects include:

  • arro3 Array or RecordBatch objects.
  • pyarrow Array or RecordBatch objects

Such an object implements the Arrow C Data Interface interface via the Arrow PyCapsule Interface. This allows for zero-copy Arrow data interchange across libraries.

ArrowSchemaExportable

Bases: Protocol

An object with an __arrow_c_schema__ method.

Supported objects include:

  • arro3 Schema, Field, or DataType objects.
  • pyarrow Schema, Field, or DataType objects.

Such an object implements the Arrow C Data Interface interface via the Arrow PyCapsule Interface. This allows for zero-copy Arrow data interchange across libraries.

ArrowStreamExportable

Bases: Protocol

An object with an __arrow_c_stream__ method.

Supported objects include:

  • arro3 Table, RecordBatchReader, ChunkedArray, or ArrayReader objects.
  • Polars Series or DataFrame objects (polars v1.2 or higher)
  • pyarrow RecordBatchReader, Table, or ChunkedArray objects (pyarrow v14 or higher)
  • pandas DataFrames (pandas v2.2 or higher)
  • ibis Table objects.

For an up to date list of supported objects, see this issue.

Such an object implements the Arrow C Stream interface via the Arrow PyCapsule Interface. This allows for zero-copy Arrow data interchange across libraries.

RecordBatchReader

An Arrow RecordBatchReader.

A RecordBatchReader holds a stream of RecordBatch.

closed property

closed: bool

Returns true if this reader has already been consumed.

schema property

schema: Schema

Access the schema of this table.

__arrow_c_stream__

__arrow_c_stream__(requested_schema: object | None = None) -> object

An implementation of the Arrow PyCapsule Interface. This dunder method should not be called directly, but enables zero-copy data transfer to other Python libraries that understand Arrow memory.

For example, you can call pyarrow.RecordBatchReader.from_stream to convert this stream to a pyarrow RecordBatchReader. Alternatively, you can call pyarrow.table() to consume this stream to a pyarrow table or Table.from_arrow() to consume this stream to an arro3 Table.

from_arrow classmethod

Construct this from an existing Arrow object.

It can be called on anything that exports the Arrow stream interface (has an __arrow_c_stream__ method), such as a Table or RecordBatchReader.

from_arrow_pycapsule classmethod

from_arrow_pycapsule(capsule) -> RecordBatchReader

Construct this object from a bare Arrow PyCapsule

Schema

An arrow Schema.

metadata property

metadata: dict[bytes, bytes]

The schema's metadata.

Returns:

metadata_str property

metadata_str: dict[str, str]

The schema's metadata where keys and values are str, not bytes.

Returns:

names property

names: list[str]

The schema's field names.

types property

types: list[DataType]

The schema's field types.

Returns:

__arrow_c_schema__

__arrow_c_schema__() -> object

An implementation of the Arrow PyCapsule Interface. This dunder method should not be called directly, but enables zero-copy data transfer to other Python libraries that understand Arrow memory.

For example, you can call pyarrow.schema() to convert this array into a pyarrow schema, without copying memory.

append

append(field: ArrowSchemaExportable) -> Schema

Append a field at the end of the schema.

In contrast to Python's list.append() it does return a new object, leaving the original Schema unmodified.

Parameters:

Returns:

empty_table

empty_table() -> Table

Provide an empty table according to the schema.

Returns:

equals

equals(other: ArrowSchemaExportable) -> bool

Test if this schema is equal to the other

Parameters:

Returns:

  • bool

    description

field

field(i: int | str) -> Field

Select a field by its column name or numeric index.

Parameters:

Returns:

from_arrow classmethod

from_arrow(input: ArrowSchemaExportable) -> Schema

Construct this from an existing Arrow object

Parameters:

Returns:

from_arrow_pycapsule classmethod

from_arrow_pycapsule(capsule) -> Schema

Construct this object from a bare Arrow PyCapsule

get_all_field_indices

get_all_field_indices(name: str) -> list[int]

Return sorted list of indices for the fields with the given name.

Parameters:

  • name (str) –

    description

Returns:

get_field_index

get_field_index(name: str) -> int

Return index of the unique field with the given name.

Parameters:

  • name (str) –

    description

Returns:

  • int

    description

insert

insert(i: int, field: ArrowSchemaExportable) -> Schema

Add a field at position i to the schema.

Parameters:

Returns:

remove

remove(i: int) -> Schema

Remove the field at index i from the schema.

Parameters:

  • i (int) –

    description

Returns:

remove_metadata

remove_metadata() -> Schema

Create new schema without metadata, if any

Returns:

set

set(i: int, field: ArrowSchemaExportable) -> Schema

Replace a field at position i in the schema.

Parameters:

Returns:

with_metadata

with_metadata(metadata: dict[str, str] | dict[bytes, bytes]) -> Schema

Add metadata as dict of string keys and values to Schema.

Parameters:

Returns:

read_parquet

read_parquet(file: Path | str) -> RecordBatchReader

Read a Parquet file to an Arrow RecordBatchReader

Parameters:

  • file (Path | str) –

    description

Returns:

write_parquet

write_parquet(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    bloom_filter_enabled: bool | None = None,
    bloom_filter_fpp: float | None = None,
    bloom_filter_ndv: int | None = None,
    column_compression: dict[ParquetColumnPath, ParquetCompression]
    | None = None,
    column_dictionary_enabled: dict[ParquetColumnPath, bool] | None = None,
    column_encoding: dict[ParquetColumnPath, ParquetEncoding] | None = None,
    column_max_statistics_size: dict[ParquetColumnPath, int] | None = None,
    compression: ParquetCompression | None = None,
    created_by: str | None = None,
    data_page_row_count_limit: int | None = None,
    data_page_size_limit: int | None = None,
    dictionary_enabled: bool | None = None,
    dictionary_page_size_limit: int | None = None,
    encoding: ParquetEncoding | None = None,
    key_value_metadata: dict[str, str] | None = None,
    max_row_group_size: int | None = None,
    max_statistics_size: int | None = None,
    write_batch_size: int | None = None,
    writer_version: Literal["parquet_1_0", "parquet_2_0"] | None = None
) -> None

Write an Arrow Table or stream to a Parquet file.

Parameters:

Other Parameters:

  • bloom_filter_enabled (bool | None) –

    Sets if bloom filter is enabled by default for all columns (defaults to false).

  • bloom_filter_fpp (float | None) –

    Sets the default target bloom filter false positive probability (fpp) for all columns (defaults to 0.05).

  • bloom_filter_ndv (int | None) –

    Sets default number of distinct values (ndv) for bloom filter for all columns (defaults to 1_000_000).

  • column_compression (dict[ParquetColumnPath, ParquetCompression] | None) –

    Sets compression codec for a specific column. Takes precedence over compression.

  • column_dictionary_enabled (dict[ParquetColumnPath, bool] | None) –

    Sets flag to enable/disable dictionary encoding for a specific column. Takes precedence over dictionary_enabled.

  • column_encoding (dict[ParquetColumnPath, ParquetEncoding] | None) –

    Sets encoding for a specific column. Takes precedence over encoding.

  • column_max_statistics_size (dict[ParquetColumnPath, int] | None) –

    Sets max size for statistics for a specific column. Takes precedence over max_statistics_size.

  • compression (ParquetCompression | None) –

    Sets default compression codec for all columns (default to uncompressed). Note that you can pass in a custom compression level with a string like "zstd(3)" or "gzip(9)" or "brotli(3)".

  • created_by (str | None) –

    Sets "created by" property (defaults to parquet-rs version <VERSION>).

  • data_page_row_count_limit (int | None) –

    Sets best effort maximum number of rows in a data page (defaults to 20_000).

    The parquet writer will attempt to limit the number of rows in each DataPage to this value. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

    Note: this is a best effort limit based on value of set_write_batch_size.

  • data_page_size_limit (int | None) –

    Sets best effort maximum size of a data page in bytes (defaults to 1024 * 1024).

    The parquet writer will attempt to limit the sizes of each DataPage to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

    Note: this is a best effort limit based on value of set_write_batch_size.

  • dictionary_enabled (bool | None) –

    Sets default flag to enable/disable dictionary encoding for all columns (defaults to True).

  • dictionary_page_size_limit (int | None) –

    Sets best effort maximum dictionary page size, in bytes (defaults to 1024 * 1024).

    The parquet writer will attempt to limit the size of each DataPage used to store dictionaries to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

    Note: this is a best effort limit based on value of set_write_batch_size.

  • encoding (ParquetEncoding | None) –

    Sets default encoding for all columns.

    If dictionary is not enabled, this is treated as a primary encoding for all columns. In case when dictionary is enabled for any column, this value is considered to be a fallback encoding for that column.

  • key_value_metadata (dict[str, str] | None) –

    Sets "key_value_metadata" property (defaults to None).

  • max_row_group_size (int | None) –

    Sets maximum number of rows in a row group (defaults to 1024 * 1024).

  • max_statistics_size (int | None) –

    Sets default max statistics size for all columns (defaults to 4096).

  • write_batch_size (int | None) –

    Sets write batch size (defaults to 1024).

    For performance reasons, data for each column is written in batches of this size.

    Additional limits such as such as set_data_page_row_count_limit are checked between batches, and thus the write batch size value acts as an upper-bound on the enforcement granularity of other limits.

  • writer_version (Literal['parquet_1_0', 'parquet_2_0'] | None) –

    Sets the WriterVersion written into the parquet metadata (defaults to "parquet_1_0"). This value can determine what features some readers will support.