arro3.io¶

arro3.io ¶

ParquetColumnPath `module-attribute` ¶

ParquetColumnPath = str | Sequence[str]

Allowed types to refer to a Parquet Column.

ParquetCompression `module-attribute` ¶

ParquetCompression = (
    Literal[
        "uncompressed",
        "snappy",
        "gzip",
        "lzo",
        "brotli",
        "lz4",
        "zstd",
        "lz4_raw",
    ]
    | str
)

Allowed compression schemes for Parquet.

ParquetEncoding `module-attribute` ¶

ParquetEncoding = Literal[
    "plain",
    "plain_dictionary",
    "rle",
    "bit_packed",
    "delta_binary_packed",
    "delta_length_byte_array",
    "delta_byte_array",
    "rle_dictionary",
    "byte_stream_split",
]

Allowed Parquet encodings.

ArrowArrayExportable ¶

Bases: Protocol

An object with an __arrow_c_array__ method.

Supported objects include:

arro3 Array or RecordBatch objects.
pyarrow Array or RecordBatch objects

Such an object implements the Arrow C Data Interface interface via the Arrow PyCapsule Interface. This allows for zero-copy Arrow data interchange across libraries.

ArrowSchemaExportable ¶

Bases: Protocol

An object with an __arrow_c_schema__ method.

Supported objects include:

arro3 Schema, Field, or DataType objects.
pyarrow Schema, Field, or DataType objects.

Such an object implements the Arrow C Data Interface interface via the Arrow PyCapsule Interface. This allows for zero-copy Arrow data interchange across libraries.

ArrowStreamExportable ¶

Bases: Protocol

An object with an __arrow_c_stream__ method.

Supported objects include:

arro3 Table, RecordBatchReader, ChunkedArray, or ArrayReader objects.
Polars Series or DataFrame objects (polars v1.2 or higher)
pyarrow RecordBatchReader, Table, or ChunkedArray objects (pyarrow v14 or higher)
pandas DataFrames (pandas v2.2 or higher)
ibis Table objects.

For an up to date list of supported objects, see this issue.

Such an object implements the Arrow C Stream interface via the Arrow PyCapsule Interface. This allows for zero-copy Arrow data interchange across libraries.

RecordBatchReader ¶

An Arrow RecordBatchReader.

A RecordBatchReader holds a stream of RecordBatch.

closed `property` ¶

closed: bool

Returns true if this reader has already been consumed.

schema `property` ¶

schema: Schema

Access the schema of this table.

__arrow_c_stream__ ¶

__arrow_c_stream__(requested_schema: object | None = None) -> object

An implementation of the Arrow PyCapsule Interface. This dunder method should not be called directly, but enables zero-copy data transfer to other Python libraries that understand Arrow memory.

For example, you can call pyarrow.RecordBatchReader.from_stream to convert this stream to a pyarrow RecordBatchReader. Alternatively, you can call pyarrow.table() to consume this stream to a pyarrow table or Table.from_arrow() to consume this stream to an arro3 Table.

from_arrow `classmethod` ¶

from_arrow(
    input: ArrowArrayExportable | ArrowStreamExportable,
) -> RecordBatchReader

Construct this from an existing Arrow object.

It can be called on anything that exports the Arrow stream interface (has an __arrow_c_stream__ method), such as a Table or RecordBatchReader.

from_arrow_pycapsule `classmethod` ¶

from_arrow_pycapsule(capsule) -> RecordBatchReader

Construct this object from a bare Arrow PyCapsule

Schema ¶

An arrow Schema.

metadata `property` ¶

metadata: dict[bytes, bytes]

The schema's metadata.

Returns:

dict[bytes, bytes] –

description

metadata_str `property` ¶

metadata_str: dict[str, str]

The schema's metadata where keys and values are str, not bytes.

Returns:

dict[str, str] –

description

names `property` ¶

names: list[str]

The schema's field names.

types `property` ¶

types: list[DataType]

The schema's field types.

Returns:

list[DataType] –

description

__arrow_c_schema__ ¶

__arrow_c_schema__() -> object

An implementation of the Arrow PyCapsule Interface. This dunder method should not be called directly, but enables zero-copy data transfer to other Python libraries that understand Arrow memory.

For example, you can call pyarrow.schema() to convert this array into a pyarrow schema, without copying memory.

append ¶

append(field: ArrowSchemaExportable) -> Schema

Append a field at the end of the schema.

In contrast to Python's list.append() it does return a new object, leaving the original Schema unmodified.

Parameters:

field (ArrowSchemaExportable) –

new field

Returns:

Schema –

New Schema

empty_table ¶

empty_table() -> Table

Provide an empty table according to the schema.

Returns:

Table –

Table

equals ¶

equals(other: ArrowSchemaExportable) -> bool

Test if this schema is equal to the other

Parameters:

other (ArrowSchemaExportable) –

description

Returns:

bool –

description

field ¶

field(i: int | str) -> Field

Select a field by its column name or numeric index.

Parameters:

i (int | str) –

other

Returns:

Field –

description

from_arrow `classmethod` ¶

from_arrow(input: ArrowSchemaExportable) -> Schema

Construct this from an existing Arrow object

Parameters:

input (ArrowSchemaExportable) –

Arrow schema to use for constructing this object

Returns:

Schema –

description

from_arrow_pycapsule `classmethod` ¶

from_arrow_pycapsule(capsule) -> Schema

Construct this object from a bare Arrow PyCapsule

get_all_field_indices ¶

get_all_field_indices(name: str) -> list[int]

Return sorted list of indices for the fields with the given name.

Parameters:

name (str) –

description

Returns:

list[int] –

description

get_field_index ¶

get_field_index(name: str) -> int

Return index of the unique field with the given name.

Parameters:

name (str) –

description

Returns:

int –

description

insert ¶

insert(i: int, field: ArrowSchemaExportable) -> Schema

Add a field at position i to the schema.

Parameters:

i (int) –

description
field (ArrowSchemaExportable) –

description

Returns:

Schema –

description

remove ¶

remove(i: int) -> Schema

Remove the field at index i from the schema.

Parameters:

i (int) –

description

Returns:

Schema –

description

remove_metadata ¶

remove_metadata() -> Schema

Create new schema without metadata, if any

Returns:

Schema –

description

set ¶

set(i: int, field: ArrowSchemaExportable) -> Schema

Replace a field at position i in the schema.

Parameters:

i (int) –

description
field (ArrowSchemaExportable) –

description

Returns:

Schema –

description

with_metadata ¶

with_metadata(metadata: dict[str, str] | dict[bytes, bytes]) -> Schema

Add metadata as dict of string keys and values to Schema.

Parameters:

metadata (dict[str, str] | dict[bytes, bytes]) –

description

Returns:

Schema –

description

read_parquet ¶

read_parquet(file: Path | str) -> RecordBatchReader

Read a Parquet file to an Arrow RecordBatchReader

Parameters:

file (Path | str) –

description

Returns:

RecordBatchReader –

description

write_parquet ¶

write_parquet(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    bloom_filter_enabled: bool | None = None,
    bloom_filter_fpp: float | None = None,
    bloom_filter_ndv: int | None = None,
    column_compression: dict[ParquetColumnPath, ParquetCompression]
    | None = None,
    column_dictionary_enabled: dict[ParquetColumnPath, bool] | None = None,
    column_encoding: dict[ParquetColumnPath, ParquetEncoding] | None = None,
    column_max_statistics_size: dict[ParquetColumnPath, int] | None = None,
    compression: ParquetCompression | None = None,
    created_by: str | None = None,
    data_page_row_count_limit: int | None = None,
    data_page_size_limit: int | None = None,
    dictionary_enabled: bool | None = None,
    dictionary_page_size_limit: int | None = None,
    encoding: ParquetEncoding | None = None,
    key_value_metadata: dict[str, str] | None = None,
    max_row_group_size: int | None = None,
    max_statistics_size: int | None = None,
    write_batch_size: int | None = None,
    writer_version: Literal["parquet_1_0", "parquet_2_0"] | None = None
) -> None

Write an Arrow Table or stream to a Parquet file.

Parameters:

data (ArrowStreamExportable | ArrowArrayExportable) –

The Arrow Table, RecordBatchReader, or RecordBatch to write to Parquet.
file (IO[bytes] | Path | str) –

The output file.

Other Parameters:

bloom_filter_enabled (bool | None) –

Sets if bloom filter is enabled by default for all columns (defaults to false).
bloom_filter_fpp (float | None) –

Sets the default target bloom filter false positive probability (fpp) for all columns (defaults to 0.05).
bloom_filter_ndv (int | None) –

Sets default number of distinct values (ndv) for bloom filter for all columns (defaults to 1_000_000).
column_compression (dict[ParquetColumnPath, ParquetCompression] | None) –

Sets compression codec for a specific column. Takes precedence over compression.
column_dictionary_enabled (dict[ParquetColumnPath, bool] | None) –

Sets flag to enable/disable dictionary encoding for a specific column. Takes precedence over dictionary_enabled.
column_encoding (dict[ParquetColumnPath, ParquetEncoding] | None) –

Sets encoding for a specific column. Takes precedence over encoding.
column_max_statistics_size (dict[ParquetColumnPath, int] | None) –

Sets max size for statistics for a specific column. Takes precedence over max_statistics_size.
compression (ParquetCompression | None) –

Sets default compression codec for all columns (default to uncompressed). Note that you can pass in a custom compression level with a string like "zstd(3)" or "gzip(9)" or "brotli(3)".
created_by (str | None) –

Sets "created by" property (defaults to parquet-rs version <VERSION>).
data_page_row_count_limit (int | None) –

Sets best effort maximum number of rows in a data page (defaults to 20_000).

The parquet writer will attempt to limit the number of rows in each DataPage to this value. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

Note: this is a best effort limit based on value of set_write_batch_size.
data_page_size_limit (int | None) –

Sets best effort maximum size of a data page in bytes (defaults to 1024 * 1024).

The parquet writer will attempt to limit the sizes of each DataPage to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

Note: this is a best effort limit based on value of set_write_batch_size.
dictionary_enabled (bool | None) –

Sets default flag to enable/disable dictionary encoding for all columns (defaults to True).
dictionary_page_size_limit (int | None) –

Sets best effort maximum dictionary page size, in bytes (defaults to 1024 * 1024).

The parquet writer will attempt to limit the size of each DataPage used to store dictionaries to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

Note: this is a best effort limit based on value of set_write_batch_size.
encoding (ParquetEncoding | None) –

Sets default encoding for all columns.

If dictionary is not enabled, this is treated as a primary encoding for all columns. In case when dictionary is enabled for any column, this value is considered to be a fallback encoding for that column.
key_value_metadata (dict[str, str] | None) –

Sets "key_value_metadata" property (defaults to None).
max_row_group_size (int | None) –

Sets maximum number of rows in a row group (defaults to 1024 * 1024).
max_statistics_size (int | None) –

Sets default max statistics size for all columns (defaults to 4096).
write_batch_size (int | None) –

Sets write batch size (defaults to 1024).

For performance reasons, data for each column is written in batches of this size.

Additional limits such as such as set_data_page_row_count_limit are checked between batches, and thus the write batch size value acts as an upper-bound on the enforcement granularity of other limits.
writer_version (Literal['parquet_1_0', 'parquet_2_0'] | None) –

Sets the WriterVersion written into the parquet metadata (defaults to "parquet_1_0"). This value can determine what features some readers will support.

arro3.io¶

arro3.io ¶

ParquetColumnPath module-attribute ¶

ParquetCompression module-attribute ¶

ParquetEncoding module-attribute ¶

ArrowArrayExportable ¶

ArrowSchemaExportable ¶

ArrowStreamExportable ¶

RecordBatchReader ¶

closed property ¶

schema property ¶

__arrow_c_stream__ ¶

from_arrow classmethod ¶

from_arrow_pycapsule classmethod ¶

Schema ¶

metadata property ¶

metadata_str property ¶

names property ¶

types property ¶

__arrow_c_schema__ ¶

append ¶

empty_table ¶

equals ¶

field ¶

from_arrow classmethod ¶

from_arrow_pycapsule classmethod ¶

get_all_field_indices ¶

get_field_index ¶

insert ¶

remove ¶

remove_metadata ¶

set ¶

with_metadata ¶

read_parquet ¶

write_parquet ¶

ParquetColumnPath `module-attribute` ¶

ParquetCompression `module-attribute` ¶

ParquetEncoding `module-attribute` ¶

closed `property` ¶

schema `property` ¶

from_arrow `classmethod` ¶

from_arrow_pycapsule `classmethod` ¶

metadata `property` ¶

metadata_str `property` ¶

names `property` ¶

types `property` ¶

from_arrow `classmethod` ¶

from_arrow_pycapsule `classmethod` ¶