arro3.io¶
arro3.io ¶
ParquetColumnPath
module-attribute
¶
Allowed types to refer to a Parquet Column.
ParquetCompression
module-attribute
¶
ParquetCompression = (
Literal[
"uncompressed",
"snappy",
"gzip",
"lzo",
"brotli",
"lz4",
"zstd",
"lz4_raw",
]
| str
)
Allowed compression schemes for Parquet.
ParquetEncoding
module-attribute
¶
ParquetEncoding = Literal[
"plain",
"plain_dictionary",
"rle",
"bit_packed",
"delta_binary_packed",
"delta_length_byte_array",
"delta_byte_array",
"rle_dictionary",
"byte_stream_split",
]
Allowed Parquet encodings.
ArrowArrayExportable ¶
Bases: Protocol
An object with an __arrow_c_array__
method.
Supported objects include:
- arro3
Array
orRecordBatch
objects. - pyarrow
Array
orRecordBatch
objects
Such an object implements the Arrow C Data Interface interface via the Arrow PyCapsule Interface. This allows for zero-copy Arrow data interchange across libraries.
ArrowSchemaExportable ¶
Bases: Protocol
An object with an __arrow_c_schema__
method.
Supported objects include:
- arro3
Schema
,Field
, orDataType
objects. - pyarrow
Schema
,Field
, orDataType
objects.
Such an object implements the Arrow C Data Interface interface via the Arrow PyCapsule Interface. This allows for zero-copy Arrow data interchange across libraries.
ArrowStreamExportable ¶
Bases: Protocol
An object with an __arrow_c_stream__
method.
Supported objects include:
- arro3
Table
,RecordBatchReader
,ChunkedArray
, orArrayReader
objects. - Polars
Series
orDataFrame
objects (polars v1.2 or higher) - pyarrow
RecordBatchReader
,Table
, orChunkedArray
objects (pyarrow v14 or higher) - pandas
DataFrame
s (pandas v2.2 or higher) - ibis
Table
objects.
For an up to date list of supported objects, see this issue.
Such an object implements the Arrow C Stream interface via the Arrow PyCapsule Interface. This allows for zero-copy Arrow data interchange across libraries.
RecordBatchReader ¶
An Arrow RecordBatchReader.
A RecordBatchReader holds a stream of RecordBatch
.
__arrow_c_stream__ ¶
An implementation of the Arrow PyCapsule Interface. This dunder method should not be called directly, but enables zero-copy data transfer to other Python libraries that understand Arrow memory.
For example, you can call
pyarrow.RecordBatchReader.from_stream
to convert this stream to a pyarrow RecordBatchReader
. Alternatively, you can
call pyarrow.table()
to consume this stream to a pyarrow
table or Table.from_arrow()
to consume this stream to an
arro3 Table.
from_arrow
classmethod
¶
from_arrow(
input: ArrowArrayExportable | ArrowStreamExportable,
) -> RecordBatchReader
Construct this from an existing Arrow object.
It can be called on anything that exports the Arrow stream interface
(has an __arrow_c_stream__
method), such as a Table
or RecordBatchReader
.
from_arrow_pycapsule
classmethod
¶
from_arrow_pycapsule(capsule) -> RecordBatchReader
Construct this object from a bare Arrow PyCapsule
Schema ¶
An arrow Schema.
metadata
property
¶
metadata_str
property
¶
types
property
¶
__arrow_c_schema__ ¶
__arrow_c_schema__() -> object
An implementation of the Arrow PyCapsule Interface. This dunder method should not be called directly, but enables zero-copy data transfer to other Python libraries that understand Arrow memory.
For example, you can call pyarrow.schema()
to convert this
array into a pyarrow schema, without copying memory.
append ¶
append(field: ArrowSchemaExportable) -> Schema
Append a field at the end of the schema.
In contrast to Python's list.append()
it does return a new object, leaving the
original Schema unmodified.
Parameters:
-
field
(ArrowSchemaExportable
) –new field
Returns:
-
Schema
–New Schema
equals ¶
equals(other: ArrowSchemaExportable) -> bool
Test if this schema is equal to the other
Parameters:
-
other
(ArrowSchemaExportable
) –description
Returns:
-
bool
–description
field ¶
from_arrow
classmethod
¶
from_arrow(input: ArrowSchemaExportable) -> Schema
Construct this from an existing Arrow object
Parameters:
-
input
(ArrowSchemaExportable
) –Arrow schema to use for constructing this object
Returns:
-
Schema
–description
from_arrow_pycapsule
classmethod
¶
from_arrow_pycapsule(capsule) -> Schema
Construct this object from a bare Arrow PyCapsule
get_all_field_indices ¶
get_field_index ¶
insert ¶
insert(i: int, field: ArrowSchemaExportable) -> Schema
Add a field at position i
to the schema.
Parameters:
-
i
(int
) –description
-
field
(ArrowSchemaExportable
) –description
Returns:
-
Schema
–description
remove ¶
set ¶
set(i: int, field: ArrowSchemaExportable) -> Schema
Replace a field at position i
in the schema.
Parameters:
-
i
(int
) –description
-
field
(ArrowSchemaExportable
) –description
Returns:
-
Schema
–description
read_parquet ¶
read_parquet(file: Path | str) -> RecordBatchReader
Read a Parquet file to an Arrow RecordBatchReader
Parameters:
Returns:
-
RecordBatchReader
–description
write_parquet ¶
write_parquet(
data: ArrowStreamExportable | ArrowArrayExportable,
file: IO[bytes] | Path | str,
*,
bloom_filter_enabled: bool | None = None,
bloom_filter_fpp: float | None = None,
bloom_filter_ndv: int | None = None,
column_compression: dict[ParquetColumnPath, ParquetCompression]
| None = None,
column_dictionary_enabled: dict[ParquetColumnPath, bool] | None = None,
column_encoding: dict[ParquetColumnPath, ParquetEncoding] | None = None,
column_max_statistics_size: dict[ParquetColumnPath, int] | None = None,
compression: ParquetCompression | None = None,
created_by: str | None = None,
data_page_row_count_limit: int | None = None,
data_page_size_limit: int | None = None,
dictionary_enabled: bool | None = None,
dictionary_page_size_limit: int | None = None,
encoding: ParquetEncoding | None = None,
key_value_metadata: dict[str, str] | None = None,
max_row_group_size: int | None = None,
max_statistics_size: int | None = None,
write_batch_size: int | None = None,
writer_version: Literal["parquet_1_0", "parquet_2_0"] | None = None
) -> None
Write an Arrow Table or stream to a Parquet file.
Parameters:
-
data
(ArrowStreamExportable | ArrowArrayExportable
) –The Arrow Table, RecordBatchReader, or RecordBatch to write to Parquet.
-
file
(IO[bytes] | Path | str
) –The output file.
Other Parameters:
-
bloom_filter_enabled
(bool | None
) –Sets if bloom filter is enabled by default for all columns (defaults to
false
). -
bloom_filter_fpp
(float | None
) –Sets the default target bloom filter false positive probability (fpp) for all columns (defaults to
0.05
). -
bloom_filter_ndv
(int | None
) –Sets default number of distinct values (ndv) for bloom filter for all columns (defaults to
1_000_000
). -
column_compression
(dict[ParquetColumnPath, ParquetCompression] | None
) –Sets compression codec for a specific column. Takes precedence over
compression
. -
column_dictionary_enabled
(dict[ParquetColumnPath, bool] | None
) –Sets flag to enable/disable dictionary encoding for a specific column. Takes precedence over
dictionary_enabled
. -
column_encoding
(dict[ParquetColumnPath, ParquetEncoding] | None
) –Sets encoding for a specific column. Takes precedence over
encoding
. -
column_max_statistics_size
(dict[ParquetColumnPath, int] | None
) –Sets max size for statistics for a specific column. Takes precedence over
max_statistics_size
. -
compression
(ParquetCompression | None
) –Sets default compression codec for all columns (default to
uncompressed
). Note that you can pass in a custom compression level with a string like"zstd(3)"
or"gzip(9)"
or"brotli(3)"
. -
created_by
(str | None
) –Sets "created by" property (defaults to
parquet-rs version <VERSION>
). -
data_page_row_count_limit
(int | None
) –Sets best effort maximum number of rows in a data page (defaults to
20_000
).The parquet writer will attempt to limit the number of rows in each
DataPage
to this value. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.Note: this is a best effort limit based on value of
set_write_batch_size
. -
data_page_size_limit
(int | None
) –Sets best effort maximum size of a data page in bytes (defaults to
1024 * 1024
).The parquet writer will attempt to limit the sizes of each
DataPage
to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.Note: this is a best effort limit based on value of
set_write_batch_size
. -
dictionary_enabled
(bool | None
) –Sets default flag to enable/disable dictionary encoding for all columns (defaults to
True
). -
dictionary_page_size_limit
(int | None
) –Sets best effort maximum dictionary page size, in bytes (defaults to
1024 * 1024
).The parquet writer will attempt to limit the size of each
DataPage
used to store dictionaries to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.Note: this is a best effort limit based on value of
set_write_batch_size
. -
encoding
(ParquetEncoding | None
) –Sets default encoding for all columns.
If dictionary is not enabled, this is treated as a primary encoding for all columns. In case when dictionary is enabled for any column, this value is considered to be a fallback encoding for that column.
-
key_value_metadata
(dict[str, str] | None
) –Sets "key_value_metadata" property (defaults to
None
). -
max_row_group_size
(int | None
) –Sets maximum number of rows in a row group (defaults to
1024 * 1024
). -
max_statistics_size
(int | None
) –Sets default max statistics size for all columns (defaults to
4096
). -
write_batch_size
(int | None
) –Sets write batch size (defaults to 1024).
For performance reasons, data for each column is written in batches of this size.
Additional limits such as such as
set_data_page_row_count_limit
are checked between batches, and thus the write batch size value acts as an upper-bound on the enforcement granularity of other limits. -
writer_version
(Literal['parquet_1_0', 'parquet_2_0'] | None
) –Sets the
WriterVersion
written into the parquet metadata (defaults to"parquet_1_0"
). This value can determine what features some readers will support.