arro3.io¶
arro3.io ¶
infer_csv_schema
builtin
¶
infer_csv_schema(
file: IO[bytes] | Path | str,
*,
has_header: bool | None = None,
max_records: int | None = None,
delimiter: str | None = None,
escape: str | None = None,
quote: str | None = None,
terminator: str | None = None,
comment: str | None = None
) -> Schema
Infer a CSV file's schema
infer_json_schema
builtin
¶
Infer a JSON file's schema
read_csv
builtin
¶
read_csv(
file: IO[bytes] | Path | str,
schema: ArrowSchemaExportable,
*,
has_header: bool | None = None,
batch_size: int | None = None,
delimiter: str | None = None,
escape: str | None = None,
quote: str | None = None,
terminator: str | None = None,
comment: str | None = None
) -> RecordBatchReader
Read a CSV file to an Arrow RecordBatchReader
read_ipc
builtin
¶
read_ipc(file: IO[bytes] | Path | str) -> RecordBatchReader
Read an Arrow IPC file to an Arrow RecordBatchReader
read_ipc_stream
builtin
¶
read_ipc_stream(file: IO[bytes] | Path | str) -> RecordBatchReader
Read an Arrow IPC Stream file to an Arrow RecordBatchReader
read_json
builtin
¶
read_json(
file: IO[bytes] | Path | str,
schema: ArrowSchemaExportable,
*,
batch_size: int | None = None
) -> RecordBatchReader
Read a JSON file to an Arrow RecordBatchReader
read_parquet
builtin
¶
read_parquet(file: Path | str) -> RecordBatchReader
Read a Parquet file to an Arrow RecordBatchReader
Parameters:
Returns:
-
RecordBatchReader
–description
write_csv
builtin
¶
write_csv(
data: ArrowStreamExportable | ArrowArrayExportable,
file: IO[bytes] | Path | str,
*,
header: bool | None = None,
delimiter: str | None = None,
escape: str | None = None,
quote: str | None = None,
date_format: str | None = None,
datetime_format: str | None = None,
time_format: str | None = None,
timestamp_format: str | None = None,
timestamp_tz_format: str | None = None,
null: str | None = None
) -> None
Write an Arrow Table or stream to a CSV file
write_ipc
builtin
¶
write_ipc(
data: ArrowStreamExportable | ArrowArrayExportable,
file: IO[bytes] | Path | str,
) -> None
Write an Arrow Table or stream to an IPC File
write_ipc_stream
builtin
¶
write_ipc_stream(
data: ArrowStreamExportable | ArrowArrayExportable,
file: IO[bytes] | Path | str,
) -> None
Write an Arrow Table or stream to an IPC Stream
write_json
builtin
¶
write_json(
data: ArrowStreamExportable | ArrowArrayExportable,
file: IO[bytes] | Path | str,
*,
explicit_nulls: bool | None = None
) -> None
Write an Arrow Table or stream to a JSON file
write_ndjson
builtin
¶
write_ndjson(
data: ArrowStreamExportable | ArrowArrayExportable,
file: IO[bytes] | Path | str,
*,
explicit_nulls: bool | None = None
) -> None
Write an Arrow Table or stream to a newline-delimited JSON file
write_parquet
builtin
¶
write_parquet(
data: ArrowStreamExportable | ArrowArrayExportable,
file: IO[bytes] | Path | str,
*,
bloom_filter_enabled: bool | None = None,
bloom_filter_fpp: float | None = None,
bloom_filter_ndv: int | None = None,
column_compression: dict[ParquetColumnPath, ParquetCompression]
| None = None,
column_dictionary_enabled: dict[ParquetColumnPath, bool] | None = None,
column_encoding: dict[ParquetColumnPath, ParquetEncoding] | None = None,
column_max_statistics_size: dict[ParquetColumnPath, int] | None = None,
compression: ParquetCompression | None = None,
created_by: str | None = None,
data_page_row_count_limit: int | None = None,
data_page_size_limit: int | None = None,
dictionary_enabled: bool | None = None,
dictionary_page_size_limit: int | None = None,
encoding: ParquetEncoding | None = None,
key_value_metadata: dict[str, str] | None = None,
max_row_group_size: int | None = None,
max_statistics_size: int | None = None,
write_batch_size: int | None = None,
writer_version: Literal["parquet_1_0", "parquet_2_0"] | None = None
) -> None
Write an Arrow Table or stream to a Parquet file.
Parameters:
-
data
(ArrowStreamExportable | ArrowArrayExportable
) –The Arrow Table, RecordBatchReader, or RecordBatch to write to Parquet.
-
file
(IO[bytes] | Path | str
) –The output file.
Other Parameters:
-
bloom_filter_enabled
(bool | None
) –Sets if bloom filter is enabled by default for all columns (defaults to
false
). -
bloom_filter_fpp
(float | None
) –Sets the default target bloom filter false positive probability (fpp) for all columns (defaults to
0.05
). -
bloom_filter_ndv
(int | None
) –Sets default number of distinct values (ndv) for bloom filter for all columns (defaults to
1_000_000
). -
column_compression
(dict[ParquetColumnPath, ParquetCompression] | None
) –Sets compression codec for a specific column. Takes precedence over
compression
. -
column_dictionary_enabled
(dict[ParquetColumnPath, bool] | None
) –Sets flag to enable/disable dictionary encoding for a specific column. Takes precedence over
dictionary_enabled
. -
column_encoding
(dict[ParquetColumnPath, ParquetEncoding] | None
) –Sets encoding for a specific column. Takes precedence over
encoding
. -
column_max_statistics_size
(dict[ParquetColumnPath, int] | None
) –Sets max size for statistics for a specific column. Takes precedence over
max_statistics_size
. -
compression
(ParquetCompression | None
) –Sets default compression codec for all columns (default to
uncompressed
). Note that you can pass in a custom compression level with a string like"zstd(3)"
or"gzip(9)"
or"brotli(3)"
. -
created_by
(str | None
) –Sets "created by" property (defaults to
parquet-rs version <VERSION>
). -
data_page_row_count_limit
(int | None
) –Sets best effort maximum number of rows in a data page (defaults to
20_000
).The parquet writer will attempt to limit the number of rows in each
DataPage
to this value. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.Note: this is a best effort limit based on value of
set_write_batch_size
. -
data_page_size_limit
(int | None
) –Sets best effort maximum size of a data page in bytes (defaults to
1024 * 1024
).The parquet writer will attempt to limit the sizes of each
DataPage
to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.Note: this is a best effort limit based on value of
set_write_batch_size
. -
dictionary_enabled
(bool | None
) –Sets default flag to enable/disable dictionary encoding for all columns (defaults to
True
). -
dictionary_page_size_limit
(int | None
) –Sets best effort maximum dictionary page size, in bytes (defaults to
1024 * 1024
).The parquet writer will attempt to limit the size of each
DataPage
used to store dictionaries to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.Note: this is a best effort limit based on value of
set_write_batch_size
. -
encoding
(ParquetEncoding | None
) –Sets default encoding for all columns.
If dictionary is not enabled, this is treated as a primary encoding for all columns. In case when dictionary is enabled for any column, this value is considered to be a fallback encoding for that column.
-
key_value_metadata
(dict[str, str] | None
) –Sets "key_value_metadata" property (defaults to
None
). -
max_row_group_size
(int | None
) –Sets maximum number of rows in a row group (defaults to
1024 * 1024
). -
max_statistics_size
(int | None
) –Sets default max statistics size for all columns (defaults to
4096
). -
write_batch_size
(int | None
) –Sets write batch size (defaults to 1024).
For performance reasons, data for each column is written in batches of this size.
Additional limits such as such as
set_data_page_row_count_limit
are checked between batches, and thus the write batch size value acts as an upper-bound on the enforcement granularity of other limits. -
writer_version
(Literal['parquet_1_0', 'parquet_2_0'] | None
) –Sets the
WriterVersion
written into the parquet metadata (defaults to"parquet_1_0"
). This value can determine what features some readers will support.