Skip to content

arro3.io

arro3.io

ParquetColumnPath module-attribute

ParquetColumnPath = str | Sequence[str]

Allowed types to refer to a Parquet Column.

ParquetCompression module-attribute

ParquetCompression = (
    Literal[
        "uncompressed",
        "snappy",
        "gzip",
        "lzo",
        "brotli",
        "lz4",
        "zstd",
        "lz4_raw",
    ]
    | str
)

Allowed compression schemes for Parquet.

ParquetEncoding module-attribute

ParquetEncoding = Literal[
    "plain",
    "plain_dictionary",
    "rle",
    "bit_packed",
    "delta_binary_packed",
    "delta_length_byte_array",
    "delta_byte_array",
    "rle_dictionary",
    "byte_stream_split",
]

Allowed Parquet encodings.

infer_csv_schema

infer_csv_schema(
    file: IO[bytes] | Path | str,
    *,
    has_header: bool | None = None,
    max_records: int | None = None,
    delimiter: str | None = None,
    escape: str | None = None,
    quote: str | None = None,
    terminator: str | None = None,
    comment: str | None = None
) -> Schema

Infer a CSV file's schema

If max_records is None, all records will be read, otherwise up to max_records records are read to infer the schema

Parameters:

  • file (IO[bytes] | Path | str) –

    The input CSV path or buffer.

  • has_header (bool | None, default: None ) –

    Set whether the CSV file has a header. Defaults to None.

  • max_records (int | None, default: None ) –

    The maximum number of records to read to infer schema. Defaults to None.

  • delimiter (str | None, default: None ) –

    Set the CSV file's column delimiter as a byte character. Defaults to None.

  • escape (str | None, default: None ) –

    Set the CSV escape character. Defaults to None.

  • quote (str | None, default: None ) –

    Set the CSV quote character. Defaults to None.

  • terminator (str | None, default: None ) –

    Set the line terminator. Defaults to None.

  • comment (str | None, default: None ) –

    Set the comment character. Defaults to None.

Returns:

  • Schema

    inferred schema from data

infer_json_schema

infer_json_schema(
    file: IO[bytes] | Path | str, *, max_records: int | None = None
) -> Schema

Infer the schema of a JSON file by reading the first n records of the buffer, with max_records controlling the maximum number of records to read.

Parameters:

  • file (IO[bytes] | Path | str) –

    The input JSON path or buffer.

  • max_records (int | None, default: None ) –

    The maximum number of records to read to infer schema. If not provided, will read the entire file to deduce field types. Defaults to None.

Returns:

  • Schema

    Inferred Arrow Schema

read_csv

read_csv(
    file: IO[bytes] | Path | str,
    schema: ArrowSchemaExportable,
    *,
    has_header: bool | None = None,
    batch_size: int | None = None,
    delimiter: str | None = None,
    escape: str | None = None,
    quote: str | None = None,
    terminator: str | None = None,
    comment: str | None = None
) -> RecordBatchReader

Read a CSV file to an Arrow RecordBatchReader.

Parameters:

  • file (IO[bytes] | Path | str) –

    The input CSV path or buffer.

  • schema (ArrowSchemaExportable) –

    The Arrow schema for this CSV file. Use infer_csv_schema to infer an Arrow schema if

  • has_header (bool | None, default: None ) –

    Set whether the CSV file has a header. Defaults to None.

  • batch_size (int | None, default: None ) –

    Set the batch size (number of records to load at one time). Defaults to None.

  • delimiter (str | None, default: None ) –

    Set the CSV file's column delimiter as a byte character. Defaults to None.

  • escape (str | None, default: None ) –

    Set the CSV escape character. Defaults to None.

  • quote (str | None, default: None ) –

    Set the CSV quote character. Defaults to None.

  • terminator (str | None, default: None ) –

    Set the line terminator. Defaults to None.

  • comment (str | None, default: None ) –

    Set the comment character. Defaults to None.

Returns:

read_ipc

read_ipc(file: IO[bytes] | Path | str) -> RecordBatchReader

Read an Arrow IPC file into memory

Parameters:

  • file (IO[bytes] | Path | str) –

    The input Arrow IPC file path or buffer.

Returns:

read_ipc_stream

read_ipc_stream(file: IO[bytes] | Path | str) -> RecordBatchReader

Read an Arrow IPC stream into memory

Parameters:

  • file (IO[bytes] | Path | str) –

    The input Arrow IPC stream path or buffer.

Returns:

read_json

read_json(
    file: IO[bytes] | Path | str,
    schema: ArrowSchemaExportable,
    *,
    batch_size: int | None = None
) -> RecordBatchReader

Reads JSON data with a known schema into Arrow

Parameters:

  • file (IO[bytes] | Path | str) –

    The JSON file or buffer to read from.

  • schema (ArrowSchemaExportable) –

    The Arrow schema representing the JSON data.

  • batch_size (int | None, default: None ) –

    Set the batch size (number of records to load at one time). Defaults to None.

Returns:

read_parquet

read_parquet(file: IO[bytes] | Path | str) -> RecordBatchReader

Read a Parquet file to an Arrow RecordBatchReader

Parameters:

  • file (IO[bytes] | Path | str) –

    The input Parquet file path or buffer.

Returns:

write_csv

write_csv(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    header: bool | None = None,
    delimiter: str | None = None,
    escape: str | None = None,
    quote: str | None = None,
    date_format: str | None = None,
    datetime_format: str | None = None,
    time_format: str | None = None,
    timestamp_format: str | None = None,
    timestamp_tz_format: str | None = None,
    null: str | None = None
) -> None

Write an Arrow Table or stream to a CSV file.

Parameters:

  • data (ArrowStreamExportable | ArrowArrayExportable) –

    The Arrow Table, RecordBatchReader, or RecordBatch to write.

  • file (IO[bytes] | Path | str) –

    The output buffer or file path for where to write the CSV.

  • header (bool | None, default: None ) –

    Set whether to write the CSV file with a header. Defaults to None.

  • delimiter (str | None, default: None ) –

    Set the CSV file's column delimiter as a byte character. Defaults to None.

  • escape (str | None, default: None ) –

    Set the CSV file's escape character as a byte character.

    In some variants of CSV, quotes are escaped using a special escape character like \ (instead of escaping quotes by doubling them).

    By default, writing these idiosyncratic escapes is disabled, and is only used when double_quote is disabled. Defaults to None.

  • quote (str | None, default: None ) –

    Set the CSV file's quote character as a byte character. Defaults to None.

  • date_format (str | None, default: None ) –

    Set the CSV file's date format. Defaults to None.

  • datetime_format (str | None, default: None ) –

    Set the CSV file's datetime format. Defaults to None.

  • time_format (str | None, default: None ) –

    Set the CSV file's time format. Defaults to None.

  • timestamp_format (str | None, default: None ) –

    Set the CSV file's timestamp format. Defaults to None.

  • timestamp_tz_format (str | None, default: None ) –

    Set the CSV file's timestamp tz format. Defaults to None.

  • null (str | None, default: None ) –

    Set the value to represent null in output. Defaults to None.

write_ipc

write_ipc(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    compression: Literal["LZ4", "lz4", "ZSTD", "zstd"] | None = None
) -> None

Write Arrow data to an Arrow IPC file

Parameters:

Other Parameters:

  • compression (Literal['LZ4', 'lz4', 'ZSTD', 'zstd'] | None) –

    Compression to apply to file.

write_ipc_stream

write_ipc_stream(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    compression: Literal["LZ4", "lz4", "ZSTD", "zstd"] | None = None
) -> None

Write Arrow data to an Arrow IPC stream

Parameters:

Other Parameters:

  • compression (Literal['LZ4', 'lz4', 'ZSTD', 'zstd'] | None) –

    Compression to apply to file.

write_json

write_json(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    explicit_nulls: bool | None = None
) -> None

Write Arrow data to JSON.

By default the writer will skip writing keys with null values for backward compatibility.

Parameters:

  • data (ArrowStreamExportable | ArrowArrayExportable) –

    the Arrow Table, RecordBatchReader, or RecordBatch to write.

  • file (IO[bytes] | Path | str) –

    the output file or buffer to write to

  • explicit_nulls (bool | None, default: None ) –

    Set whether to keep keys with null values, or to omit writing them. Defaults to skipping nulls.

write_ndjson

write_ndjson(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    explicit_nulls: bool | None = None
) -> None

Write Arrow data to newline-delimited JSON.

By default the writer will skip writing keys with null values for backward compatibility.

Parameters:

  • data (ArrowStreamExportable | ArrowArrayExportable) –

    the Arrow Table, RecordBatchReader, or RecordBatch to write.

  • file (IO[bytes] | Path | str) –

    the output file or buffer to write to

  • explicit_nulls (bool | None, default: None ) –

    Set whether to keep keys with null values, or to omit writing them. Defaults to skipping nulls.

write_parquet

write_parquet(
    data: ArrowStreamExportable | ArrowArrayExportable,
    file: IO[bytes] | Path | str,
    *,
    bloom_filter_enabled: bool | None = None,
    bloom_filter_fpp: float | None = None,
    bloom_filter_ndv: int | None = None,
    column_compression: dict[ParquetColumnPath, ParquetCompression]
    | None = None,
    column_dictionary_enabled: dict[ParquetColumnPath, bool] | None = None,
    column_encoding: dict[ParquetColumnPath, ParquetEncoding] | None = None,
    column_max_statistics_size: dict[ParquetColumnPath, int] | None = None,
    compression: ParquetCompression | None = None,
    created_by: str | None = None,
    data_page_row_count_limit: int | None = None,
    data_page_size_limit: int | None = None,
    dictionary_enabled: bool | None = None,
    dictionary_page_size_limit: int | None = None,
    encoding: ParquetEncoding | None = None,
    key_value_metadata: dict[str, str] | None = None,
    max_row_group_size: int | None = None,
    max_statistics_size: int | None = None,
    skip_arrow_metadata: bool = False,
    write_batch_size: int | None = None,
    writer_version: Literal["parquet_1_0", "parquet_2_0"] | None = None
) -> None

Write an Arrow Table or stream to a Parquet file.

Parameters:

Other Parameters:

  • bloom_filter_enabled (bool | None) –

    Sets if bloom filter is enabled by default for all columns (defaults to false).

  • bloom_filter_fpp (float | None) –

    Sets the default target bloom filter false positive probability (fpp) for all columns (defaults to 0.05).

  • bloom_filter_ndv (int | None) –

    Sets default number of distinct values (ndv) for bloom filter for all columns (defaults to 1_000_000).

  • column_compression (dict[ParquetColumnPath, ParquetCompression] | None) –

    Sets compression codec for a specific column. Takes precedence over compression.

  • column_dictionary_enabled (dict[ParquetColumnPath, bool] | None) –

    Sets flag to enable/disable dictionary encoding for a specific column. Takes precedence over dictionary_enabled.

  • column_encoding (dict[ParquetColumnPath, ParquetEncoding] | None) –

    Sets encoding for a specific column. Takes precedence over encoding.

  • column_max_statistics_size (dict[ParquetColumnPath, int] | None) –

    Sets max size for statistics for a specific column. Takes precedence over max_statistics_size.

  • compression (ParquetCompression | None) –

    Sets default compression codec for all columns (default to uncompressed). Note that you can pass in a custom compression level with a string like "zstd(3)" or "gzip(9)" or "brotli(3)".

  • created_by (str | None) –

    Sets "created by" property (defaults to parquet-rs version <VERSION>).

  • data_page_row_count_limit (int | None) –

    Sets best effort maximum number of rows in a data page (defaults to 20_000).

    The parquet writer will attempt to limit the number of rows in each DataPage to this value. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

    Note: this is a best effort limit based on value of set_write_batch_size.

  • data_page_size_limit (int | None) –

    Sets best effort maximum size of a data page in bytes (defaults to 1024 * 1024).

    The parquet writer will attempt to limit the sizes of each DataPage to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

    Note: this is a best effort limit based on value of set_write_batch_size.

  • dictionary_enabled (bool | None) –

    Sets default flag to enable/disable dictionary encoding for all columns (defaults to True).

  • dictionary_page_size_limit (int | None) –

    Sets best effort maximum dictionary page size, in bytes (defaults to 1024 * 1024).

    The parquet writer will attempt to limit the size of each DataPage used to store dictionaries to this many bytes. Reducing this value will result in larger parquet files, but may improve the effectiveness of page index based predicate pushdown during reading.

    Note: this is a best effort limit based on value of set_write_batch_size.

  • encoding (ParquetEncoding | None) –

    Sets default encoding for all columns.

    If dictionary is not enabled, this is treated as a primary encoding for all columns. In case when dictionary is enabled for any column, this value is considered to be a fallback encoding for that column.

  • key_value_metadata (dict[str, str] | None) –

    Sets "key_value_metadata" property (defaults to None).

  • max_row_group_size (int | None) –

    Sets maximum number of rows in a row group (defaults to 1024 * 1024).

  • max_statistics_size (int | None) –

    Sets default max statistics size for all columns (defaults to 4096).

  • skip_arrow_metadata (bool) –

    Parquet files generated by this writer contain embedded arrow schema by default. Set skip_arrow_metadata to True, to skip encoding the embedded metadata (defaults to False).

  • write_batch_size (int | None) –

    Sets write batch size (defaults to 1024).

    For performance reasons, data for each column is written in batches of this size.

    Additional limits such as such as set_data_page_row_count_limit are checked between batches, and thus the write batch size value acts as an upper-bound on the enforcement granularity of other limits.

  • writer_version (Literal['parquet_1_0', 'parquet_2_0'] | None) –

    Sets the WriterVersion written into the parquet metadata (defaults to "parquet_1_0"). This value can determine what features some readers will support.