RTree¶

geoindex_rs.rtree ¶

ArrayLike `module-attribute` ¶

ArrayLike = Union[ndarray, ArrowArrayExportable, Buffer]

A type alias for accepted input to the RTreeBuilder.add method.

Accepted input includes numpy arrays, Arrow arrays, and C-contiguous buffer protocol input.

IndexLike `module-attribute` ¶

IndexLike = Union[ndarray, ArrowArrayExportable, Buffer, RTree]

A type alias for accepted input as an RTree.

RTree ¶

Bases: Buffer

A fast, memory-efficient, zero-copy-compatible, 2D RTree.

Use RTreeBuilder, and then call finish to construct this.

This uses a binary-stable memory layout under the hood. So you can write the index to disk, then later load it and perform queries on it.

This class implements the Python buffer protocol, so you can pass it to the Python bytes constructor to copy the underlying binary memory into a Python bytes object.

import numpy as np
from geoindex_rs import rtree as rt

min_x = np.arange(0, 3)
min_y = np.arange(0, 3)
max_x = np.arange(2, 5)
max_y = np.arange(2, 5)

builder = rt.RTreeBuilder(num_items=3)
builder.add(min_x, min_y, max_x, max_y)
tree = builder.finish()

# Copy to a Python bytes object
copied_tree = bytes(tree)

# The entire RTree is contained within this 144 byte buffer
assert len(copied_tree) == 144

# We can use the bytes object (or anything else implementing the Python buffer
protocol) directly in searches
results = rt.neighbors(copied_tree, 5, 5)
assert results.to_pylist() == [2, 1, 0]

metadata `property` ¶

metadata: RTreeMetadata

Access the metadata instance of this RTree.

Use this to infer the number of items or nodes in this RTree.

RTreeBuilder ¶

A builder class to create an RTree.

Example:

import numpy as np
from geoindex_rs import rtree as rt

builder = rt.RTreeBuilder(3)
min_x = np.arange(0, 3)
min_y = np.arange(0, 3)
max_x = np.arange(2, 5)
max_y = np.arange(2, 5)
builder.add(min_x, min_y, max_x, max_y)
tree = builder.finish()

init ¶

__init__(
    num_items: int,
    node_size: int = 16,
    coord_type: Literal["float32", "float64"] = "float64",
) -> None

Initialize an RTree with the given parameters.

This will provision memory for the RTree, to be filled in the add method.

Parameters:

num_items (int) –

The number of items in the tree
node_size (int, default: 16 ) –

The node size of the tree. Defaults to 16.
coord_type (Literal['float32', 'float64'], default: 'float64' ) –

The coordinate type to use in the tree. Currently only float32 and float64 are permitted. Defaults to None.

add ¶

add(
    min_x: ArrayLike,
    min_y: ArrayLike | None = None,
    max_x: ArrayLike | None = None,
    max_y: ArrayLike | None = None,
) -> Array

Insert bounding boxes in this RTree.

There are multiple ways to call this method:

Four numpy or Arrow numeric arrays, passed as min_x, min_y, max_x, max_y.
A 2D numpy array, with shape (N, 4). It must be C-contiguous.
One argument with an Arrow array of type FixedSizeList or Struct. The FixedSizeListArray must have a list_size of 4, and the StructArray must have four children, ordered min_x, min_y, max_x, max_y.

In this case, all other parameters should be left as None.

Example: Adding shapely geometries

import numpy as np
import shapely
from geoindex_rs import rtree as rt

# Shapely array of geometries, such as from GeoPandas
geometries = [...]

builder = rt.RTreeBuilder(len(geometries))

# Find the bounding box of each geometry
bounds = shapely.bounds(geometries)
builder.add(bounds)
tree = builder.finish()

results = rt.neighbors(tree, 5, 5)
assert results.to_pylist() == [2, 1, 0]

Note

Most of the time, you should call add only once. This function is vectorized, and you should avoid calling it in a loop with a few rows of data at a time.

In some cases, it may be useful to call this in a loop, such as if the input geometries do not all fit in memory at once.

It's important to add arrays at a time. This should usually not be called in a loop.

Parameters:

min_x (ArrayLike) –

array-like input. If this is the only provided input, it should represent the entire bounding box, as described above. Otherwise, pass four separate parameters.
min_y (ArrayLike | None, default: None ) –

array-like input. Defaults to None.
max_x (ArrayLike | None, default: None ) –

array-like input. Defaults to None.
max_y (ArrayLike | None, default: None ) –

array-like input. Defaults to None.

Returns:

Array –

An Arrow array with the insertion index of each element, which provides a lookup back into the original data.

This can be converted to a pyarrow.Array by passing to pyarrow.array.

finish ¶

finish(method: Literal['hilbert', 'str'] = 'hilbert') -> RTree

Sort the internal index and convert this class to an RTree instance.

Parameters:

method (Literal['hilbert', 'str'], default: 'hilbert' ) –
The method used for sorting the RTree. Defaults to "hilbert".
- "hilbert" will use a Hilbert Curve for sorting.
- "str" will use the Sort-Tile-Recursive algorithm.

Returns:

RTree –

An immutable RTree instance, which can be used for spatial queries.

RTreeMetadata ¶

Common metadata to describe an RTree.

This can be used to know the number of items, node information, or total byte size of an RTree.

Additionally, this can be used to know how much memory an RTree would use with the given number of items and node size. An RTree with 1 million items and a node size of 20 would take up 38 MiB.

from geoindex_rs import rtree as rt

metadata = rt.RTreeMetadata(num_items=1_000_000, node_size=20)
assert metadata.num_bytes == 37_894_796

node_size `property` ¶

node_size: int

The maximum number of items per node.

num_bytes `property` ¶

num_bytes: int

The number of bytes that an RTree with this metadata would have.

num_items `property` ¶

num_items: int

The number of items indexed in the tree.

num_levels `property` ¶

num_levels: int

The number of levels in the tree.

num_nodes `property` ¶

num_nodes: int

The total number of nodes at all levels in the tree.

init ¶

__init__(
    num_items: int,
    node_size: int = 16,
    coord_type: Literal["float32", "float64"] = "float64",
) -> None

Create a new RTreeMetadata given a number of items and node size.

Parameters:

num_items (int) –

The number of items in the tree
node_size (int, default: 16 ) –

The node size of the tree. Defaults to 16.
coord_type (Literal['float32', 'float64'], default: 'float64' ) –

The coordinate type to use in the tree. Currently only float32 and float64 are permitted. Defaults to None.

from_index `classmethod` ¶

from_index(index: IndexLike) -> RTreeMetadata

Create from an existing RTree buffer.

boxes_at_level ¶

boxes_at_level(index: IndexLike, level: int, *, copy: bool = False) -> Array

Access the raw bounding box data contained in the RTree at a given tree level.

Parameters:

index (IndexLike) –

the RTree to search.
level (int) –

The level of the tree to read from. Level 0 is the base of the tree. Each integer higher is one level higher of the tree.

Other Parameters:

copy (bool) –

if True, make a copy of the data from the underlying RTree instead of viewing it directly. Making a copy can be preferred if you'd like to delete the index itself to save memory.

Returns:

Array –

An Arrow FixedSizeListArray containing the bounding box coordinates.

If copy is False, the returned array is a a zero-copy view from Rust. Note that it will keep the entire index memory alive until the returned array is garbage collected.

neighbors ¶

neighbors(
    index: IndexLike,
    x: int | float,
    y: int | float,
    *,
    max_results: int | None,
    max_distance: int | float | None,
) -> Array

Search items in order of distance from the given point.

Example:

import numpy as np
from geoindex_rs import rtree as rt

builder = rt.RTreeBuilder(3)
min_x = np.arange(0, 3)
min_y = np.arange(0, 3)
max_x = np.arange(2, 5)
max_y = np.arange(2, 5)
builder.add(min_x, min_y, max_x, max_y)
tree = builder.finish()

results = rt.neighbors(tree, 5, 5)
assert results.to_pylist() == [2, 1, 0]

Parameters:

index (IndexLike) –

the RTree to search.
x (int | float) –

the x coordinate of the query point
y (int | float) –

the y coordinate of the query point
max_results (int | None) –

The maximum number of results to search. If not provided, all results (within max_distance) will be returned.
max_distance (int | float | None) –

The maximum distance from the query point to search. If not provided, all results (up to max_results) will be returned.

Returns:

Array –

An Arrow array with the insertion indexes of query results.

partition_boxes ¶

partition_boxes(index: IndexLike, *, copy: bool = False) -> RecordBatch

Extract the geometries of the spatial partitions from an RTree.

In order for these boxes to be zero-copy from Rust, they are returned as a FixedSizeListArray, where each element has 4 items.

Note

While partitions returns a RecordBatch that has a length of the number of items, this returns a RecordBatch with a length matching the number of partitions.

This is equivalent to calling boxes_at_level(1), plus the partition_id column, which is a monotonically-increasing integer column.

Parameters:

index (IndexLike) –

the RTree to use.

Other Parameters:

copy (bool) –

if True, make a copy of the data from the underlying RTree instead of viewing it directly. Making a copy can be preferred if you'd like to delete the index itself to save memory.

Returns:

RecordBatch –

An Arrow RecordBatch with two columns: boxes and partition_ids. boxes stores the box geometry of each partition and partition_ids refers to the partition each row belongs to.

If copy is False, the boxes column is constructed as a zero-copy view on the internal boxes data. The partition_id column will be uint16 type if there are less than 65,536 partitions; otherwise it will be uint32 type.

partitions ¶

partitions(index: IndexLike, *, copy=False) -> RecordBatch

Extract the spatial partitions from an RTree.

This can be used to find the sorted groups for spatially partitioning the original data, such as when writing spatially-partitioned GeoParquet.

Note

Currently, this always uses the lowest level of the tree when inferring partitioning. Thus, for large input you may want to use the largest node size possible (65535) for constructing a tree for use in spatial partitioning.

Future work may allow higher levels of the tree to be used for partitioning.

Example:

import numpy as np
from geoindex_rs import rtree as rt

# Create a new builder with space for 100 million boxes
num_items = 100_000_000
builder = rt.RTreeBuilder(num_items, 65535)
min_x = np.random.uniform(-100, -10, num_items)
min_y = np.random.uniform(-50, -10, num_items)
max_x = np.random.uniform(10, 100, num_items)
max_y = np.random.uniform(10, 50, num_items)
builder.add(min_x, min_y, max_x, max_y)
tree = builder.finish()

partitions = rt.partitions(tree)
# There are 1526 partitioned groups
assert partitions["partition_id"][-1].as_py() == 1525

Parameters:

index (IndexLike) –

the RTree to use.

Other Parameters:

copy –

if True, make a copy of the data from the underlying RTree instead of viewing it directly. Making a copy can be preferred if you'd like to delete the index itself to save memory.

Returns:

RecordBatch –

An Arrow RecordBatch with two columns: indices and partition_ids. indices refers to the insertion index of each row and partition_ids refers to the partition each row belongs to.

If copy is False, the indices column is constructed as a zero-copy view on the provided index. Therefore, the indices array will have type uint16 if the tree has fewer than 16,384 items; otherwise it will have type uint32.

search ¶

search(
    index: IndexLike,
    min_x: int | float,
    min_y: int | float,
    max_x: int | float,
    max_y: int | float,
) -> Array

Search an RTree for elements intersecting the provided bounding box.

Results are the insertion indexes of items that match the query.

Parameters:

index (IndexLike) –

the RTree to search
min_x (int | float) –

The min_x coordinate of the query bounding box
min_y (int | float) –

The min_y coordinate of the query bounding box
max_x (int | float) –

The max_x coordinate of the query bounding box
max_y (int | float) –

The max_y coordinate of the query bounding box

Returns:

Array –

An Arrow array with the insertion indexes of query results.

tree_join ¶

tree_join(left: IndexLike, right: IndexLike) -> RecordBatch

Find the overlapping elements of two RTrees.

This is the first part of a spatial join: to find which elements from two different data sources potentially intersect.

Note that this only evaluates intersection of the bounding boxes in the tree. Assuming that these bounding boxes represent actual vector geometries, the result of this join are candidates. Any pairs present implies that the bounding boxes of those two indices intersect. Any pairs not present implies that the bounding boxes of those two indices cannot intersect, and therefore the represented geometries cannot intersect.

This returns an Arrow RecordBatch with positional indexes from the left and right trees. The RecordBatch has two uint32 columns named "left" and "right".

Parameters:

left (IndexLike) –

The "left" spatial index for the join.
right (IndexLike) –

The "right" spatial index for the join.

Returns:

RecordBatch –

An Arrow RecordBatch with the positional indexes of intersecting tree elements.

RTree¶

geoindex_rs.rtree ¶

ArrayLike module-attribute ¶

IndexLike module-attribute ¶

RTree ¶

metadata property ¶

RTreeBuilder ¶

__init__ ¶

add ¶

finish ¶

RTreeMetadata ¶

node_size property ¶

num_bytes property ¶

num_items property ¶

num_levels property ¶

num_nodes property ¶

__init__ ¶

from_index classmethod ¶

boxes_at_level ¶

neighbors ¶

partition_boxes ¶

partitions ¶

search ¶

tree_join ¶

ArrayLike `module-attribute` ¶

IndexLike `module-attribute` ¶

metadata `property` ¶

init ¶

node_size `property` ¶

num_bytes `property` ¶

num_items `property` ¶

num_levels `property` ¶

num_nodes `property` ¶

init ¶

from_index `classmethod` ¶