RTree¶
geoindex_rs.rtree ¶
ArrayLike
module-attribute
¶
ArrayLike = Union[ndarray, ArrowArrayExportable, Buffer]
A type alias for accepted input to the RTreeBuilder.add
method.
Accepted input includes numpy arrays, Arrow arrays, and C-contiguous buffer protocol input.
IndexLike
module-attribute
¶
IndexLike = Union[ndarray, ArrowArrayExportable, Buffer, RTree]
A type alias for accepted input as an RTree.
RTree ¶
Bases: Buffer
A fast, memory-efficient, zero-copy-compatible, 2D RTree.
Use RTreeBuilder
, and then call
finish
to construct this.
This uses a binary-stable memory layout under the hood. So you can write the index to disk, then later load it and perform queries on it.
This class implements the Python buffer protocol, so you can pass it to the Python
bytes
constructor to copy the underlying binary memory into a Python bytes
object.
import numpy as np
from geoindex_rs import rtree as rt
min_x = np.arange(0, 3)
min_y = np.arange(0, 3)
max_x = np.arange(2, 5)
max_y = np.arange(2, 5)
builder = rt.RTreeBuilder(num_items=3)
builder.add(min_x, min_y, max_x, max_y)
tree = builder.finish()
# Copy to a Python bytes object
copied_tree = bytes(tree)
# The entire RTree is contained within this 144 byte buffer
assert len(copied_tree) == 144
# We can use the bytes object (or anything else implementing the Python buffer
protocol) directly in searches
results = rt.neighbors(copied_tree, 5, 5)
assert results.to_pylist() == [2, 1, 0]
metadata
property
¶
metadata: RTreeMetadata
Access the metadata instance of this RTree.
Use this to infer the number of items or nodes in this RTree.
RTreeBuilder ¶
A builder class to create an RTree.
Example:
import numpy as np
from geoindex_rs import rtree as rt
builder = rt.RTreeBuilder(3)
min_x = np.arange(0, 3)
min_y = np.arange(0, 3)
max_x = np.arange(2, 5)
max_y = np.arange(2, 5)
builder.add(min_x, min_y, max_x, max_y)
tree = builder.finish()
__init__ ¶
__init__(
num_items: int,
node_size: int = 16,
coord_type: Literal["float32", "float64"] = "float64",
) -> None
Initialize an RTree with the given parameters.
This will provision memory for the RTree, to be filled in the add
method.
Parameters:
add ¶
add(
min_x: ArrayLike,
min_y: ArrayLike | None = None,
max_x: ArrayLike | None = None,
max_y: ArrayLike | None = None,
) -> Array
Insert bounding boxes in this RTree.
There are multiple ways to call this method:
- Four numpy or Arrow numeric arrays, passed as
min_x
,min_y
,max_x
,max_y
. - A 2D numpy array, with shape
(N, 4)
. It must be C-contiguous. - One argument with an Arrow array of type FixedSizeList or Struct. The
FixedSizeListArray must have a
list_size
of 4, and theStructArray
must have four children, orderedmin_x
,min_y
,max_x
,max_y
.
In this case, all other parameters should be left as None
.
Example: Adding shapely geometries
import numpy as np
import shapely
from geoindex_rs import rtree as rt
# Shapely array of geometries, such as from GeoPandas
geometries = [...]
builder = rt.RTreeBuilder(len(geometries))
# Find the bounding box of each geometry
bounds = shapely.bounds(geometries)
builder.add(bounds)
tree = builder.finish()
results = rt.neighbors(tree, 5, 5)
assert results.to_pylist() == [2, 1, 0]
Note
Most of the time, you should call add
only once. This function is
vectorized, and you should avoid calling it in a loop with a few rows of
data at a time.
In some cases, it may be useful to call this in a loop, such as if the input geometries do not all fit in memory at once.
It's important to add arrays at a time. This should usually not be called in a loop.
Parameters:
-
min_x
(ArrayLike
) –array-like input. If this is the only provided input, it should represent the entire bounding box, as described above. Otherwise, pass four separate parameters.
-
min_y
(ArrayLike | None
, default:None
) –array-like input. Defaults to None.
-
max_x
(ArrayLike | None
, default:None
) –array-like input. Defaults to None.
-
max_y
(ArrayLike | None
, default:None
) –array-like input. Defaults to None.
Returns:
-
Array
–An Arrow array with the insertion index of each element, which provides a lookup back into the original data.
This can be converted to a
pyarrow.Array
by passing topyarrow.array
.
finish ¶
Sort the internal index and convert this class to an RTree instance.
Parameters:
-
method
(Literal['hilbert', 'str']
, default:'hilbert'
) –The method used for sorting the RTree. Defaults to
"hilbert"
."hilbert"
will use a Hilbert Curve for sorting."str"
will use the Sort-Tile-Recursive algorithm.
Returns:
-
RTree
–An immutable RTree instance, which can be used for spatial queries.
RTreeMetadata ¶
Common metadata to describe an RTree.
This can be used to know the number of items, node information, or total byte size of an RTree.
Additionally, this can be used to know how much memory an RTree would use with the given number of items and node size. An RTree with 1 million items and a node size of 20 would take up 38 MiB.
from geoindex_rs import rtree as rt
metadata = rt.RTreeMetadata(num_items=1_000_000, node_size=20)
assert metadata.num_bytes == 37_894_796
num_bytes
property
¶
num_bytes: int
The number of bytes that an RTree with this metadata would have.
__init__ ¶
__init__(
num_items: int,
node_size: int = 16,
coord_type: Literal["float32", "float64"] = "float64",
) -> None
Create a new RTreeMetadata given a number of items and node size.
Parameters:
from_index
classmethod
¶
from_index(index: IndexLike) -> RTreeMetadata
Create from an existing RTree buffer.
boxes_at_level ¶
Access the raw bounding box data contained in the RTree at a given tree level.
Parameters:
-
index
(IndexLike
) –the RTree to search.
-
level
(int
) –The level of the tree to read from. Level 0 is the base of the tree. Each integer higher is one level higher of the tree.
Other Parameters:
-
copy
(bool
) –if True, make a copy of the data from the underlying RTree instead of viewing it directly. Making a copy can be preferred if you'd like to delete the index itself to save memory.
Returns:
-
Array
–An Arrow FixedSizeListArray containing the bounding box coordinates.
If
copy
isFalse
, the returned array is a a zero-copy view from Rust. Note that it will keep the entire index memory alive until the returned array is garbage collected.
neighbors ¶
neighbors(
index: IndexLike,
x: int | float,
y: int | float,
*,
max_results: int | None,
max_distance: int | float | None,
) -> Array
Search items in order of distance from the given point.
Example:
import numpy as np
from geoindex_rs import rtree as rt
builder = rt.RTreeBuilder(3)
min_x = np.arange(0, 3)
min_y = np.arange(0, 3)
max_x = np.arange(2, 5)
max_y = np.arange(2, 5)
builder.add(min_x, min_y, max_x, max_y)
tree = builder.finish()
results = rt.neighbors(tree, 5, 5)
assert results.to_pylist() == [2, 1, 0]
Parameters:
-
index
(IndexLike
) –the RTree to search.
-
x
(int | float
) –the
x
coordinate of the query point -
y
(int | float
) –the
y
coordinate of the query point -
max_results
(int | None
) –The maximum number of results to search. If not provided, all results (within
max_distance
) will be returned. -
max_distance
(int | float | None
) –The maximum distance from the query point to search. If not provided, all results (up to
max_results
) will be returned.
Returns:
-
Array
–An Arrow array with the insertion indexes of query results.
partition_boxes ¶
partition_boxes(index: IndexLike, *, copy: bool = False) -> RecordBatch
Extract the geometries of the spatial partitions from an RTree.
In order for these boxes to be zero-copy from Rust, they are returned as a FixedSizeListArray, where each element has 4 items.
Note
While partitions
returns a RecordBatch
that has a length of the number of
items, this returns a RecordBatch
with a length matching the number of
partitions.
This is equivalent to calling boxes_at_level(1)
, plus the partition_id
column,
which is a monotonically-increasing integer column.
Parameters:
-
index
(IndexLike
) –the RTree to use.
Other Parameters:
-
copy
(bool
) –if True, make a copy of the data from the underlying RTree instead of viewing it directly. Making a copy can be preferred if you'd like to delete the index itself to save memory.
Returns:
-
RecordBatch
–An Arrow
RecordBatch
with two columns:boxes
andpartition_ids
.boxes
stores the box geometry of each partition andpartition_ids
refers to the partition each row belongs to.If
copy
isFalse
, theboxes
column is constructed as a zero-copy view on the internal boxes data. Thepartition_id
column will beuint16
type if there are less than 65,536 partitions; otherwise it will beuint32
type.
partitions ¶
partitions(index: IndexLike, *, copy=False) -> RecordBatch
Extract the spatial partitions from an RTree.
This can be used to find the sorted groups for spatially partitioning the original data, such as when writing spatially-partitioned GeoParquet.
Note
Currently, this always uses the lowest level of the tree when inferring
partitioning. Thus, for large input you may want to use the largest node size
possible (65535
) for constructing a tree for use in spatial partitioning.
Future work may allow higher levels of the tree to be used for partitioning.
Example:
import numpy as np
from geoindex_rs import rtree as rt
# Create a new builder with space for 100 million boxes
num_items = 100_000_000
builder = rt.RTreeBuilder(num_items, 65535)
min_x = np.random.uniform(-100, -10, num_items)
min_y = np.random.uniform(-50, -10, num_items)
max_x = np.random.uniform(10, 100, num_items)
max_y = np.random.uniform(10, 50, num_items)
builder.add(min_x, min_y, max_x, max_y)
tree = builder.finish()
partitions = rt.partitions(tree)
# There are 1526 partitioned groups
assert partitions["partition_id"][-1].as_py() == 1525
Parameters:
-
index
(IndexLike
) –the RTree to use.
Other Parameters:
-
copy
–if True, make a copy of the data from the underlying RTree instead of viewing it directly. Making a copy can be preferred if you'd like to delete the index itself to save memory.
Returns:
-
RecordBatch
–An Arrow
RecordBatch
with two columns:indices
andpartition_ids
.indices
refers to the insertion index of each row andpartition_ids
refers to the partition each row belongs to.If
copy
isFalse
, theindices
column is constructed as a zero-copy view on the provided index. Therefore, theindices
array will have typeuint16
if the tree has fewer than 16,384 items; otherwise it will have typeuint32
.
search ¶
search(
index: IndexLike,
min_x: int | float,
min_y: int | float,
max_x: int | float,
max_y: int | float,
) -> Array
Search an RTree for elements intersecting the provided bounding box.
Results are the insertion indexes of items that match the query.
Parameters:
-
index
(IndexLike
) –the RTree to search
-
min_x
(int | float
) –The
min_x
coordinate of the query bounding box -
min_y
(int | float
) –The
min_y
coordinate of the query bounding box -
max_x
(int | float
) –The
max_x
coordinate of the query bounding box -
max_y
(int | float
) –The
max_y
coordinate of the query bounding box
Returns:
-
Array
–An Arrow array with the insertion indexes of query results.
tree_join ¶
tree_join(left: IndexLike, right: IndexLike) -> RecordBatch
Find the overlapping elements of two RTrees.
This is the first part of a spatial join: to find which elements from two different data sources potentially intersect.
Note that this only evaluates intersection of the bounding boxes in the tree. Assuming that these bounding boxes represent actual vector geometries, the result of this join are candidates. Any pairs present implies that the bounding boxes of those two indices intersect. Any pairs not present implies that the bounding boxes of those two indices cannot intersect, and therefore the represented geometries cannot intersect.
This returns an Arrow RecordBatch
with positional indexes from the left and right
trees. The RecordBatch
has two uint32
columns named "left"
and "right"
.
Parameters:
-
left
(IndexLike
) –The "left" spatial index for the join.
-
right
(IndexLike
) –The "right" spatial index for the join.
Returns:
-
RecordBatch
–An Arrow
RecordBatch
with the positional indexes of intersecting tree elements.