feat(rust/sedona-spatial-join): Add a bounding box sampler for building spatial partitioners or other purposes #442

Kontinuation · 2025-12-12T07:52:16Z

This implements part of #436 . Spatial partitioners is the core component of the spatial partitioned spatial join. We need to collect samples of geospatial objects to build the spatial partitioning grid.

The goal is that we collect enough number of samples to create a high quality spatial partition even for small datasets, while not collecting too many samples for large datasets to avoid running out of memory. The sampling should be uniform, so that the collected samples could faithfully represent the distribution of the entire dataset. The sampler should only go through the sampled stream in one single pass, since evaluating the sampled stream multiple times may trigger repeated computations of upstream physical operators.

The sampling algorithm we adopted is a combination of reservoir sampling and Bernoulli sampling: it collects at least $N_\text{min}$ , at most $N_\text{max}$ samples per partition, and make sure that the sampling rate won’t go below $R$ before hitting $N_\text{max}$.

The algorithm maintains a set of sampled envelopes $S$, and will go through 4 stages as the number of rows seen $k$ proceeds:

Stage 1 - Filling the small reservoir: When $k < N_\text{min}$, simply add the envelope of the geometry to $S$
Stage 2 - Small reservoir sampling: when $N_\text{min} \leq k < \dfrac{N_\text{min}}{R}$, use [reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) method to maintain a fixed number of samples ($N_\text{min}$) in $S$
Stage 3 - Bernoulli sampling: when $k \geq \dfrac{N_\text{min}}{R} \land ||S|| < N_\text{max}$, use Bernoulli sampling to determine if we accept the next sample or not. $S$ starts to grow in this stage.
Stage 4 - Large reservoir sampling: when $||S|| = N_\text{max}$, use reservoir sampling method to maintain a fixed number of samples ($N_\text{max}$) in $S$

This algorithm guarantees that:

Collect enough samples even for small partitions: If number of rows in a partition is not less than $N_\text{min}$, at least $N_\text{min}$ samples will be collected. If number of rows in a partition is less than $N_\text{min}$, all rows will be collected as samples.
Won’t collect too many samples for large partitions: $||S||$ will never exceed $N_\text{max}$, no matter how large the partition is.
Uniform sampling: The samples are uniformly sampled even though the algorithm is composed by 4 distinct stages. This is trivial to prove.

Here is a figure illustrating the 4 stages of the sampling algorithm, it shows which stage is used to sample each portion of the row stream. We take $N_\text{min} = 1000$, $N_\text{max} = 10000$, $R = 0.01$ as an example.

Copilot

Pull request overview

This PR implements a bounding box sampler for collecting representative samples from spatial datasets to build high-quality spatial partitioning grids. The sampler uses a multi-stage algorithm combining reservoir sampling and Bernoulli sampling to ensure uniform sampling while maintaining memory bounds.

Key changes:

Implements a 4-stage sampling algorithm that guarantees minimum sample collection, maximum memory bounds, and uniform distribution
Provides functionality to combine samples from multiple partitions while preserving uniform sampling properties
Adds comprehensive test coverage for all sampling stages and edge cases

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

File	Description
rust/sedona-spatial-join/src/utils/bbox_sampler.rs	New module implementing the multi-stage bounding box sampler with extensive documentation and tests
rust/sedona-spatial-join/src/utils.rs	Adds the bbox_sampler module to the utils module exports
rust/sedona-spatial-join/Cargo.toml	Adds fastrand dependency for non-cryptographic random number generation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rust/sedona-spatial-join/src/utils/bbox_sampler.rs

…urposes

paleolimbot

Thank you!

paleolimbot · 2025-12-12T20:34:40Z

rust/sedona-spatial-join/src/utils/bbox_sampler.rs

+        // make both sides having the same sampling rate
+        let subsampling_rate = self_sampling_rate / other_sampling_rate;
+        let mut samples = self.samples;
+        let mut rng = Rng::with_seed(seed);


Should these always be combined with a fresh Rng with the same seed or should this function accept an Rng?

Changed the parameter of combine to a mutable reference to Rng.

Kontinuation force-pushed the bbox-sampler branch 2 times, most recently from f96d89f to c408d12 Compare December 12, 2025 09:32

Kontinuation requested a review from Copilot December 12, 2025 09:33

Copilot AI reviewed Dec 12, 2025

View reviewed changes

rust/sedona-spatial-join/src/utils/bbox_sampler.rs Show resolved Hide resolved

Add a bounding box sampler for building spatial partitions or other p…

a9b1a03

…urposes

Kontinuation force-pushed the bbox-sampler branch from c408d12 to a9b1a03 Compare December 12, 2025 09:53

paleolimbot approved these changes Dec 12, 2025

View reviewed changes

paleolimbot changed the title ~~feat: Add a bounding box sampler for building spatial partitioners or other purposes~~ feat(rust/sedona-spatial-join): Add a bounding box sampler for building spatial partitioners or other purposes Dec 12, 2025

Pass a mutable random number generator instead of seed into combine

c8a1ee0

Kontinuation marked this pull request as ready for review December 15, 2025 07:26

Kontinuation mentioned this pull request Dec 15, 2025

feat(rust/sedona-spatial-join): Add spatial partitioners for partitioning spatially joined data by partitioning grids #443

Merged

paleolimbot approved these changes Dec 16, 2025

View reviewed changes

Kontinuation merged commit 68cf66e into apache:main Dec 17, 2025
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(rust/sedona-spatial-join): Add a bounding box sampler for building spatial partitioners or other purposes #442

feat(rust/sedona-spatial-join): Add a bounding box sampler for building spatial partitioners or other purposes #442

Uh oh!

Kontinuation commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

paleolimbot left a comment

Uh oh!

paleolimbot Dec 12, 2025

Uh oh!

Kontinuation Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(rust/sedona-spatial-join): Add a bounding box sampler for building spatial partitioners or other purposes #442

feat(rust/sedona-spatial-join): Add a bounding box sampler for building spatial partitioners or other purposes #442

Uh oh!

Conversation

Kontinuation commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

paleolimbot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Kontinuation Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants