Skip to content

Conversation

@Kontinuation
Copy link
Member

This implements part of #436 . Spatial partitioners is the core component of the spatial partitioned spatial join. We need to collect samples of geospatial objects to build the spatial partitioning grid.

The goal is that we collect enough number of samples to create a high quality spatial partition even for small datasets, while not collecting too many samples for large datasets to avoid running out of memory. The sampling should be uniform, so that the collected samples could faithfully represent the distribution of the entire dataset. The sampler should only go through the sampled stream in one single pass, since evaluating the sampled stream multiple times may trigger repeated computations of upstream physical operators.

The sampling algorithm we adopted is a combination of reservoir sampling and Bernoulli sampling: it collects at least $N_\text{min}$ , at most $N_\text{max}$ samples per partition, and make sure that the sampling rate won’t go below $R$ before hitting $N_\text{max}$.

The algorithm maintains a set of sampled envelopes $S$, and will go through 4 stages as the number of rows seen $k$ proceeds:

  • Stage 1 - Filling the small reservoir: When $k < N_\text{min}$, simply add the envelope of the geometry to $S$
  • Stage 2 - Small reservoir sampling: when $N_\text{min} \leq k < \dfrac{N_\text{min}}{R}$, use [reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) method to maintain a fixed number of samples ($N_\text{min}$) in $S$
  • Stage 3 - Bernoulli sampling: when $k \geq \dfrac{N_\text{min}}{R} \land ||S|| < N_\text{max}$, use Bernoulli sampling to determine if we accept the next sample or not. $S$ starts to grow in this stage.
  • Stage 4 - Large reservoir sampling: when $||S|| = N_\text{max}$, use reservoir sampling method to maintain a fixed number of samples ($N_\text{max}$) in $S$

This algorithm guarantees that:

  1. Collect enough samples even for small partitions: If number of rows in a partition is not less than $N_\text{min}$, at least $N_\text{min}$ samples will be collected. If number of rows in a partition is less than $N_\text{min}$, all rows will be collected as samples.
  2. Won’t collect too many samples for large partitions: $||S||$ will never exceed $N_\text{max}$, no matter how large the partition is.
  3. Uniform sampling: The samples are uniformly sampled even though the algorithm is composed by 4 distinct stages. This is trivial to prove.

Here is a figure illustrating the 4 stages of the sampling algorithm, it shows which stage is used to sample each portion of the row stream. We take $N_\text{min} = 1000$, $N_\text{max} = 10000$, $R = 0.01$ as an example.

SamplingStages drawio

@Kontinuation Kontinuation force-pushed the bbox-sampler branch 2 times, most recently from f96d89f to c408d12 Compare December 12, 2025 09:32
@Kontinuation Kontinuation requested a review from Copilot December 12, 2025 09:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a bounding box sampler for collecting representative samples from spatial datasets to build high-quality spatial partitioning grids. The sampler uses a multi-stage algorithm combining reservoir sampling and Bernoulli sampling to ensure uniform sampling while maintaining memory bounds.

Key changes:

  • Implements a 4-stage sampling algorithm that guarantees minimum sample collection, maximum memory bounds, and uniform distribution
  • Provides functionality to combine samples from multiple partitions while preserving uniform sampling properties
  • Adds comprehensive test coverage for all sampling stages and edge cases

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

File Description
rust/sedona-spatial-join/src/utils/bbox_sampler.rs New module implementing the multi-stage bounding box sampler with extensive documentation and tests
rust/sedona-spatial-join/src/utils.rs Adds the bbox_sampler module to the utils module exports
rust/sedona-spatial-join/Cargo.toml Adds fastrand dependency for non-cryptographic random number generation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

// make both sides having the same sampling rate
let subsampling_rate = self_sampling_rate / other_sampling_rate;
let mut samples = self.samples;
let mut rng = Rng::with_seed(seed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these always be combined with a fresh Rng with the same seed or should this function accept an Rng?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the parameter of combine to a mutable reference to Rng.

@paleolimbot paleolimbot changed the title feat: Add a bounding box sampler for building spatial partitioners or other purposes feat(rust/sedona-spatial-join): Add a bounding box sampler for building spatial partitioners or other purposes Dec 12, 2025
@Kontinuation Kontinuation merged commit 68cf66e into apache:main Dec 17, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants