-
Notifications
You must be signed in to change notification settings - Fork 38
feat(rust/sedona-spatial-join): Add a bounding box sampler for building spatial partitioners or other purposes #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f96d89f to
c408d12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a bounding box sampler for collecting representative samples from spatial datasets to build high-quality spatial partitioning grids. The sampler uses a multi-stage algorithm combining reservoir sampling and Bernoulli sampling to ensure uniform sampling while maintaining memory bounds.
Key changes:
- Implements a 4-stage sampling algorithm that guarantees minimum sample collection, maximum memory bounds, and uniform distribution
- Provides functionality to combine samples from multiple partitions while preserving uniform sampling properties
- Adds comprehensive test coverage for all sampling stages and edge cases
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| rust/sedona-spatial-join/src/utils/bbox_sampler.rs | New module implementing the multi-stage bounding box sampler with extensive documentation and tests |
| rust/sedona-spatial-join/src/utils.rs | Adds the bbox_sampler module to the utils module exports |
| rust/sedona-spatial-join/Cargo.toml | Adds fastrand dependency for non-cryptographic random number generation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c408d12 to
a9b1a03
Compare
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
| // make both sides having the same sampling rate | ||
| let subsampling_rate = self_sampling_rate / other_sampling_rate; | ||
| let mut samples = self.samples; | ||
| let mut rng = Rng::with_seed(seed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these always be combined with a fresh Rng with the same seed or should this function accept an Rng?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the parameter of combine to a mutable reference to Rng.
This implements part of #436 . Spatial partitioners is the core component of the spatial partitioned spatial join. We need to collect samples of geospatial objects to build the spatial partitioning grid.
The goal is that we collect enough number of samples to create a high quality spatial partition even for small datasets, while not collecting too many samples for large datasets to avoid running out of memory. The sampling should be uniform, so that the collected samples could faithfully represent the distribution of the entire dataset. The sampler should only go through the sampled stream in one single pass, since evaluating the sampled stream multiple times may trigger repeated computations of upstream physical operators.
The sampling algorithm we adopted is a combination of reservoir sampling and Bernoulli sampling: it collects at least$N_\text{min}$ , at most $N_\text{max}$ samples per partition, and make sure that the sampling rate won’t go below $R$ before hitting $N_\text{max}$ .
The algorithm maintains a set of sampled envelopes$S$ , and will go through 4 stages as the number of rows seen $k$ proceeds:
This algorithm guarantees that:
Here is a figure illustrating the 4 stages of the sampling algorithm, it shows which stage is used to sample each portion of the row stream. We take$N_\text{min} = 1000$ , $N_\text{max} = 10000$ , $R = 0.01$ as an example.