-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
Right now in Datafusion there are bugs when you are sorting & spilling with multiple partitions. There is even a comment that on the multi level merge that this is not a great solution:
datafusion/datafusion/physical-plan/src/sorts/multi_level_merge.rs
Lines 361 to 362 in d412ba5
| // For memory pools that are not shared this is good, for other this is not | |
| // and there should be some upper limit to memory reservation so we won't starve the system |
There is also some contention between when we should spill and when we are out of memory. Both of these are essentially two limits on the memory cap.
After we fix memory accounting via solutions in #20714, we should be able to adjust the memory pool implementation to allow for more memory coordination.
Describe the solution you'd like
Implement a new "memory coordinator" infrastructure which has the following properties:
- Partition Aware Scheduling: if there are multiple partitions across one or more active queries, we should ensure that we fairly allocate memory according to partitions
- High and Low memory watermarks: we should have two thresholds for our memory limits. The first is the low watermark: when things should start to spill, and high watermark: when we should error out because we are at max memory and there is no way to continue without blowing out memory allocations
- Ability to "wait" for memory to be free: rather than error out immediately if there is no memory, have the ability to wait until memory is available. This would mean that we would need some sort of method to wait until memory is available. However: this is deadlock prone so we need to be careful, or have it as an opt-in. I.e, have the default behaviour to error out immediately, but allow users of datafusion the ability to wait around for a bit
I have experimented with all of these style properties in this branch: main...pydantic:datafusion:memory_observations but it was vibe coded to prove the concept (which it does), but probably needs to be broken out into discrete chunks of work and more thoroughly thought out.
Describe alternatives you've considered
No response
Additional context
No response