Skip to content

Fix Memory Backpressure/Spilling Coordination #20715

@cetra3

Description

@cetra3

Is your feature request related to a problem or challenge?

Right now in Datafusion there are bugs when you are sorting & spilling with multiple partitions. There is even a comment that on the multi level merge that this is not a great solution:

// For memory pools that are not shared this is good, for other this is not
// and there should be some upper limit to memory reservation so we won't starve the system

There is also some contention between when we should spill and when we are out of memory. Both of these are essentially two limits on the memory cap.

After we fix memory accounting via solutions in #20714, we should be able to adjust the memory pool implementation to allow for more memory coordination.

Describe the solution you'd like

Implement a new "memory coordinator" infrastructure which has the following properties:

  • Partition Aware Scheduling: if there are multiple partitions across one or more active queries, we should ensure that we fairly allocate memory according to partitions
  • High and Low memory watermarks: we should have two thresholds for our memory limits. The first is the low watermark: when things should start to spill, and high watermark: when we should error out because we are at max memory and there is no way to continue without blowing out memory allocations
  • Ability to "wait" for memory to be free: rather than error out immediately if there is no memory, have the ability to wait until memory is available. This would mean that we would need some sort of method to wait until memory is available. However: this is deadlock prone so we need to be careful, or have it as an opt-in. I.e, have the default behaviour to error out immediately, but allow users of datafusion the ability to wait around for a bit

I have experimented with all of these style properties in this branch: main...pydantic:datafusion:memory_observations but it was vibe coded to prove the concept (which it does), but probably needs to be broken out into discrete chunks of work and more thoroughly thought out.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions