[CALCITE-6893] Remove agg from Union children in IntersectToDistinctRule - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.40.0
Component/s: None
Labels:
- pull-request-available

Description

Whether agg should be pushed down should be determined by the cost.

SQL:

select ename from emp where deptno = 10
intersect
select ename from emp where deptno = 20

Then used rule INTERSECT_TO_DISTINCT(updated version) and AGGREGATE_UNION_TRANSPOSE in hep planner.

We can get logical plan:

LogicalProject(ENAME=[$0])
  LogicalFilter(condition=[=($1, 2)])
    LogicalAggregate(group=[{0}], agg#0=[$SUM0($1)])
      LogicalUnion(all=[true])
        LogicalAggregate(group=[{0}], agg#0=[COUNT()])
          LogicalProject(ENAME=[$1])
            LogicalFilter(condition=[=($7, 10)])
              LogicalTableScan(table=[[CATALOG, SALES, EMP]])
        LogicalAggregate(group=[{0}], agg#0=[COUNT()])
          LogicalProject(ENAME=[$1])
            LogicalFilter(condition=[=($7, 20)])
              LogicalTableScan(table=[[CATALOG, SALES, EMP]])

Then we also use the two same rules in volcanol planner.

Final Phy Plan:

EnumerableProject(ENAME=[$0]): rowcount = 1.0, cumulative cost = {43.72500000000001 rows, 68.4 cpu, 0.0 io}, id = 85
  EnumerableFilter(condition=[=($1, 2)]): rowcount = 1.0, cumulative cost = {42.72500000000001 rows, 67.4 cpu, 0.0 io}, id = 84
    EnumerableAggregate(group=[{0}], agg#0=[COUNT()]): rowcount = 1.0, cumulative cost = {41.72500000000001 rows, 66.4 cpu, 0.0 io}, id = 83
      EnumerableUnion(all=[true]): rowcount = 4.2, cumulative cost = {40.60000000000001 rows, 66.4 cpu, 0.0 io}, id = 82
        EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = {18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 79
          EnumerableFilter(condition=[=($7, 10)]): rowcount = 2.1, cumulative cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 78
            EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = 14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69
        EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = {18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 81
          EnumerableFilter(condition=[=($7, 20)]): rowcount = 2.1, cumulative cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 80
            EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = 14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69

We can see the best plan, the children of union do not have agg.

DAG:

Currently, Calcite does not support distributed planning. If in a distributed planning, agg will be divided into two stages. If the filtering effect in the first stage is very good, the downward push of agg will be meaningful and reduce the network transmission of shuffle. However, optimizing the current rule is also meaningful. Calcite now also has rules that can do the downward push of agg. We can give the choice to the volcano.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2025-03-16-11-02-04-725.png
16/Mar/25 03:02
895 kB
Zhen Chen

Issue Links

Blocked

CALCITE-7086 Implement a rule that performs the inverse operation of AggregateCaseToFilterRule

Resolved

links to

GitHub Pull Request #4246

Activity

People

Assignee:: Zhen Chen

Reporter:: Zhen Chen

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Due:: 15/Mar/25

Created:: 15/Mar/25 14:58

Updated:: 06/Jul/25 10:47

Resolved:: 08/Apr/25 17:55