Skip to content

[CALCITE-5944] Add metadata for Sample #3382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 8, 2023

Conversation

JiajunBernoulli
Copy link
Contributor

No description provided.

@@ -104,6 +105,10 @@ public class RelMdMaxRowCount
return mq.getMaxRowCount(rel.getInput());
}

public @Nullable Double getMaxRowCount(Sample rel, RelMetadataQuery mq) {
return mq.getMaxRowCount(rel.getInput());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,I have a question is that why getMaxRowCount method not consider the sample rate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bernoulli sampling can, in the worst case, return all rows, or no rows.

Not sure about system sampling.

@@ -103,6 +104,10 @@ public Double getMinRowCount(Filter rel, RelMetadataQuery mq) {
return mq.getMinRowCount(rel.getInput());
}

public @Nullable Double getMinRowCount(Sample rel, RelMetadataQuery mq) {
return mq.getMinRowCount(rel.getInput());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -169,6 +170,12 @@ public Double getRowCount(Calc rel, RelMetadataQuery mq) {
return limit < rowCount ? limit : rowCount;
}

public @Nullable Double getRowCount(Sample rel, RelMetadataQuery mq) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is make sense for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Perhaps worth justifying with a comment (because nothing is obvious in statistics). The expected value of a binomial distribution with probability p over n trials is n * p.

* <a href="https://issues.apache.org/jira/browse/CALCITE-5944">[CALCITE-5944]
* RelMdRowCount can return more accurate rowCount. </a>. */
@Test void testRowCountSample() {
sql("select * from emp tablesample bernoulli(50) repeatable(1)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the test for getMaxRowCount(Sample rel, RelMetadataQuery mq) and getMinRowCount(Sample rel, RelMetadataQuery mq)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many '.'s.

Also, the jira summary is not very descriptive. This is about adding support for Sample, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missing Sample.

@JiajunBernoulli JiajunBernoulli changed the title [CALCITE-5944] RelMdRowCount can return more accurate rowCount [CALCITE-5944] RelMdRowCount can return more accurate rowCount for Sample Aug 21, 2023
* RelMdRowCount can return more accurate rowCount. </a>. */
@Test void testRowCountSample() {
sql("select * from emp tablesample bernoulli(50) repeatable(1)")
.assertThatRowCount(is(EMP_SIZE * 0.5), is(0D), is(Double.POSITIVE_INFINITY));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LakeShen
The second parameter 0D is MinRowCount.
The third parameter Double.POSITIVE_INFINITY is MaxRowCount.

So we have test for getMaxRowCount(Sample rel, RelMetadataQuery mq) and getMinRowCount(Sample rel, RelMetadataQuery mq).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LakeShen The second parameter 0D is MinRowCount. The third parameter Double.POSITIVE_INFINITY is MaxRowCount.

So we have test for getMaxRowCount(Sample rel, RelMetadataQuery mq) and getMinRowCount(Sample rel, RelMetadataQuery mq).

Hi @JiajunBernoulli ,thank you for reminding me.

@JiajunBernoulli
Copy link
Contributor Author

JiajunBernoulli commented Aug 26, 2023

The unit test case is unstable.


FAILURE 311.7sec, org.apache.calcite.adapter.tpch.TpchTest > testQuery07()

    java.util.concurrent.TimeoutException: testQuery07() timed out after 5 minutes

        at org.junit.jupiter.engine.extension.TimeoutExceptionFactory.create(TimeoutExceptionFactory.java:29)

        at org.junit.jupiter.engine.extension.SameThreadTimeoutInvocation.proceed(SameThreadTimeoutInvocation.java:58)

        at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)

        at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147)

        at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86)

        at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103)

        at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93)

        at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)

        at org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)

        at org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)

        at org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)

        at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92)

        at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86)

        at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:217)

        at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)

        at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:213)

        at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:138)

        at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:68)

        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151)

        at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)

        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)

        at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)

        at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)

        at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)

        at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)

@JiajunBernoulli
Copy link
Contributor Author

@rubenada , would you please review it?

@JiajunBernoulli JiajunBernoulli added the LGTM-will-merge-soon Overall PR looks OK. Only minor things left. label Sep 2, 2023
@rubenada rubenada changed the title [CALCITE-5944] RelMdRowCount can return more accurate rowCount for Sample [CALCITE-5944] Add metadata for Sample Sep 4, 2023
@@ -169,6 +170,12 @@ public Double getRowCount(Calc rel, RelMetadataQuery mq) {
return limit < rowCount ? limit : rowCount;
}

public @Nullable Double getRowCount(Sample rel, RelMetadataQuery mq) {
final Double rowCount = mq.getRowCount(rel.getInput());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I'd name this variable inputRowCount or something like that.
Also, as Julian mentions, perhaps a comment would be helpful.

@rubenada
Copy link
Contributor

rubenada commented Sep 4, 2023

@JiajunBernoulli overall lgtm, I left a minor comment.

- RelMdAllPredicates
- RelMdColumnOrigins
- RelMdExpressionLineage
- RelMdMaxRowCount
- RelMdMinRowCount
- RelMdPredicates
- RelMdRowCount
@JiajunBernoulli
Copy link
Contributor Author

Rename inputRowCount and rebase main.

@rubenada Thank you~

@sonarqubecloud
Copy link

sonarqubecloud bot commented Sep 7, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

100.0% 100.0% Coverage
0.0% 0.0% Duplication

@JiajunBernoulli JiajunBernoulli merged commit 64268b9 into apache:main Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LGTM-will-merge-soon Overall PR looks OK. Only minor things left.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants