-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[CALCITE-5944] Add metadata for Sample #3382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -104,6 +105,10 @@ public class RelMdMaxRowCount | |||
return mq.getMaxRowCount(rel.getInput()); | |||
} | |||
|
|||
public @Nullable Double getMaxRowCount(Sample rel, RelMetadataQuery mq) { | |||
return mq.getMaxRowCount(rel.getInput()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,I have a question is that why getMaxRowCount method not consider the sample rate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bernoulli sampling can, in the worst case, return all rows, or no rows.
Not sure about system sampling.
@@ -103,6 +104,10 @@ public Double getMinRowCount(Filter rel, RelMetadataQuery mq) { | |||
return mq.getMinRowCount(rel.getInput()); | |||
} | |||
|
|||
public @Nullable Double getMinRowCount(Sample rel, RelMetadataQuery mq) { | |||
return mq.getMinRowCount(rel.getInput()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@@ -169,6 +170,12 @@ public Double getRowCount(Calc rel, RelMetadataQuery mq) { | |||
return limit < rowCount ? limit : rowCount; | |||
} | |||
|
|||
public @Nullable Double getRowCount(Sample rel, RelMetadataQuery mq) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is make sense for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Perhaps worth justifying with a comment (because nothing is obvious in statistics). The expected value of a binomial distribution with probability p over n trials is n * p.
* <a href="https://issues.apache.org/jira/browse/CALCITE-5944">[CALCITE-5944] | ||
* RelMdRowCount can return more accurate rowCount. </a>. */ | ||
@Test void testRowCountSample() { | ||
sql("select * from emp tablesample bernoulli(50) repeatable(1)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the test for getMaxRowCount(Sample rel, RelMetadataQuery mq) and getMinRowCount(Sample rel, RelMetadataQuery mq)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many '.'s.
Also, the jira summary is not very descriptive. This is about adding support for Sample, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missing Sample.
* RelMdRowCount can return more accurate rowCount. </a>. */ | ||
@Test void testRowCountSample() { | ||
sql("select * from emp tablesample bernoulli(50) repeatable(1)") | ||
.assertThatRowCount(is(EMP_SIZE * 0.5), is(0D), is(Double.POSITIVE_INFINITY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LakeShen
The second parameter 0D
is MinRowCount.
The third parameter Double.POSITIVE_INFINITY
is MaxRowCount.
So we have test for getMaxRowCount(Sample rel, RelMetadataQuery mq)
and getMinRowCount(Sample rel, RelMetadataQuery mq)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LakeShen The second parameter
0D
is MinRowCount. The third parameterDouble.POSITIVE_INFINITY
is MaxRowCount.So we have test for
getMaxRowCount(Sample rel, RelMetadataQuery mq)
andgetMinRowCount(Sample rel, RelMetadataQuery mq)
.
Hi @JiajunBernoulli ,thank you for reminding me.
The unit test case is unstable.
|
@rubenada , would you please review it? |
@@ -169,6 +170,12 @@ public Double getRowCount(Calc rel, RelMetadataQuery mq) { | |||
return limit < rowCount ? limit : rowCount; | |||
} | |||
|
|||
public @Nullable Double getRowCount(Sample rel, RelMetadataQuery mq) { | |||
final Double rowCount = mq.getRowCount(rel.getInput()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I'd name this variable inputRowCount
or something like that.
Also, as Julian mentions, perhaps a comment would be helpful.
@JiajunBernoulli overall lgtm, I left a minor comment. |
- RelMdAllPredicates - RelMdColumnOrigins - RelMdExpressionLineage - RelMdMaxRowCount - RelMdMinRowCount - RelMdPredicates - RelMdRowCount
08089ab
to
92757f3
Compare
Rename @rubenada Thank you~ |
Kudos, SonarCloud Quality Gate passed! |
No description provided.