Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-6322] Casts to DECIMAL types are ignored #3733

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mihaibudiu
Copy link
Contributor

This PR addresses the problem that casts such as CAST(X as DECIMAL(12, 2)) were actually interpreted as no-ops in Calcite.
With this change the value X is actually rounded to the specified precision and scale, and if the result overflows a runtime error is produced. This is in line with all the other SQL implementations I have checked.

This PR handles casts where the source type of X is an integer or a decimal type, but there may be additional cases to consider. We leave this for a later PR.

With this check quite a few tests would fail, so I had to insert additional casts to make them pass. This has changed the numeric precision of some results.

@mihaibudiu
Copy link
Contributor Author

I will add a commit handling the case of casting floating point values to decimals, it's important.

Copy link

sonarcloud bot commented Mar 14, 2024

@mihaibudiu
Copy link
Contributor Author

This PR is ready for review. I consider this a major bug, since it essentially affects all DECIMAL computations.

@mihaibudiu
Copy link
Contributor Author

This PR is fixing a major calcite bug, which has been open for many years. It would be nice if someone could take a look.
Unfortunately, the changes affect many tests.
The problem is that casts to a DECIMAL(x, y) type were completely ignored; this PR implements them correctly.

@mihaibudiu mihaibudiu force-pushed the issue6322 branch 2 times, most recently from bea5205 to 69d3ab0 Compare April 18, 2024 18:32
@zabetak zabetak added the discussion-in-jira There's open discussion in JIRA to be resolved before proceeding with the PR label Apr 19, 2024
@@ -579,7 +579,7 @@ private static void assertRows(Interpreter interpreter,
final String sql = "select x, min(y), max(y), sum(y), avg(y)\n"
+ "from (values ('a', -1.2), ('a', 2.3), ('a', 15)) as t(x, y)\n"
+ "group by x";
sql(sql).returnsRows("[a, -1.2, 15.0, 16.1, 5.366666666666667]");
sql(sql).returnsRows("[a, -1.2, 15.0, 16.1, 5.3]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't in that case it be 5.4 ?
It seems rounding should be tweaked a bit
or did i miss anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snuyanzin I have written extensive comments about the rounding mode in Calcite in several places, e.g.
https://issues.apache.org/jira/browse/CALCITE-6322
I don't like this choice myself, but it's consistent in how Calcite handles rounding elsewhere.
If we want to change this we should first (1) write a spec about the semantics of casts in Calcite, (2) implement it consistently everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem doing this work myself, but first everyone should agree on the desired result, because otherwise every piece of code in Calcite that handles conversions (and there are a lot of them) will do something different.

!ok

select stddev_pop(sal) from emp;
EXPR$0
1139.488618295281639802851714193820953369140625
1139.48
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like here should be
1139.49

EXPR$0
1512779.356060606
1512779.3560
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also seems an issue with rounding

Comment on lines +381 to +384
7654, 1366.66
7698, 1737.50
7844, 1690.00
7900, 1566.666666666667
7900, 1566.66
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue with rounding

Comment on lines +542 to +544
7698, 585.9465
7844, 602.7713
7900, 602.7713
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the PR looks good, but the explicit CAST that were added make me a bit skeptical so I would like to understand better why they are needed. Apart from that I left some other small comments/suggestions.

@@ -523,78 +523,78 @@ select deptno, ratio_to_report(sal) over (partition by deptno) from emp;
!}

# STDDEV_POP
select empno, stddev_pop(comm) over (order by empno rows unbounded preceding) from emp where deptno = 30 order by 1;
select empno, stddev_pop(CAST(comm AS DECIMAL(12, 4))) over (order by empno rows unbounded preceding) from emp where deptno = 30 order by 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the description, I assume that the CAST was added to avoid a runtime error but it is not evident why the error is raised. Does the plan have another CAST somewhere? For what reason?

As far as I see the column comm is declared as DECIMAL(7,2) and the result of the stddev_pop function should also be a DECIMAL(7,2) according to RelDataTypeSystemImpl#deriveAvgAggType. Based on the existing data the result of stddev_pop is always between 0 and 523 so I am not sure why it cannot be represented as DECIMAL(7,2).

From a SQL user perspective it is strange to require an explicit CAST for functions that are supposed to handle exact numerics such as stddev_pop, sum, variance etc.

Can you please elaborate a bit why it is necessary to introduce this explicit cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I had filed issue https://issues.apache.org/jira/browse/CALCITE-6324 about this problem. See the comments there as well. Indeed, in Calcite, as in Oracle, the result type of an aggregate is the same as the type of the value aggregated.

Moreover, the Calcite implementation of aggregates also uses internally the same type for all calculations. Since STDDEV needs to square values, it requires double the precision not to overflow. So without changing the type here, this program would fail because of overflows for decimal values (prior to this PR such overflows were simply ignored, which is wrong).

The problem of internal overflows could be solved in two ways:
(1) change the internal implementation of all aggregation functions that could overflow to use a different precision for intermediate results.
(2) require the user to use a type wide enough to accommodate possible overflows

Now, unfortunately the compiler cannot choose safely a type to guarantee that the results won't overflow, even for a simple aggregate such as AVG. The reason is that, although the compiler has a bound on the precision of the values aggregated, in general it has no bound on the number of rows in the table at runtime (especially for cases like incremental view maintenance, where the plans generated need to work even for future contents of the tables). So no matter what precision the compiler may choose for internal computations, this precision may be exceeded at runtime for some input data.

This leaves the other alternative: the user should choose a precision high enough to carry all computations. Although the user may also not know the size of the tables, at least the user has some domain-specific knowledge. So this PR adopts the approach in (2), which is also compatible with the existing Calcite implementation of aggregates.

If we want to implement (1), it will require much more work, we have to modify the implementation of all Calcite aggregates, choosing an arbitrary type for intermediate results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR was there a loss of precision in the results? Was the old result of this query incorrect?

Does the same query in Oracle fail without adding the explicit CAST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calcite does not specify anywhere the type of the results produced by an aggregate function. However, the type inference code for aggregates implies that the result type is the same as the data type. If this is correct, these results were wrong, because the scale of the results is not the same as the scale of the input data.

As you see, many of these problems stem from the fact that there is no written spec for what Calcite should be doing.

I don't have access to an Oracle database to check. My knowledge of Oracle for aggregations comes from their published documentation, e.g.: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/STDDEV.html

Postgres returns double if inputs are integers, or DECIMAL otherwise, a completely different behavior from Oracle or Calcite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring an explicit CAST in order to make the query work is not user friendly and I would be a bit surprised if other databases behave the same way.

Since there is no doc about the expected types from aggregate function it is not completely accurate to say that the old result was wrong.

How about relaxing the precision overflow check till a more complete solution is found? In fact, I recall that there are some nuances across engines on what they consider "loss of precision" and "significant digits".

I am bit worried that this kind of breaking change (requiring a CAST) will make some users unhappy. Nevertheless, I don't feel strongly about it so I am leaving the final decision to you. Feel free to advance this as you see fit.

+ " DruidQuery(table=[[foodmart, foodmart]],"
+ " intervals=[[1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z]],"
+ " projects=[[$63, $90, $91, $89]], groups=[{0}], aggs=[[SUM($1), SUM($2), SUM($3)]],"
+ " post_projects=[[$0, /($1, $2), CASE(=($3, 0), CAST(1:DECIMAL(19, 0)):DECIMAL(19, 0),"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CAST(1:DECIMAL(19, 0)):DECIMAL(19, 0) seems like a redundant cast and it was not there before. Why do we need it now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is some lack of optimization in the Druid code generator, which doesn't remove redundant casts. If we think this is a bug, I suggest we file a new issue to deal with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new CAST appeared as part of the changes in this PR so even though it may be a missing optimization in Druid it might have an impact on other Calcite users if for some reasons CAST are added redundantly. It may also be reproducible outside Druid but maybe we are missing the right test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code I wrote in this PR does not modify the program representation in any way. It only changes the way constant expressions involving casts are evaluated. I will try to find out what is going on here, though. So this must be some interaction with the optimizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this was addressed by other PRs? If that's the case you can mark this discussion as resolved.

final String plan =
"DruidQuery(table=[[foodmart, foodmart]], intervals=[[1900-01-09T00:00:00.000Z/"
+ "2992-01-10T00:00:00.000Z]], projects=[[$63, $90, $91, $89]], groups=[{0}], "
+ "aggs=[[SUM($1), SUM($2), SUM($3)]], post_projects=[[$0, /($1, $2), "
+ "CASE(=($3, 0), 1.0:DECIMAL(19, 0), CAST($3):DECIMAL(19, 0))]], sort0=[1], dir0=[DESC])";
+ "CASE(=($3, 0), CAST(1:DECIMAL(19, 0)):DECIMAL(19, 0), "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant CAST?


/** Called from BuiltInMethod.INTEGER_DECIMAL_CAST */
public static @Nullable Object integerDecimalCast(
@Nullable Object value, int precision, int scale) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later we are casting the Object to Number so I assume that generated code guarantees that it will always be like that. In that case why not declaring the value type as Number to begin with?

@@ -1836,7 +1836,7 @@ public static boolean isValidDecimalValue(@Nullable BigDecimal value, RelDataTyp
case DECIMAL:
final int intDigits = value.precision() - value.scale();
final int maxIntDigits = toType.getPrecision() - toType.getScale();
return intDigits <= maxIntDigits;
return (intDigits <= maxIntDigits) && (value.scale() <= toType.getScale());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed? Is this something mandated by the standard? What does it fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is really about what this function is supposed to implement.
My assumption is that this function decides whether a BigDecimal value can be represented exactly in the specified type. In this case I believe that both conditions need to be satisfied. If the scale is not large enough, the number may need rounding to be represented, so the representation won't be exact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't know what it does and why it exists better not modify it. Like this we avoid accidental changes that are not directly related to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check, this may be a leftover that is not needed here, but is needed for https://issues.apache.org/jira/browse/CALCITE-3522. I expect that eventually we will need this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is actually necessary, so I will create a new helper function for it to avoid modifying the existing one.

Comment on lines 603 to 607
@Test void testDecimalDecimalCast() {
final SqlOperatorFixture f = fixture();
f.checkScalar("CAST(1.123 AS DECIMAL(4, 0))", "1", "DECIMAL(4, 0) NOT NULL");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a new method should we rather enrich testCastToExactNumeric?

@@ -598,6 +598,13 @@ void testCastBooleanToNumeric(CastType castType, SqlOperatorFixture f) {
"Cast function cannot convert value of type BOOLEAN to type DECIMAL\\(19, 0\\)", false);
}

/** Test case for <a href="https://issues.apache.org/jira/projects/CALCITE/issues/CALCITE-6322">
* [CALCITE-6322] Casts to DECIMAL types are ignored</a>. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR claims to add support for three kinds of CASTS:

  • decimal to decimal
  • int to decimal
  • double/float to decimal

I've seen a few tests for the first category, do we have something for the other two? If not then I think we should add.

Moreover, I've noticed some tests that are currently disabled by the DECIMAL flag. A few of them should be fixed by this PR so please enable those that are relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will re-enable many of these tests, but this doesn't completely solve the cases covered by SqlOperatorTest.DECIMAL. I have filed a new issue https://issues.apache.org/jira/browse/CALCITE-6379 that covers arithmetic. Moreover, the current PR also does not cover casts from STRING and INTERVAL. I have filed https://issues.apache.org/jira/browse/CALCITE-6380 to cover these two cases.

@mihaibudiu
Copy link
Contributor Author

@zabetak I have tried to address your comments. In the process I found at least one more bug in my PR, the rounding mode is really DOWN and not FLOOR. This was found through the new tests you requested.

Copy link
Contributor

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the PR brings in some net improvements but also some potentially debatable breaking changes. Let's continue iterating on the discussion of the newly added CASTs till we reach consensus.

Comment on lines 558 to 565
select empno, sum(CAST(comm AS DECIMAL(12, 4))) over (order by empno rows unbounded preceding) from emp where deptno = 30 order by 1;
EMPNO, EXPR$1
7499, 300.00
7521, 800.00
7654, 2200.00
7698, 2200.00
7844, 2200.00
7900, 2200.00
7499, 300.0000
7521, 800.0000
7654, 2200.0000
7698, 2200.0000
7844, 2200.0000
7900, 2200.0000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user perspective the old results for this query seem to be rather fine. With the changes in this PR it seems that there will be an ArithmeticException if there is no explicit CAST added by the user. Moreover the extra CAST changes the results which doesn't seem desired.

Is the new behavior better than the old one and why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use (CAST(comm as DECIMAL(12, 2)) the results will be unchanged. I will make that change to make the diff smaller.

The implementation of casts to decimal in Calcite treats them as noops, which is clearly wrong. This bug was masking many other potential problems, which now will be apparent. For example, aggregation results were also incorrectly computed.

There is no clear spec in Calcite how aggregation results are computed (i.e., the precision of intermediate results, or the precision of the final results). SQL dialects differ in this respect, and the standard does not mandate a specific implementation. The current Calcite implementation is a reasonable one, but it won't work for this example without the manually-inserted cast, because overflows in decimal computations will cause the computation to fail. If we think that the aggregation implementation in Calcite is wrong, we should file a separate issue.

@@ -523,78 +523,78 @@ select deptno, ratio_to_report(sal) over (partition by deptno) from emp;
!}

# STDDEV_POP
select empno, stddev_pop(comm) over (order by empno rows unbounded preceding) from emp where deptno = 30 order by 1;
select empno, stddev_pop(CAST(comm AS DECIMAL(12, 4))) over (order by empno rows unbounded preceding) from emp where deptno = 30 order by 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR was there a loss of precision in the results? Was the old result of this query incorrect?

Does the same query in Oracle fail without adding the explicit CAST?

+ " DruidQuery(table=[[foodmart, foodmart]],"
+ " intervals=[[1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z]],"
+ " projects=[[$63, $90, $91, $89]], groups=[{0}], aggs=[[SUM($1), SUM($2), SUM($3)]],"
+ " post_projects=[[$0, /($1, $2), CASE(=($3, 0), CAST(1:DECIMAL(19, 0)):DECIMAL(19, 0),"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new CAST appeared as part of the changes in this PR so even though it may be a missing optimization in Druid it might have an impact on other Calcite users if for some reasons CAST are added redundantly. It may also be reproducible outside Druid but maybe we are missing the right test case.

@@ -1836,7 +1836,7 @@ public static boolean isValidDecimalValue(@Nullable BigDecimal value, RelDataTyp
case DECIMAL:
final int intDigits = value.precision() - value.scale();
final int maxIntDigits = toType.getPrecision() - toType.getScale();
return intDigits <= maxIntDigits;
return (intDigits <= maxIntDigits) && (value.scale() <= toType.getScale());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't know what it does and why it exists better not modify it. Like this we avoid accidental changes that are not directly related to this PR.

Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
@mihaibudiu
Copy link
Contributor Author

@zabetak I think I have addressed your comments. I moved part of this PR into in a separate one which has been merged #3779. So what's left should be less controversial. I will also submit a new PR for https://issues.apache.org/jira/browse/CALCITE-6380 which will complete the implementation of DECIMAL tests.

Copy link

sonarcloud bot commented May 10, 2024

Copy link
Contributor

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider also adding an entry in history.md about the potential breaking changes in this PR and if necessary update the reference doc for the behavior of the CAST function.

@@ -523,78 +523,78 @@ select deptno, ratio_to_report(sal) over (partition by deptno) from emp;
!}

# STDDEV_POP
select empno, stddev_pop(comm) over (order by empno rows unbounded preceding) from emp where deptno = 30 order by 1;
select empno, stddev_pop(CAST(comm AS DECIMAL(12, 4))) over (order by empno rows unbounded preceding) from emp where deptno = 30 order by 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring an explicit CAST in order to make the query work is not user friendly and I would be a bit surprised if other databases behave the same way.

Since there is no doc about the expected types from aggregate function it is not completely accurate to say that the old result was wrong.

How about relaxing the precision overflow check till a more complete solution is found? In fact, I recall that there are some nuances across engines on what they consider "loss of precision" and "significant digits".

I am bit worried that this kind of breaking change (requiring a CAST) will make some users unhappy. Nevertheless, I don't feel strongly about it so I am leaving the final decision to you. Feel free to advance this as you see fit.

Comment on lines +1915 to +1919
+ " DruidQuery(table=[[foodmart, foodmart]],"
+ " intervals=[[1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z]],"
+ " projects=[[$63, $90, $91, $89]], groups=[{0}], aggs=[[SUM($1), SUM($2), SUM($3)]],"
+ " post_projects=[[$0, /($1, $2), CASE(=($3, 0), 1.0:DECIMAL(19, 0),"
+ " CAST($3):DECIMAL(19, 0))]], sort0=[1], dir0=[DESC])\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from formatting is there anything else that changed? If not then please restore the formatting to avoid polluting the git history with unnecessary changes.

Comment on lines +2223 to +2224
+ "CASE(=($3, 0), 1.0:DECIMAL(19, 0), "
+ "CAST($3):DECIMAL(19, 0))]], sort0=[1], dir0=[DESC])";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is just formatting please revert.

Comment on lines 1753 to +1755
f.checkScalarExact("case 1 when 1 then 11.2 "
+ "when 2 then 4.543 else null end",
"DECIMAL(5, 3)", "11.200");
"DECIMAL(5, 3)", "11.2");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is still under DECIMAL flag. It is not fixed?

Comment on lines +11877 to +11878
f.checkScalarExact("ceil(cast(3.45 as decimal(19, 1)))",
"DECIMAL(19, 1) NOT NULL", "4");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we change the default type (DECIMAL(19,0))?

+ " DruidQuery(table=[[foodmart, foodmart]],"
+ " intervals=[[1900-01-09T00:00:00.000Z/2992-01-10T00:00:00.000Z]],"
+ " projects=[[$63, $90, $91, $89]], groups=[{0}], aggs=[[SUM($1), SUM($2), SUM($3)]],"
+ " post_projects=[[$0, /($1, $2), CASE(=($3, 0), CAST(1:DECIMAL(19, 0)):DECIMAL(19, 0),"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this was addressed by other PRs? If that's the case you can mark this discussion as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion-in-jira There's open discussion in JIRA to be resolved before proceeding with the PR
Projects
None yet
3 participants