[DM-17872] Parsing for math and coercing types #116

cbanek · 2021-07-15T00:26:54Z

We had some problems at LSST where if someone was doing math in a select statement, such as doing a column plus a constant, and this would end up coming out as a string representation of the number, which causes some problems in portals and other things with smart sorting for number types. This is a pretty rough stab, and isn't well tested, but I was wondering what you think.

pdowler · 2021-07-15T17:17:50Z

The approach looks sound: figure out the obvious cases and fall back to string (like before)... can be extended in future if we encounter something more obscure.

The one thing that will cause pain is the resulting names in the TapSelectItem. From the test changes, these look to be the full function(arg) string, which is informative but doesn't follow the recommendation on names for expressions (https://www.ivoa.net/documents/TAP/20190927/REC-TAP-1.1.html#tth_sEc3, second paragraph of 3.2). The purpose of that is so that an output table can (probably) be used as an input (upload) table in the same or another tap service without having to modify field names. A "simple" solution allowed by the spec would be to use quoted identifiers (double quotes) for expressions, but I can't say for sure that our parser actually handles quoted identifiers correctly. taplint doesn't complain and I'm pretty sure Mark would test that, but you might want to try them (eg in an upload table) before going that way.

cbanek · 2021-07-15T20:00:01Z

Thanks for the double check on the methodology, and yes I was wondering about the names in the TapSelection. I think I can get this back to the way it was before, although now knowing why we do it is certainly helpful. My type coercion is still broken though, I think what I'm going to do is say that the output from an arithmetic expression is either long or double, and try to avoid the smaller datatypes just because I don't want to have to worry about computing overflow into a bigger type (since I don't know the results, I don't know if anything has actually overflowed). I guess those types will be a bit bigger, but should be fine to still upload back as a temporary table like you mention without having to modify the types. I'll keep working on this and let you know!

cbanek · 2021-07-15T20:11:50Z

I also just noticed that apparently the computed names of columns have to be unique! That'll be interesting. It seems like it might not exactly be unique now, if I have like two max calls, it seems like there will be two columns named max. I guess I could go for some kind of made up name like col42, col43, etc. and just increment the number. I liked how I had the entire subexpression since it seemed a little more useful in the finding the columns, but in reality users should be using aliases. I'll keep thinking about this a bit more.

pdowler · 2021-07-15T20:28:07Z

At some point in the code the description field from tap_schema.columns gets into the output... if you could get the original expression into the description that would help clarity. Might have to augment TapSelectItem to do that and track down where that info gets merged with the ColumnDesc and put into the output (votable field).

Another thing is if the ADQL has an alias for the expression that always takes precedence over a generated column name.

Fun times with SQL!

cbanek · 2021-07-17T01:59:26Z

Okay so an update here - I've gone ahead and gone back to the way the tests were with the function names and got that to pass. The only changes are the types where it's now a number and not a string.

Sorry this change got a bit big and I ended up kind of rewriting most of this. It's a bit nervewracking for me, but having the tests show green makes me feel pretty good. The one architectural thing I'd like to do that I'm struggling with is I think it'd be better if TapSelectItem could have a setAlias function or something to set the alias on an already created TapSelectItem. That'd get rid of some of the spread out alias handling and would be able to get it down to one spot. But when I make those changes, since they are in a different java library, I was having the trouble with the builds not finding the new interface. Anyway, this one is up for another round of review when you are ready!

cbanek · 2021-09-23T17:34:25Z

Any more thoughts on taking this one?

So this PR hasn't been accepted yet upstream, but we have one that fixes some of the parsing and type coercian for math in sql. Until then we should pin this version that I've checked in and built myself with the PR branch. opencadc/tap#116

pdowler

Sorry for the long delay on this... finally getting back to it.

I think it would be good to add a member var colIndex that starts at 0 that would be incremented in the loop over select list items (eg line incremented up in the loop over selectListItems (around line 182 where the item is added). Then the colIndex could be used to make sure all generated TapSelectItem name(s) are unique and as a side effect they'd have a number indicating which column it was. Usage in other comments below.

The ExtractorTest would benefit from at least one uniqueness test, eg

select max(table_name), max(column_name) from tap_schema.columns

which with the suggested changes would have columns max0 and max1. Now that I write this comment, I could see an argument for a 1-based colIndex :-)

cadc-adql/src/main/java/ca/nrc/cadc/tap/parser/extractor/SelectListExpressionExtractor.java

pdowler · 2023-04-13T01:08:09Z

cadc-adql/src/main/java/ca/nrc/cadc/tap/parser/extractor/SelectListExpressionExtractor.java

+            TapDataType rightType = right.getDatatype();
+
+            log.info("leftType: " + leftType + " rightType: " + rightType);
+


this is tricky. My instinct is to default to DOUBLE and see if the types allow one to go back to LONG. Then as a second order improvement see if it's more correct to go back to a smaller type (FLOAT, INT, SHORT, ...). I'm not sure that second step is worth any effort so I might stop at DOUBLE and LONG, because:

jsqlparser only has long and double

a lot of numeric functions are hard-coded to double in TapSchemaDAO.getFunctionDescs

it will get plenty more complicated when we get to adding ADQL-2.1 cast operation

Yeah, unless you specifically have a use case for FLOAT, I'd remove that bit. If you do have a use case, then you do need to remove the else below or check for DOUBLE before FLOAT.

OK, l just pulled out this whole thing. It's really hard to infer the type even if there's a column in there, since I can think of at least 1 case for each + - * /. And deefaulting to double seems too small since say double x double could be a long. And we don't have the number to test its range or anything. I think just a numbered col is where I'm at.

I don't think you need to worry about double * double going to anything but double, certainly not to long.

I still think fallback to double in SelectListExpressionExtractor.getItemFromExxpression() will work more often, Certainly expressions can output real values and if we use long the resulting VOTable will be invalid. Chosing long over double has to be the extra effort to improve the metadata quality when it is possible to infer that is safe/correct.

Now that I look at the complete function, is there also the possibility that an expression that's not a column or function results in a string (concat? anything else?) and that's not yet handled? I think that's why it was originally using string, but maybe that was overly cautious/lazy on my part :-)

There is a Concatenate extends BinaryExpression in another package but I don't recall where that gets shows up exactly... I think it's something that can be injected in the select list after the SelectListExpressionExtractor so the internal SQL concatenates multiple values into one output column (eg replace a Column with a Concatenate and probably have a custom Format for output). The presence of that class makes me thing jsqlparser doesn't have a concatenate.

cbanek · 2023-12-21T00:52:11Z

OK hopefully I've taken care of the points you mentioned.

cbanek · 2024-01-10T22:57:50Z

I'm still a little stumped by the concat operator - I've gone back to what you were doing with the default being string, and maybe that will catch that?

I've tried doing some queries in the unit tests with "||" which I think is the concat operator but yeah, jsqlparser did not like that and didn't parse the query. I do see that concatenate operator class but I'm not sure how to get to it.

Anyway, hopefully this is a bit better after some vacation to rest the ol' brain pan.

pdowler · 2024-01-22T21:25:34Z

I don't think that operator is valid SQL92 (hence ADQL), so that class can be used internally to replace something in the ADQL with a concatenate so the executed SQL has the || operator. So I guess it is safe to assume you won't see this in the SelectListExtractor because the extractor runs before any query mangling happens. That was just an example that came to mind of an "expression" that produces a string...

I still wonder if there are other expressions that can result in a string rather than a number... but maybe this is good enough: we can just go with double, attempt to deduce that long is correct and better, and then wait and see if anyone finds that obscure use case where the result is a string (and the votable is invalid)

cbanek · 2024-01-22T21:48:32Z

Well I am on line 384 of the SLEE defaulting back to string, so if there's another operator, and it's not one that I've already picked up above there in the if statements it should fall through to be a string. I can't think of any reason why a math expression would return a string, but I was thinking that maybe a function could return a string, like some databases I think have concat() But if someone used a function like that then they'd say it was a string output from that function, and we should be able to handle that no problem.

cbanek mentioned this pull request Oct 11, 2021

[DM-31965] Add in patched cadc-adql lsst-sqre/lsst-tap-service#64

Merged

pdowler requested changes Apr 13, 2023

View reviewed changes

pdowler approved these changes Feb 20, 2024

View reviewed changes

[DM-17872] Parsing for math and coercing types

3a8a15c

pdowler merged commit 5039841 into opencadc:master Feb 22, 2024
1 check passed

gpdf mentioned this pull request Feb 24, 2024

[DM-17872] Fix column names rubin-dp0/tutorial-notebooks#174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DM-17872] Parsing for math and coercing types #116

[DM-17872] Parsing for math and coercing types #116

cbanek commented Jul 15, 2021

pdowler commented Jul 15, 2021

cbanek commented Jul 15, 2021

cbanek commented Jul 15, 2021

pdowler commented Jul 15, 2021

cbanek commented Jul 17, 2021

cbanek commented Sep 23, 2021

pdowler left a comment •

edited

Loading

pdowler Apr 13, 2023

cbanek Dec 21, 2023

gpdf Dec 27, 2023

pdowler Jan 4, 2024

cbanek commented Dec 21, 2023

cbanek commented Jan 10, 2024

pdowler commented Jan 22, 2024 •

edited

Loading

cbanek commented Jan 22, 2024

		TapDataType rightType = right.getDatatype();

		log.info("leftType: " + leftType + " rightType: " + rightType);

[DM-17872] Parsing for math and coercing types #116

[DM-17872] Parsing for math and coercing types #116

Conversation

cbanek commented Jul 15, 2021

pdowler commented Jul 15, 2021

cbanek commented Jul 15, 2021

cbanek commented Jul 15, 2021

pdowler commented Jul 15, 2021

cbanek commented Jul 17, 2021

cbanek commented Sep 23, 2021

pdowler left a comment • edited Loading

Choose a reason for hiding this comment

pdowler Apr 13, 2023

Choose a reason for hiding this comment

cbanek Dec 21, 2023

Choose a reason for hiding this comment

gpdf Dec 27, 2023

Choose a reason for hiding this comment

pdowler Jan 4, 2024

Choose a reason for hiding this comment

cbanek commented Dec 21, 2023

cbanek commented Jan 10, 2024

pdowler commented Jan 22, 2024 • edited Loading

cbanek commented Jan 22, 2024

pdowler left a comment •

edited

Loading

pdowler commented Jan 22, 2024 •

edited

Loading