-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DM-17872] Parsing for math and coercing types #116
Conversation
The approach looks sound: figure out the obvious cases and fall back to string (like before)... can be extended in future if we encounter something more obscure. The one thing that will cause pain is the resulting names in the TapSelectItem. From the test changes, these look to be the full |
Thanks for the double check on the methodology, and yes I was wondering about the names in the TapSelection. I think I can get this back to the way it was before, although now knowing why we do it is certainly helpful. My type coercion is still broken though, I think what I'm going to do is say that the output from an arithmetic expression is either long or double, and try to avoid the smaller datatypes just because I don't want to have to worry about computing overflow into a bigger type (since I don't know the results, I don't know if anything has actually overflowed). I guess those types will be a bit bigger, but should be fine to still upload back as a temporary table like you mention without having to modify the types. I'll keep working on this and let you know! |
I also just noticed that apparently the computed names of columns have to be unique! That'll be interesting. It seems like it might not exactly be unique now, if I have like two max calls, it seems like there will be two columns named max. I guess I could go for some kind of made up name like col42, col43, etc. and just increment the number. I liked how I had the entire subexpression since it seemed a little more useful in the finding the columns, but in reality users should be using aliases. I'll keep thinking about this a bit more. |
At some point in the code the description field from tap_schema.columns gets into the output... if you could get the original expression into the description that would help clarity. Might have to augment TapSelectItem to do that and track down where that info gets merged with the ColumnDesc and put into the output (votable field). Another thing is if the ADQL has an alias for the expression that always takes precedence over a generated column name. Fun times with SQL! |
Okay so an update here - I've gone ahead and gone back to the way the tests were with the function names and got that to pass. The only changes are the types where it's now a number and not a string. Sorry this change got a bit big and I ended up kind of rewriting most of this. It's a bit nervewracking for me, but having the tests show green makes me feel pretty good. The one architectural thing I'd like to do that I'm struggling with is I think it'd be better if TapSelectItem could have a setAlias function or something to set the alias on an already created TapSelectItem. That'd get rid of some of the spread out alias handling and would be able to get it down to one spot. But when I make those changes, since they are in a different java library, I was having the trouble with the builds not finding the new interface. Anyway, this one is up for another round of review when you are ready! |
Any more thoughts on taking this one? |
So this PR hasn't been accepted yet upstream, but we have one that fixes some of the parsing and type coercian for math in sql. Until then we should pin this version that I've checked in and built myself with the PR branch. opencadc/tap#116
So this PR hasn't been accepted yet upstream, but we have one that fixes some of the parsing and type coercian for math in sql. Until then we should pin this version that I've checked in and built myself with the PR branch. opencadc/tap#116
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long delay on this... finally getting back to it.
I think it would be good to add a member var colIndex
that starts at 0 that would be incremented in the loop over select list items (eg line incremented up in the loop over selectListItems (around line 182 where the item is added). Then the colIndex could be used to make sure all generated TapSelectItem name(s) are unique and as a side effect they'd have a number indicating which column it was. Usage in other comments below.
The ExtractorTest would benefit from at least one uniqueness test, eg
select max(table_name), max(column_name) from tap_schema.columns
which with the suggested changes would have columns max0
and max1
. Now that I write this comment, I could see an argument for a 1-based colIndex :-)
cadc-adql/src/main/java/ca/nrc/cadc/tap/parser/extractor/SelectListExpressionExtractor.java
Outdated
Show resolved
Hide resolved
cadc-adql/src/main/java/ca/nrc/cadc/tap/parser/extractor/SelectListExpressionExtractor.java
Show resolved
Hide resolved
cadc-adql/src/main/java/ca/nrc/cadc/tap/parser/extractor/SelectListExpressionExtractor.java
Outdated
Show resolved
Hide resolved
cadc-adql/src/main/java/ca/nrc/cadc/tap/parser/extractor/SelectListExpressionExtractor.java
Outdated
Show resolved
Hide resolved
TapDataType rightType = right.getDatatype(); | ||
|
||
log.info("leftType: " + leftType + " rightType: " + rightType); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is tricky. My instinct is to default to DOUBLE and see if the types allow one to go back to LONG. Then as a second order improvement see if it's more correct to go back to a smaller type (FLOAT, INT, SHORT, ...). I'm not sure that second step is worth any effort so I might stop at DOUBLE and LONG, because:
- jsqlparser only has long and double
- a lot of numeric functions are hard-coded to double in TapSchemaDAO.getFunctionDescs
- it will get plenty more complicated when we get to adding ADQL-2.1 cast operation
Yeah, unless you specifically have a use case for FLOAT, I'd remove that bit. If you do have a use case, then you do need to remove the else
below or check for DOUBLE before FLOAT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, l just pulled out this whole thing. It's really hard to infer the type even if there's a column in there, since I can think of at least 1 case for each + - * /. And deefaulting to double seems too small since say double x double could be a long. And we don't have the number to test its range or anything. I think just a numbered col is where I'm at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to worry about double * double
going to anything but double
, certainly not to long
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think fallback to double in SelectListExpressionExtractor.getItemFromExxpression() will work more often, Certainly expressions can output real values and if we use long the resulting VOTable will be invalid. Chosing long over double has to be the extra effort to improve the metadata quality when it is possible to infer that is safe/correct.
Now that I look at the complete function, is there also the possibility that an expression that's not a column or function results in a string (concat? anything else?) and that's not yet handled? I think that's why it was originally using string, but maybe that was overly cautious/lazy on my part :-)
There is a Concatenate extends BinaryExpression
in another package but I don't recall where that gets shows up exactly... I think it's something that can be injected in the select list after the SelectListExpressionExtractor so the internal SQL concatenates multiple values into one output column (eg replace a Column with a Concatenate and probably have a custom Format for output). The presence of that class makes me thing jsqlparser doesn't have a concatenate.
OK hopefully I've taken care of the points you mentioned. |
I'm still a little stumped by the concat operator - I've gone back to what you were doing with the default being string, and maybe that will catch that? I've tried doing some queries in the unit tests with "||" which I think is the concat operator but yeah, jsqlparser did not like that and didn't parse the query. I do see that concatenate operator class but I'm not sure how to get to it. Anyway, hopefully this is a bit better after some vacation to rest the ol' brain pan. |
I don't think that operator is valid SQL92 (hence ADQL), so that class can be used internally to replace something in the ADQL with a concatenate so the executed SQL has the I still wonder if there are other expressions that can result in a string rather than a number... but maybe this is good enough: we can just go with |
Well I am on line 384 of the SLEE defaulting back to string, so if there's another operator, and it's not one that I've already picked up above there in the if statements it should fall through to be a string. I can't think of any reason why a math expression would return a string, but I was thinking that maybe a function could return a string, like some databases I think have concat() But if someone used a function like that then they'd say it was a string output from that function, and we should be able to handle that no problem. |
We had some problems at LSST where if someone was doing math in a select statement, such as doing a column plus a constant, and this would end up coming out as a string representation of the number, which causes some problems in portals and other things with smart sorting for number types. This is a pretty rough stab, and isn't well tested, but I was wondering what you think.