Proposal: mild mode for the `code` parser #155

luav · 2018-05-09T23:22:46Z

Spellright works fine with the text but becomes hardly usable for the code (Python, C++, JS, ...) with docstrings even in the parser mode (only comments are strings are spelled).

It makes sense to use more mild mode / less strict mode when the code is parsed to omit most of the variables instead of showing them as errors. It could be done just by specifying the lower bound of similarity for typos and consider words that are too much different from the dictionary as variables. It would be nice to be able to adjust this similarity value E [0, 1] for particular project to control false alarms vs precision.
A [pseudo] example of the implementation can be checked here.

The text was updated successfully, but these errors were encountered:

bartosz-antosik · 2018-05-10T08:44:47Z

Hi! I think I understand your sentiment, but there is a number of problems with the approach proposed.

First: Spell Right is supposed to spell "fenced" comments (e.g. docstrings) in e.g. Python and other code as well, see here:

If it does not work in some language then let me know. And BTW the parser is supposed to omit all the variables so far.

Second: similarity measure: As you may have noticed my extension uses native spelling APIs which results in better spelling quality (e.g. short words, abbreviations, case and many other are taken under consideration whereas other spellers e.g. start spelling from three letters to speed things up.) but has its limitations one of them is that It does not allow to adjust spelling metrics like you propose. Of course there are approaches which could, as you propose, infer about what is code and what is string (e.g. there is nice Bayesian approach to this, I remember extension which was automatically choosing dictionary in a mail being written) but that seems a bit of overshoot especially when you cannot control anything below the API level.

Main problem with the solution that you postulate is that so far VSCode does not give extensions an access to the document's syntax information. If I could use VSCode's parser used to colorize code to determine what text is what (variable, keyword, identifier etc.) I could easily apply e.g. CamelCase or snake_case rules for spelling. It would not require a similarity measure approach. Could be just strict spelling. On the other hand I am not able to provide parsers so fine-grained to service a multitude of code documents. Hence, the comments/strings approach known for long time e.g. from Visual Studio extensions. The whole extension has originated from VSCode's issue #20266 which points out that there is generally a problem with spelling in VSCode.

There is other speller which does spell code in a brute way. That is care not about the syntax just spell everything and eliminate in some way as much as possible of what seems to be OK: keywords in separate dictionaries etc. I have simply decided for different approach.

But I am not ruling anything out. Especially if there would be some document syntax support from VSCode's part.

luav · 2018-05-10T13:16:17Z

An example of hundreds of false alarms for the variables in Python docstrings for a small project, note that only the comments and strings are parsed (code parser mode) and I already excluded all the CamelCase and snake_case variables, URLs and etc by the spellright.ignoreRegExps:

Note: the selected multinode term is correctly treated as a spelling error but ncpunodes, tinfext, indstep, colsep and dozens of other variables are clear false alarms.

... but has its limitations one of them is that It does not allow to adjust spelling metrics like you propose

It seems that the outlined limitations might be easily solvable in the following way.
The spellright shows a hint ("Show Fixes (Ctrl + .)" yellow bulb) for each spelling correction, where the most probable / relevant corrections are suggested in the drop-down list. It is possible to evaluate the outlined similarity of the target correcting word with up to 3-5 top suggestions and if the max similarity is lower than the custom threshold then most likely the word is a variable, not a typo and should not be shown as a typo. Such functionality extension operates on the already available data and seems should not require lots of computations operating only on the single words (no Bayesian statistics is required).

Note that proposed approach does not require access to the VSCode parser and operates only on the data produced by the own plugin.

PS Thank you for providing the Spellright, which is a very useful tool and I believe it is possible to make it more convenient with little efforts ;-)

bartosz-antosik · 2018-05-10T15:19:46Z

Oh, gosh! I have finally understood what you desire! One well commented picture is worth more than ... never mind.

This treshold can easily be done. Plus e.g. condition that variables are small case. But! What I see suggests that it may be possible to search for divisions of the word to see if it is a compound e.g. multinode would spell correctly as multi node... If this would be limited to words suspected of being variables in code parser then it may not take too much time.

bartosz-antosik · 2018-05-12T22:00:43Z

I gave the thing another thought and I think I have found a much better way to do this. The speller should use document's symbols! That is I cannot use the parsers which are used by VSCode but there is a way to get the symbols exactly like the list that you get when you do a Goto symbol (Ctrl+Shift+O on Windows). And these symbols are (should be) exactly what you have a problem with. Extending this using these symbols as a dictionary seems reasonable - they may appear especially in the comments.

Because of some considerations it was a complicated code modification and I would like to ask you to test the modified extension. You can download it from here:

https://drive.google.com/open?id=1IBI8JseAlrQNHGYQVV2O3B--TiSmJLFi

Just install it from file using Install from VSIX option.

luav · 2018-05-13T15:42:24Z

Hi @bartosz-antosik ,
I install the update from the provided file and reloaded (restarted VSCode) but it has not helped:

bartosz-antosik · 2018-05-13T15:54:39Z

Is the document you test it with available somewhere? Could I have a similar document to test things?

luav · 2018-05-13T15:59:54Z

I also tired to uninstall the extension and then install it from the provided .vsix, which reduced the number of errors from 184 to 171 (helped a bit, but not much):

But now the synchronization is lost between the reporting errors and actual lines of the code and still most of the errors are remained.
It might be easier for you to debug the plugin on my project directly from your environment.

bartosz-antosik · 2018-05-13T16:16:03Z

Something is not OK here (e.g. switching document has sometimes problems), but it seems that this approach does reduce the number of spelling errors (see Args names below) and helps to detect spelling errors (TaskUnfoExt instead of TaskInfoExt):

There is a known problem with reading symbols from the document that is sometimes returns empty list at first. I will investigate it further but it seems that it is the way to go. Partially at least.

bartosz-antosik · 2018-05-14T11:31:55Z

Another take, should be better (version 2.3.7):

https://drive.google.com/open?id=1_iOS4VgN0Njp8rfhd-jTbMXNt6GmbQ_S

Could you please install and tell me whether it eliminates document symbols correctly?

luav · 2018-05-14T12:31:56Z

Thanks, it works much better!
But still there are about hundred false alarms for the variable names mainly in the commented code and constructions like print('{afnbin} ...'.format(afnbin=...)), which could be solved presenting the similarity threshold.
By the way, I identified the reporting lines synchronization loss issue, see #159.

bartosz-antosik · 2018-05-14T17:18:00Z

Last release has the symbols spelling included. I will deliberate a bit on these other ideas of yours.

bartosz-antosik added the enhancement label May 10, 2018

bartosz-antosik added a commit that referenced this issue May 14, 2018

Document sumbols used for spelling (#155)

4080c68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: mild mode for the `code` parser #155

Proposal: mild mode for the `code` parser #155

luav commented May 9, 2018

bartosz-antosik commented May 10, 2018

luav commented May 10, 2018 •

edited

Loading

bartosz-antosik commented May 10, 2018 •

edited

Loading

bartosz-antosik commented May 12, 2018

luav commented May 13, 2018

bartosz-antosik commented May 13, 2018

luav commented May 13, 2018 •

edited

Loading

bartosz-antosik commented May 13, 2018 •

edited

Loading

bartosz-antosik commented May 14, 2018

luav commented May 14, 2018

bartosz-antosik commented May 14, 2018 •

edited

Loading

Proposal: mild mode for the code parser #155

Proposal: mild mode for the code parser #155

Comments

luav commented May 9, 2018

bartosz-antosik commented May 10, 2018

luav commented May 10, 2018 • edited Loading

bartosz-antosik commented May 10, 2018 • edited Loading

bartosz-antosik commented May 12, 2018

luav commented May 13, 2018

bartosz-antosik commented May 13, 2018

luav commented May 13, 2018 • edited Loading

bartosz-antosik commented May 13, 2018 • edited Loading

bartosz-antosik commented May 14, 2018

luav commented May 14, 2018

bartosz-antosik commented May 14, 2018 • edited Loading

Proposal: mild mode for the `code` parser #155

Proposal: mild mode for the `code` parser #155

luav commented May 10, 2018 •

edited

Loading

bartosz-antosik commented May 10, 2018 •

edited

Loading

luav commented May 13, 2018 •

edited

Loading

bartosz-antosik commented May 13, 2018 •

edited

Loading

bartosz-antosik commented May 14, 2018 •

edited

Loading