Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: mild mode for the code parser #155

Open
luav opened this issue May 9, 2018 · 11 comments
Open

Proposal: mild mode for the code parser #155

luav opened this issue May 9, 2018 · 11 comments

Comments

@luav
Copy link

luav commented May 9, 2018

Spellright works fine with the text but becomes hardly usable for the code (Python, C++, JS, ...) with docstrings even in the parser mode (only comments are strings are spelled).

It makes sense to use more mild mode / less strict mode when the code is parsed to omit most of the variables instead of showing them as errors. It could be done just by specifying the lower bound of similarity for typos and consider words that are too much different from the dictionary as variables. It would be nice to be able to adjust this similarity value E [0, 1] for particular project to control false alarms vs precision.
A [pseudo] example of the implementation can be checked here.

@bartosz-antosik
Copy link
Owner

Hi! I think I understand your sentiment, but there is a number of problems with the approach proposed.

First: Spell Right is supposed to spell "fenced" comments (e.g. docstrings) in e.g. Python and other code as well, see here:

image

If it does not work in some language then let me know. And BTW the parser is supposed to omit all the variables so far.

Second: similarity measure: As you may have noticed my extension uses native spelling APIs which results in better spelling quality (e.g. short words, abbreviations, case and many other are taken under consideration whereas other spellers e.g. start spelling from three letters to speed things up.) but has its limitations one of them is that It does not allow to adjust spelling metrics like you propose. Of course there are approaches which could, as you propose, infer about what is code and what is string (e.g. there is nice Bayesian approach to this, I remember extension which was automatically choosing dictionary in a mail being written) but that seems a bit of overshoot especially when you cannot control anything below the API level.

Main problem with the solution that you postulate is that so far VSCode does not give extensions an access to the document's syntax information. If I could use VSCode's parser used to colorize code to determine what text is what (variable, keyword, identifier etc.) I could easily apply e.g. CamelCase or snake_case rules for spelling. It would not require a similarity measure approach. Could be just strict spelling. On the other hand I am not able to provide parsers so fine-grained to service a multitude of code documents. Hence, the comments/strings approach known for long time e.g. from Visual Studio extensions. The whole extension has originated from VSCode's issue #20266 which points out that there is generally a problem with spelling in VSCode.

There is other speller which does spell code in a brute way. That is care not about the syntax just spell everything and eliminate in some way as much as possible of what seems to be OK: keywords in separate dictionaries etc. I have simply decided for different approach.

But I am not ruling anything out. Especially if there would be some document syntax support from VSCode's part.

@luav
Copy link
Author

luav commented May 10, 2018

An example of hundreds of false alarms for the variables in Python docstrings for a small project, note that only the comments and strings are parsed (code parser mode) and I already excluded all the CamelCase and snake_case variables, URLs and etc by the spellright.ignoreRegExps:
spellright_pythondocstringvars
Note: the selected multinode term is correctly treated as a spelling error but ncpunodes, tinfext, indstep, colsep and dozens of other variables are clear false alarms.

... but has its limitations one of them is that It does not allow to adjust spelling metrics like you propose

It seems that the outlined limitations might be easily solvable in the following way.
The spellright shows a hint ("Show Fixes (Ctrl + .)" yellow bulb) for each spelling correction, where the most probable / relevant corrections are suggested in the drop-down list. It is possible to evaluate the outlined similarity of the target correcting word with up to 3-5 top suggestions and if the max similarity is lower than the custom threshold then most likely the word is a variable, not a typo and should not be shown as a typo. Such functionality extension operates on the already available data and seems should not require lots of computations operating only on the single words (no Bayesian statistics is required).

Note that proposed approach does not require access to the VSCode parser and operates only on the data produced by the own plugin.

PS Thank you for providing the Spellright, which is a very useful tool and I believe it is possible to make it more convenient with little efforts ;-)

@bartosz-antosik
Copy link
Owner

bartosz-antosik commented May 10, 2018

Oh, gosh! I have finally understood what you desire! One well commented picture is worth more than ... never mind.

This treshold can easily be done. Plus e.g. condition that variables are small case. But! What I see suggests that it may be possible to search for divisions of the word to see if it is a compound e.g. multinode would spell correctly as multi node... If this would be limited to words suspected of being variables in code parser then it may not take too much time.

@bartosz-antosik
Copy link
Owner

I gave the thing another thought and I think I have found a much better way to do this. The speller should use document's symbols! That is I cannot use the parsers which are used by VSCode but there is a way to get the symbols exactly like the list that you get when you do a Goto symbol (Ctrl+Shift+O on Windows). And these symbols are (should be) exactly what you have a problem with. Extending this using these symbols as a dictionary seems reasonable - they may appear especially in the comments.

Because of some considerations it was a complicated code modification and I would like to ask you to test the modified extension. You can download it from here:

https://drive.google.com/open?id=1IBI8JseAlrQNHGYQVV2O3B--TiSmJLFi

Just install it from file using Install from VSIX option.

@luav
Copy link
Author

luav commented May 13, 2018

Hi @bartosz-antosik ,
I install the update from the provided file and reloaded (restarted VSCode) but it has not helped:
spellright_pythondocstringvarsf1

@bartosz-antosik
Copy link
Owner

Is the document you test it with available somewhere? Could I have a similar document to test things?

@luav
Copy link
Author

luav commented May 13, 2018

I also tired to uninstall the extension and then install it from the provided .vsix, which reduced the number of errors from 184 to 171 (helped a bit, but not much):
spellright_pythondocstringvarsf2

But now the synchronization is lost between the reporting errors and actual lines of the code and still most of the errors are remained.
It might be easier for you to debug the plugin on my project directly from your environment.

@bartosz-antosik
Copy link
Owner

bartosz-antosik commented May 13, 2018

Something is not OK here (e.g. switching document has sometimes problems), but it seems that this approach does reduce the number of spelling errors (see Args names below) and helps to detect spelling errors (TaskUnfoExt instead of TaskInfoExt):

image

There is a known problem with reading symbols from the document that is sometimes returns empty list at first. I will investigate it further but it seems that it is the way to go. Partially at least.

@bartosz-antosik
Copy link
Owner

Another take, should be better (version 2.3.7):

https://drive.google.com/open?id=1_iOS4VgN0Njp8rfhd-jTbMXNt6GmbQ_S

Could you please install and tell me whether it eliminates document symbols correctly?

@luav
Copy link
Author

luav commented May 14, 2018

Thanks, it works much better!
But still there are about hundred false alarms for the variable names mainly in the commented code and constructions like print('{afnbin} ...'.format(afnbin=...)), which could be solved presenting the similarity threshold.
By the way, I identified the reporting lines synchronization loss issue, see #159.

@bartosz-antosik
Copy link
Owner

bartosz-antosik commented May 14, 2018

Last release has the symbols spelling included. I will deliberate a bit on these other ideas of yours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants