Skip to content
This repository has been archived by the owner on Aug 5, 2024. It is now read-only.

Ignore Whitespace in DiffMatchPatch #133

Open
deepanshukhanna opened this issue Jan 16, 2023 · 3 comments
Open

Ignore Whitespace in DiffMatchPatch #133

deepanshukhanna opened this issue Jan 16, 2023 · 3 comments

Comments

@deepanshukhanna
Copy link

Currently there is no way we can ignore those diff if there is only a whitespace or formatting done on the code block. If we can add some feature which can be enabled and then we can ignore those changes which were found due to whitespace only.

@dmsnell
Copy link

dmsnell commented Jan 16, 2023

@deepanshukhanna this library creates a diff patch to get from one version of a document to another; ignoring whitespace or formatting would break that output as the diffs it produces would no longer be able to be used to move between the two versions.

you might want to look into treating diff-match-patch as a lower-level library and handling those aesthetic preferences in your own application or in a language-specific diffing utility. it's pretty hard in the general sense to say that something is actually whitespace, formatting, or truly relevant for a diff. for example, a whitespace change in human language may split paragraphs; a newline in a Python document may introduce another code block or continue an existing one depending on how much whitespace follows the newline; in some languages the addition or removal of whitespace may trigger different syntax parsing.

@anmol-rana-ar
Copy link

anmol-rana-ar commented Jan 17, 2023

@dmsnell We are using diff-match-patch as a library, and we have written some extensions to use it for our requirements.
The reason it's challenging for us to ignore whitespaces changes is because diff-match-patch performs diffing on unicodes so we can't modify that. Removing whitespaces will result in issues in location changes and workarounds to handle that will make it complex and add unnecessary overheads and redoing some computations again.

Motivation behind this ask: Most of the editors and git provides this configuration to ignore whitespaces between versions of code. We are relying on diff-match-patch to get diffs between version of codes. Without an option to ignore whitespaces changes, we are getting diffs with whitespaces changes as well. While doing some critical computations on newly introduces changes, it doesn't make sense to do it on old code that is being parsed just because of some whitespace introduction.
It would be a great feature if diff-match-patch can align to produce outputs that are aligned with these editors based on some ignore_Whitespace configuration.

" it's pretty hard in the general sense to say that something is actually whitespace, formatting, or truly relevant for a diff"

Removing whitespaces from start and end of line before changing it to unicodes might be helpful and give expected diffs here while ignoring whitespaces. But yes, I believe same unicodes are converted back to line before returning diffs so those removed spaces should be added back for those lines somehow.

@dmsnell
Copy link

dmsnell commented Jan 17, 2023

Removing whitespaces from start and end of line before changing it to unicodes might be helpful and give expected diffs here while ignoring whitespaces.

my point is that in some languages this gives a diff that hides relevant semantic changes in the code.

if __DEBUG__:
	print_sys_stats()
	print_all_the_secrets()

to

if __DEBUG__:
	print_sys_stats()
print_all_the_secrets()

is a fairly massive change but hidden if we assume whitespace is irrelevant.


to restate, I don't think diff-match-patch is where you want to hide whitespace changes. you might have much better luck if you process diff-match-patch's output to hide whitespace changes you decide aren't important or use a diffing algorithm that works on the code structure.

or you could pre-process your files and remove the whitespace, then create a diff.

either way I think looking to this library to ignore whitespace might be an uphill battle compared to addressing it at your application level.


diff-match-patch performs diffing on unicodes so we can't modify that

something may be lost in the comment here because I'm not understanding what you are talking about. note that you can get structural diff output which contains a list of diffing operations and examine that to look for those operations which only add or remove whitespace.

diff-match-patch doesn't operate on Unicode code points, or Unicode anything apart from when converting a diff to a patch string when it has to convert to UTF8. each language operates on different atomic units, but largely that's either diffing bytes or diffing code units in whatever size is native to the language (e.g. UTF16 with JavaScript).

@deepanshukhanna deepanshukhanna changed the title Ignore Whitespace in DiffMatchPath Ignore Whitespace in DiffMatchPatch Jan 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants