An experiment in tracking and diffing versions of modern Microsoft Office files in Git.
Modern Office file formats are ZIP archives with XML files in them. The ZIP archives are binary files so Git (and furthemore GitHub, GitLab where diff cannot be tweaked) won't display a nice diff for them. The XML files are not binary, so in order to display a diff for these, this unpacks the ZIP files to directories that are tracked in Git. Tracking generated files is pretty dumb, but so is tracking binary files and when forced to have one, it's not a leap to have the other as well if it bring something useful to the table.
This is achieved using a PowerShell script which unpacks the ZIP file to a tracked directory, formats the XML files for nice diff and tracks the formatted files as well.
Looking for OpenOffice format support? Check out Tim Wiel's version
Examples:
The XML diff captures the exact change whereas the TXT diff captures text-only change for quick content inspection.
Features:
- Every Office file (DOCX, XLSX, PPTS) has complementary
.git
directory with XML and TXT files for diffing - Formatting XML files for nicer diffing
- Generating TXT files from just text nodes for lossy text-only diffing
- Warning in extracted and generated content about read-onliness of the data
- Skipping processing unchanged files for fast operation even in repos with many Office files
- Removing associated generated content automatically for Office files that have been removed from the repo
- Ability to run as a Git hook for worry free tracking
Limitations:
- Stores compressed and uncompressed versions in Git - by design, for plain text diffing and binary source of truth
- No support for DOC, XLS and PPT, only XLSX, DOCX and PPTX (XML based formats) - by design, no use diffing binary formats
- Risk of getting generated files out of sync if hook is not run or a manual edit is made to the generated files
- Won't process files uploaded to repository through GitHub/GitLab online UI (no pre-commit hook)
Support:
- Windows: 10.0.16299+ (
cmd -c ver
) - Ubuntu: 16.0.0+ (
lsb-release -r
)
Run PowerShell scripts using VS Code PowerShell Integrated Console to avoid security blocks.
Open it by clicking on any .ps1
file with integrated terminal open or running the PowerShell: Show Integrated Console VS Code command (F1
+(p
+s
+c
+i
)).
- Run
cmd/version-office-files.ps1
from the command line - Run
cmd/edit-in-powrshell-ise.ps1
to open in PowerShell ISE (Integrated Shell Environment) - Add a Git pre-commit hook:
cp .git/hooks/pre-commit.sample .git/hooks/pre-commit
code .git/hooks/pre-commit
- Paste the contents of
hook/pre-commit.sh
into.git/hooks/pre-commit
.
Observe commit diffs to see Office file changes in the XML and TXT files.
Run PowerShell scripts using VS Code PowerShell Integrated Console to avoid security blocks.
Open it by clicking on any .ps1
file with integrated terminal open or running the PowerShell: Show Integrated Console VS Code command (F1
+(p
+s
+c
+i
)).
Run cmd/run-tests.ps1
which will run NodeJS tests in test/
(prerequisites).
In this repository, the tests run together with the main script in a pre-commit hook in order to catch any bugs as soon as possible during development. When using this script as a tool in a repository other than this one, only the main script would be ran as shown in the Git pre-commit hook setup code.
Use WSL (Ubuntu) to test portability of the PowerShell script.
Use lsb_release -a
to find WSL Ubuntu version and use
PowerShell Linux installation instructions
Of course the basic web editor UI still won't…
The privileges security thing currently makes committing through VS Code fail.
Use hook/pre-commit-development.sh
when
contributing to this repository to also run tests.
Derived works based on this project:
- Tim Wiel (2020): derived work for OpenOffice document formats
Some notable prior art:
All of these focus on on-demand (non-tracked) generating of text-only versions of the files, do not capture structure changes. This project aims to explore the other, potentially less useful, but nonetheless interesting, route of versioning both the compressed and the uncompressed forms of a file in parallel. See the Features and Limitations sections for pros and cons.