Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up git #1162

Closed
michielbdejong opened this issue Sep 10, 2024 · 5 comments · May be fixed by OpenTermsArchive/docs#142
Closed

Speed up git #1162

michielbdejong opened this issue Sep 10, 2024 · 5 comments · May be fixed by OpenTermsArchive/docs#142

Comments

@michielbdejong
Copy link
Member

I did some timing tests on a server with 4Gb memory, and it's way too slow.
When a document declaration has just been added it takes about 9 seconds to crawl it (not ideal but acceptable; this includes launching and stopping the headless browser).
But when a document has a slightly longer history such as Musi then it just gets stuck in git log processes:

Screenshot 2024-09-10 at 11 02 37

You can also see that git log Musi already takes git a long time:

crawler@ota-tosdr-ubuntu-20-04:~/engine/data/snapshots$ time git log Musi/Terms\ of\ Service.html
commit 83421d340c68f8d691713ae0e5e8856f3965b87d
Author: Open Terms Archive Bot <[email protected]>
Date:   Wed Jun 19 05:16:50 2024 +0000

    First record of Musi Terms of Service

real	0m33.702s
user	0m33.239s
sys	0m0.458s

I'll investigate ways to speed up git log. If that fails then I'll investigate if we can move away from git.

@michielbdejong
Copy link
Member Author

Ah, thanks to StackOverflow I was able to reduce these 33 seconds to .5 seconds! Will create a PR to the OTA docs about this.

@michielbdejong michielbdejong changed the title OTA with git isn't working for us Speed up git Sep 10, 2024
michielbdejong added a commit to michielbdejong/docs-2 that referenced this issue Sep 10, 2024
@michielbdejong
Copy link
Member Author

For our crawler there might be other quickfixes such as maybe truncating the git history after git push?

@michielbdejong
Copy link
Member Author

Indeed, if I run git rev-parse HEAD~5 > .git/shallow in data/versions and in data/snapshots then time npx ota track --services Musi is reduced from 5 minutes to 13 seconds.

@michielbdejong
Copy link
Member Author

It does then give a grafted commit in the log, not sure if that's going to mess things up. I'll try running npx ota track on the shallow repos and see how that goes!

@michielbdejong
Copy link
Member Author

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant