Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Viewing log in directories with thousands of files does not work, ends with Apache "504 Gateway Timeout" #148

Open
loeschers opened this issue Jul 26, 2021 · 16 comments

Comments

@loeschers
Copy link

Hi!

Thanks for the great work!
I use WebSVN 2.6.1 and most things are working perfectly.

But when I want to view the log for a file, which is located in a directory with thousands of other files, it ends with the Apache error "504 Gateway Timeout".
I did some debugging and found out, that WebSVN is executing tons of commands like
svnauthz accessof --repository 'xxxxxx' --path '/trunk/myfiles/somefile.txt' --username '***' '/etc/httpd/conf.d/svn_rep_access'
This is done for each file in this directory at least three times.
With ~10.000 files in a directory, this leads to executing ~30.000 svnauthz commands, which leads to the timeout (IMO).

My basic question: Why is svnauthz executed for each file in that directory?
To my understanding, it should be enough to execute it only for this single file, for which the log was requested.

Is there a way to fix this?

BR,
Stephan.

@michael-o
Copy link
Member

Althogh 30 000 seems to be inefficient code, svnauthz is likely unavoidable because the every single file could be subject to authz. How fast is the default SVN view for this directory with 10 000 files?

@ams-tschoening
Copy link
Contributor

My basic question: Why is svnauthz executed for each file in that directory?
To my understanding, it should be enough to execute it only for this single file, for which the log was requested.

You should provide the exact link you are executing and for which you have seen all those svnauthz calls. It's not clear to me currently if you e.g. simply browse to some directory, see all of those calls and then click on the log-link and don't see those calls anymore or if you see them AGAIN after clicking the log-link only. Having a quick look at the log-code, it seems access is only checked per repo there instead of per file. OTOH, when browsing a directory, like @michael-o said, checking access per file seems the correct thing to do.

Performance problems like these have been discussed in the past already, but from our point of view, using svnauthz instead of a custom access check like in the past is the correct thing to do. The custom code in the past had errors as well. There's a cache for results of access checks available, though it's only short lived currently. You might consider enhancing that to a longer period of time in combination with something like FastCGI.

8821af1#diff-5b6bdb07f82e491a5daf1f78b8afae0e1d995fa67a06f957e467a29460f404b1R100
#78 (comment)

@loeschers
Copy link
Author

How fast is the default SVN view for this directory with 10 000 files?

Entering this directory takes about 3 seconds for viewing. That's OK.

@loeschers
Copy link
Author

I start e.g. with the URL https://myserver.my.domain/websvn/browse/myrepo/trunk/dir_with_thousand_files/
Opening this URL takes about 3 seconds, which is OK.
Then I click on "Log"-link for one file.
URL is https://myserver.my.domain/websvn/browse/myrepo/trunk/dir_with_thousand_files/myfile.txt?op=log&
This ends with the 504 error.
To find out, what is taking so long, I added some debug output in the function runCommand() in the file /usr/share/websvn/include/command.php, which displays this lot of svnauthz commands.
I understand fully, that when browsing a directory, every files needs auth check.
But when I only want to see the log for a single file, I think it should be enough to only check auth for this single file and not all files in this directory.
Should I add some debug code to help you finding the location of the source of this svnauthz calls?

@michael-o
Copy link
Member

Yes, please. This should be better. aI wonder whether we could utilize Redis somehow for this. The fundamental difference between WebSVN and SVN authz module is that we don't have access to the C API and the to exec which is compared slow. Maybe some knows how to properly call C from PHP and we could solve this problem.

@ams-tschoening
Copy link
Contributor

Should I add some debug code to help you finding the location of the source of this svnauthz calls?

That would be great, especially have a look at svnlook.php and the function getLog. That is called multiple times in log.php as part of SVNRepository::getLog and contains a loop over entries checking read access to those.

foreach ($curLog->entries as $entryKey => $entry) {
	$fullModAccess = true;
	$anyModAccess = (count($entry->mods) == 0);
	$precisePath = null;
	foreach ($entry->mods as $modKey => $mod) {
		$access = $this->repConfig->hasLogReadAccess($mod->path);
		if ($access) {

I suggest simply removing that check in favor of true to see if that's the bottleneck already and if so, you need to provide more detailed output. Would be good to look at the path the function is called with, the returned XML history, the entries iterated over and stuff like that.

Have a look at var_dump to output things into HTML and afterwards simply copy the necessary data when looking at HTML source in your browser.

@loeschers
Copy link
Author

I already started yesterday to find out where to insert some debug code.
I don't really know PHP, but I am just making a great learning curve :-)

I updated from 2.6.1. to the current GIT code.

After this change

			foreach ($entry->mods as $modKey => $mod) {
var_dump($mod->path); echo "<br>\n";
$access = 1;
//				$access = $this->repConfig->hasLogReadAccess($mod->path);

it speeds it up and the log display is working.
It lists all files in the directory exactly five times in the form:

string(33) "/trunk/dir_with_thousand_files/myfile0001.txt"

Then I added the $cmd output to see the svn commands:

string(137) "svn --non-interactive --config-dir /tmp/websvn log --xml --limit 1 -r HEAD:1 'file:///data/svn/myrepo/trunk/dir_with_thousand_files/myfile0001.txt@HEAD'"
string(149) "svn --non-interactive --config-dir /tmp/websvn log --xml --quiet --limit 40 -r 231634:1 'file:///data/svn/myrepo/trunk/dir_with_thousand_files/myfile0001.txt@231634'"
string(145) "svn --non-interactive --config-dir /tmp/websvn log --xml --verbose -r 231634:192697 'file:///data/svn/myrepo/trunk/dir_with_thousand_files/myfile0001.txt@231634'"

I don't know how to dump the XML results in PHP, but on the command line the first two svn commands are listing only the things related to the file myfile0001.txt, but the last one displays entries for all files from that directory.
And I know the reason: In revision 231634 all ~6000 files where modified!

What can we do now?

@ams-tschoening
Copy link
Contributor

And I know the reason: In revision 231634 all ~6000 files where modified!

I've expected something like that. There's two aspects here:

  1. Is the behaviour of WebSVN to check all files mentioned in the revision correct or not?
  2. If it's correct, any way to speedup things?

For question 1, it might make sense to test how other clients behave, like TortoiseSVN or alike. Especially if access to some of the files is restricted and to some not. I've already seen that TSVN sometimes doesn't show log entries because of a lack of permissions, but am not sure how possibly unrelated files influence this. Besides that, you might want to ask on the SVN user mailing list about what's the correct behaviour. I simply don't know it as well.

From my point of view, when the history of one concrete log file is of interest only, checking the permissions for only that file should be enough. But people might habe implemented this like it is for a reason.

About performance: As long as external svnauthz is used, there's not too much which can be done. Caches don't help when hitting the problem the first time. A C-binding probably doesn't exist currently and is most likely too much work for us as well. In theory, as SVN is OSS, one might be able to put a wrapper around the existing code of svnauthz do make some daemon of it... But this is a lot of work as well, getting everything build and stuff, I won't be able to do it.

@michael-o
Copy link
Member

About performance: As long as external svnauthz is used, there's not too much which can be done. Caches don't help when hitting the problem the first time. A C-binding probably doesn't exist currently and is most likely too much work for us as well. In theory, as SVN is OSS, one might be able to put a wrapper around the existing code of svnauthz do make some daemon of it... But this is a lot of work as well, getting everything build and stuff, I won't be able to do it.

A deamon isn't necessary. I assume that the Apache module for authz loads the file at request time, holds in memory processes the request and frees as resources. This is basically what we need. svnauthz is just a C wrapper for SVN lib calls. We need a PHP wrapper around these calls.

@michael-o
Copy link
Member

This is what we need:
https://github.com/apache/subversion/blob/7bc28e4c79fb2e2211c8f569790e880a4115e405/tools/server-side/svnauthz.c#L321-L401 plus the glue code from main() and sub_main().

@loeschers
Copy link
Author

1. Is the behaviour of WebSVN to check all files mentioned in the revision correct or not?

I would say: To check all files in the revision basically is valid, but checking files in the same directory is obsolete.
Because as far as I know, SVN does not support access restrictions on a single file. Access control is always working on directory level. So there is no need to check the files in the same directory.

@ams-tschoening
Copy link
Contributor

Because as far as I know, SVN does not support access restrictions on a single file.

I'm pretty sure SVN maintains access to arbitrary paths:

As files are paths, too, it's even possible to restrict access on a per file basis.

https://svnbook.red-bean.com/en/1.7/svn.serverconfig.pathbasedauthz.html

@michael-o
Copy link
Member

@loeschers mod_authz_svn has full logging. Can you enable to see whether it checks for each and every node (file)? I read the source code and it my opinion it does.

@loeschers
Copy link
Author

@ams-tschoening You are right. Current SVN is able to resctrict access also on file basis.

@michael-o I enabled Apache debug LogLevel and then executed something like that:
svn log https://myserver.my.domain/myrepo/trunk/dir_with_thousand_files/myfile0001.txt
This generates thousands of authz_svn messages in the log and also checks all ~6000 files.

This means as a result: The logic in WebSVN is correct. :-)

But would be there a way to speed it up or at least avoid the Apache timeout?

@michael-o
Copy link
Member

@ams-tschoening You are right. Current SVN is able to resctrict access also on file basis.

@michael-o I enabled Apache debug LogLevel and then executed something like that:
svn log https://myserver.my.domain/myrepo/trunk/dir_with_thousand_files/myfile0001.txt
This generates thousands of authz_svn messages in the log and also checks all ~6000 files.

This means as a result: The logic in WebSVN is correct. :-)

But would be there a way to speed it up or at least avoid the Apache timeout?

I see only two ways:

  1. Investigate why three calls are necessary and what they represent in the UI.
  2. Use C from PHP to make the validation without svnauthz command.

@ams-tschoening
Copy link
Contributor

But would be there a way to speed it up or at least avoid the Apache timeout?

Avoiding the timeout is easy in theory, just set it high enough... ;-)

Timeout 3600

https://httpd.apache.org/docs/2.4/mod/core.html#timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants