-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mountpoint-s3 has a high memory usage, resulting in an Out of memory: Kill process error in the analysis software diamond #566
Comments
Hi! Can you please attach log outputs with adding the Logging options |
@sauraank Hi Ankit, the last part of the log outputs :
and you can get the whole log file from the attachment: mountpoint-s3-2023-10-27T06-44-50Z.zip |
Hey @sunl, I have three quick questions:
|
I took a quick look over the logs. Here's some rough notes for others looking into this issue.
|
My bad, I misunderstood part of the issue. It's the diamond application which is encountering the "Out of memory" error due to memory usage on the machine. This explains some of the behaviors such as FUSE receiving an We recently made a fix to some prefetching behavior which would cause excessive data to be fetched from S3: #488. The fix was released in mount-s3 v1.1.0. @sunl, can you see if using the latest version solves the issue? Separately, I am planning to dig into what other metrics may help us understand what's going on on the mountpoint-s3 side to help understand an issue like this much faster. |
In background mode we have two processes both racing on the log file, and they can scribble each other's log entries (I saw this in awslabs#566). O_APPEND should fix that. We should also log the version number as a point of reference. Signed-off-by: James Bornholt <[email protected]>
by the way, I saw the new version released last week, so I upgraded it to v1.1.0 and did some test with the --debug --log-metrics and --log-directory options, I still got the same issue. |
In background mode we have two processes both racing on the log file, and they can scribble each other's log entries (I saw this in #566). O_APPEND should fix that. We should also log the version number as a point of reference. Signed-off-by: James Bornholt <[email protected]>
Hi, @sunl Agree that the memory in those instances should be enough to run your applications and we might want to dig deeper into this problem. Next step for us is to reproduce the issue but it’s a bit hard to use diamond as we’re not familiar with the software. It would be great if you can provide a small script (probably python) to constantly reading data from mounted directory until we see the OOM problem. We will let you know as soon as we have any updates on this issue. |
Im experiencing a similar issue. Running a |
Hey @daltschu22, do you mind opening a new bug report? This sounds like a case that will be easier to reproduce - details of the EC2 instance or machine being used, the number of S3 objects and how they are organized will be useful. The exact find command (redactions OK) would also be great. |
Hi: Per Dev's request I opened an AWS support ticket 14460192231 with more details. |
Update: we can no longer reproduce this leak behavior using the latest release. As we run the scan, we see memory usage creep upwards, but at the end of the scan it's on the order of 100MB used. That usage doesn't go away until we unmount, but it doesn't go higher than that. In earlier releases, it would climb uncontrolled until the system became unresponsive. |
That's brilliant news, thank you for sharing! I hope it is broadly applicable, though I doubt it will have solved the issue @sunl is facing. The logs shared show only one file handle hanging around. In the last release (v1.3.2), we released a change (6e7252d) which fixed a potential memory leak due to unreleased file handles. This leak was reported in #670, and you may have been impacted if you saw lots of messages in the logs like "mountpoint_s3::fuse: release failed: unable to unwrap file handle reference". I'll share this in that particular issue so folks can try it out too. |
I do recall seeing that message earlier in one of our test envs late last year (running an earlier version though I don't recall which one), but I didn't have time then to capture logs and look into it. I definitely did not see that when I was trying to reproduce this most recently. Anyway, sorry to hijack @sunl's issue, I thought I was hitting the same bug. I'll shut up now ;) |
Mountpoint v1.10.0 has been released with some prefetcher improvements and might reduce memory usage. Could you please try upgrading to see if it provides any improvements for you? |
With adaptive prefetching (#987) released in Mountpoint v1.10.0, we expect that many OOM cases will be avoided. We'll close this issue for now. If you do see OOM errors, please do open a new bug report. |
Mountpoint for Amazon S3 version
mount-s3 v1.0.2
AWS Region
cn-northwest-1
Describe the running environment
Running in EC2(c6i.4xlarge) on Centos7 against an S3 Bucket in the same account.
What happened?
My testing scenario is using diamond blast for genomics analysis, the command used are as follows:
Among them, nr.dmnd is the reference database that the analysis relies on, stored in the S3 bucket. The S3 bucket is mounted to the /database directory on EC2 with mount-s3.
One EC2 runs one task to process one sample, each task needs to run for approximately 6 hours. I simultaneously started 20 EC2 runs 20 tasks (index from 1 to 20). Among the 20 tasks, 2-3 out of 20 tasks often fail after running for over about 5 hours, and the task index that fail each time are not fixed. The following error message is observed:
Then I replaced mount-s3 with goofys, and all 20 tasks were successful every time, but the analysis time will be about 8% -10% longer than using mount-s3.
I suspect that this issue is caused by the high memory usage of mount s3. I am not sure if you have any suggestions or if you can optimize the memory usage of mount s3 in the future?
Relevant log output
No response
The text was updated successfully, but these errors were encountered: