-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support hdfs trash #17756
base: master-2.x
Are you sure you want to change the base?
Support hdfs trash #17756
Conversation
@humengyu2012 Thanks for your nice and consistent contributions! @HelloHorizon @elega Would you mind taking a review of this PR? |
@humengyu2012 Thanks for your contribution for this. I have some question for this PR.
Please let me know if something wrong with me, thanks! |
Firstly, when deleting a large directory in HDFS, a large number of block reporting operations can cause the namenode to hang, so we need to protect the namenode. Secondly, we mount HDFS to users' pods through Alluxio Fuse. They may perform deletion operations, but because the local directory doesn't appear to be "that large," they may not realize that they are deleting a large directory. Lastly, if a user accidentally deletes a file in Alluxio, there are no protective measures in place since Alluxio directly deletes the file. As a result, the user will be unable to recover the mistakenly deleted file. |
@humengyu2012 Thanks for your explain, after reading and thinking about your comment, i think the trash is useful for alluxio, so could it be possible to set up a policy for alluxio filesystem? |
Unfortunately, it is difficult to support this in the Alluxio filesystem because some file systems, such as object storage, do not support the rename operation. Therefore, it is challenging to create a unified abstraction. Even for UFS (e.g., HDFS) that supports the rename operation, it is still subject to limitations imposed by viewfs and RBF. Hence, I tend to favor each UFS providing its own implementation for this feature, and it is up to the user to evaluate whether to enable it. |
AFAIK, after |
Cleaning the trash directory can indeed cause the Namenode to become unresponsive. As far as I know, most users have implemented their own asynchronous cleanup logic for the trash directory. In this scenario, moving files to the trash is preferred over direct deletion. However, the primary objective of this pull request (PR) is not to alleviate the HDFS load but rather to provide users with a way to recover files from the recycle bin after deletion. |
return hdfs.delete(hdfsPath, recursive); | ||
} | ||
// move to trash | ||
if (isDirectory && !recursive) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the directory is empty and recursion is false, it should delete the directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it.
if (isDirectory && !recursive) { | ||
return false; | ||
} | ||
return trash.get().moveToTrash(hdfsPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if moveToTrash
method return false(For example, this file is already in the trash). The file should be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that after a failed deletion in the recycle bin, fallback to the regular deletion logic should be implemented? However, this may change the user's expectations. I believe that if deletion fails, the user should be informed of the failure instead of automatically deleting the file for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, HDFS trash automatically handles files with the same name for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example is that if the moveToTrash method is called on the directory /user/hdfs/.Trash/xxx, it will return false. If occur exception like network exception or permission denied exception, it will throw IOException.
I think what you mean by asynchronous clean up is essentially calling the synchronous deletion interface, FYI https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java#L396 |
Asynchronous file cleanup is implemented by the users themselves and is independent of HDFS. They will move the files from the trash to another location and then delete them one by one. This approach helps protect the namenode. |
I believe that most users are using HDFS's native trash mechanism for cleanup. In addition, I believe that converting all deletion operation to a movement operation is more suitable to be implemented on the namenode. |
This PR provides the fastest implementation of the HDFS trash feature, and I don't think it's crucial whether it gets merged or not. If there are Alluxio users who want this functionality in the future, this PR provides them with an option, considering that Alluxio currently lacks any implementation related to the hdfs trash. BTW, has Alluxio considered implementing a generic trash feature? |
My point of view is that if users want to protect the data in HDFS, then it is more appropriate to perform this conversion on the HDFS namenode, as this can prevent all deletion requests to HDFS not only through Alluxio. If you want to prevent deleting data through Alluxio and instead move it to trash, then implementing a universal trash framework in Alluxio makes more sense. As you mentioned, designing a universal trash framework is challenging because different storage systems have different semantics. |
Thank you for continuously providing me with advice. I highly agree with your viewpoints. However, implementing a universal trash feature in Alluxio is extremely complex and not an easily achievable task. For users who have a strong need for HDFS trash functionality, waiting for a long development cycle can be very painful. Perhaps they simply want a simple and usable HDFS trash feature. Based on this, I have submitted this PR, which is a quick implementation of HDFS trash. It has some value as a reference, and although it may not be perfect, it is simple and easy to use. Therefore, whether this PR gets merged or not is not important. I just hope that when users search for "alluxio hdfs trash," they can see this PR. From my interactions with the community, not every user has the energy to modify Alluxio's code themselves when they need a new feature. Having some PRs that can serve as references will make it easier for them to accomplish their work. Lastly, thank you once again for your review. |
What changes are proposed in this pull request?
Support hdfs trash in hdfs ufs.
Why are the changes needed?
#17723
HDFS trash is a very useful feature, and we should support it. It allows users to recover deleted files from the trash after accidental deletion.
Usage:
add config in
alluxio-site.properties
:add config in hdfs
core-site.xml
:Does this PR introduce any user facing changes?
Please list the user-facing changes introduced by your change, including