Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hdfs trash #17756

Open
wants to merge 4 commits into
base: master-2.x
Choose a base branch
from

Conversation

humengyu2012
Copy link
Contributor

@humengyu2012 humengyu2012 commented Jul 10, 2023

What changes are proposed in this pull request?

Support hdfs trash in hdfs ufs.

Why are the changes needed?

#17723

HDFS trash is a very useful feature, and we should support it. It allows users to recover deleted files from the trash after accidental deletion.

Usage:

add config in alluxio-site.properties:

alluxio.underfs.hdfs.trash.enable=true

add config in hdfs core-site.xml:

    <property>
        <name>fs.trash.interval</name>
        <value>1440</value>
    </property>

Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including

  1. change in user-facing APIs YES
  2. addition or removal of property keys YES
  3. webui NO

@ChunxuTang
Copy link
Member

@humengyu2012 Thanks for your nice and consistent contributions!

@HelloHorizon @elega Would you mind taking a review of this PR?

@maobaolong
Copy link
Contributor

@humengyu2012 Thanks for your contribution for this. I have some question for this PR.

  • Could you explain the use case of yours?
  • AFAIK, HDFS trash is protected(Delete->Move) in client-side, so I guess we should do nothing in the alluxio layer, just let hdfs client do this. Further more, I found hdfs shell command use the hdfs trash and namenode manage the trash.

Please let me know if something wrong with me, thanks!

@humengyu2012
Copy link
Contributor Author

@humengyu2012 Thanks for your contribution for this. I have some question for this PR.

  • Could you explain the use case of yours?
  • AFAIK, HDFS trash is protected(Delete->Move) in client-side, so I guess we should do nothing in the alluxio layer, just let hdfs client do this. Further more, I found hdfs shell command use the hdfs trash and namenode manage the trash.

Please let me know if something wrong with me, thanks!

Firstly, when deleting a large directory in HDFS, a large number of block reporting operations can cause the namenode to hang, so we need to protect the namenode. Secondly, we mount HDFS to users' pods through Alluxio Fuse. They may perform deletion operations, but because the local directory doesn't appear to be "that large," they may not realize that they are deleting a large directory. Lastly, if a user accidentally deletes a file in Alluxio, there are no protective measures in place since Alluxio directly deletes the file. As a result, the user will be unable to recover the mistakenly deleted file.

@maobaolong
Copy link
Contributor

@humengyu2012 Thanks for your explain, after reading and thinking about your comment, i think the trash is useful for alluxio, so could it be possible to set up a policy for alluxio filesystem?

@humengyu2012
Copy link
Contributor Author

Unfortunately, it is difficult to support this in the Alluxio filesystem because some file systems, such as object storage, do not support the rename operation. Therefore, it is challenging to create a unified abstraction. Even for UFS (e.g., HDFS) that supports the rename operation, it is still subject to limitations imposed by viewfs and RBF. Hence, I tend to favor each UFS providing its own implementation for this feature, and it is up to the user to evaluate whether to enable it.

@qian0817
Copy link
Contributor

AFAIK, after fs.trash.interval time, the namenode will delete the trash directory. This still seems to cause the namenode hang problem you mentioned. Can you explain why configuring the trash will avoid this problem? thanks.

@humengyu2012
Copy link
Contributor Author

humengyu2012 commented Jul 18, 2023

AFAIK, after fs.trash.interval time, the namenode will delete the trash directory. This still seems to cause the namenode hang problem you mentioned. Can you explain why configuring the trash will avoid this problem? thanks.

Cleaning the trash directory can indeed cause the Namenode to become unresponsive. As far as I know, most users have implemented their own asynchronous cleanup logic for the trash directory. In this scenario, moving files to the trash is preferred over direct deletion. However, the primary objective of this pull request (PR) is not to alleviate the HDFS load but rather to provide users with a way to recover files from the recycle bin after deletion.

return hdfs.delete(hdfsPath, recursive);
}
// move to trash
if (isDirectory && !recursive) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the directory is empty and recursion is false, it should delete the directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it.

if (isDirectory && !recursive) {
return false;
}
return trash.get().moveToTrash(hdfsPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if moveToTrash method return false(For example, this file is already in the trash). The file should be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that after a failed deletion in the recycle bin, fallback to the regular deletion logic should be implemented? However, this may change the user's expectations. I believe that if deletion fails, the user should be informed of the failure instead of automatically deleting the file for them.

Copy link
Contributor Author

@humengyu2012 humengyu2012 Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, HDFS trash automatically handles files with the same name for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example is that if the moveToTrash method is called on the directory /user/hdfs/.Trash/xxx, it will return false. If occur exception like network exception or permission denied exception, it will throw IOException.

@qian0817
Copy link
Contributor

AFAIK, after fs.trash.interval time, the namenode will delete the trash directory. This still seems to cause the namenode hang problem you mentioned. Can you explain why configuring the trash will avoid this problem? thanks.

Cleaning the trash directory can indeed cause the Namenode to become unresponsive. As far as I know, most users have implemented their own asynchronous cleanup logic for the trash directory. In this scenario, moving files to the recycle bin is preferred over direct deletion. However, the primary objective of this pull request (PR) is not to alleviate the HDFS load but rather to provide users with a way to recover files from the recycle bin after deletion.

I think what you mean by asynchronous clean up is essentially calling the synchronous deletion interface, FYI https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/TrashPolicyDefault.java#L396

@humengyu2012
Copy link
Contributor Author

Asynchronous file cleanup is implemented by the users themselves and is independent of HDFS. They will move the files from the trash to another location and then delete them one by one. This approach helps protect the namenode.

@qian0817
Copy link
Contributor

I believe that most users are using HDFS's native trash mechanism for cleanup. In addition, I believe that converting all deletion operation to a movement operation is more suitable to be implemented on the namenode.

@humengyu2012
Copy link
Contributor Author

I believe that most users are using HDFS's native trash mechanism for cleanup. In addition, I believe that converting all deletion operation to a movement operation is more suitable to be implemented on the namenode.

This PR provides the fastest implementation of the HDFS trash feature, and I don't think it's crucial whether it gets merged or not. If there are Alluxio users who want this functionality in the future, this PR provides them with an option, considering that Alluxio currently lacks any implementation related to the hdfs trash. BTW, has Alluxio considered implementing a generic trash feature?

@qian0817
Copy link
Contributor

I believe that most users are using HDFS's native trash mechanism for cleanup. In addition, I believe that converting all deletion operation to a movement operation is more suitable to be implemented on the namenode.

This PR provides the fastest implementation of the HDFS trash feature, and I don't think it's crucial whether it gets merged or not. If there are Alluxio users who want this functionality in the future, this PR provides them with an option, considering that Alluxio currently lacks any implementation related to the hdfs trash. BTW, has Alluxio considered implementing a generic trash feature?

My point of view is that if users want to protect the data in HDFS, then it is more appropriate to perform this conversion on the HDFS namenode, as this can prevent all deletion requests to HDFS not only through Alluxio. If you want to prevent deleting data through Alluxio and instead move it to trash, then implementing a universal trash framework in Alluxio makes more sense. As you mentioned, designing a universal trash framework is challenging because different storage systems have different semantics.

@humengyu2012
Copy link
Contributor Author

Thank you for continuously providing me with advice. I highly agree with your viewpoints. However, implementing a universal trash feature in Alluxio is extremely complex and not an easily achievable task. For users who have a strong need for HDFS trash functionality, waiting for a long development cycle can be very painful. Perhaps they simply want a simple and usable HDFS trash feature. Based on this, I have submitted this PR, which is a quick implementation of HDFS trash. It has some value as a reference, and although it may not be perfect, it is simple and easy to use. Therefore, whether this PR gets merged or not is not important. I just hope that when users search for "alluxio hdfs trash," they can see this PR. From my interactions with the community, not every user has the energy to modify Alluxio's code themselves when they need a new feature. Having some PRs that can serve as references will make it easier for them to accomplish their work. Lastly, thank you once again for your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants