Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cloud): Add File Cache Consistency Check (#41280) #43440

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Lupinus
Copy link

@Lupinus Lupinus commented Nov 7, 2024

Add a feature to verify the consistency of the file cache, checking whether the disk cache is under the control of the file cache management system.

What problem does this PR solve?

Occasionally, we found that there have been cases of disk cache data escaping from the management of Doris file cache, causing disk space leaks. To make it easier for debugging, we need a checking tool that compares the contents in the Doris file cache memory management structure with the current disk contents to identify the differences between the two (which are potential problematic data).

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No colde files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.
  • Release note

    None

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -588,6 +588,114 @@ void FSFileCacheStorage::load_cache_info_into_memory(BlockFileCache* _mgr) const
TEST_SYNC_POINT_CALLBACK("BlockFileCache::TmpFile2");
}

void FSFileCacheStorage::check_consistency(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'check_consistency' has cognitive complexity of 59 (threshold 50) [readability-function-cognitive-complexity]

void FSFileCacheStorage::check_consistency(
                         ^
Additional context

be/src/io/cache/fs_file_cache_storage.cpp:596: nesting level increased to 1

    auto check = [_mgr, &cache_lock, &confirmed_blocks, &inconsistent_cache_context,
                 ^

be/src/io/cache/fs_file_cache_storage.cpp:598: +2, including nesting penalty of 1, nesting level increased to 2

        for (; key_it != std::filesystem::directory_iterator(); ++key_it) {
        ^

be/src/io/cache/fs_file_cache_storage.cpp:607: +3, including nesting penalty of 2, nesting level increased to 3

            if (ec) [[unlikely]] {
            ^

be/src/io/cache/fs_file_cache_storage.cpp:613: +3, including nesting penalty of 2, nesting level increased to 3

            for (; offset_it != std::filesystem::directory_iterator(); ++offset_it) {
            ^

be/src/io/cache/fs_file_cache_storage.cpp:618: +4, including nesting penalty of 3, nesting level increased to 4

                if (!this->parse_filename_suffix_to_cache_type(
                ^

be/src/io/cache/fs_file_cache_storage.cpp:627: +4, including nesting penalty of 3, nesting level increased to 4

                if (!cell || cell->is_deleted) {
                ^

be/src/io/cache/fs_file_cache_storage.cpp:627: +1

                if (!cell || cell->is_deleted) {
                          ^

be/src/io/cache/fs_file_cache_storage.cpp:636: +4, including nesting penalty of 3, nesting level increased to 4

                                ? cell->dowloading_size()
                                ^

be/src/io/cache/fs_file_cache_storage.cpp:635: +1

                        (is_tmp && cell->file_block->state() == FileBlock::State::DOWNLOADING)
                                ^

be/src/io/cache/fs_file_cache_storage.cpp:639: +4, including nesting penalty of 3, nesting level increased to 4

                if (size != expected_size) {
                ^

be/src/io/cache/fs_file_cache_storage.cpp:643: +4, including nesting penalty of 3, nesting level increased to 4

                if (cache_type != cell->file_block->cache_type()) {
                ^

be/src/io/cache/fs_file_cache_storage.cpp:647: +4, including nesting penalty of 3, nesting level increased to 4

                if (expiration_time != cell->file_block->expiration_time()) {
                ^

be/src/io/cache/fs_file_cache_storage.cpp:652: +4, including nesting penalty of 3, nesting level increased to 4

                if (inconsistent_type != InconsistentCacheContext::InconsistentType::NONE) {
                ^

be/src/io/cache/fs_file_cache_storage.cpp:659: +1, including nesting penalty of 0, nesting level increased to 1

    if constexpr (USE_CACHE_VERSION2) {
    ^

be/src/io/cache/fs_file_cache_storage.cpp:661: +2, including nesting penalty of 1, nesting level increased to 2

        if (ec) {
        ^

be/src/io/cache/fs_file_cache_storage.cpp:665: +2, including nesting penalty of 1, nesting level increased to 2

        for (; key_prefix_it != std::filesystem::directory_iterator(); ++key_prefix_it) {
        ^

be/src/io/cache/fs_file_cache_storage.cpp:666: +3, including nesting penalty of 2, nesting level increased to 3

            if (!key_prefix_it->is_directory()) {
            ^

be/src/io/cache/fs_file_cache_storage.cpp:670: +3, including nesting penalty of 2, nesting level increased to 3

            if (key_prefix_it->path().filename().native().size() != KEY_PREFIX_LENGTH) {
            ^

be/src/io/cache/fs_file_cache_storage.cpp:675: +4, including nesting penalty of 3, nesting level increased to 4

                if (ec) {
                ^

be/src/io/cache/fs_file_cache_storage.cpp:682: +3, including nesting penalty of 2, nesting level increased to 3

            if (ec) {
            ^

be/src/io/cache/fs_file_cache_storage.cpp:688: +1, nesting level increased to 1

    } else {
      ^

be/src/io/cache/fs_file_cache_storage.cpp:690: +2, including nesting penalty of 1, nesting level increased to 2

        if (ec) {
        ^

@@ -588,6 +588,114 @@
TEST_SYNC_POINT_CALLBACK("BlockFileCache::TmpFile2");
}

void FSFileCacheStorage::check_consistency(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'check_consistency' exceeds recommended size/complexity thresholds [readability-function-size]

void FSFileCacheStorage::check_consistency(
                         ^
Additional context

be/src/io/cache/fs_file_cache_storage.cpp:590: 102 lines including whitespace and comments (threshold 80)

void FSFileCacheStorage::check_consistency(
                         ^

@@ -21,6 +21,7 @@
#include <gen_cpp/Types_types.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gen_cpp/Types_types.h' file not found [clang-diagnostic-error]

#include <gen_cpp/Types_types.h>
         ^

@@ -5284,6 +5285,145 @@
}
}

TEST_F(BlockFileCacheTest, check_fs_file_cache_consistency) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]

    ^
Additional context

be/test/io/cache/block_file_cache_test.cpp:5287: 137 lines including whitespace and comments (threshold 80)

    ^

@@ -102,6 +102,12 @@ class FSFileCacheStorage : public FileCacheStorage {

void load_cache_info_into_memory(BlockFileCache* _mgr) const;

void check_consistency(
BlockFileCache* _mgr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not couple this interface with FileCacheManager. The method of checking here is that the storage can return a list with comprehensive "file" information (including path, size, creation time, etc.), preferably in an ordered manner. Let the FileCacheManager, which depends on this storage, perform its own forward and reverse checks to see if there is any extra or missing data in the storage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the correction, I will make the modifications promptly.

@freemandealer
Copy link
Contributor

run buildall

Copy link
Contributor

@freemandealer freemandealer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

greaaat work! thanks for your contribution!

@@ -50,6 +50,8 @@ constexpr static std::string_view CLEAR = "clear";
constexpr static std::string_view RESET = "reset";
constexpr static std::string_view HASH = "hash";
constexpr static std::string_view LIST_CACHE = "list_cache";
constexpr static std::string_view VIEW = "view";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_base_paths might be more explicit

*json_metrics = json_array.ToString();
} else if (operation == CHECK) {
const std::string& cache_base_path = req->param(BASE_PATH.data());
auto* block_file_cache = io::FileCacheFactory::instance()->get_by_path(cache_base_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle the error if the path is invalid. otherwise this will crash into nullptr.

std::vector<std::string> BlockFileCache::check_file_cache_consistency() {
std::unordered_set<AccessKeyAndOffset, KeyAndOffsetHash> confirmed_blocks;
std::vector<InconsistentCacheContext> inconsistent_cache_context;
std::lock_guard<std::mutex> lock(_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming advice: lock -> cache_lock for efficient code search.

uint32_t type;

public:
enum : uint32_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very compact, that's nice...

public:
enum : uint32_t {
NONE = 0,
FILES_INCONSISTENT = 1 << 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the description for each type is a necessary here

enum : uint32_t {
NONE = 0,
FILES_INCONSISTENT = 1 << 0,
STORAGE_INCONSISTENT = 1 << 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"INCONSISTENT" is bidirectional. Might 'MISSING_IN_STORAGE' be more explicit?

<< "Expiration Time: " << expiration_time << "\n"
<< "Offset: " << offset << "\n"
<< "Cache Type: " << file_cache_type_to_string(cache_type) << "\n"
<< "Inconsistent Type: " << inconsistent_type.to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Reason:"

FileCacheType cache_type = FileCacheType::NORMAL;
if (!this->parse_filename_suffix_to_cache_type(
fs, offset_it->path().filename().native(), expiration_time, size,
&offset, &is_tmp, &cache_type)) {
Copy link
Contributor

@freemandealer freemandealer Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if is_tmp == true, the type will never be assigned (always be the default value FileCacheType::NORMAL). check the internal of parse_filename_suffix_to_cache_type for details.

maybe add TMP as another 'type' in file_cache_type_to_string(be/src/io/cache/file_cache_common.cpp ). As I see it, there is no counterpart in management structure in memory. Maybe we can deal with TMP like this:
if a tmp file with state 'DOWNLOADING', it's consistent. otherwise, inconsistent.

std::string offset_path = offset_it->path();
auto* cell = _mgr->get_cell(hash, offset, cache_lock);

if (!cell || cell->is_deleted) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_deleted works as a flag, the file is not deleted yet

continue;
}
if (key_prefix_it->path().filename().native().size() != KEY_PREFIX_LENGTH) {
LOG(WARNING) << "Unknown directory " << key_prefix_it->path().native()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no deletion here. this is just a checking tool, don't touch anything!

check(key_it);
}
} else {
std::filesystem::directory_iterator key_it {_cache_base_path, ec};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this check works for both VERSION1 and VERSION2? if not, just warning here "only support version2"

_storage->check_consistency(this, confirmed_blocks, inconsistent_cache_context, lock);
for (const auto& [hash, offset_to_cell] : _files) {
for (const auto& [offset, cell] : offset_to_cell) {
if (cell.is_deleted || confirmed_blocks.contains({hash, offset})) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_deleted works as a flag, the file is not deleted yet

public:
enum : uint32_t {
NONE = 0,
FILES_INCONSISTENT = 1 << 0,
Copy link
Contributor

@freemandealer freemandealer Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be 'NOT_LOADED'?

assert_range(2, blocks[0], io::FileBlock::Range(10, 18), io::FileBlock::State::DOWNLOADING);
download(blocks[0]);
mgr._files[key1].erase(10);
mgr.check_file_cache_consistency();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's for DEBUG, forgot to remove :D

std::string dir_path = fs_file_cache_storage->get_path_in_local_cache(key1, 0);
fs::path block_file_path = std::filesystem::path(dir_path) / "20";
fs::remove(block_file_path);
mgr.check_file_cache_consistency();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

std::ofstream out_file(block_file_path, std::ios::out | std::ios::app);
out_file << data;
out_file.close();
mgr.check_file_cache_consistency();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

assert_range(2, blocks[0], io::FileBlock::Range(40, 48), io::FileBlock::State::DOWNLOADING);
download(blocks[0]);
blocks[0]->_key.meta.type = io::FileCacheType::INDEX;
mgr.check_file_cache_consistency();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

assert_range(2, blocks[0], io::FileBlock::Range(0, 8), io::FileBlock::State::DOWNLOADING);
download(blocks[0]);
blocks[0]->_key.meta.expiration_time = 0;
mgr.check_file_cache_consistency();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -5284,6 +5285,155 @@ TEST_F(BlockFileCacheTest, file_cache_path_storage_parse) {
}
}

TEST_F(BlockFileCacheTest, check_fs_file_cache_consistency) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'TEST_F' exceeds recommended size/complexity thresholds [readability-function-size]

    ^
Additional context

be/test/io/cache/block_file_cache_test.cpp:5287: 147 lines including whitespace and comments (threshold 80)

    ^

@Lupinus Lupinus force-pushed the issue#41280 branch 2 times, most recently from b2e2c54 to 590dffa Compare November 13, 2024 07:37
@Lupinus
Copy link
Author

Lupinus commented Nov 13, 2024

run buildall

@Lupinus
Copy link
Author

Lupinus commented Nov 21, 2024

run buildall

Copy link
Contributor

@freemandealer freemandealer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (cache_base_path.empty()) {
st = Status::InvalidArgument("missing parameter: {} is required", VALUE.data());
} else {
auto* block_file_cache = io::FileCacheFactory::instance()->get_by_path(cache_base_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!block_file_cache) {
st = Status::InvalidArgument("invalid path: {}", VALUE.data());
return st;
}

@freemandealer
Copy link
Contributor

run external

Add a feature to verify the consistency of the file cache, checking
whether the disk cache is under the control of the file cache management
system.
@Lupinus
Copy link
Author

Lupinus commented Dec 5, 2024

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.48% (10015/26025)
Line Coverage: 29.56% (84077/284459)
Region Coverage: 28.66% (43206/150758)
Branch Coverage: 25.26% (21978/86990)
Coverage Report: http://coverage.selectdb-in.cc/coverage/84a990d31626571d7dcb2646d2ce9243de9391c7_84a990d31626571d7dcb2646d2ce9243de9391c7/report/index.html

@Lupinus
Copy link
Author

Lupinus commented Dec 5, 2024

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.49% (10014/26015)
Line Coverage: 29.55% (84069/284497)
Region Coverage: 28.65% (43201/150788)
Branch Coverage: 25.25% (21973/87020)
Coverage Report: http://coverage.selectdb-in.cc/coverage/84a990d31626571d7dcb2646d2ce9243de9391c7_84a990d31626571d7dcb2646d2ce9243de9391c7/report/index.html

@Lupinus
Copy link
Author

Lupinus commented Dec 7, 2024

run p0

@Lupinus
Copy link
Author

Lupinus commented Dec 15, 2024

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.77% (10116/26094)
Line Coverage: 29.73% (85028/285997)
Region Coverage: 28.79% (43638/151559)
Branch Coverage: 25.36% (22190/87498)
Coverage Report: http://coverage.selectdb-in.cc/coverage/84a990d31626571d7dcb2646d2ce9243de9391c7_84a990d31626571d7dcb2646d2ce9243de9391c7/report/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants