Meta server with authentication enabled failed and could not be started normally after dropping table #2149

empiredan · 2024-11-21T06:08:31Z

There were a pegasus cluster with 3 meta servers and 5 replica servers. The authentication was enabled. And a script is written to drop a great number of tables.

While the script was being executed, the meta server failed with nothing but got signal id: 11 and following dmesg:

[Tue Nov 12 15:32:39 2024]  meta.meta_stat[681978]: segfault at 40 ip 00007faa351ea839 sp 00007faa0d48abc0 error 4 in libdsn_utils.so[7faa35124000+115000]
[Tue Nov 12 15:32:39 2024] Code: 23 f9 ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 48 8d 45 cf 53 4c 8d 67 08 48 83 ec 28 <4c> 8b 7f 18 48 89 45 b8 48 8d 45 ce 4d 39 e7 48 89 45 b0 74 6c 48

In the logs of failed meta server (namely primary meta server), lots of errors are also found:

E2024-11-12 15:32:45.711 (1731396765711870108 a67f5)   meta.meta_server4.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:32:45.713 (1731396765713890084 a67f5)   meta.meta_server4.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:32:52.225 (1731396772225348205 a67f6)   meta.meta_server5.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:32:52.226 (1731396772226529887 a67f6)   meta.meta_server5.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:32:58.919 (1731396778919427343 a67f3)   meta.meta_server2.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:32:58.921 (1731396778921276545 a67f3)   meta.meta_server2.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:33:06.374 (1731396786374687523 a67f6)   meta.meta_server5.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:33:06.376 (1731396786376019669 a67f6)   meta.meta_server5.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:33:14.775 (1731396794775332362 a67f2)   meta.meta_server1.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:33:14.777 (1731396794777299007 a67f2)   meta.meta_server1.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:33:30.679 (1731396810679840313 a67f3)   meta.meta_server2.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:33:30.681 (1731396810681638580 a67f3)   meta.meta_server2.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:33:37.501 (1731396817501816052 a67f7)   meta.meta_server6.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:33:37.503 (1731396817503027320 a67f7)   meta.meta_server6.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:33:44.338 (1731396824338693868 a67f4)   meta.meta_server3.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:33:44.339 (1731396824339976731 a67f4)   meta.meta_server3.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.

After that, other standby meta servers also failed while they tried to take over. See following logs:

E2024-11-12 15:34:33.624 (1731396873624300621 19c265)   meta.meta_server0.010200030000042e: server_state.cpp:689:operator()(): assertion expression: false
F2024-11-12 15:34:33.624 (1731396873624310529 19c265)   meta.meta_server0.010200030000042e: server_state.cpp:689:operator()(): invalid status(app_status::AS_DROPPING) for app(abc(1)) in remote storage

AS_DROPPING was found persistent on the remote meta storage (namely ZooKeeper) as the status of the table.

{"status":"app_status::AS_DROPPING","app_type":"pegasus","app_name":"abc","app_id":1,"partition_count":8, ...}

However, this state is just an intermediate state, which should not be found on ZooKeeper.

Then, all meta server were never be started normally: they exited immediately after they were started.

The text was updated successfully, but these errors were encountered:

empiredan · 2024-11-21T07:58:35Z

There are two problems that should be solved:

why the primary meta server failed with segfault while dropping tables ?
why all meta servers were never be restarted normally after the primary meta server failed ?

To illustrate the reasons for both problems more clearly, I'll put here some mechanisms about updating meta data. A pegasus cluster would flush security policies to remote meta storage periodically (by update_ranger_policy_interval_sec) in the form of environment variables. We do this by server_state::set_app_envs(). However, after updating the meta data on the remote storage (namely ZooKeeper), the table is not checked that if it still exists while updating environment variables of local memory. See the following code:

void server_state::set_app_envs(const app_env_rpc &env_rpc)
{

...

    do_update_app_info(app_path, ainfo, [this, app_name, keys, values, env_rpc](error_code ec) {
        CHECK_EQ_MSG(ec, ERR_OK, "update app info to remote storage failed");

        zauto_write_lock l(_lock);
        std::shared_ptr<app_state> app = get_app(app_name);
        std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');
        for (int idx = 0; idx < keys.size(); idx++) {
            app->envs[keys[idx]] = values[idx];
        }
        std::string new_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');
        LOG_INFO("app envs changed: old_envs = {}, new_envs = {}", old_envs, new_envs);
    });
}

In std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');, since app is nullptr, app->envs would point an invalid address, leading to segfault in libdsn_utils.so where dsn::utils::kv_map_to_string is.

Therefore, the reason for the 1st problem is very clear: the callback for updating meta data on remove storage is called immediately after the table is removed, and an invalid address is accessed due to null pointer.

Then, the meta server would load meta data from remote storage after it is restart. However, the intermediate status AS_DROPPING is also flushed to remote storage with security policies since all meta data for a table is a unitary json object: the whole json would be set to remote storage once any property is updated. However AS_DROPPING is invalid, and cannot pass the assertion which would make meta server fail again and again, which is the reason of the 2nd problem. See following code:

server_state::sync_apps_from_remote_storage()
{

...

                    std::shared_ptr<app_state> app = app_state::create(info);
                    {
                        zauto_write_lock l(_lock);
                        _all_apps.emplace(app->app_id, app);
                        if (app->status == app_status::AS_AVAILABLE) {
                            app->status = app_status::AS_CREATING;
                            _exist_apps.emplace(app->app_name, app);
                            _table_metric_entities.create_entity(app->app_id, app->partition_count);
                        } else if (app->status == app_status::AS_DROPPED) {
                            app->status = app_status::AS_DROPPING;
                        } else {
                            CHECK(false,
                                  "invalid status({}) for app({}) in remote storage",
                                  enum_to_string(app->status),
                                  app->get_logname());
                        }
                    }

...

}

… setting environment variables after dropping table (#2148) #2149. There are two problems that should be solved: 1. why the primary meta server failed with `segfault` while dropping tables ? 2. why all meta servers were never be restarted normally after the primary meta server failed ? A pegasus cluster would flush security policies to remote meta storage periodically (by `update_ranger_policy_interval_sec`) in the form of environment variables. We do this by `server_state::set_app_envs()`. However, after updating the meta data on the remote storage (namely ZooKeeper), the table is not checked that if it still exists while updating environment variables of local memory: ```C++ void server_state::set_app_envs(const app_env_rpc &env_rpc) { ... do_update_app_info(app_path, ainfo, [this, app_name, keys, values, env_rpc](error_code ec) { CHECK_EQ_MSG(ec, ERR_OK, "update app info to remote storage failed"); zauto_write_lock l(_lock); std::shared_ptr<app_state> app = get_app(app_name); std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '='); for (int idx = 0; idx < keys.size(); idx++) { app->envs[keys[idx]] = values[idx]; } std::string new_envs = dsn::utils::kv_map_to_string(app->envs, ',', '='); LOG_INFO("app envs changed: old_envs = {}, new_envs = {}", old_envs, new_envs); }); } ``` In `std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');`, since `app` is `nullptr`, `app->envs` would point an invalid address, leading to `segfault` in `libdsn_utils.so` where `dsn::utils::kv_map_to_string` is. Therefore, the reason for the 1st problem is very clear: the callback for updating meta data on remote storage is called immediately after the table is removed, and an invalid address is accessed due to null pointer. Then, the meta server would load meta data from remote storage after it is restart. However, the intermediate status `AS_DROPPING` is also flushed to remote storage with security policies since all meta data for a table is an unitary `json` object: the whole `json` would be set to remote storage once any property is updated. However `AS_DROPPING` is invalid, and cannot pass the assertion which would make meta server fail again and again, which is the reason of the 2nd problem: ```C++ server_state::sync_apps_from_remote_storage() { ... std::shared_ptr<app_state> app = app_state::create(info); { zauto_write_lock l(_lock); _all_apps.emplace(app->app_id, app); if (app->status == app_status::AS_AVAILABLE) { app->status = app_status::AS_CREATING; _exist_apps.emplace(app->app_name, app); _table_metric_entities.create_entity(app->app_id, app->partition_count); } else if (app->status == app_status::AS_DROPPED) { app->status = app_status::AS_DROPPING; } else { CHECK(false, "invalid status({}) for app({}) in remote storage", enum_to_string(app->status), app->get_logname()); } } ... } ``` To fix the 1st problem, we just check if the table still exists after meta data is updated on the remote storage. To fix the 2nd problem, we prevent meta data with intermediate status `AS_DROPPING` from being flushed to remote storage.

…r table was dropped (#2170) #2149. Previously we've fixed the problem that meta server failed due to null pointer while deleting environment variables locally immediately after a table was dropped in #2148. There's the same problem while deleting environment variables.

empiredan added the type/bug This issue reports a bug. label Nov 21, 2024

empiredan mentioned this issue Nov 21, 2024

fix(meta): meta server failed and could not be started normally while setting environment variables after dropping table #2148

Merged

empiredan mentioned this issue Dec 16, 2024

fix(meta): fix null pointer while deleting environment variables after table was dropped #2170

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta server with authentication enabled failed and could not be started normally after dropping table #2149

Meta server with authentication enabled failed and could not be started normally after dropping table #2149

empiredan commented Nov 21, 2024 •

edited

Loading

empiredan commented Nov 21, 2024 •

edited

Loading

Meta server with authentication enabled failed and could not be started normally after dropping table #2149

Meta server with authentication enabled failed and could not be started normally after dropping table #2149

Comments

empiredan commented Nov 21, 2024 • edited Loading

empiredan commented Nov 21, 2024 • edited Loading

empiredan commented Nov 21, 2024 •

edited

Loading

empiredan commented Nov 21, 2024 •

edited

Loading