-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meta server with authentication enabled failed and could not be started normally after dropping table #2149
Comments
There are two problems that should be solved:
To illustrate the reasons for both problems more clearly, I'll put here some mechanisms about updating meta data. A pegasus cluster would flush security policies to remote meta storage periodically (by void server_state::set_app_envs(const app_env_rpc &env_rpc)
{
...
do_update_app_info(app_path, ainfo, [this, app_name, keys, values, env_rpc](error_code ec) {
CHECK_EQ_MSG(ec, ERR_OK, "update app info to remote storage failed");
zauto_write_lock l(_lock);
std::shared_ptr<app_state> app = get_app(app_name);
std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');
for (int idx = 0; idx < keys.size(); idx++) {
app->envs[keys[idx]] = values[idx];
}
std::string new_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');
LOG_INFO("app envs changed: old_envs = {}, new_envs = {}", old_envs, new_envs);
});
} In Therefore, the reason for the 1st problem is very clear: the callback for updating meta data on remove storage is called immediately after the table is removed, and an invalid address is accessed due to null pointer. Then, the meta server would load meta data from remote storage after it is restart. However, the intermediate status server_state::sync_apps_from_remote_storage()
{
...
std::shared_ptr<app_state> app = app_state::create(info);
{
zauto_write_lock l(_lock);
_all_apps.emplace(app->app_id, app);
if (app->status == app_status::AS_AVAILABLE) {
app->status = app_status::AS_CREATING;
_exist_apps.emplace(app->app_name, app);
_table_metric_entities.create_entity(app->app_id, app->partition_count);
} else if (app->status == app_status::AS_DROPPED) {
app->status = app_status::AS_DROPPING;
} else {
CHECK(false,
"invalid status({}) for app({}) in remote storage",
enum_to_string(app->status),
app->get_logname());
}
}
...
} |
… setting environment variables after dropping table (#2148) #2149. There are two problems that should be solved: 1. why the primary meta server failed with `segfault` while dropping tables ? 2. why all meta servers were never be restarted normally after the primary meta server failed ? A pegasus cluster would flush security policies to remote meta storage periodically (by `update_ranger_policy_interval_sec`) in the form of environment variables. We do this by `server_state::set_app_envs()`. However, after updating the meta data on the remote storage (namely ZooKeeper), the table is not checked that if it still exists while updating environment variables of local memory: ```C++ void server_state::set_app_envs(const app_env_rpc &env_rpc) { ... do_update_app_info(app_path, ainfo, [this, app_name, keys, values, env_rpc](error_code ec) { CHECK_EQ_MSG(ec, ERR_OK, "update app info to remote storage failed"); zauto_write_lock l(_lock); std::shared_ptr<app_state> app = get_app(app_name); std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '='); for (int idx = 0; idx < keys.size(); idx++) { app->envs[keys[idx]] = values[idx]; } std::string new_envs = dsn::utils::kv_map_to_string(app->envs, ',', '='); LOG_INFO("app envs changed: old_envs = {}, new_envs = {}", old_envs, new_envs); }); } ``` In `std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');`, since `app` is `nullptr`, `app->envs` would point an invalid address, leading to `segfault` in `libdsn_utils.so` where `dsn::utils::kv_map_to_string` is. Therefore, the reason for the 1st problem is very clear: the callback for updating meta data on remote storage is called immediately after the table is removed, and an invalid address is accessed due to null pointer. Then, the meta server would load meta data from remote storage after it is restart. However, the intermediate status `AS_DROPPING` is also flushed to remote storage with security policies since all meta data for a table is an unitary `json` object: the whole `json` would be set to remote storage once any property is updated. However `AS_DROPPING` is invalid, and cannot pass the assertion which would make meta server fail again and again, which is the reason of the 2nd problem: ```C++ server_state::sync_apps_from_remote_storage() { ... std::shared_ptr<app_state> app = app_state::create(info); { zauto_write_lock l(_lock); _all_apps.emplace(app->app_id, app); if (app->status == app_status::AS_AVAILABLE) { app->status = app_status::AS_CREATING; _exist_apps.emplace(app->app_name, app); _table_metric_entities.create_entity(app->app_id, app->partition_count); } else if (app->status == app_status::AS_DROPPED) { app->status = app_status::AS_DROPPING; } else { CHECK(false, "invalid status({}) for app({}) in remote storage", enum_to_string(app->status), app->get_logname()); } } ... } ``` To fix the 1st problem, we just check if the table still exists after meta data is updated on the remote storage. To fix the 2nd problem, we prevent meta data with intermediate status `AS_DROPPING` from being flushed to remote storage.
There were a pegasus cluster with 3 meta servers and 5 replica servers. The authentication was enabled. And a script is written to drop a great number of tables.
While the script was being executed, the meta server failed with nothing but
got signal id: 11
and followingdmesg
:In the logs of failed meta server (namely primary meta server), lots of errors are also found:
After that, other standby meta servers also failed while they tried to take over. See following logs:
AS_DROPPING
was found persistent on the remote meta storage (namely ZooKeeper) as the status of the table.However, this state is just an intermediate state, which should not be found on
ZooKeeper
.Then, all meta server were never be started normally: they exited immediately after they were started.
The text was updated successfully, but these errors were encountered: