Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta server with authentication enabled failed and could not be started normally after dropping table #2149

Open
empiredan opened this issue Nov 21, 2024 · 1 comment
Labels
type/bug This issue reports a bug.

Comments

@empiredan
Copy link
Contributor

empiredan commented Nov 21, 2024

There were a pegasus cluster with 3 meta servers and 5 replica servers. The authentication was enabled. And a script is written to drop a great number of tables.

While the script was being executed, the meta server failed with nothing but got signal id: 11 and following dmesg:

[Tue Nov 12 15:32:39 2024]  meta.meta_stat[681978]: segfault at 40 ip 00007faa351ea839 sp 00007faa0d48abc0 error 4 in libdsn_utils.so[7faa35124000+115000]
[Tue Nov 12 15:32:39 2024] Code: 23 f9 ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 48 8d 45 cf 53 4c 8d 67 08 48 83 ec 28 <4c> 8b 7f 18 48 89 45 b8 48 8d 45 ce 4d 39 e7 48 89 45 b0 74 6c 48

In the logs of failed meta server (namely primary meta server), lots of errors are also found:

E2024-11-12 15:32:45.711 (1731396765711870108 a67f5)   meta.meta_server4.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:32:45.713 (1731396765713890084 a67f5)   meta.meta_server4.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:32:52.225 (1731396772225348205 a67f6)   meta.meta_server5.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:32:52.226 (1731396772226529887 a67f6)   meta.meta_server5.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:32:58.919 (1731396778919427343 a67f3)   meta.meta_server2.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:32:58.921 (1731396778921276545 a67f3)   meta.meta_server2.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:33:06.374 (1731396786374687523 a67f6)   meta.meta_server5.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:33:06.376 (1731396786376019669 a67f6)   meta.meta_server5.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:33:14.775 (1731396794775332362 a67f2)   meta.meta_server1.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:33:14.777 (1731396794777299007 a67f2)   meta.meta_server1.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:33:30.679 (1731396810679840313 a67f3)   meta.meta_server2.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:33:30.681 (1731396810681638580 a67f3)   meta.meta_server2.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:33:37.501 (1731396817501816052 a67f7)   meta.meta_server6.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:33:37.503 (1731396817503027320 a67f7)   meta.meta_server6.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.
E2024-11-12 15:33:44.338 (1731396824338693868 a67f4)   meta.meta_server3.01010000000009fa: ranger_resource_policy_manager.cpp:641:sync_policies_to_app_envs(): ERR_INVALID_PARAMETERS: set_app_envs failed.
E2024-11-12 15:33:44.339 (1731396824339976731 a67f4)   meta.meta_server3.01010000000009fa: ranger_resource_policy_manager.cpp:304:update_policies_from_ranger_service(): ERR_INVALID_PARAMETERS: Sync policies to app envs failed.

After that, other standby meta servers also failed while they tried to take over. See following logs:

E2024-11-12 15:34:33.624 (1731396873624300621 19c265)   meta.meta_server0.010200030000042e: server_state.cpp:689:operator()(): assertion expression: false
F2024-11-12 15:34:33.624 (1731396873624310529 19c265)   meta.meta_server0.010200030000042e: server_state.cpp:689:operator()(): invalid status(app_status::AS_DROPPING) for app(abc(1)) in remote storage

AS_DROPPING was found persistent on the remote meta storage (namely ZooKeeper) as the status of the table.

{"status":"app_status::AS_DROPPING","app_type":"pegasus","app_name":"abc","app_id":1,"partition_count":8, ...}

However, this state is just an intermediate state, which should not be found on ZooKeeper.

Then, all meta server were never be started normally: they exited immediately after they were started.

@empiredan
Copy link
Contributor Author

empiredan commented Nov 21, 2024

There are two problems that should be solved:

  1. why the primary meta server failed with segfault while dropping tables ?
  2. why all meta servers were never be restarted normally after the primary meta server failed ?

To illustrate the reasons for both problems more clearly, I'll put here some mechanisms about updating meta data. A pegasus cluster would flush security policies to remote meta storage periodically (by update_ranger_policy_interval_sec) in the form of environment variables. We do this by server_state::set_app_envs(). However, after updating the meta data on the remote storage (namely ZooKeeper), the table is not checked that if it still exists while updating environment variables of local memory. See the following code:

void server_state::set_app_envs(const app_env_rpc &env_rpc)
{

...

    do_update_app_info(app_path, ainfo, [this, app_name, keys, values, env_rpc](error_code ec) {
        CHECK_EQ_MSG(ec, ERR_OK, "update app info to remote storage failed");

        zauto_write_lock l(_lock);
        std::shared_ptr<app_state> app = get_app(app_name);
        std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');
        for (int idx = 0; idx < keys.size(); idx++) {
            app->envs[keys[idx]] = values[idx];
        }
        std::string new_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');
        LOG_INFO("app envs changed: old_envs = {}, new_envs = {}", old_envs, new_envs);
    });
}

In std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');, since app is nullptr, app->envs would point an invalid address, leading to segfault in libdsn_utils.so where dsn::utils::kv_map_to_string is.

Therefore, the reason for the 1st problem is very clear: the callback for updating meta data on remove storage is called immediately after the table is removed, and an invalid address is accessed due to null pointer.

Then, the meta server would load meta data from remote storage after it is restart. However, the intermediate status AS_DROPPING is also flushed to remote storage with security policies since all meta data for a table is a unitary json object: the whole json would be set to remote storage once any property is updated. However AS_DROPPING is invalid, and cannot pass the assertion which would make meta server fail again and again, which is the reason of the 2nd problem. See following code:

server_state::sync_apps_from_remote_storage()
{

...

                    std::shared_ptr<app_state> app = app_state::create(info);
                    {
                        zauto_write_lock l(_lock);
                        _all_apps.emplace(app->app_id, app);
                        if (app->status == app_status::AS_AVAILABLE) {
                            app->status = app_status::AS_CREATING;
                            _exist_apps.emplace(app->app_name, app);
                            _table_metric_entities.create_entity(app->app_id, app->partition_count);
                        } else if (app->status == app_status::AS_DROPPED) {
                            app->status = app_status::AS_DROPPING;
                        } else {
                            CHECK(false,
                                  "invalid status({}) for app({}) in remote storage",
                                  enum_to_string(app->status),
                                  app->get_logname());
                        }
                    }

...

}

empiredan added a commit that referenced this issue Nov 25, 2024
… setting environment variables after dropping table (#2148)

#2149.

There are two problems that should be solved:

1. why the primary meta server failed with `segfault` while dropping tables ?
2. why all meta servers were never be restarted normally after the primary
meta server failed ?

A pegasus cluster would flush security policies to remote meta storage
periodically (by `update_ranger_policy_interval_sec`) in the form of environment
variables. We do this by `server_state::set_app_envs()`. However, after updating
the meta data on the remote storage (namely ZooKeeper), the table is not checked
that if it still exists while updating environment variables of local memory:

```C++
void server_state::set_app_envs(const app_env_rpc &env_rpc)
{

...

    do_update_app_info(app_path, ainfo, [this, app_name, keys, values, env_rpc](error_code ec) {
        CHECK_EQ_MSG(ec, ERR_OK, "update app info to remote storage failed");

        zauto_write_lock l(_lock);
        std::shared_ptr<app_state> app = get_app(app_name);
        std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');
        for (int idx = 0; idx < keys.size(); idx++) {
            app->envs[keys[idx]] = values[idx];
        }
        std::string new_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');
        LOG_INFO("app envs changed: old_envs = {}, new_envs = {}", old_envs, new_envs);
    });
}
```

In `std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');`, since
`app` is `nullptr`, `app->envs` would point an invalid address, leading to `segfault`
in `libdsn_utils.so` where `dsn::utils::kv_map_to_string` is.

Therefore, the reason for the 1st problem is very clear: the callback for updating meta
data on remote storage is called immediately after the table is removed, and an invalid
address is accessed due to null pointer.

Then, the meta server would load meta data from remote storage after it is restart.
However, the intermediate status `AS_DROPPING` is also flushed to remote storage
with security policies since all meta data for a table is an unitary `json` object: the whole
`json` would be set to remote storage once any property is updated.

However `AS_DROPPING` is invalid, and cannot pass the assertion which would make
meta server fail again and again, which is the reason of the 2nd problem: 

```C++
server_state::sync_apps_from_remote_storage()
{

...

                    std::shared_ptr<app_state> app = app_state::create(info);
                    {
                        zauto_write_lock l(_lock);
                        _all_apps.emplace(app->app_id, app);
                        if (app->status == app_status::AS_AVAILABLE) {
                            app->status = app_status::AS_CREATING;
                            _exist_apps.emplace(app->app_name, app);
                            _table_metric_entities.create_entity(app->app_id, app->partition_count);
                        } else if (app->status == app_status::AS_DROPPED) {
                            app->status = app_status::AS_DROPPING;
                        } else {
                            CHECK(false,
                                  "invalid status({}) for app({}) in remote storage",
                                  enum_to_string(app->status),
                                  app->get_logname());
                        }
                    }

...

}
```

To fix the 1st problem, we just check if the table still exists after meta data is updated
on the remote storage. To fix the 2nd problem, we prevent meta data with intermediate
status `AS_DROPPING` from being flushed to remote storage.
empiredan added a commit that referenced this issue Dec 17, 2024
…r table was dropped (#2170)

#2149.

Previously we've fixed the problem that meta server failed due to null pointer while
deleting environment variables locally immediately after a table was dropped in
#2148. There's the same problem
while deleting environment variables.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

1 participant