Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure routing_filter_prefetch() -> cache_get(), reporting invalid input page addr. #545

Closed
gapisback opened this issue Mar 2, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@gapisback
Copy link
Collaborator

In /main, [ @sha 4b6e0b1 ], while trying to re-run & validate repro created for issue #458 (branch agurajada/458-mini_destroy_unused-Assert; new test case large_inserts_stress_test.c), we run into the following assertion using the debug binary:

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737349731584) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737349731584)
    at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737349731584) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737349731584, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7d10476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7cf67f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055555555b495 in platform_assert_false (filename=0x7ffff7f9e586 "src/clockcache.c",
    linenumber=2092, functionname=0x7ffff7f9f960 <__FUNCTION__.17> "clockcache_get_internal",
    expr=0x7ffff7f9ed83 "((addr % page_size) == 0)",
    message=0x7ffff7f9ed6a "addr=%lu, page_size=%lu\n") at src/platform_linux/platform.c:377
#6  0x00007ffff7f599e9 in clockcache_get_internal (cc=0x5555555622c0, addr=1034551297, blocking=1,
    type=PAGE_TYPE_FILTER, page=0x7fffffffd460) at src/clockcache.c:2092
#7  0x00007ffff7f59f82 in clockcache_get (cc=0x5555555622c0, addr=1034551297, blocking=1,
    type=PAGE_TYPE_FILTER) at src/clockcache.c:2271
#8  0x00007ffff7f562f8 in clockcache_get_virtual (c=0x5555555622c0, addr=1034551297, blocking=1,
    type=PAGE_TYPE_FILTER) at src/clockcache.c:295
#9  0x00007ffff7f68c73 in cache_get (cc=0x5555555622c0, addr=1034551297, blocking=1,
    type=PAGE_TYPE_FILTER) at src/cache.h:283
#10 0x00007ffff7f6b1f3 in routing_filter_prefetch (cc=0x5555555622c0, cfg=0x7fffe6feb0d8,
    filter=0x7fffffffd80c, num_indices=1024) at src/routing_filter.c:646
#11 0x00007ffff7f6b6b3 in routing_filter_estimate_unique_fp (cc=0x5555555622c0, cfg=0x7fffe6feb0d8,
    hid=0x0, filter=0x7fffffffd7f0, num_filters=3) at src/routing_filter.c:715
#12 0x00007ffff7f865a4 in trunk_pivot_estimate_unique_keys (spl=0x7fffe6feb040, node=0x7fffffffe000,
    pdata=0x7ffff5d524ae) at src/trunk.c:5183
#13 0x00007ffff7f86962 in trunk_split_leaf (spl=0x7fffe6feb040, parent=0x7fffffffe140,
    leaf=0x7fffffffe000, child_idx=0) at src/trunk.c:5281
#14 0x00007ffff7f828ca in trunk_flush (spl=0x7fffe6feb040, parent=0x7fffffffe140,
    pdata=0x7fffe70fd4ae, is_space_rec=0) at src/trunk.c:4235
#15 0x00007ffff7f82f0c in trunk_flush_fullest (spl=0x7fffe6feb040, node=0x7fffffffe140)
    at src/trunk.c:4324
#16 0x00007ffff7f7efa4 in trunk_memtable_incorporate (spl=0x7fffe6feb040, generation=12, tid=0)
    at src/trunk.c:3380
#17 0x00007ffff7f7f276 in trunk_memtable_flush_internal (spl=0x7fffe6feb040, generation=12)
    at src/trunk.c:3432
#18 0x00007ffff7f7f2cd in trunk_memtable_flush_internal_virtual (arg=0x7fffe703c2f0,
    scratch=0x7ffff7adb040) at src/trunk.c:3443
#19 0x00007ffff7f727ac in task_group_run_task (group=0x55555556ba80, assigned_task=0x555555d3c580)
    at src/task.c:481

The extended assertion reports:

Running 1 CTests, suite name 'large_inserts_bugs_stress', test case 'all'.
TEST 1/1 large_inserts_bugs_stress:test_issue_458_mini_destroy_unused_debug_assert OS-pid=4159620, OS-tid=4159620, Thread-ID=0, Assertion failed at src/clockcache.c:2092:clockcache_get_internal(): "((addr % page_size) == 0)". addr=1034551297, page_size=4096

The input page-address addr=1034551297 is clearly wrong.

The relevant code in /main is this:

 630 void
 631 routing_filter_prefetch(cache          *cc,
 632                         routing_config *cfg,
 633                         routing_filter *filter,
 634                         uint64          num_indices)
 635 {
 636    uint64 last_extent_addr = 0;
 637    uint64 addrs_per_page =
 638       cache_config_page_size(cfg->cache_cfg) / sizeof(uint64);
 639    uint64 num_index_pages = (num_indices - 1) / addrs_per_page + 1;
 640    uint64 index_no        = 0;
 641
 642    for (uint64 index_page_no = 0; index_page_no < num_index_pages;
 643         index_page_no++) {
 644       uint64       index_addr = filter->addr + index_page_no;

On L644, the addr + index_page_no seems incorrect. We should multiply this index by the page-size.

@gapisback gapisback self-assigned this Mar 2, 2023
@gapisback gapisback added the bug Something isn't working label Mar 2, 2023
@gapisback
Copy link
Collaborator Author

NOTE: With an attempted fix for reported issue in the function (), when you re-run the new test case, we run into this new BTree-pack failure:

(gdb) run
Starting program: /home/agurajada/Code/splinterdb/build/debug/bin/unit/large_inserts_stress_test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Running 1 CTests, suite name 'large_inserts_stress', test case 'all'.
TEST 1/1 large_inserts_stress:test_issue_458_mini_destroy_unused_debug_assert btree_pack exceeded output size limit: req->num_tuples=6291456, req->max_tuples=6291456
btree_pack exceeded output size limit: req->num_tuples=6291456, req->max_tuples=6291456

This error is triggered by this code in btree_pack():

2985    while (SUCCESS(iterator_at_end(req->itor, &at_end)) && !at_end) {
2986       iterator_get_curr(req->itor, &tuple_key, &data);
2987       if (!btree_pack_can_fit_tuple(req, tuple_key, data)) {
2988          platform_error_log("%s(): req->num_tuples=%lu exceeded output size "
2989                             "limit: req->max_tuples=%lu\n",
2990                             __func__,
2991                             req->num_tuples,
2992                             req->max_tuples);

Curiously, the checker function btree_pack_can_fit_tuple() does this: return req->num_tuples < req->max_tuples;

Seems like an off-by-1 error in the condition check.

@gapisback
Copy link
Collaborator Author

Attn: @ajhconway , @rtjohnso -- Just getting your attention to this long-winded 'test-stabilization' fix attempts I am meandering through in the background: (We should talk in-person for me to make progress on this item.)


After fixing the original issue "((addr % page_size) == 0)". and with an attempted fix for what seems to be an off-by-1 error in the btree_pack_can_fit_tuple() check [ see previous comment] , if you re-run this single test case, it will run into this error from btree_pack():

TEST 1/1 large_inserts_stress:test_issue_458_mini_destroy_unused_debug_assert btree_pack(): req->num_tuples=6291457 exceeded output size limit: req->max_tuples=6291456
btree_pack(): req->num_tuples=6291457 exceeded output size limit: req->max_tuples=6291456

Debugging this, I confirmed that we only run into this from this code-flow:

Breakpoint 2, btree_pack (req=0x7fffffffbc90) at src/btree.c:2988
2988	         platform_error_log("%s(): req->num_tuples=%lu exceeded output size "
(gdb) bt
#0  btree_pack (req=0x7fffffffbc90) at src/btree.c:2988
#1  0x00007ffff7f8479d in trunk_compact_bundle (arg=0x555557531680, scratch_buf=0x7ffff7adb040) at src/trunk.c:4797
#2  0x00007ffff7f727f7 in task_group_run_task (group=0x55555556cd40, assigned_task=0x555555d3c600) at src/task.c:481
#3  0x00007ffff7f730eb in task_group_perform_one (group=0x55555556cd40, queue_scale_percent=0) at src/task.c:712
#4  0x00007ffff7f73172 in task_perform_one_if_needed (ts=0x55555556a580, queue_scale_percent=0) at src/task.c:731
#5  0x00007ffff7f8a384 in trunk_insert (spl=0x7fffe6feb040, tuple_key=..., data=...) at src/trunk.c:6177
#6  0x00007ffff7f70f9c in splinterdb_insert_message (kvs=0x555555561c80, user_key=..., msg=...)
    at src/splinterdb.c:470
#7  0x00007ffff7f71044 in splinterdb_insert (kvsb=0x555555561c80, user_key=..., value=...) at src/splinterdb.c:478
#8  0x000055555555991d in ctest_large_inserts_stress_test_issue_458_mini_destroy_unused_debug_assert_run (
    data=0x555555560240 <ctest_large_inserts_stress_test_issue_458_mini_destroy_unused_debug_assert_data>)
    at tests/unit/large_inserts_stress_test.c:81
#9  0x0000555555557f8f in ctest_main (argc=1, argv=0x7fffffffe758) at tests/unit/main.c:320
#10 0x0000555555557a34 in main (argc=1, argv=0x7fffffffe758) at tests/unit/main.c:159

And it's a valid failure condition:

(gdb) p req->num_tuples
$9 = 6291457
(gdb) p req->max_tuples
$10 = 6291456

There is already code to handle from this failure and cleanup; i.e.

4617 void
4618 trunk_compact_bundle(void *arg, void *scratch_buf)
[...]

4797    platform_status pack_status = btree_pack(&pack_req);
4798    if (!SUCCESS(pack_status)) {
4799       platform_default_log("btree_pack failed: %s\n",
4800                            platform_status_to_string(pack_status));
4801       trunk_compact_bundle_cleanup_iterators(
4802          spl, &merge_itor, num_branches, skip_itor_arr);
4803       btree_pack_req_deinit(&pack_req, spl->heap_id);
4804       platform_free(spl->heap_id, req);
4805       goto out;
4806    }

Seems like on L4799 onwards, we do a clean back-out, so no stale BTree-pack should be residual.

This happens twice in this test case's run. Then we run into this hard assertion:

OS-pid=11463, OS-tid=11463, Thread-ID=0, Assertion failed at src/trunk.c:3543:trunk_inc_filter(): "filter->addr != 0".

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737349731584) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737349731584) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737349731584) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737349731584, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7d10476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7cf67f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055555555b495 in platform_assert_false (filename=0x7ffff7fa477b "src/trunk.c", linenumber=3543,
    functionname=0x7ffff7fa8e90 <__FUNCTION__.66> "trunk_inc_filter", expr=0x7ffff7fa4d95 "filter->addr != 0",
    message=0x7ffff7fa3fca "") at src/platform_linux/platform.c:377
#6  0x00007ffff7f7f594 in trunk_inc_filter (spl=0x7fffe6feb040, filter=0x7ffff43622d2) at src/trunk.c:3543
#7  0x00007ffff7f82042 in trunk_flush_into_bundle (spl=0x7fffe6feb040, parent=0x7fffffffbc30, child=0x7fffffffbaa0,
    pdata=0x7ffff419458c, req=0x555557531a40) at src/trunk.c:4102
#8  0x00007ffff7f82786 in trunk_flush (spl=0x7fffe6feb040, parent=0x7fffffffbc30, pdata=0x7ffff419458c,
    is_space_rec=0) at src/trunk.c:4214
#9  0x00007ffff7f82d2d in trunk_flush_fullest (spl=0x7fffe6feb040, node=0x7fffffffbc30) at src/trunk.c:4295
#10 0x00007ffff7f83cf8 in trunk_compact_bundle (arg=0x555557531680, scratch_buf=0x7ffff7adb040) at src/trunk.c:4642
#11 0x00007ffff7f727f7 in task_group_run_task (group=0x55555556cd40, assigned_task=0x555555d3c600) at src/task.c:481
#12 0x00007ffff7f730eb in task_group_perform_one (group=0x55555556cd40, queue_scale_percent=0) at src/task.c:712
#13 0x00007ffff7f73172 in task_perform_one_if_needed (ts=0x55555556a580, queue_scale_percent=0) at src/task.c:731
#14 0x00007ffff7f8a384 in trunk_insert (spl=0x7fffe6feb040, tuple_key=..., data=...) at src/trunk.c:6177
#15 0x00007ffff7f70f9c in splinterdb_insert_message (kvs=0x555555561c80, user_key=..., msg=...)
    at src/splinterdb.c:470
#16 0x00007ffff7f71044 in splinterdb_insert (kvsb=0x555555561c80, user_key=..., value=...) at src/splinterdb.c:478
#17 0x000055555555991d in ctest_large_inserts_stress_test_issue_458_mini_destroy_unused_debug_assert_run (
    data=0x555555560240 <ctest_large_inserts_stress_test_issue_458_mini_destroy_unused_debug_assert_data>)
    at tests/unit/large_inserts_stress_test.c:81

The error: trunk_inc_filter(): "filter->addr != 0" is very weird.

This means in this code block:

4000 trunk_bundle *
4001 trunk_flush_into_bundle(trunk_handle             *spl,    // IN
[...]

4095          child_sb->end_branch = trunk_end_branch(spl, child);
4096          for (uint16 i = 0; i < filter_count; i++) {
4097             routing_filter *child_filter =
4098                trunk_subbundle_filter(spl, child, child_sb, i);
4099             routing_filter *parent_filter =
4100                trunk_subbundle_filter(spl, parent, parent_sb, i);
4101             *child_filter = *parent_filter;
4102             trunk_inc_filter(spl, child_filter);
4103          }

On L4102, we are accessing a child_filter page that has not been setup properly.

Under the debugger we find:

(gdb) p child_filter
$12 = (routing_filter *) 0x7ffff43622d2
(gdb) p *child_filter
$13 = {addr = 0, meta_head = 0, num_fingerprints = 0, num_unique = 0, value_size = 0}

Unclear - why this is an uninitialized filter handle.

Or we are getting some wrong data structure in the code above.

Curiously, parent_filter also appears 'bogus':

(gdb) p parent_filter
$14 = (routing_filter *) 0x7ffff419437a
(gdb) p *parent_filter
$15 = {addr = 0, meta_head = 0, num_fingerprints = 0, num_unique = 0, value_size = 0}

This will need further debugging.

@gapisback
Copy link
Collaborator Author

The original issue reported in this issue (bug in routing_filter_prefetch()) has been fixed and integrated.

The sub-issues reported in this thread has been peeled of to its own standalone issue #570.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant