Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add action to run against asan #212

Merged
merged 29 commits into from
Oct 25, 2023

Conversation

ezra-varady
Copy link
Collaborator

@ezra-varady ezra-varady commented Oct 22, 2023

This adds a github action to run against ASAN when someone pushes to main. There's likely a more appropriate trigger, though considering it's fairly time consuming it may not be worth running on every PR. It should be easy to include UBSan too, however there is UB in postgres' index scan function and I was unable to make ubsan suppress this so for the time being it's disabled. There's a docker container as well that can be invoked using the associated script to test locally. Running scripts/sanitizers/run_sanitizers.sh in the project root will create a directory called sanitizer that contains the asan logs. This will be owned by uid 999 so you'll have to chown it. This will likely need some revision, but I wanted to get some eyes on this and test out the github workflow. @var77 if you have any feedback let me know, I recall you had done a lot of work on the other actions

ASan detects a use after free in getDataBlockNumber in a couple of tests. I haven't figured out the source. If anyone else is interested in this let me know and I can send along logs and details for recreating it. An example is shown below. It can also trigger in the immutable retriever from gettuple, but I think the insert case might be more telling since the entirety of aminsert nominally runs in a single memory context that is deleted at the end of the access method.

==postgres==420==ERROR: AddressSanitizer: use-after-poison on address 0x7fa40dea7b08 at pc 0x7fa415c8cfdb bp 0x7ffe1c99dab0 sp 0x7ffe1c99daa8
READ of size 4 at 0x7fa40dea7b08 thread T0
    #0 0x7fa415c8cfda in getDataBlockNumber /lantern/src/hnsw/external_index.c:549:16
    #1 0x7fa415c8d890 in ldb_wal_index_node_retriever_mut /lantern/src/hnsw/external_index.c:660:37
    ... absolutely disgusting C++ call stack
    #14 0x7fa415c9586b in usearch_add_external /lantern/third_party/usearch/c/lib.cpp:260:27
    #15 0x7fa415c8f4c6 in ldb_aminsert /lantern/src/hnsw/insert.c:169:5

It's not clear exactly what conditions cause this. It reads as if the block cache needs to be cleared at some point, but it's unclear why/where the memory is being released. One test fails scanning an index created from a file and the other fails inserting rows into an empty index. Unfortunately postgres' memory management system necessitates manually poisoning memory which means ASan doesn't log information about where it was freed. The order of operations is probably to set a watchpoint in the shadow memory and use this to determine when memory is poisoned. I made an attempt at this but was left somewhat confused

@ezra-varady
Copy link
Collaborator Author

not sure why this is failing looking for /tmp/lantern will look tomorrow

@var77
Copy link
Collaborator

var77 commented Oct 23, 2023

Thank you @ezra-varady this looks great! The use after free issues may be legit, saw your fix. I had one issue which caused segfault will try to reproduce it with this fix to see if it is resolved or not. The workflow is also okay, caching the postgres build is smart, I think if this workflow will take less than 10 mins we may also run it on PRs to detect possible issues before merging to main.

@ezra-varady
Copy link
Collaborator Author

ezra-varady commented Oct 23, 2023

I think things are working, if anyone knows of a more accurate way to test locally than act please let me know. It takes about 5-7m to run for an individual version, the bulk of the time is spent in building postgres. To mitigate this after the build succeeds I cache it regardless of whether tests pass. I added support for all current versions. I set it to run weekly and I also extended the cache lifetime to a month. UBSan currently doesn't run because I couldn't get it to suppress UB in postgres so I'm gonna circle back to that for a bit

UPDATE: this use after poison was legitimate. we weren't flushing the cache in ldb_retriever_area_reset despite freeing the underlying pinned buffers. score 1 asan!

@Ngalstyan4
Copy link
Contributor

There are important fixes in this PR.
Let's get those merged ASAP.
@var77 , could you review this?

@@ -571,7 +576,8 @@ BlockNumber getDataBlockNumber(RetrieverCtx *ctx, int id, bool add_to_extra_dirt

offset = id % HNSW_BLOCKMAP_BLOCKS_PER_PAGE;
blockno = blockmap_page->blocknos[ offset ];
cache_set_item(cache, &id, &blockmap_page->blocknos[ offset ]);
size_t cache_value = (size_t)blockno;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great catch!!

Copy link
Collaborator

@var77 var77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good, great fixes!

Are we running the sanitizers via cron to see if nothing breaks in postgres? Or is there anything else I missed?

# Suppressions for Clang Sanitizers #

This folder contains [supression files](https://clang.llvm.org/docs/SanitizerSpecialCaseList.html) for
running timescale using Clang's [AddressSanitizer](https://clang.llvm.org/docs/AddressSanitizer.html)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update this README file?

@ezra-varady
Copy link
Collaborator Author

PR should be ready to merge now. The issue had to do with the way the checkout action works. When it's used in a PR it brings the repo to a ref representing the merge of the PR branch and the target branch, not the PR branch HEAD. In main the ef_search GUC variable was being initialized to 0 but had a minimum value of 1 which violated an assertion in postgres. I rebased and lowered the minimum value to zero and tests are again passing. Since we test against prebuilt postgres these assertions aren't normally enforced

@Ngalstyan4 Ngalstyan4 merged commit 05336cf into lanterndata:main Oct 25, 2023
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants