Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge release/2.6 into google/2.6 #15725

Merged
merged 47 commits into from
Jan 14, 2025
Merged

Merge release/2.6 into google/2.6 #15725

merged 47 commits into from
Jan 14, 2025

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Jan 13, 2025

phender and others added 30 commits November 6, 2024 20:06
Tag second test build for 2.6.2.

faults-enabled: false

Signed-off-by: Phil Henderson <[email protected]>
2.6.2 release notes document

Signed-off-by: Phil Henderson <[email protected]>
- Enable write access to the Security section of Github project

- Use GHA cache to avoid Trivy scan failures due to overuse of CVEs database results in database download failure
Upgrade `trivy-action` to version 0.28.0 where the caching mechanism is enabled by default.
Enable debug option in Trivy to be prepared for detail scan failures analysis

Signed-off-by: Tomasz Gromadzki <[email protected]>
Test deployment/ior_per_rank fails with 'No space' on some CI
clusters. Reduce the requested pool size to accommodate nodes
with smaller storage capacity.

Signed-off-by: James A. Nunez <[email protected]>
Split the erasurecode/multiple_failure.py into two separate tests to
reduce the possibility of a large number of ERR messages in the server
log file from preventing other test variants from failing dure to out of
space errors.

Signed-off-by: Phil Henderson <[email protected]>
#15411)

Loop retrying the check for the pool free space after destroying half of the containers. If the check doesn't pass within 60 seconds, then fail the test.

Signed-off-by: Phil Henderson <[email protected]>
… (#15540)

Support calling register cleanup methods for tests based upon the Test
and TestWithoutServers classes. Also remove stopping agents as part of
calling TestWithServers.stop_servers() since DAOS-6873 is no longer an
issue.

Signed-off-by: Phil Henderson <[email protected]>
Do not raise an exception if parsing empty json output.

Signed-off-by: Phil Henderson <[email protected]>
#15420) (#15457)

The object placement algorithm was changed by DAOS-16445. As a result,
data are written to targets more uniformly while the amount of
leftover data after container destroy/garbage collection in each target
remains the same. i.e., Data are written to more targets while the
cleanup method in each target hasn't been improved, which results in
higher aggregate leftover data.

To handle larger amount of leftover data in SCM, increase the threshold
to 1.5MB.

Signed-off-by: Makito Kano <[email protected]>
In special massive failure case -
1. some engines down and triggered rebuild.
2. one engine participated the rebuild, not finished yet, it down again,
   the #failures exceeds pool RF and will not change pool map.
3. That engine restarted by administrator.

In that case should recover the rebuild task on the engine, to simplify it now just
abort and retry the global rebuild task.
No such issue by the typical recover approach that restart the whole
system including the PS leader.

another backport commit -
947c76d DAOS-16175 container: fix a case for cont_iv_hdl_fetch (#15395)

Signed-off-by: Xuezhao Liu <[email protected]>
Fix stopping timed out processes run by a JobManager class by only
searching for and killing the command executable being run by clush,
orterun, mpirun, etc. Add a new harness/cmocka.py test to verify the
stopping of the processes with a test timeout.

Signed-off-by: Phil Henderson <[email protected]>
…) (#15595)

Update soak to support using an internal job scheduler.

Signed-off-by: Maureen Jean <[email protected]>
Co-authored-by: mjean308 <[email protected]>
Update flake8 to 7.1.1.
Adjust githook to work with newer flake8.
Also tested to be backwards compatible with flake8<6

Signed-off-by: Dalton Bohning <[email protected]>
Add a section on handling unavailable engines.

Signed-off-by: Li Wei <[email protected]>
clear the sc_ec_agg_active flag more proactively.

Signed-off-by: Xuezhao Liu <[email protected]>
- If failed to reply, skip rpc early buffer release

Signed-off-by: Alexander A Oganezov <[email protected]>
Use -r so if no scons or non-scons files are grep'ed, flake8 does not
run.

Signed-off-by: Dalton Bohning <[email protected]>
Add the use of reusable workflows and actions to reduce the amount of
duplicated code in this repository as well as dependency repositories.

Run Bullseye workflow on schedule (#15574)
Saturdays at midnight, UTC.

Accept and propagate a run-gha variable (#15576)
For the case where daos is being used as a downstream test.

Test inputs context before trying to use it.

Fixes: SRE-2570 DAOS-16262

Signed-off-by: Brian J. Murrell <[email protected]>
- Set Go minimum version to 1.21 in rpm and debian packaging spec files.
- Update scons Go version check to use version in go.mod.
- Add a reminder in go.mod file so we remember the packaging files when
  bumping the minimum Go version in the future.
- Update Ubuntu 22.04 Dockerfile to get an appropriate version of Go.

Signed-off-by: Kris Jacque <[email protected]>
…b26 (#15477)

For collective RPC, when handle failure cases during crt_req_send(),
its reference may has been released via crt_rpc_complete_and_unlock()
that is triggered by crt_corpc_complete(). Under such case, we should
check whether the RPC is completed or not before calling RPC_DECREF()
to avoid releasing the RPC reference repeatedly.

The patch also initializes some local variable for CHK RPC to avoid
accessing invalid DRAM when handle failed collective CHK RPC.

Some enhancement for CR test logic.

Signed-off-by: Fan Yong <[email protected]>
Update netty-buffer to 4.1.115

Signed-off-by: Jeff Olivier <[email protected]>
Co-authored-by: Jeff Olivier <[email protected]>
* Fix compiling issues in gcc 14

Signed-off-by: Jinshan Xiong <[email protected]>
Co-authored-by: Dalton Bohning <[email protected]>
Co-authored-by: Jeff Olivier <[email protected]>
Update mantic (EOL) to oracular.
Update 22.04 LTS to 24.04 LTS.

Signed-off-by: Dalton Bohning <[email protected]>
…#15598)

Update some tests to use unique dfuse mount directory by letting the
framework generate one.

Remove mount_dir from run_ior_multiple_variants since it is no longer
needed and this level of fine control should be handled per test
ideally.

Signed-off-by: Dalton Bohning <[email protected]>
Remove workflows/version-checks.yml now that dependabot checks this.

Signed-off-by: Dalton Bohning <[email protected]>
)

* Add a suppression for Go runtime function racefuncenter.
* Add suppression for rt0_go CGo malloc

Signed-off-by: Kris Jacque <[email protected]>
At build time any more, as of e01970d.

Signed-off-by: Brian J. Murrell <[email protected]>
verify daos_server_helper on server instead of the runner/client.
misc cleanup

Signed-off-by: Dalton Bohning <[email protected]>
With tcp provider, using many sockets can cause significant
file descriptor usage.  Bump the soft limit, if possible
and warn if it appears insufficient.
Valgrind sets hard limit to soft limit, so work around that in NLT.

Signed-off-by: Jeff Olivier <[email protected]>
daltonbohning and others added 17 commits December 19, 2024 07:29
Add a requirement to protobufc for building daos control binaries.

Signed-off-by: Dalton Bohning <[email protected]>
…15642)

merge yamllint and clang-format into linting workflow so all lint checks
are grouped together.

Make yaml-lint required but clang-format optional until stable.

Signed-off-by: Dalton Bohning <[email protected]>
Test with Leap 15.6 instead of Leap 15.5.

To support building Leap 15.5 DAOS RPMs and testing them on Leap 15.6
the Functional on Leap 15.6 stage needs to explicitly specify the Leap
15.6 image for node provisioning.

Combination of #15561, #15684

Signed-off-by: Phil Henderson <[email protected]>
Signed-off-by: Dalton Bohning <[email protected]>
#15655)

Add GHA to check for copyright update.
Move core logic from update-copyright githook into
check_update_copyright.sh so the logic is shared between the githook and
GHA.

Remove required-githooks watermark since it did not work in all scenarios and the
GHA checks are more secure than client-side githooks.

Combination of #15552, #15596, #15636, #15639

Signed-off-by: Dalton Bohning <[email protected]>
Add a new HPE copyright line for modified files.
Update HPE copyright instead of Intel.

Signed-off-by: Dalton Bohning <[email protected]>
…DTX - b26 (#15658)

As long as the container is not destroyed, then anytime want to deregister
a modification from related active DTX entry (that is usually triggered for
vos discard or aggregation), the caller needs to offer container handle to
vos_dtx_deregister_record() for locating the DTX entry in active DTX table.
Otherwise, if the caller offers empty container handle, then it will cause
dangling reference in related DTX entry as to data corruption in subsequent
DTX commit or abort.

On the other hand, if the container will be destroyed, then all related DTX
entries for such container will be useless any more. We need to destroy DTX
table firstly to avoid generating dangling DTX references during destroying
the container.

Signed-off-by: Fan Yong <[email protected]>
… (#15512)

After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Signed-off-by: Wang Shilong <[email protected]>
Co-authored-by: Phil Henderson <[email protected]>
Also use D_ASPRINTF instead of asprintf

Signed-off-by: Lei Huang <[email protected]>
Since 0 is the minimum RF, we should allow setting it to 0. We can
revisit naming later.

Signed-off-by: Jeff Olivier <[email protected]>
…15545) (#15702)

Fix a regression which prevents dmg storage query usage from
enumerating devices backed with emulated (AIO file or kdev) NVMe.

Signed-off-by: Tom Nabarro <[email protected]>
…15632) (#15701)

When single socket is missing a pmem namespace on dual-socket host a
confusing no-space error can be returned from daos_server scm prepare.
The previously required workaround is to specify --socket. Fix this
issue by adding NumaNode in fall-back case where ndctl region idset
overflow requires matching of numa/socket via ipmctl region info
instead. Also add unit test cases to cover the situation.

Signed-off-by: Tom Nabarro <[email protected]>
Invalid hole extent might be left by process_hole_ult(),
so let's skip it.

Signed-off-by: Di Wang <[email protected]>
This PR enhances the DDB functionality for CR purposes with
the following updates:

1. Pool Behavior Control:

Administrators can now control certain vos pool behaviors,
such as skipping vos pool loading or setting a vos pool to immutable  mode.

2. Manual Pool Shard Removal:

A new command ddb rm_pool <vos_pool> has been introduced,
allowing administrators to manually remove pool shards.

3. SPDK Environment Initialization Bug Fix:

Fixed an issue where spdk_env_init() would fail during reinitialization.

These updates aim to improve system flexibility and stability,
providing administrators with more robust management capabilities.

Signed-off-by: Wang Shilong <[email protected]>
- Update go.mod.
- Update vendored dependencies.
- Exclude Go vendored dependencies from codespell githook.
- Fix codespell skip handling.

Signed-off-by: Kris Jacque <[email protected]>
Signed-off-by: Dalton Bohning <[email protected]>
Co-authored-by: Dalton Bohning <[email protected]>
- Bump github/codeql-action from 3.24.9 to 3.27.7 (#15589)
- Bump github/codeql-action from 3.27.7 to 3.27.9 (#15618)
- Bump github/codeql-action from 3.27.9 to 3.28.0 (#15662)
- Bump thollander/actions-comment-pull-request from 2 to 3 (#15590)
- Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0 (#15591)
- Bump codespell-project/actions-codespell to latest (#15592)
- Bump EnricoMi/publish-unit-test-result-action from 1.17 to 2.7 (#15593)
- Bump EnricoMi/publish-unit-test-result-action from 2.7.0 to 2.18.0 (#15660)
- Bump isort/isort-action from 1.1.0 to 1.1.1 (#15594)
- Bump phoenix-actions/test-reporting from 10 to 15 (#15617)
- Bump actions/setup-python from 5.1.0 to 5.3.0 (#15661)

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Dalton Bohning <[email protected]>
Copy link

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/Merge

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scorecard found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15725/1/execution/node/1220/log

@jolivier23 jolivier23 merged commit 806b655 into google/2.6 Jan 14, 2025
65 of 72 checks passed
@jolivier23 jolivier23 deleted the mjmac/google/2.6 branch January 14, 2025 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.