Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement](tablet-meta) Avoid be coredump due to potential race condition when updating tablet cumu point #45643

Merged
merged 1 commit into from
Dec 23, 2024

Conversation

TangSiyang2001
Copy link
Collaborator

@TangSiyang2001 TangSiyang2001 commented Dec 19, 2024

What problem does this PR solve?

Problem Summary:

Currently, when setting tablet's cumu point, aseert fail will happend if new point is less than local value, resulting BE coredump.

This could happend when race condition happend:

  1. thread A try to sync rowset
  2. thread A fetch cumu point from ms
  3. thread B update cumu point(like sc/compaction),commit to ms after 2. and set be tablet cumu point before 4.
  4. thread A try to set cumu point seen before and meet the assertion, coredump.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Contributor

sh-checker report

To get the full details, please check in the job output.

shellcheck errors

'shellcheck ' returned error 1 finding the following syntactical issues:

----------

In gensrc/script/gen_build_version.sh line 38:
if [[ ${build_version_hotfix} > 0 ]]; then
                              ^-- SC2071 (error): > is for string comparisons. Use -gt instead.


In gensrc/script/gen_build_version.sh line 228:
if [[ ${build_version_hotfix} > 0 ]]; then
                              ^-- SC2071 (error): > is for string comparisons. Use -gt instead.

For more information:
  https://www.shellcheck.net/wiki/SC2071 -- > is for string comparisons. Use ...
----------

You can address the above issues in one of three ways:
1. Manually correct the issue in the offending shell script;
2. Disable specific issues by adding the comment:
  # shellcheck disable=NNNN
above the line that contains the issue, where NNNN is the error code;
3. Add '-e NNNN' to the SHELLCHECK_OPTS setting in your .yml action file.



shfmt errors
'shfmt ' found no issues.

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@TangSiyang2001 TangSiyang2001 marked this pull request as ready for review December 19, 2024 09:38
@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.87% (10133/26066)
Line Coverage: 29.79% (85198/285972)
Region Coverage: 28.83% (43690/151525)
Branch Coverage: 25.37% (22197/87482)
Coverage Report: http://coverage.selectdb-in.cc/coverage/65fb04d169f4112e54f32270610bad830993d6b8_65fb04d169f4112e54f32270610bad830993d6b8/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 39775 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 65fb04d169f4112e54f32270610bad830993d6b8, data reload: false

------ Round 1 ----------------------------------
q1	17580	7375	7408	7375
q2	2050	178	175	175
q3	10595	1102	1171	1102
q4	10532	771	777	771
q5	7604	2697	2638	2638
q6	241	146	145	145
q7	976	620	601	601
q8	9268	1825	1888	1825
q9	6672	6468	6526	6468
q10	7058	2268	2322	2268
q11	457	260	267	260
q12	434	223	227	223
q13	17792	2904	2906	2904
q14	270	210	204	204
q15	568	496	508	496
q16	663	583	582	582
q17	971	617	599	599
q18	7315	6771	6642	6642
q19	1355	955	1036	955
q20	461	193	186	186
q21	4062	3313	3040	3040
q22	382	316	329	316
Total cold run time: 107306 ms
Total hot run time: 39775 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7241	7192	8000	7192
q2	328	232	236	232
q3	2907	2828	2933	2828
q4	2071	1811	1880	1811
q5	5684	5668	5649	5649
q6	234	142	155	142
q7	2254	1855	1787	1787
q8	3403	3535	3505	3505
q9	8911	8890	8954	8890
q10	3599	3544	3540	3540
q11	606	520	521	520
q12	795	640	598	598
q13	11940	3177	3122	3122
q14	315	271	274	271
q15	576	530	508	508
q16	694	647	657	647
q17	1847	1639	1623	1623
q18	8366	7689	7726	7689
q19	2192	1547	1618	1547
q20	2079	1891	1882	1882
q21	5550	5576	5500	5500
q22	664	579	578	578
Total cold run time: 72256 ms
Total hot run time: 60061 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 196257 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 65fb04d169f4112e54f32270610bad830993d6b8, data reload: false

query1	1341	942	921	921
query2	6226	2207	2326	2207
query3	10960	4701	4636	4636
query4	32945	23461	23350	23350
query5	4365	463	452	452
query6	292	202	186	186
query7	3989	313	298	298
query8	302	242	239	239
query9	9338	2730	2715	2715
query10	481	250	252	250
query11	17988	15218	15159	15159
query12	163	106	107	106
query13	1603	438	431	431
query14	9239	6636	6665	6636
query15	270	201	207	201
query16	8125	457	484	457
query17	1541	620	604	604
query18	2151	301	318	301
query19	330	159	164	159
query20	125	124	123	123
query21	205	116	116	116
query22	4788	4605	4518	4518
query23	35372	33823	33961	33823
query24	11242	2487	2556	2487
query25	651	396	401	396
query26	1356	158	151	151
query27	2733	325	333	325
query28	7490	2458	2480	2458
query29	838	437	455	437
query30	241	151	147	147
query31	1076	845	859	845
query32	100	54	56	54
query33	783	332	312	312
query34	971	538	505	505
query35	897	769	792	769
query36	1127	961	974	961
query37	159	75	75	75
query38	4142	4207	4085	4085
query39	1526	1450	1455	1450
query40	214	109	100	100
query41	45	52	48	48
query42	117	108	107	107
query43	533	497	507	497
query44	1263	832	837	832
query45	195	175	169	169
query46	1195	736	740	736
query47	2042	1967	1946	1946
query48	444	330	321	321
query49	919	388	413	388
query50	854	411	394	394
query51	7444	7218	7265	7218
query52	113	92	93	92
query53	260	183	184	183
query54	1072	433	409	409
query55	85	79	77	77
query56	255	251	231	231
query57	1298	1165	1175	1165
query58	244	241	216	216
query59	3305	3196	3093	3093
query60	304	257	256	256
query61	109	107	111	107
query62	877	693	697	693
query63	219	196	223	196
query64	4015	666	633	633
query65	3325	3212	3243	3212
query66	1162	316	311	311
query67	16029	15599	15491	15491
query68	5944	551	550	550
query69	490	253	248	248
query70	1253	1134	1148	1134
query71	449	243	257	243
query72	7136	4204	4090	4090
query73	789	361	365	361
query74	9924	8941	8958	8941
query75	3684	2652	2665	2652
query76	3943	1033	1127	1033
query77	576	284	266	266
query78	10092	9406	9438	9406
query79	1133	617	593	593
query80	958	436	547	436
query81	512	238	225	225
query82	488	119	124	119
query83	204	144	146	144
query84	287	71	79	71
query85	1277	308	309	308
query86	389	309	251	251
query87	4631	4482	4378	4378
query88	3380	2252	2207	2207
query89	431	291	293	291
query90	1982	189	186	186
query91	151	104	106	104
query92	60	52	52	52
query93	1859	544	540	540
query94	668	276	287	276
query95	344	255	244	244
query96	610	281	279	279
query97	2848	2674	2652	2652
query98	217	197	197	197
query99	1641	1343	1314	1314
Total cold run time: 303758 ms
Total hot run time: 196257 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.97 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 65fb04d169f4112e54f32270610bad830993d6b8, data reload: false

query1	0.03	0.04	0.03
query2	0.07	0.03	0.03
query3	0.23	0.07	0.07
query4	1.62	0.10	0.10
query5	0.41	0.39	0.42
query6	1.16	0.68	0.65
query7	0.02	0.01	0.02
query8	0.04	0.03	0.03
query9	0.59	0.51	0.51
query10	0.56	0.58	0.55
query11	0.15	0.11	0.11
query12	0.13	0.11	0.11
query13	0.61	0.61	0.60
query14	2.86	2.76	2.74
query15	0.90	0.83	0.83
query16	0.39	0.39	0.38
query17	0.99	1.04	1.03
query18	0.23	0.21	0.22
query19	1.98	2.04	1.85
query20	0.01	0.01	0.02
query21	15.36	0.59	0.60
query22	2.87	2.14	1.98
query23	17.10	0.96	0.73
query24	3.18	1.22	2.08
query25	0.12	0.11	0.24
query26	0.53	0.13	0.14
query27	0.04	0.04	0.04
query28	9.80	1.13	1.09
query29	12.55	3.21	3.19
query30	0.24	0.06	0.07
query31	2.86	0.38	0.39
query32	3.24	0.46	0.46
query33	3.08	3.05	3.15
query34	16.68	4.45	4.51
query35	4.48	4.46	4.49
query36	0.66	0.51	0.49
query37	0.09	0.06	0.06
query38	0.05	0.04	0.03
query39	0.04	0.02	0.02
query40	0.17	0.13	0.12
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.03	0.03	0.04
Total cold run time: 106.27 s
Total hot run time: 32.97 s

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Dec 19, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@gavinchou gavinchou merged commit cd916c6 into apache:master Dec 23, 2024
31 of 34 checks passed
github-actions bot pushed a commit that referenced this pull request Dec 23, 2024
…ndition when updating tablet cumu point (#45643)

Currently, when setting tablet's cumu point, aseert fail will happend if
new point is less than local value, resulting BE coredump.

This could happend when race condition happend:
1. thread A try to sync rowset
2. thread A fetch cumu point from ms 
3. thread B update cumu point(like sc/compaction),commit to ms after 2.
and set be tablet cumu point before 4.
4. thread A try to set cumu point seen before and meet the assertion,
coredump.
dataroaring pushed a commit that referenced this pull request Dec 25, 2024
…tial race condition when updating tablet cumu point #45643 (#45785)

Cherry-picked from #45643

Co-authored-by: Siyang Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.4-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants