Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix](Iceberg)Fix HDFS FileSystem Leak Caused by Frequent Refresh of Iceberg Catalog #45956

Closed

Conversation

CalvinKirs
Copy link
Member

@CalvinKirs CalvinKirs commented Dec 25, 2024

Background

When using an HMS-based Iceberg Catalog, refreshing the Catalog frequently creates new HadoopAuthenticator instances. This leads to the following issues:

Frequent FileSystem Creation:

Iceberg uses Path.getFileSystem(Configuration) to obtain a FileSystem instance. Even though the Configuration remains unchanged, changes in the UserGroupInformation (UGI) cause new FileSystem instances to be created.

Resource Leakage:

Newly created FileSystem instances are not released, leading to increased resource consumption over time.

 198:          1978          72712  [Ljava.security.Principal;
 244:          2202          52848  javax.security.auth.kerberos.KerberosPrincipal
 418:           753          24096  sun.security.krb5.PrincipalName
 717:           391           9384  com.sun.security.auth.UnixNumericGroupPrincipal
 940:           382           6112  com.sun.security.auth.UnixNumericUserPrincipal
  64:          1158         240864  org.apache.hadoop.hdfs.client.impl.DfsClientConf
  99:          1155         147840  org.apache.hadoop.hdfs.DFSClient
 113:          1158         120432  org.apache.hadoop.hdfs.client.impl.DfsClientConf$ShortCircuitConf
 127:          3465         110880  org.apache.hadoop.hdfs.server.namenode.ha.AbstractNNFailoverProxyProvider$NNProxyInfo
 189:          1155          64680  org.apache.hadoop.hdfs.DistributedFileSystem
 204:           294          58800  org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage
 215:          2310          55440  org.apache.hadoop.hdfs.server.datanode.CachingStrategy

Root Cause

Each Catalog refresh creates a new HadoopAuthenticator instance. Changes in the UGI contained in HadoopAuthenticator lead to the creation of new FileSystem instances, even when the Configuration is the same.

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Dec 25, 2024

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@CalvinKirs
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32388 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c4bb2a95e9a93fb3e31a685d2f1279b0db30bc4b, data reload: false

------ Round 1 ----------------------------------
q1	17579	6151	6009	6009
q2	2050	286	161	161
q3	10438	1239	732	732
q4	10211	853	427	427
q5	7563	2207	1940	1940
q6	214	180	144	144
q7	908	743	615	615
q8	9226	1379	1129	1129
q9	5417	4979	4936	4936
q10	6757	2335	1871	1871
q11	475	282	253	253
q12	350	357	226	226
q13	17760	3566	2949	2949
q14	249	229	213	213
q15	562	510	508	508
q16	624	631	577	577
q17	564	852	322	322
q18	7073	6351	6478	6351
q19	1260	976	553	553
q20	310	320	186	186
q21	2842	2159	1986	1986
q22	359	331	300	300
Total cold run time: 102791 ms
Total hot run time: 32388 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6204	6204	6241	6204
q2	234	326	234	234
q3	2213	2622	2277	2277
q4	1393	1835	1345	1345
q5	4368	4750	4672	4672
q6	180	169	136	136
q7	1990	1866	1741	1741
q8	2513	2681	2571	2571
q9	6975	6902	6916	6902
q10	2967	3216	2687	2687
q11	574	508	487	487
q12	647	739	593	593
q13	3198	3666	2979	2979
q14	277	291	261	261
q15	551	514	496	496
q16	647	691	650	650
q17	1170	1692	1220	1220
q18	7321	7275	6795	6795
q19	865	1051	1094	1051
q20	1912	1995	1846	1846
q21	5415	5244	4892	4892
q22	598	625	534	534
Total cold run time: 52212 ms
Total hot run time: 50573 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191135 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c4bb2a95e9a93fb3e31a685d2f1279b0db30bc4b, data reload: false

query1	1004	400	379	379
query2	6524	2311	2417	2311
query3	6719	208	208	208
query4	33685	23568	23517	23517
query5	4380	620	459	459
query6	283	198	174	174
query7	4623	495	298	298
query8	291	247	223	223
query9	9311	2743	2727	2727
query10	471	306	245	245
query11	18077	15350	15228	15228
query12	163	110	114	110
query13	1669	557	410	410
query14	10193	7578	7025	7025
query15	238	201	188	188
query16	8076	617	464	464
query17	1607	732	539	539
query18	2077	401	295	295
query19	195	175	151	151
query20	117	108	115	108
query21	201	121	104	104
query22	4283	4320	4176	4176
query23	34264	33532	33458	33458
query24	6380	2303	2251	2251
query25	497	448	378	378
query26	1204	263	148	148
query27	2208	445	326	326
query28	5188	2466	2444	2444
query29	756	555	404	404
query30	229	190	154	154
query31	996	913	832	832
query32	99	62	61	61
query33	489	345	310	310
query34	750	842	502	502
query35	812	819	735	735
query36	1015	1067	957	957
query37	123	98	77	77
query38	4273	4245	4304	4245
query39	1528	1439	1437	1437
query40	207	114	103	103
query41	47	45	44	44
query42	117	104	103	103
query43	519	524	491	491
query44	1275	803	799	799
query45	191	172	167	167
query46	873	1034	652	652
query47	1939	1915	1876	1876
query48	385	410	326	326
query49	775	475	379	379
query50	617	653	393	393
query51	7061	7362	7202	7202
query52	96	100	89	89
query53	217	278	183	183
query54	471	480	400	400
query55	83	79	78	78
query56	257	272	238	238
query57	1192	1194	1128	1128
query58	230	217	231	217
query59	3053	3150	2986	2986
query60	277	259	260	259
query61	115	115	113	113
query62	859	806	742	742
query63	232	202	199	199
query64	4741	1082	761	761
query65	3263	3172	3240	3172
query66	1083	431	385	385
query67	15897	15865	15614	15614
query68	8983	749	522	522
query69	466	296	260	260
query70	1225	1166	1141	1141
query71	439	286	266	266
query72	5816	4037	3964	3964
query73	668	754	364	364
query74	9811	8942	9250	8942
query75	4377	3163	2657	2657
query76	4157	1182	783	783
query77	783	358	285	285
query78	9959	10232	9433	9433
query79	3512	836	582	582
query80	726	511	489	489
query81	479	266	232	232
query82	633	152	122	122
query83	200	173	144	144
query84	285	97	73	73
query85	784	378	308	308
query86	353	316	283	283
query87	4474	4556	4603	4556
query88	4404	2246	2210	2210
query89	406	338	307	307
query90	1917	190	190	190
query91	137	134	107	107
query92	71	58	54	54
query93	1340	862	551	551
query94	667	398	287	287
query95	339	267	252	252
query96	486	599	280	280
query97	2787	2827	2701	2701
query98	224	250	201	201
query99	1675	1563	1432	1432
Total cold run time: 293653 ms
Total hot run time: 191135 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.88 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c4bb2a95e9a93fb3e31a685d2f1279b0db30bc4b, data reload: false

query1	0.04	0.04	0.03
query2	0.06	0.03	0.04
query3	0.24	0.07	0.07
query4	1.63	0.10	0.11
query5	0.41	0.42	0.38
query6	1.15	0.66	0.65
query7	0.02	0.02	0.01
query8	0.04	0.04	0.03
query9	0.58	0.52	0.50
query10	0.55	0.57	0.56
query11	0.14	0.10	0.11
query12	0.15	0.11	0.11
query13	0.62	0.58	0.60
query14	2.71	2.74	2.72
query15	0.89	0.82	0.82
query16	0.40	0.37	0.38
query17	1.05	1.06	1.06
query18	0.24	0.20	0.20
query19	2.00	1.87	2.02
query20	0.01	0.02	0.01
query21	15.36	0.95	0.60
query22	0.76	0.89	0.59
query23	15.31	1.44	0.59
query24	3.67	1.44	1.54
query25	0.14	0.21	0.23
query26	0.20	0.14	0.15
query27	0.05	0.05	0.06
query28	14.36	1.50	1.05
query29	12.58	3.92	3.25
query30	0.26	0.09	0.06
query31	2.84	0.60	0.38
query32	3.23	0.54	0.46
query33	3.03	3.07	3.11
query34	16.87	5.15	4.53
query35	4.59	4.56	4.50
query36	0.66	0.49	0.48
query37	0.09	0.06	0.05
query38	0.04	0.03	0.04
query39	0.03	0.03	0.02
query40	0.16	0.13	0.12
query41	0.07	0.02	0.02
query42	0.03	0.02	0.02
query43	0.04	0.03	0.02
Total cold run time: 107.3 s
Total hot run time: 31.88 s

@CalvinKirs
Copy link
Member Author

run P0

@CalvinKirs
Copy link
Member Author

run buildall

1 similar comment
@CalvinKirs
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32859 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3ec2f54c9aab694b4a0d571e19962ac26373ba54, data reload: false

------ Round 1 ----------------------------------
q1	18050	6249	6133	6133
q2	2047	311	170	170
q3	10744	1222	759	759
q4	10620	871	429	429
q5	7587	2193	1989	1989
q6	215	189	154	154
q7	929	787	609	609
q8	9264	1419	1195	1195
q9	6218	5092	5161	5092
q10	6939	2339	1869	1869
q11	480	290	254	254
q12	352	364	219	219
q13	18036	3651	3008	3008
q14	244	245	216	216
q15	575	515	514	514
q16	641	615	590	590
q17	566	853	322	322
q18	6873	6443	6342	6342
q19	2170	979	569	569
q20	298	321	183	183
q21	2791	2156	1936	1936
q22	359	335	307	307
Total cold run time: 105998 ms
Total hot run time: 32859 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6320	6221	6232	6221
q2	230	325	233	233
q3	2221	2650	2305	2305
q4	1381	1830	1328	1328
q5	4347	4782	4892	4782
q6	178	179	142	142
q7	2129	2001	1837	1837
q8	2598	2763	2726	2726
q9	7318	7285	7350	7285
q10	3049	3358	2734	2734
q11	593	513	526	513
q12	679	773	592	592
q13	3443	3777	3135	3135
q14	305	300	294	294
q15	579	498	505	498
q16	666	671	661	661
q17	1186	1714	1242	1242
q18	7640	7501	7395	7395
q19	799	1163	1065	1065
q20	1990	2023	1895	1895
q21	5706	5109	4971	4971
q22	607	608	615	608
Total cold run time: 53964 ms
Total hot run time: 52462 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 196901 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3ec2f54c9aab694b4a0d571e19962ac26373ba54, data reload: false

query1	1293	946	878	878
query2	6455	2254	2228	2228
query3	10981	4635	4702	4635
query4	33031	23750	23681	23681
query5	4432	623	491	491
query6	290	211	189	189
query7	4005	486	304	304
query8	294	246	226	226
query9	9632	2753	2749	2749
query10	474	299	253	253
query11	18111	15287	15070	15070
query12	148	110	104	104
query13	1558	539	410	410
query14	9875	7426	7015	7015
query15	279	205	182	182
query16	8088	631	432	432
query17	1522	748	594	594
query18	2192	410	309	309
query19	192	193	162	162
query20	122	117	116	116
query21	210	132	109	109
query22	4709	4669	4561	4561
query23	34559	33729	33512	33512
query24	6566	2319	2325	2319
query25	503	451	426	426
query26	790	280	152	152
query27	2040	485	334	334
query28	6052	2514	2508	2508
query29	676	551	426	426
query30	208	182	147	147
query31	1015	947	845	845
query32	94	59	63	59
query33	478	356	304	304
query34	764	865	509	509
query35	820	825	775	775
query36	1052	1101	973	973
query37	113	107	77	77
query38	4302	4296	4335	4296
query39	1599	1484	1461	1461
query40	211	114	102	102
query41	46	44	45	44
query42	126	109	143	109
query43	535	529	501	501
query44	1336	831	836	831
query45	180	170	171	170
query46	877	1055	669	669
query47	1984	1987	1953	1953
query48	377	406	330	330
query49	721	486	395	395
query50	651	678	418	418
query51	7339	7335	7119	7119
query52	100	105	96	96
query53	249	272	193	193
query54	506	526	453	453
query55	81	78	80	78
query56	256	260	245	245
query57	1284	1228	1148	1148
query58	229	230	226	226
query59	3203	3087	3018	3018
query60	280	290	261	261
query61	109	109	120	109
query62	861	791	773	773
query63	231	201	198	198
query64	3719	1126	759	759
query65	3300	3256	3251	3251
query66	827	445	323	323
query67	16691	15919	15554	15554
query68	9853	759	518	518
query69	483	290	258	258
query70	1238	1116	1138	1116
query71	443	302	256	256
query72	6324	3826	3848	3826
query73	661	748	364	364
query74	10329	9025	8987	8987
query75	4602	3160	2670	2670
query76	4723	1197	804	804
query77	991	374	272	272
query78	10132	10300	9415	9415
query79	2978	905	604	604
query80	748	516	430	430
query81	486	270	225	225
query82	625	150	131	131
query83	205	157	144	144
query84	280	92	75	75
query85	778	417	300	300
query86	340	321	305	305
query87	4456	4477	4677	4477
query88	3279	2273	2249	2249
query89	448	329	291	291
query90	1933	184	190	184
query91	135	136	103	103
query92	63	52	54	52
query93	1910	915	534	534
query94	667	399	254	254
query95	328	263	255	255
query96	484	610	281	281
query97	2768	2813	2713	2713
query98	220	201	197	197
query99	1698	1516	1464	1464
Total cold run time: 300441 ms
Total hot run time: 196901 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.57 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3ec2f54c9aab694b4a0d571e19962ac26373ba54, data reload: false

query1	0.03	0.03	0.02
query2	0.07	0.03	0.03
query3	0.23	0.07	0.08
query4	1.60	0.11	0.10
query5	0.42	0.41	0.42
query6	1.14	0.65	0.65
query7	0.02	0.02	0.01
query8	0.04	0.04	0.03
query9	0.58	0.50	0.51
query10	0.55	0.56	0.55
query11	0.14	0.10	0.11
query12	0.14	0.11	0.11
query13	0.60	0.60	0.60
query14	2.73	2.74	2.72
query15	0.90	0.83	0.82
query16	0.38	0.39	0.38
query17	1.05	1.06	1.05
query18	0.23	0.22	0.21
query19	1.93	1.72	1.98
query20	0.01	0.01	0.01
query21	15.38	0.94	0.59
query22	0.74	0.80	0.82
query23	15.16	1.41	0.51
query24	3.22	1.28	1.58
query25	0.23	0.19	0.09
query26	0.34	0.15	0.14
query27	0.05	0.05	0.04
query28	13.96	1.49	1.05
query29	12.57	3.91	3.28
query30	0.25	0.09	0.07
query31	2.82	0.62	0.38
query32	3.23	0.55	0.47
query33	3.04	3.10	3.22
query34	16.63	5.10	4.46
query35	4.51	4.44	4.43
query36	0.69	0.49	0.47
query37	0.09	0.06	0.06
query38	0.05	0.04	0.04
query39	0.04	0.02	0.02
query40	0.16	0.13	0.14
query41	0.08	0.03	0.02
query42	0.03	0.02	0.02
query43	0.03	0.03	0.04
Total cold run time: 106.09 s
Total hot run time: 31.57 s

@CalvinKirs
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33249 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 23a2723e3cf10177960809267d946f4e3e5635ce, data reload: false

------ Round 1 ----------------------------------
q1	17571	6312	6162	6162
q2	2047	293	166	166
q3	10505	1239	812	812
q4	10227	861	444	444
q5	7963	2192	2010	2010
q6	211	178	148	148
q7	917	787	617	617
q8	9246	1391	1365	1365
q9	5353	5028	5000	5000
q10	6755	2294	1859	1859
q11	488	291	267	267
q12	361	375	225	225
q13	17765	3600	2987	2987
q14	237	238	226	226
q15	581	518	485	485
q16	636	643	578	578
q17	560	868	339	339
q18	7115	6884	6385	6385
q19	2234	967	611	611
q20	323	323	186	186
q21	3072	2400	2050	2050
q22	365	347	327	327
Total cold run time: 104532 ms
Total hot run time: 33249 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6446	6324	6313	6313
q2	244	325	230	230
q3	2285	2655	2328	2328
q4	1434	1835	1414	1414
q5	4364	4769	4933	4769
q6	194	177	146	146
q7	2126	1964	1894	1894
q8	2652	2825	2721	2721
q9	7350	7171	7286	7171
q10	3131	3328	2785	2785
q11	591	521	503	503
q12	699	795	584	584
q13	3397	3816	3203	3203
q14	306	332	273	273
q15	574	513	493	493
q16	655	709	650	650
q17	1261	1759	1283	1283
q18	7614	7401	7341	7341
q19	904	1263	1137	1137
q20	2023	2021	1898	1898
q21	5835	5343	5126	5126
q22	625	628	598	598
Total cold run time: 54710 ms
Total hot run time: 52860 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 197044 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 23a2723e3cf10177960809267d946f4e3e5635ce, data reload: false

query1	1327	964	922	922
query2	6498	2273	2321	2273
query3	11126	4861	4656	4656
query4	32768	23970	23425	23425
query5	4372	627	469	469
query6	271	206	187	187
query7	3986	494	296	296
query8	288	241	229	229
query9	9584	2749	2742	2742
query10	468	305	257	257
query11	18083	15419	15113	15113
query12	156	110	102	102
query13	1580	550	422	422
query14	10501	7444	7134	7134
query15	249	224	195	195
query16	8056	627	506	506
query17	1572	763	621	621
query18	2164	426	329	329
query19	230	195	172	172
query20	124	117	115	115
query21	207	143	110	110
query22	4583	4539	4600	4539
query23	35166	34182	33727	33727
query24	6413	2337	2284	2284
query25	468	436	388	388
query26	729	248	161	161
query27	2049	485	336	336
query28	5409	2540	2472	2472
query29	577	530	421	421
query30	215	178	152	152
query31	958	927	855	855
query32	85	62	62	62
query33	476	357	300	300
query34	802	868	521	521
query35	820	863	769	769
query36	1030	1057	973	973
query37	113	93	79	79
query38	4150	4203	4158	4158
query39	1517	1456	1469	1456
query40	209	134	106	106
query41	47	44	46	44
query42	122	104	103	103
query43	532	530	496	496
query44	1312	835	839	835
query45	189	174	172	172
query46	935	1068	688	688
query47	1943	1973	1965	1965
query48	420	425	323	323
query49	706	510	396	396
query50	692	682	410	410
query51	7341	7088	7309	7088
query52	107	106	96	96
query53	229	256	193	193
query54	507	500	418	418
query55	85	83	79	79
query56	271	257	273	257
query57	1250	1242	1151	1151
query58	256	226	232	226
query59	3122	3258	3197	3197
query60	268	265	277	265
query61	111	107	114	107
query62	847	815	783	783
query63	236	200	197	197
query64	3528	1039	683	683
query65	3336	3284	3230	3230
query66	775	431	314	314
query67	16700	16008	15517	15517
query68	9843	753	527	527
query69	504	299	248	248
query70	1199	1137	1168	1137
query71	454	286	244	244
query72	6290	3916	3777	3777
query73	803	764	373	373
query74	9924	9024	9216	9024
query75	4685	3175	2646	2646
query76	5615	1197	814	814
query77	1031	369	303	303
query78	10124	10393	9500	9500
query79	2838	913	615	615
query80	718	525	438	438
query81	509	266	231	231
query82	324	159	123	123
query83	190	168	143	143
query84	278	91	70	70
query85	729	356	354	354
query86	334	310	287	287
query87	4585	4374	4469	4374
query88	3332	2254	2236	2236
query89	411	342	295	295
query90	1990	188	190	188
query91	138	139	102	102
query92	62	59	53	53
query93	1054	897	542	542
query94	668	392	290	290
query95	340	272	253	253
query96	498	610	278	278
query97	2780	2811	2678	2678
query98	222	206	197	197
query99	1785	1541	1437	1437
Total cold run time: 299503 ms
Total hot run time: 197044 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.82 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 23a2723e3cf10177960809267d946f4e3e5635ce, data reload: false

query1	0.04	0.04	0.04
query2	0.08	0.03	0.03
query3	0.24	0.07	0.07
query4	1.61	0.11	0.10
query5	0.41	0.43	0.41
query6	1.16	0.65	0.66
query7	0.02	0.02	0.02
query8	0.04	0.03	0.04
query9	0.58	0.50	0.50
query10	0.55	0.58	0.55
query11	0.15	0.10	0.10
query12	0.14	0.12	0.11
query13	0.60	0.62	0.60
query14	2.70	2.74	2.85
query15	0.89	0.82	0.82
query16	0.38	0.39	0.38
query17	1.04	1.01	1.04
query18	0.23	0.21	0.21
query19	1.94	1.84	2.02
query20	0.01	0.02	0.01
query21	15.36	0.96	0.59
query22	0.79	0.77	0.70
query23	15.24	1.47	0.55
query24	3.39	1.39	1.78
query25	0.12	0.10	0.11
query26	0.26	0.16	0.14
query27	0.07	0.07	0.04
query28	14.46	1.50	1.05
query29	12.56	3.94	3.29
query30	0.25	0.09	0.06
query31	2.84	0.59	0.40
query32	3.23	0.56	0.45
query33	3.02	3.11	3.05
query34	16.91	5.18	4.51
query35	4.56	4.53	4.49
query36	0.63	0.50	0.47
query37	0.10	0.07	0.06
query38	0.04	0.04	0.04
query39	0.03	0.03	0.02
query40	0.18	0.14	0.13
query41	0.08	0.03	0.02
query42	0.04	0.02	0.03
query43	0.04	0.04	0.03
Total cold run time: 107.01 s
Total hot run time: 31.82 s

…Iceberg Catalog

### Background
When using an HMS-based Iceberg Catalog, refreshing the Catalog frequently creates new HadoopAuthenticator instances. This leads to the following issues:

#### Frequent FileSystem Creation:
Iceberg uses Path.getFileSystem(Configuration) to obtain a FileSystem instance. Even though the Configuration remains unchanged, changes in the UserGroupInformation (UGI) cause new FileSystem instances to be created.

#### Resource Leakage:
Newly created FileSystem instances are not released, leading to increased resource consumption over time.

#### Root Cause
Each Catalog refresh creates a new HadoopAuthenticator instance. Changes in the UGI contained in HadoopAuthenticator lead to the creation of new FileSystem instances, even when the Configuration is the same.
@CalvinKirs CalvinKirs force-pushed the master-fix-iceberg-memory-lake branch from 23a2723 to b7f2092 Compare December 26, 2024 07:20
@CalvinKirs
Copy link
Member Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32238 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b7f2092c549806511fce856d75d1f6043616b0f6, data reload: false

------ Round 1 ----------------------------------
q1	17604	6066	6022	6022
q2	2049	309	175	175
q3	10406	1256	724	724
q4	10285	857	435	435
q5	7509	2156	1956	1956
q6	210	180	146	146
q7	883	800	609	609
q8	9234	1353	1099	1099
q9	5396	4916	4886	4886
q10	6764	2329	1883	1883
q11	495	278	258	258
q12	343	366	216	216
q13	17787	3622	2928	2928
q14	223	245	227	227
q15	612	514	485	485
q16	627	613	594	594
q17	562	831	359	359
q18	7065	6588	6228	6228
q19	1224	967	550	550
q20	298	331	183	183
q21	2805	2188	1966	1966
q22	357	334	309	309
Total cold run time: 102738 ms
Total hot run time: 32238 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6289	6261	6243	6243
q2	239	319	229	229
q3	2249	2626	2334	2334
q4	1457	1813	1372	1372
q5	4328	4727	4690	4690
q6	180	172	138	138
q7	1924	1828	1709	1709
q8	2467	2629	2550	2550
q9	6933	6841	6859	6841
q10	2993	3233	2731	2731
q11	570	523	502	502
q12	681	724	584	584
q13	3227	3629	2988	2988
q14	274	281	268	268
q15	558	503	512	503
q16	627	665	639	639
q17	1146	1668	1206	1206
q18	7243	7060	6940	6940
q19	815	1102	995	995
q20	1887	1955	1899	1899
q21	5358	5168	5022	5022
q22	628	594	577	577
Total cold run time: 52073 ms
Total hot run time: 50960 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 190423 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b7f2092c549806511fce856d75d1f6043616b0f6, data reload: false

query1	975	388	392	388
query2	6527	2363	2270	2270
query3	6707	217	207	207
query4	33765	23909	23818	23818
query5	5199	641	449	449
query6	290	200	186	186
query7	4630	497	300	300
query8	300	251	232	232
query9	9676	2767	2760	2760
query10	510	307	248	248
query11	18259	15383	15163	15163
query12	163	106	106	106
query13	1670	546	420	420
query14	12367	6993	7266	6993
query15	249	203	181	181
query16	7951	582	472	472
query17	1543	750	577	577
query18	2129	442	291	291
query19	241	183	145	145
query20	114	110	118	110
query21	204	128	104	104
query22	4288	4334	4270	4270
query23	34127	34144	33338	33338
query24	6054	2293	2220	2220
query25	434	436	389	389
query26	760	250	149	149
query27	2043	472	331	331
query28	4943	2481	2442	2442
query29	532	528	434	434
query30	233	179	149	149
query31	975	901	827	827
query32	89	60	58	58
query33	492	348	299	299
query34	770	835	503	503
query35	793	812	748	748
query36	1047	1038	955	955
query37	115	92	75	75
query38	4186	4092	4046	4046
query39	1524	1407	1421	1407
query40	205	115	105	105
query41	47	45	45	45
query42	121	113	102	102
query43	523	519	478	478
query44	1379	809	804	804
query45	178	174	169	169
query46	863	1037	647	647
query47	1884	1907	1853	1853
query48	390	420	319	319
query49	742	467	380	380
query50	625	657	388	388
query51	7072	7507	7147	7147
query52	103	104	95	95
query53	230	264	184	184
query54	480	483	408	408
query55	80	77	84	77
query56	265	254	250	250
query57	1185	1177	1133	1133
query58	229	222	230	222
query59	2977	3127	2963	2963
query60	270	267	246	246
query61	117	110	109	109
query62	881	758	755	755
query63	226	196	195	195
query64	3772	997	651	651
query65	3288	3174	3234	3174
query66	854	426	304	304
query67	15807	15741	15508	15508
query68	8924	768	529	529
query69	494	282	251	251
query70	1212	1149	1106	1106
query71	447	282	254	254
query72	6251	3838	3768	3768
query73	656	755	362	362
query74	9973	9078	9102	9078
query75	4704	3131	2663	2663
query76	5227	1212	778	778
query77	963	356	275	275
query78	10006	10198	9357	9357
query79	3373	902	602	602
query80	726	563	438	438
query81	474	281	249	249
query82	641	165	126	126
query83	195	168	146	146
query84	281	98	77	77
query85	768	378	331	331
query86	340	288	283	283
query87	4595	4381	4564	4381
query88	4480	2253	2210	2210
query89	422	333	292	292
query90	1935	191	191	191
query91	139	134	107	107
query92	71	57	53	53
query93	1255	894	550	550
query94	663	401	285	285
query95	330	258	259	258
query96	499	608	283	283
query97	2743	2852	2720	2720
query98	227	196	196	196
query99	1699	1545	1453	1453
Total cold run time: 296380 ms
Total hot run time: 190423 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.59 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b7f2092c549806511fce856d75d1f6043616b0f6, data reload: false

query1	0.04	0.04	0.03
query2	0.07	0.03	0.03
query3	0.24	0.06	0.07
query4	1.61	0.11	0.10
query5	0.40	0.41	0.41
query6	1.16	0.66	0.64
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.58	0.52	0.50
query10	0.55	0.58	0.55
query11	0.14	0.10	0.10
query12	0.14	0.11	0.12
query13	0.61	0.61	0.61
query14	2.71	2.74	2.70
query15	0.92	0.83	0.84
query16	0.38	0.38	0.39
query17	1.06	1.00	0.99
query18	0.23	0.21	0.20
query19	1.95	1.85	2.00
query20	0.01	0.01	0.01
query21	15.38	0.92	0.58
query22	0.76	0.75	0.71
query23	15.33	1.40	0.52
query24	3.02	1.20	1.50
query25	0.21	0.20	0.12
query26	0.36	0.16	0.14
query27	0.08	0.07	0.04
query28	13.80	1.52	1.04
query29	12.54	3.85	3.28
query30	0.24	0.10	0.06
query31	2.83	0.61	0.39
query32	3.23	0.55	0.45
query33	3.14	3.12	3.10
query34	16.94	5.11	4.52
query35	4.53	4.49	4.56
query36	0.67	0.50	0.51
query37	0.09	0.07	0.06
query38	0.05	0.03	0.04
query39	0.03	0.02	0.02
query40	0.17	0.14	0.12
query41	0.09	0.03	0.02
query42	0.03	0.03	0.02
query43	0.03	0.03	0.04
Total cold run time: 106.41 s
Total hot run time: 31.59 s

@CalvinKirs CalvinKirs closed this Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants