Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8489: Sender memory leak when rpc encode exception #2901

Merged
merged 1 commit into from
May 1, 2024

Conversation

shfshihuafeng
Copy link
Contributor

@shfshihuafeng shfshihuafeng commented Apr 17, 2024

DRILL-8489: Sender memory leak when rpc encode exception

Description

When encode throw Exception, if encode msg instanceof ReferenceCounted, netty can release msg, but drill convert msg to OutboundRpcMessage, so netty can not release msg.

we can reproduce this scenario by break point and add debug log. Seeing Testing#Test1

Documentation

(Please describe user-visible changes similar to what should appear in the Drill documentation.)

Testing

Test 1

1. set -Ddrill.memory.debug.allocator=TRUE

2. we add debug log as following

DrillByteBufAllocator #DrillByteBufAllocator

    public ByteBuf buffer() {
    File file = new File("/data/shf/b.log");
    if (file.exists()) {
      throw new OutOfMemoryException("shf encode exception");
    }
    return buffer(DEFAULT_BUFFER_SIZE);
  }

3. restart drillbit
image

4. Run tpch sql8

select
o_year,
sum(case when nation = 'CHINA' then volume else 0 end) / sum(volume) as mkt_share
from (
select
extract(year from o_orderdate) as o_year,
l_extendedprice * 1.0 as volume,
n2.n_name as nation
from hive.tpch1s.part, hive.tpch1s.supplier, hive.tpch1s.lineitem, hive.tpch1s.orders, hive.tpch1s.customer, hive.tpch1s.nation n1, hive.tpch1s.nation n2, hive.tpch1s.region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = 'ASIA'
and s_nationkey = n2.n_nationkey
and o_orderdate between date '1995-01-01'
and date '1996-12-31'
and p_type = 'LARGE BRUSHED BRASS') as all_nations
group by o_year
order by o_year;   

5.Break point: BroadcastSenderRootExec#innerNext#tunnels[i].sendRecordBatch(batch);
we resume program (F9, idea tool ) until there is memory had been allocated in the writableBatch object shown below

image

6.Break point: MessageToMessageEncoder#encode
we resume program (F9, idea tool ) until step 5 writableBatch encode

image
  1. we mkdir "/data/shf/b.log" for debug on step 2
  2. end break point
  3. find memory leak
image
  1. Check whether the leaked memory id is equal to that allocated by writableBatch
Allocator(frag:4:0) 3000000/1000000/4000512/30000000000 (res/actual/peak/limit)
  child allocators: 1
    Allocator(op:4:0:0:BroadcastSender) 1000000/53408/106816/10000000000 (res/actual/peak/limit)
      child allocators: 0
      ledgers: 5
        ledger[155] allocator: op:4:0:0:BroadcastSender), isOwning: true, size: 128, references: 1, life: 2050915810044022..0, allocatorManager: [130, life: 2050915807998314..0] holds 1 buffers.
            DrillBuf[156], udle: [132 0..128]

   ledger[159] allocator: op:4:0:0:BroadcastSender), isOwning: true, size: 4096, references: 1, life: 2050915810510561..0, allocatorManager: [138, life: 2050915808701687..0] holds 1 buffers.
            DrillBuf[160],
			
			        ledger[161] allocator: op:4:0:0:BroadcastSender), isOwning: true, size: 32768, references: 1, life: 2050915810690813..0, allocatorManager: [134, life: 2050915808423055..0] holds 1 buffers.
            DrillBuf[162], udle: [135 0..32768]
       event log for: DrillBuf[162]

        ledger[160] allocator: op:4:0:0:BroadcastSender), isOwning: true, size: 16384, references: 1, life: 2050915810616308..0, allocatorManager: [136, life: 2050915808530627..0] holds 1 buffers.
            DrillBuf[161], udle: [137 0..16384]

Test 2

  1. export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"2G"}
  2. tpch 1s
  3. tpch sql 8
    4.This scenario is relatively easy to Reproduce by running the following script
drill_home=/data/shf/apache-drill-1.22.0-SNAPSHOT/bin
fileName=/data/shf/1s/shf.txt

random_sql(){
#for i in `seq 1 3`
while true
do
  num=$((RANDOM%22+1))
  if [ -f $fileName ]; then
  echo "$fileName" " is exit"
  exit 0
  else
          $drill_home/sqlline -u \"jdbc:drill:zk=jupiter-2:2181/drill_shf/jupiterbits_shf1\" -f tpch_sql8.sql >> sql8.log 2>&1
  fi
done
}
main(){
unset HADOOP_CLASSPATH
#TPCH power test
for i in `seq 1 25`
do
        random_sql &
done


}

@cgivre cgivre added minor-update backport-to-stable This bug fix is applicable to the latest stable release and should be considered for inclusion there bug stability verified labels Apr 26, 2024
@cgivre cgivre self-requested a review April 26, 2024 16:54
Copy link
Contributor

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@cgivre cgivre merged commit 6d94399 into apache:master May 1, 2024
8 checks passed
jnturton pushed a commit to jnturton/drill that referenced this pull request May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-stable This bug fix is applicable to the latest stable release and should be considered for inclusion there bug minor-update stability verified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants