Bug #428
closedTT local memory leak / RCU race condition on shutdown
100%
Description
On shutdown/rmmod of batman-adv.ko with batman-adv compiled from the main branch Slab complains with the following error message about a memory leak in the batman-adv TT local cache (full dmesg output in attachment):
[ 112.208633] =============================================================================
[ 112.210359] BUG batadv_tl_cache (Tainted: G OE ): Objects remaining in batadv_tl_cache on __kmem_cache_shutdown()
[ 112.211943] -----------------------------------------------------------------------------
[ 112.212517] Slab 0xffffe8afc0216d00 objects=16 used=1 fp=0xffff93f4085b4340 flags=0xfffffc0000a00(workingset|slab|node=0|zone=1|lastcpupid=0x1fffff)
[ 112.212517] CPU: 1 PID: 776 Comm: rmmod Tainted: G OE 6.8.12-amd64 #1 Debian 6.8.12-1
[ 112.212517] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 112.212517] Call Trace:
[ 112.212517] <TASK>
[ 112.212517] dump_stack_lvl+0x64/0x80
[ 112.212517] slab_err+0xe6/0x120
[ 112.212517] __kmem_cache_shutdown+0x160/0x2e0
[ 112.212517] kmem_cache_destroy+0x55/0x160
[ 112.220849] batadv_tt_cache_destroy+0x15/0x60 [batman_adv]
[ 112.220849] __do_sys_delete_module+0x1d5/0x320
[ 112.220849] do_syscall_64+0x83/0x190
[ 112.220849] ? do_syscall_64+0x8f/0x190
[ 112.220849] ? exc_page_fault+0x7f/0x180
[ 112.220849] entry_SYSCALL_64_after_hwframe+0x78/0x80
[ 112.224478] RIP: 0033:0x7f2ac8434977
[ 112.224478] Code: 73 01 c3 48 8b 0d a9 94 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 79 94 0c 00 f7 d8 64 89 01 48
[ 112.224478] RSP: 002b:00007ffe0adf6138 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 112.224478] RAX: ffffffffffffffda RBX: 000055db9018e770 RCX: 00007f2ac8434977
[ 112.224478] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055db9018e7d8
[ 112.224478] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000
[ 112.224478] R10: 00007f2ac84a6ac0 R11: 0000000000000206 R12: 00007ffe0adf6390
[ 112.224478] R13: 000055db9018e770 R14: 000055db9018d2a0 R15: 0000000000000000
[ 112.233961] </TASK>
[ 112.233961] Disabling lock debugging due to kernel taint
[ 112.233961] Object 0xffff93f4085b4140 @offset=320
[ 112.233961] Allocated in batadv_tt_local_add+0x297/0xa20 [batman_adv] age=15835 cpu=1 pid=755
[ 112.233961] batadv_tt_local_add+0x297/0xa20 [batman_adv]
[ 112.233961] batadv_interface_set_mac_addr+0xf6/0x120 [batman_adv]
[ 112.233961] dev_set_mac_address+0xde/0x140
[ 112.233961] dev_set_mac_address_user+0x30/0x50
[ 112.233961] do_setlink+0x261/0x12d0
[ 112.233961] rtnl_setlink+0x11f/0x1d0
[ 112.233961] rtnetlink_rcv_msg+0x152/0x3c0
[ 112.241772] netlink_rcv_skb+0x5b/0x110
[ 112.241772] netlink_unicast+0x1a6/0x290
[ 112.241772] netlink_sendmsg+0x223/0x490
[ 112.241772] __sys_sendto+0x1df/0x1f0
[ 112.241772] __x64_sys_sendto+0x24/0x30
[ 112.241772] do_syscall_64+0x83/0x190
[ 112.241772] entry_SYSCALL_64_after_hwframe+0x78/0x80
[ 112.245994] ------------[ cut here ]------------
[ 112.246650] kmem_cache_destroy batadv_tl_cache: Slab cache still has objects when called from batadv_tt_cache_destroy+0x15/0x60 [batman_adv]
[ 112.246668] WARNING: CPU: 1 PID: 776 at mm/slab_common.c:493 kmem_cache_destroy+0x14d/0x160
[ 112.249584] Modules linked in: veth batman_adv(OE-) cfg80211 rfkill bridge stp llc libcrc32c crc32c_generic crc16 rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver binfmt_misc pcspkr button joydev evdev serio_raw loop dm_mod efi_pstore nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci qemu_fw_cfg ip_tables x_tables autofs4 nfsv3 nfs_acl nfs lockd grace sunrpc 9pnet_rdma rdma_cm iw_cm ib_cm ib_core configfs 9p netfs ata_generic ata_piix libata psmouse scsi_mod 9pnet_virtio i2c_piix4 9pnet e1000 scsi_common floppy crypto_simd cryptd
[ 112.256555] CPU: 1 PID: 776 Comm: rmmod Tainted: G B OE 6.8.12-amd64 #1 Debian 6.8.12-1
[ 112.258457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 112.260410] RIP: 0010:kmem_cache_destroy+0x14d/0x160
[ 112.261687] Code: 00 eb be 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 8b 53 60 48 8b 4c 24 20 48 c7 c6 60 d5 e3 98 48 c7 c7 b8 ec 2d 99 e8 43 0d d8 ff <0f> 0b e9 e2 fe ff ff c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90
[ 112.265219] RSP: 0018:ffffb3b2806e7e48 EFLAGS: 00010282
[ 112.266044] RAX: 0000000000000000 RBX: ffff93f4270a2640 RCX: 0000000000000027
[ 112.267157] RDX: ffff93f43c521708 RSI: 0000000000000001 RDI: ffff93f43c521700
[ 112.268268] RBP: 000055db9018e7d8 R08: 0000000000000000 R09: ffffb3b2806e7cd8
[ 112.269418] R10: ffffb3b2806e7cd0 R11: 0000000000000003 R12: 0000000080012d00
[ 112.270572] R13: ffffb3b2806e7f58 R14: 0000000000000000 R15: 0000000000000000
[ 112.271699] FS: 00007f2ac8308440(0000) GS:ffff93f43c500000(0000) knlGS:0000000000000000
[ 112.273001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 112.273923] CR2: 00005584ef830110 CR3: 000000000787c000 CR4: 00000000000006f0
[ 112.275050] Call Trace:
[ 112.275464] <TASK>
[ 112.275810] ? kmem_cache_destroy+0x14d/0x160
[ 112.276518] ? __warn+0x81/0x130
[ 112.277043] ? kmem_cache_destroy+0x14d/0x160
[ 112.277730] ? report_bug+0x171/0x1a0
[ 112.278315] ? prb_read_valid+0x1b/0x30
[ 112.278919] ? handle_bug+0x3c/0x80
[ 112.279467] ? exc_invalid_op+0x17/0x70
[ 112.280071] ? asm_exc_invalid_op+0x1a/0x20
[ 112.280741] ? kmem_cache_destroy+0x14d/0x160
[ 112.281603] ? kmem_cache_destroy+0x14d/0x160
[ 112.282489] batadv_tt_cache_destroy+0x15/0x60 [batman_adv]
[ 112.283373] __do_sys_delete_module+0x1d5/0x320
[ 112.284080] do_syscall_64+0x83/0x190
[ 112.284696] ? do_syscall_64+0x8f/0x190
[ 112.285315] ? exc_page_fault+0x7f/0x180
[ 112.285970] entry_SYSCALL_64_after_hwframe+0x78/0x80
[ 112.286768] RIP: 0033:0x7f2ac8434977
[ 112.287355] Code: 73 01 c3 48 8b 0d a9 94 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 79 94 0c 00 f7 d8 64 89 01 48
[ 112.290282] RSP: 002b:00007ffe0adf6138 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 112.291465] RAX: ffffffffffffffda RBX: 000055db9018e770 RCX: 00007f2ac8434977
[ 112.292595] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055db9018e7d8
[ 112.293724] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000
[ 112.294863] R10: 00007f2ac84a6ac0 R11: 0000000000000206 R12: 00007ffe0adf6390
[ 112.295982] R13: 000055db9018e770 R14: 000055db9018d2a0 R15: 0000000000000000
[ 112.297103] </TASK>
[ 112.297465] ---[ end trace 0000000000000000 ]---
The system is an x86-64 KVM instance running the latest Debian Sid, including the Linux kernel as provided by Debian through the package linux-image-6.8.12-amd64 / 6.8.12-1. The batman-adv kernel module was compiled with gcc 13.2.0-25, as provided by Debian.
qemu-system-x86-64 command:
$ qemu-system-x86_64 -enable-kvm -nodefaults -smp 2 -m 1024M -serial mon:stdio -nographic -kernel /home/linus/mesh-node/boot/vmlinuz-6.8.
12-amd64 -initrd /home/linus/mesh-node/boot/initrd.img-6.8.12-amd64 -append nodeid=1 slub_debug console=ttyS0 uts=1 rootflags=ro ro net
.ifnames=1 ip=192.168.23.2::192.168.23.1:255.255.255.0:mesh-node:ens2:off root=/dev/nfs nfsroot=192.168.23.1:/home/linus/mesh-node -dev
ice e1000,netdev=net0,mac=02:04:64:a4:39:b1 -netdev tap,id=net0,ifname=vmtap0,script=/home/linus/dev-priv/vm/ens2-ifup.sh -device e1000
,netdev=net1,mac=02:04:64:a4:39:c1 -netdev tap,id=net1,ifname=vmtap1,script=no -device e1000,netdev=net2,mac=02:04:64:a4:39:d1 -netdev
tap,id=net2,ifname=vmtap2,script=no -device e1000,netdev=net3,mac=02:04:64:a4:39:e1 -netdev tap,id=net3,ifname=vmtap3,script=no -device
e1000,netdev=net4,mac=02:04:64:a4:39:f1 -netdev tap,id=net4,ifname=vmtap4,script=no -fsdev local,security_model=passthrough,id=fsdev-r
oot,path=/home/linus/mesh-node -device virtio-9p-pci,id=fs-root,fsdev=fsdev-root,mount_tag=virtfs_root
Within the VM the issue is reliably reproduceable with the following commands (CONFIG_BATMAN_ADV_MCAST=n to keep the TT local cache and therefore the splat short and debugging easier):
$ modprobe crc16; modprobe libcrc32c; modprobe bridge; modprobe cfg80211
$ cd /mnt/batman-adv/
$ make clean && make CONFIG_BATMAN_ADV_MCAST=n
$ insmod ./net/batman-adv/batman-adv.ko
$ ip link add dev veth0 type veth peer name veth1
$ ip link add dev bat0 type batadv
$ ip link set dev veth0 master bat0
$ ip link set up dev veth0
$ ip link set up dev veth1
$ ip link set up dev bat0
$ batctl tl
[B.A.T.M.A.N. adv 2024.1-7-g08e8aefb, MainIF/MAC: veth0/ba:8d:34:7b:53:01 (bat0/da:00:ca:2f:34:50 BATMAN_IV), TTVN: 1]
Client VID Flags Last seen (CRC )
da:00:ca:2f:34:50 -1 [.P....] 0.000 (0x2babba54)
$ rmmod batman-adv.ko
Afterwards an entry in slabinfo remains for batadv_tl_cache, however with 0 objects:
$ cat /proc/slabinfo | grep batadv
batadv_tl_cache 0 16 256 16 1 : tunables 0 0 0 : slabdata 1 1 0
The issue seems to have started to appear with the following commit, reverting it makes the kernel dmesg splat disappear: https://git.open-mesh.org/batman-adv.git/commit/cf9845d4566698731c23cd7a02db956706d058a1
Alternatively, adding 10x calls with msleep(500) between rcu_barrier() and batadv_tt_cache_destroy() in batadv_exit() makes the issue disappear, too.
With extra debug output I also see the kref refcount to decrease to zero for this TT local entry.
So overall it seems that somehow rcu_barrier() call is not waiting long enough for this entry to be freed since switching from call_rcu() to kfree_rcu() for freeing TT local entries?
Update/Note: This can also be reproduced with "-smp 1" instead of "-smp 2" to qemu.
Files