Project

General

Profile

Bug #428

Updated by Linus Lüssing 6 months ago

On shutdown/rmmod of batman-adv.ko with batman-adv compiled from the main branch Slab complains with the following error message about a memory leak in the batman-adv TT local cache (full dmesg output in attachment): 

 <pre><code> 
 [    112.208633] ============================================================================= 
 [    112.210359] BUG batadv_tl_cache (Tainted: G             OE       ): Objects remaining in batadv_tl_cache on __kmem_cache_shutdown() 
 [    112.211943] ----------------------------------------------------------------------------- 

 [    112.212517] Slab 0xffffe8afc0216d00 objects=16 used=1 fp=0xffff93f4085b4340 flags=0xfffffc0000a00(workingset|slab|node=0|zone=1|lastcpupid=0x1fffff) 
 [    112.212517] CPU: 1 PID: 776 Comm: rmmod Tainted: G             OE        6.8.12-amd64 #1    Debian 6.8.12-1 
 [    112.212517] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 
 [    112.212517] Call Trace: 
 [    112.212517]    <TASK> 
 [    112.212517]    dump_stack_lvl+0x64/0x80 
 [    112.212517]    slab_err+0xe6/0x120 
 [    112.212517]    __kmem_cache_shutdown+0x160/0x2e0 
 [    112.212517]    kmem_cache_destroy+0x55/0x160 
 [    112.220849]    batadv_tt_cache_destroy+0x15/0x60 [batman_adv] 
 [    112.220849]    __do_sys_delete_module+0x1d5/0x320 
 [    112.220849]    do_syscall_64+0x83/0x190 
 [    112.220849]    ? do_syscall_64+0x8f/0x190 
 [    112.220849]    ? exc_page_fault+0x7f/0x180 
 [    112.220849]    entry_SYSCALL_64_after_hwframe+0x78/0x80 
 [    112.224478] RIP: 0033:0x7f2ac8434977 
 [    112.224478] Code: 73 01 c3 48 8b 0d a9 94 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 79 94 0c 00 f7 d8 64 89 01 48 
 [    112.224478] RSP: 002b:00007ffe0adf6138 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 
 [    112.224478] RAX: ffffffffffffffda RBX: 000055db9018e770 RCX: 00007f2ac8434977 
 [    112.224478] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055db9018e7d8 
 [    112.224478] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000 
 [    112.224478] R10: 00007f2ac84a6ac0 R11: 0000000000000206 R12: 00007ffe0adf6390 
 [    112.224478] R13: 000055db9018e770 R14: 000055db9018d2a0 R15: 0000000000000000 
 [    112.233961]    </TASK> 
 [    112.233961] Disabling lock debugging due to kernel taint 
 [    112.233961] Object 0xffff93f4085b4140 @offset=320 
 [    112.233961] Allocated in batadv_tt_local_add+0x297/0xa20 [batman_adv] age=15835 cpu=1 pid=755 
 [    112.233961]    batadv_tt_local_add+0x297/0xa20 [batman_adv] 
 [    112.233961]    batadv_interface_set_mac_addr+0xf6/0x120 [batman_adv] 
 [    112.233961]    dev_set_mac_address+0xde/0x140 
 [    112.233961]    dev_set_mac_address_user+0x30/0x50 
 [    112.233961]    do_setlink+0x261/0x12d0 
 [    112.233961]    rtnl_setlink+0x11f/0x1d0 
 [    112.233961]    rtnetlink_rcv_msg+0x152/0x3c0 
 [    112.241772]    netlink_rcv_skb+0x5b/0x110 
 [    112.241772]    netlink_unicast+0x1a6/0x290 
 [    112.241772]    netlink_sendmsg+0x223/0x490 
 [    112.241772]    __sys_sendto+0x1df/0x1f0 
 [    112.241772]    __x64_sys_sendto+0x24/0x30 
 [    112.241772]    do_syscall_64+0x83/0x190 
 [    112.241772]    entry_SYSCALL_64_after_hwframe+0x78/0x80 
 [    112.245994] ------------[ cut here ]------------ 
 [    112.246650] kmem_cache_destroy batadv_tl_cache: Slab cache still has objects when called from batadv_tt_cache_destroy+0x15/0x60 [batman_adv] 
 [    112.246668] WARNING: CPU: 1 PID: 776 at mm/slab_common.c:493 kmem_cache_destroy+0x14d/0x160 
 [    112.249584] Modules linked in: veth batman_adv(OE-) cfg80211 rfkill bridge stp llc libcrc32c crc32c_generic crc16 rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver binfmt_misc pcspkr button joydev evdev serio_raw loop dm_mod efi_pstore nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci qemu_fw_cfg ip_tables x_tables autofs4 nfsv3 nfs_acl nfs lockd grace sunrpc 9pnet_rdma rdma_cm iw_cm ib_cm ib_core configfs 9p netfs ata_generic ata_piix libata psmouse scsi_mod 9pnet_virtio i2c_piix4 9pnet e1000 scsi_common floppy crypto_simd cryptd 
 [    112.256555] CPU: 1 PID: 776 Comm: rmmod Tainted: G      B        OE        6.8.12-amd64 #1    Debian 6.8.12-1 
 [    112.258457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 
 [    112.260410] RIP: 0010:kmem_cache_destroy+0x14d/0x160 
 [    112.261687] Code: 00 eb be 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 8b 53 60 48 8b 4c 24 20 48 c7 c6 60 d5 e3 98 48 c7 c7 b8 ec 2d 99 e8 43 0d d8 ff <0f> 0b e9 e2 fe ff ff c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 
 [    112.265219] RSP: 0018:ffffb3b2806e7e48 EFLAGS: 00010282 
 [    112.266044] RAX: 0000000000000000 RBX: ffff93f4270a2640 RCX: 0000000000000027 
 [    112.267157] RDX: ffff93f43c521708 RSI: 0000000000000001 RDI: ffff93f43c521700 
 [    112.268268] RBP: 000055db9018e7d8 R08: 0000000000000000 R09: ffffb3b2806e7cd8 
 [    112.269418] R10: ffffb3b2806e7cd0 R11: 0000000000000003 R12: 0000000080012d00 
 [    112.270572] R13: ffffb3b2806e7f58 R14: 0000000000000000 R15: 0000000000000000 
 [    112.271699] FS:    00007f2ac8308440(0000) GS:ffff93f43c500000(0000) knlGS:0000000000000000 
 [    112.273001] CS:    0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
 [    112.273923] CR2: 00005584ef830110 CR3: 000000000787c000 CR4: 00000000000006f0 
 [    112.275050] Call Trace: 
 [    112.275464]    <TASK> 
 [    112.275810]    ? kmem_cache_destroy+0x14d/0x160 
 [    112.276518]    ? __warn+0x81/0x130 
 [    112.277043]    ? kmem_cache_destroy+0x14d/0x160 
 [    112.277730]    ? report_bug+0x171/0x1a0 
 [    112.278315]    ? prb_read_valid+0x1b/0x30 
 [    112.278919]    ? handle_bug+0x3c/0x80 
 [    112.279467]    ? exc_invalid_op+0x17/0x70 
 [    112.280071]    ? asm_exc_invalid_op+0x1a/0x20 
 [    112.280741]    ? kmem_cache_destroy+0x14d/0x160 
 [    112.281603]    ? kmem_cache_destroy+0x14d/0x160 
 [    112.282489]    batadv_tt_cache_destroy+0x15/0x60 [batman_adv] 
 [    112.283373]    __do_sys_delete_module+0x1d5/0x320 
 [    112.284080]    do_syscall_64+0x83/0x190 
 [    112.284696]    ? do_syscall_64+0x8f/0x190 
 [    112.285315]    ? exc_page_fault+0x7f/0x180 
 [    112.285970]    entry_SYSCALL_64_after_hwframe+0x78/0x80 
 [    112.286768] RIP: 0033:0x7f2ac8434977 
 [    112.287355] Code: 73 01 c3 48 8b 0d a9 94 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 79 94 0c 00 f7 d8 64 89 01 48 
 [    112.290282] RSP: 002b:00007ffe0adf6138 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 
 [    112.291465] RAX: ffffffffffffffda RBX: 000055db9018e770 RCX: 00007f2ac8434977 
 [    112.292595] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055db9018e7d8 
 [    112.293724] RBP: 0000000000000000 R08: 1999999999999999 R09: 0000000000000000 
 [    112.294863] R10: 00007f2ac84a6ac0 R11: 0000000000000206 R12: 00007ffe0adf6390 
 [    112.295982] R13: 000055db9018e770 R14: 000055db9018d2a0 R15: 0000000000000000 
 [    112.297103]    </TASK> 
 [    112.297465] ---[ end trace 0000000000000000 ]--- 
 </code></pre> 

 The system is an x86-64 KVM instance running the latest Debian Sid, including the Linux kernel as provided by Debian through the package linux-image-6.8.12-amd64 / 6.8.12-1. The batman-adv kernel module was compiled with gcc 13.2.0-25, as provided by Debian. 

 qemu-system-x86-64 command: 

 <pre><code> 
 $ qemu-system-x86_64 -enable-kvm -nodefaults -smp 2 -m 1024M -serial mon:stdio -nographic -kernel /home/linus/mesh-node/boot/vmlinuz-6.8. 
 12-amd64 -initrd /home/linus/mesh-node/boot/initrd.img-6.8.12-amd64 -append nodeid=1 slub_debug console=ttyS0 uts=1 rootflags=ro ro net 
 .ifnames=1 ip=192.168.23.2::192.168.23.1:255.255.255.0:mesh-node:ens2:off root=/dev/nfs nfsroot=192.168.23.1:/home/linus/mesh-node -dev 
 ice e1000,netdev=net0,mac=02:04:64:a4:39:b1 -netdev tap,id=net0,ifname=vmtap0,script=/home/linus/dev-priv/vm/ens2-ifup.sh -device e1000 
 ,netdev=net1,mac=02:04:64:a4:39:c1 -netdev tap,id=net1,ifname=vmtap1,script=no -device e1000,netdev=net2,mac=02:04:64:a4:39:d1 -netdev  
 tap,id=net2,ifname=vmtap2,script=no -device e1000,netdev=net3,mac=02:04:64:a4:39:e1 -netdev tap,id=net3,ifname=vmtap3,script=no -device 
  e1000,netdev=net4,mac=02:04:64:a4:39:f1 -netdev tap,id=net4,ifname=vmtap4,script=no -fsdev local,security_model=passthrough,id=fsdev-r 
 oot,path=/home/linus/mesh-node -device virtio-9p-pci,id=fs-root,fsdev=fsdev-root,mount_tag=virtfs_root 
 </code></pre> 

 Within the VM the issue is reliably reproduceable with the following commands (CONFIG_BATMAN_ADV_MCAST=n to keep the TT local cache and therefore the splat short and debugging easier): 

 <pre><code> 
 $ modprobe crc16; modprobe libcrc32c; modprobe bridge; modprobe cfg80211 
 $ cd /mnt/batman-adv/ 
 $ make clean && make CONFIG_BATMAN_ADV_MCAST=n 
 $ insmod ./net/batman-adv/batman-adv.ko 
 $ ip link add dev veth0 type veth peer name veth1 
 $ ip link add dev bat0 type batadv 
 $ ip link set dev veth0 master bat0 
 $ ip link set up dev veth0 
 $ ip link set up dev veth1 
 $ ip link set up dev bat0 
 $ batctl tl 
 [B.A.T.M.A.N. adv 2024.1-7-g08e8aefb, MainIF/MAC: veth0/ba:8d:34:7b:53:01 (bat0/da:00:ca:2f:34:50 BATMAN_IV), TTVN: 1] 
 Client               VID Flags      Last seen (CRC         ) 
 da:00:ca:2f:34:50     -1 [.P....]     0.000     (0x2babba54) 
 $ rmmod batman-adv.ko 
 </code></pre> 

 Afterwards an entry in slabinfo remains for batadv_tl_cache, however with 0 objects: 

 <pre><code> 
 $ cat /proc/slabinfo | grep batadv    
 batadv_tl_cache          0       16      256     16      1 : tunables      0      0      0 : slabdata        1        1        0 
 </code></pre> 

 The issue seems to have started to appear with the following commit, reverting it makes the kernel dmesg splat disappear: https://git.open-mesh.org/batman-adv.git/commit/cf9845d4566698731c23cd7a02db956706d058a1 

 Alternatively, adding 10x calls with msleep(500) between rcu_barrier() and batadv_tt_cache_destroy() in batadv_exit() makes the issue disappear, too. 

 With extra debug output I also see the kref refcount to decrease to zero for this TT local entry. 

 So overall it seems that somehow rcu_barrier() call is not waiting long enough for this entry to be freed since switching from call_rcu() to kfree_rcu() for freeing TT local entries? 


 Update/Note: This can also be reproduced with "-smp 1" instead of "-smp 2" to qemu.

Back