Bug #247
closedKernel BUG in __netdev_adjacent_dev_remove in complex bridge/VLAN setups
0%
Description
We've gotten the report that batman-adv will crash the kernel in the following setup:
- 5 VLANs (eth0.2, eth0.3, eth0.100, eth0.101, eth0.102)
- bat0 on eth0.100, eth0.101, eth0.102
- br-wan on eth0.2
- br-client on bat0, eth0.3
When eth0 goes down and OpenWrt's netifd subsequently removes eth0.* from bat0, the crash in the attached kernel log occurs (the kernel tries to remove eth0 from br-client, but eth0 isn't a port of br-client)
The issue was reported for the OpenWrt sunxi target; I was not able to reproduce it using the same version and setup on x86, therefore I'm not sure which parts of the setup are relevant.
Kernel: 3.18.27 (current OpenWrt CC HEAD)
batman-adv: 2016.0
Gluon issue tracker reference: https://github.com/freifunk-gluon/gluon/issues/680
Files
Updated by Sven Eckelmann over 8 years ago
Sounds like this one https://patchwork.ozlabs.org/project/netdev/patch/56CCDD4D.4080303@cradlepoint.com/. The CC kernel and the upstream kernel report both hit this BUG https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/core/dev.c?h=v4.5#n5531
The other patch I found was (but mostly unrelated to this problem but still output from this BUG): https://patchwork.ozlabs.org/patch/525329/
Updated by Anonymous over 8 years ago
Indeed there is also a macvlan device on top of br-client, which I forgot in my report.
The reporter of the Gluon ticket has tested that- removing the macvlan device makes the issue disappear
- https://patchwork.ozlabs.org/project/netdev/patch/56CCDD4D.4080303@cradlepoint.com/ changes the error messages, but doesn't fix it
Updated by Sven Eckelmann over 8 years ago
Interesting. Can you or the original reporter reply to the patch from this person (you can find the mail in mbox format on the patchwork page).
Updated by Anonymous over 8 years ago
Which patch are you talking about?
- https://patchwork.ozlabs.org/project/netdev/patch/56CCDD4D.4080303@cradlepoint.com/ was already tested and changes the symptoms a bit, but doesn't fix the crash
- https://patchwork.ozlabs.org/project/netdev/patch/1443738068-30750-1-git-send-email-xiyou.wangcong@gmail.com/ is a patch for VRF, which is not involved here (and didn't even exist back in 3.18)
Updated by Sven Eckelmann over 8 years ago
https://patchwork.ozlabs.org/project/netdev/patch/56CCDD4D.4080303@cradlepoint.com/ - because he knows that it is not the right way ("incomplete") to fix it but asked for comments.
Updated by Anonymous over 8 years ago
https://patchwork.ozlabs.org/patch/587118/ changes the log output of the crash: instead of trying to remove eth0 from br-client, it is now trying to remove eth0 from local-node (local-node is the macvlan device on top of br-client.) I've attached the new dmesg sent by the original reporter.
Updated by Sven Eckelmann over 8 years ago
Did you contact Andrew Collins + David Miller?
Updated by Anonymous over 8 years ago
I just answered Andrew's mail. I forgot to CC David though.
Updated by Sven Eckelmann over 8 years ago
- Status changed from New to Rejected
I am marking this bug as rejected because it looks like it was fixed by https://patchwork.ozlabs.org/project/netdev/patch/56F5B746.9050401@cradlepoint.com/ in a core net code of the kernel (see https://github.com/freifunk-gluon/gluon/issues/680#issuecomment-202438927). But please feel free to reopen it in case you find more information about required changes in batman-adv