Bug #338
closedbatman-adv2017.1 unstable with multicast optimization
0%
Description
I'm using freifunk-gluon open-wrt firmware 2017.1.1
This firmware uses batman-adv2017.1.
After upgrading to gluon 2017.1 the whole network went unstable (high load, reboots etc.). The gluon developers deactivated Multicast optimizations. The Network became stable again with the 2017.1.1 release.
I've updated my Supernodes (batman gws) to batman-adv 2017.1. (MulOpt active) and the whole mesh went unstable again. That means, that also the nodes with MulOpt disabled where affected.
I've disabled MU on the supernodes and everything went stable again.
==> Something in Multicast Optimization is able to struggle the whole mesh. Maybe some compatibility issues in heterogen. networks with MulOpt enalbed and disabled
Updated by Sven Eckelmann over 7 years ago
- Is duplicate of Bug #336: tt: store is_wifi & isolation flag per announcing originator added
Updated by Sven Eckelmann over 7 years ago
- Related to Bug #335: tt: remove is_wifi & isolation flag from multicast tt entries added
Updated by Sven Eckelmann over 7 years ago
- Related to deleted (Bug #335: tt: remove is_wifi & isolation flag from multicast tt entries)
Updated by Sven Eckelmann over 7 years ago
- Is duplicate of Bug #335: tt: remove is_wifi & isolation flag from multicast tt entries added
Updated by Sven Eckelmann over 7 years ago
- Assignee changed from batman-adv developers to Linus Lüssing
Updated by Linus Lüssing over 7 years ago
- Priority changed from Normal to High
Changing the priority to "High", as there have been independent reports from several communities on the Gluon bug tracker already:
https://github.com/freifunk-gluon/gluon/issues/1183
Regarding "I've disabled MU on the supernodes and everything went stable again.": Do you mean you disabled it via "batctl mm 0" or did you recompile batman-adv with "CONFIG_BATMAN_ADV_MCAST=n"?
And concerning the "Assignee", nothing I can do about for now, as far as I can tell. Waiting for some people to test and/or review the patch mentioned in #336.
Updated by Andre Kasper over 7 years ago
I've recompiled it.
fyi: The linked gluon ticket is mine.
Updated by Andre Kasper over 7 years ago
fyi:
I would like to test this fix, but I don't know how. I would need a fixed batman-adv for my "supernodes". If I understood the problem, it wouldn't make sense to test the fix on the gluon nodes only, without patching the high-traffic supernodes, because if the supernodes confuse the flags, the gluon nodes could not fix it....
Can someone provide an how-to get a batman-adv with enabled fix?
Updated by Linus Lüssing over 7 years ago
Sure!
You can grab a version containing this patch via:
$ git clone https://git.open-mesh.org/batman-adv.git
$ cd batman-adv
$ git checkout v2017.2
$ git cherry-pick 382d020fe3fa528b1f
$ make
(Make sure, you have your kernel headers present/installed. In Debian, the linux-headers-<your-kernel-version> package for instance.)
Then load this newly, externally build kernel module with something like:
$ insmod <yourbuildfolder>/batman-adv/build/net/batman-adv/batman-adv.ko
Depending on your host configuration you might need to unload any other batman-adv version first:
$ modprobe -r batman-adv
And might need to load some dependencies manually via modprobe as insmod is not that smart (for instance "modprobe crc16" and "modprobe libcrc32c", check "dmesg" if you have errors while inserting your self built batman-adv.ko).
Updated by Andre Kasper about 7 years ago
I've had some time to test it.
The problems seem to resist.
1. Is it important that ALL Nodes in the network have got the new and patched batman-adv?
2. The "clients" are gluon nodes connected via fastd. There is a ticket that BLA maybe broken:
https://github.com/freifunk-gluon/gluon/issues/1198
I've many reportings that the supernodes got packages with their own mac... i'm not sure if this is okay, and BLA handles it (just "detected") or if this is an error in batman-adv or gluon
which information should i provide? to find out what's happening?
Updated by Andre Kasper about 7 years ago
I'm sorry. I've to report, that I missed to enable BLA on gateway level. I had never enabled it before. The logentries where new, so something is still wrong, because before upgrading these messages where not there. I have now enabled bla, so i can't tell what's happening. any hints what i should check?
Updated by Linus Lüssing about 7 years ago
Hi Andre,
Thanks for the additional feedback and testing!
Hm, BLA should be enabled by default since some versions ago. So I'm a little confused why you say you had to enable it on a gateway with batman-adv v2017.2 plus this TT sync patch.
No, you shouldn't need to patch all nodes to notice a difference. Upgrading central nodes, like gateways or nodes with a VPN connection, should reduce the impact of the issue already and should contain infected multicast TT entries to some degree. And upgrading nodes close to the zombie node which triggers the issue for you, should help, too.
Could you share the output of "batctl tg" from a patched gateway? So we can have an overview of the current degree of infection in your network? (e.g. multicast TT entries with a "W" or "I" flag are infected)
Updated by Sven Eckelmann over 6 years ago
- Status changed from New to Rejected
Closing it because reporter is MIA and we cannot reproduce it.