Bug #355
closedMissing TT entry leads to inconsistent but stable TT state
0%
Description
Bug report is actually from Leonardo Mörlein <me@irrelefant.net>:
We have three nodes, which have the following topology:
NDS-Agenda21Haus70 <-> sn04 <-> sn03
Something happened, so they came into a stable (for hours) but inconsistent
state of their translation tables:- sn03 has 11 entries in its translocal table and crc = 0x2b6d458c
- sn04 has only 10 entries in the table for sn03 but still crc = 0x2b6d458c
- NDS-Agenda21Haus70 has also 10 entries in the table for sn03 but crc = 0x74fb4f96It seems as like something (erroneously) removed the entry 33:33:00:00:00:01
from table on node sn04 without recalculating the crc sum.In the following are the outputs of the table states:
root@NDS-Agenda21Haus70:~# batctl tr 88:e6:40:20:30:01 > traceroute to 88:e6:40:20:30:01 (88:e6:40:20:30:01), 50 hops max, 20 byte packets > 1: 88:e6:40:20:40:01 16.567 ms 20.593 ms 16.324 ms > 2: 88:e6:40:20:30:01 20.760 ms 19.826 ms 19.840 ms > > root@NDS-Agenda21Haus70:~# batctl tg | grep 88:e6:40:20:30:01 > * 33:33:ff:06:3e:25 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x74fb4f96) > * 33:33:ff:00:18:32 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x74fb4f96) > 33:33:00:00:00:02 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x74fb4f96) > 01:00:5e:00:00:fc -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x74fb4f96) > 01:00:5e:00:00:01 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x74fb4f96) > 33:33:ff:00:00:00 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x74fb4f96) > * 33:33:ff:00:30:01 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x74fb4f96) > * 2a:d7:9a:06:3e:25 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x74fb4f96) > * 33:33:ff:00:01:08 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x74fb4f96) > 33:33:00:01:00:03 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x74fb4f96) > > [root@sn04]:~ # batctl tg | grep 88:e6:40:20:30:01 > * 33:33:ff:06:3e:25 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x2b6d458c) > * 33:33:ff:00:18:32 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x2b6d458c) > 33:33:00:00:00:02 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x2b6d458c) > 01:00:5e:00:00:fc -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x2b6d458c) > 01:00:5e:00:00:01 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x2b6d458c) > 33:33:ff:00:00:00 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x2b6d458c) > * 33:33:ff:00:30:01 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x2b6d458c) > * 2a:d7:9a:06:3e:25 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x2b6d458c) > * 33:33:ff:00:01:08 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x2b6d458c) > 33:33:00:01:00:03 -1 [....] ( 3) 88:e6:40:20:30:01 ( 3) (0x2b6d458c) > > [root@sn03]:~ # batctl tl > [B.A.T.M.A.N. adv 2018.0-3-g4b2b8c68, MainIF/MAC: mesh_fastd/88:e6:40:20:30:01 (bat0/2a:d7:9a:06:3e:25 BATMAN_IV), TTVN: 3] > Client VID Flags Last seen (CRC ) > 33:33:ff:06:3e:25 -1 [.P....] 0.000 (0x2b6d458c) > 33:33:ff:00:18:32 -1 [.P....] 0.000 (0x2b6d458c) > 33:33:00:00:00:02 -1 [.P....] 0.000 (0x2b6d458c) > 01:00:5e:00:00:fc -1 [.P....] 0.000 (0x2b6d458c) > 01:00:5e:00:00:01 -1 [.P....] 0.000 (0x2b6d458c) > 33:33:ff:00:00:00 -1 [.P....] 0.000 (0x2b6d458c) > 33:33:ff:00:30:01 -1 [.P....] 0.000 (0x2b6d458c) > 2a:d7:9a:06:3e:25 -1 [.P....] 0.000 (0x2b6d458c) > 33:33:ff:00:01:08 -1 [.P....] 0.000 (0x2b6d458c) > 33:33:00:01:00:03 -1 [.P....] 0.000 (0x2b6d458c) > 33:33:00:00:00:01 -1 [.P....] 0.000 (0x2b6d458c)In the attachment, I added a capture of the received full table response from
sn04 -> NDS-Agenda21Haus70 taken on NDS-Agenda21Haus70. I checked, that this
is really an intermediate response of sn04 and not a direct response from sn03
(sn03 is not receiving queries from NDS-Agenda21Haus70).
Files
Updated by Sven Eckelmann over 6 years ago
Wild and untested idea:
- two events happen at the same time at sn04
- both are processed at the same time in different contexts
- context a sees the list (hash) of entries not yet updated (11 entries) and calculates crc 0x2b6d458c
- context b sees the list (hash) of entries which was already updated (10 entries) and calculates crc 0x74fb4f96
- batadv_tt_global_update_crc does the crc calculation without a lock and thus operations can happen iterleaved
- context b finishes early (seen from crc memory write/commit perspective) and saves crc 0x74fb4f96
- context a finishes late (seen from crc memory write/commit perspective) and saves crc 0x2b6d458c
The hash (bat_priv->tt.local_hash) will then have only 10 entries and the crc of 11 entries. It will therefore not recover itself using the data from the originator to which the global entries belong to (because the crc is correct). It will still forward (as "knowing" intermediate node) the wrong number of entries to the querier of the TT full table for this originator.
Looks to me like the tt.crc access must always be protected with a specific lock for the complete hash (bat_priv->tt.local_hash or bat_priv->tt.global_hash).
Updated by Sven Eckelmann over 6 years ago
Antonio just mentioned that TT responses or OGM messages could trigger recalculations of the CRC. And multiple recalculations for the same originator are prevented using the orig_node->tt_lock (only different originators can calculate their CRC at the same time).
Updated by Linus Lüssing over 6 years ago
From looking at the code it seems to me that a node wrongly adding a ROAM or TEMP flag to a multicast TT entry in an OGM or TT Response might cause similar issues in tt_global_add() either here:
or here:
Which would delete a multicast TT entry for the same address on any other node, without updating any CRC.
As in Leonardo's setup we already observed bogus multicast TT entries with the 'W' and 'I' flag set it seems likely that this/those broken node(s) might inject a ROAM or TEMP flag in multicast TT entries, too.
If we can verify that this is indeed the issue for Leonardo then a simple fix should be to add a "!is_multicast_ether_addr()" check for the referenced two cases, to avoid interpreting these flag bits as ROAM or TEMP for multicast TT.
I'll hopefully be able to verify this this weekend.
Updated by Linus Lüssing over 6 years ago
These two patches hopefully fix this issue:
- batman-adv: Avoid storing non-TT-sync flags on singular entries too
- batman-adv: Fix multicast TT issues with bogus ROAM flags
Leonardo, would be great if you could check that this fixes it for you, too.
Updated by Sven Eckelmann over 6 years ago
- Status changed from New to Resolved
- Target version set to 2018.2
Updated by Sven Eckelmann over 6 years ago
- Status changed from Resolved to Closed
batman-adv 2018.2 was just released