Bug #355: Missing TT entry leads to inconsistent but stable TT state - batman-adv - Open Mesh

Actions

Copy link

Bug #355

closed

Missing TT entry leads to inconsistent but stable TT state

Added by Sven Eckelmann almost 7 years ago. Updated over 3 years ago.

Status:

Closed

Priority:

Normal

Assignee:

batman-adv developers

Target version:

2018.2

Start date:

05/10/2018

Due date:

% Done:

Estimated time:

Description

Bug report is actually from Leonardo Mörlein <me@irrelefant.net>:

We have three nodes, which have the following topology:

NDS-Agenda21Haus70 <-> sn04 <-> sn03

Something happened, so they came into a stable (for hours) but inconsistent
state of their translation tables:

- sn03 has 11 entries in its translocal table and crc = 0x2b6d458c
- sn04 has only 10 entries in the table for sn03 but still crc = 0x2b6d458c
- NDS-Agenda21Haus70 has also 10 entries in the table for sn03 but crc = 0x74fb4f96

It seems as like something (erroneously) removed the entry 33:33:00:00:00:01
from table on node sn04 without recalculating the crc sum.

In the following are the outputs of the table states:

root@NDS-Agenda21Haus70:~# batctl tr 88:e6:40:20:30:01
> traceroute to 88:e6:40:20:30:01 (88:e6:40:20:30:01), 50 hops max, 20 byte packets
>  1: 88:e6:40:20:40:01  16.567 ms  20.593 ms  16.324 ms
>  2: 88:e6:40:20:30:01  20.760 ms  19.826 ms  19.840 ms
> 
> root@NDS-Agenda21Haus70:~# batctl tg | grep 88:e6:40:20:30:01 
>  * 33:33:ff:06:3e:25   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>  * 33:33:ff:00:18:32   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>    33:33:00:00:00:02   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>    01:00:5e:00:00:fc   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>    01:00:5e:00:00:01   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>    33:33:ff:00:00:00   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>  * 33:33:ff:00:30:01   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>  * 2a:d7:9a:06:3e:25   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>  * 33:33:ff:00:01:08   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
>    33:33:00:01:00:03   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x74fb4f96)
> 
> [root@sn04]:~ # batctl tg | grep 88:e6:40:20:30:01
>  * 33:33:ff:06:3e:25   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>  * 33:33:ff:00:18:32   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>    33:33:00:00:00:02   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>    01:00:5e:00:00:fc   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>    01:00:5e:00:00:01   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>    33:33:ff:00:00:00   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>  * 33:33:ff:00:30:01   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>  * 2a:d7:9a:06:3e:25   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>  * 33:33:ff:00:01:08   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
>    33:33:00:01:00:03   -1 [....] (  3) 88:e6:40:20:30:01 (  3) (0x2b6d458c)
> 
> [root@sn03]:~ # batctl tl
> [B.A.T.M.A.N. adv 2018.0-3-g4b2b8c68, MainIF/MAC: mesh_fastd/88:e6:40:20:30:01 (bat0/2a:d7:9a:06:3e:25 BATMAN_IV), TTVN: 3]
> Client             VID Flags    Last seen (CRC       )
> 33:33:ff:06:3e:25   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:ff:00:18:32   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:00:00:00:02   -1 [.P....]   0.000   (0x2b6d458c)
> 01:00:5e:00:00:fc   -1 [.P....]   0.000   (0x2b6d458c)
> 01:00:5e:00:00:01   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:ff:00:00:00   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:ff:00:30:01   -1 [.P....]   0.000   (0x2b6d458c)
> 2a:d7:9a:06:3e:25   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:ff:00:01:08   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:00:01:00:03   -1 [.P....]   0.000   (0x2b6d458c)
> 33:33:00:00:00:01   -1 [.P....]   0.000   (0x2b6d458c)

In the attachment, I added a capture of the received full table response from
sn04 -> NDS-Agenda21Haus70 taken on NDS-Agenda21Haus70. I checked, that this
is really an intermediate response of sn04 and not a direct response from sn03
(sn03 is not receiving queries from NDS-Agenda21Haus70).

Files

missing_tg_entry.pcap (210 Bytes) missing_tg_entry.pcap

Sven Eckelmann, 05/10/2018 08:35 AM

Actions

Copy link

Updated by Sven Eckelmann almost 7 years ago

Description updated (diff)

Actions

Copy link

Updated by Sven Eckelmann almost 7 years ago

Wild and untested idea:

two events happen at the same time at sn04
both are processed at the same time in different contexts
- context a sees the list (hash) of entries not yet updated (11 entries) and calculates crc 0x2b6d458c
- context b sees the list (hash) of entries which was already updated (10 entries) and calculates crc 0x74fb4f96
batadv_tt_global_update_crc does the crc calculation without a lock and thus operations can happen iterleaved
- context b finishes early (seen from crc memory write/commit perspective) and saves crc 0x74fb4f96
- context a finishes late (seen from crc memory write/commit perspective) and saves crc 0x2b6d458c

The hash (bat_priv->tt.local_hash) will then have only 10 entries and the crc of 11 entries. It will therefore not recover itself using the data from the originator to which the global entries belong to (because the crc is correct). It will still forward (as "knowing" intermediate node) the wrong number of entries to the querier of the TT full table for this originator.

Looks to me like the tt.crc access must always be protected with a specific lock for the complete hash (bat_priv->tt.local_hash or bat_priv->tt.global_hash).

Actions

Copy link

Updated by Sven Eckelmann almost 7 years ago

Antonio just mentioned that TT responses or OGM messages could trigger recalculations of the CRC. And multiple recalculations for the same originator are prevented using the orig_node->tt_lock (only different originators can calculate their CRC at the same time).

Actions

Copy link

Updated by Linus Lüssing almost 7 years ago

From looking at the code it seems to me that a node wrongly adding a ROAM or TEMP flag to a multicast TT entry in an OGM or TT Response might cause similar issues in tt_global_add() either here:

https://elixir.bootlin.com/linux/latest/source/net/batman-adv/translation-table.c#L1752

or here:

https://elixir.bootlin.com/linux/latest/source/net/batman-adv/translation-table.c#L1781

Which would delete a multicast TT entry for the same address on any other node, without updating any CRC.

As in Leonardo's setup we already observed bogus multicast TT entries with the 'W' and 'I' flag set it seems likely that this/those broken node(s) might inject a ROAM or TEMP flag in multicast TT entries, too.

If we can verify that this is indeed the issue for Leonardo then a simple fix should be to add a "!is_multicast_ether_addr()" check for the referenced two cases, to avoid interpreting these flag bits as ROAM or TEMP for multicast TT.

I'll hopefully be able to verify this this weekend.

Actions

Copy link