Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FRR 10.0.1 route maps filtering prefixes that should be sent #16367

Open
2 tasks done
fedepaol opened this issue Jul 11, 2024 · 17 comments
Open
2 tasks done

FRR 10.0.1 route maps filtering prefixes that should be sent #16367

fedepaol opened this issue Jul 11, 2024 · 17 comments
Assignees
Labels
bgp triage Needs further investigation

Comments

@fedepaol
Copy link

Description

This is happening in metallb/frr-k8s#166 , where I am bumping the version used in frr-k8s from 9.1 to 10.0.1. The FRR-K8s CI is quite stable, while after the bump I am seeing a few flakes.

172.18.0.2 tries to announce 192.168.0.0/24 and 172.18.0.5 is not receiving it

I am going to attach the full configuration, but adding here the relevant snippets:

Neighbor route map:

address-family ipv4 unicast
  network 192.168.0.0/24
  network 192.168.1.0/24
  network 192.168.2.0/24
  network 192.168.3.0/24
  neighbor 172.18.0.5 activate
  neighbor 172.18.0.5 route-map 172.18.0.5-in in
  neighbor 172.18.0.5 route-map 172.18.0.5-out out

Prefix lists:

ip prefix-list 172.18.0.5-100-ipv4-localpref-prefixes seq 1 permit 192.168.0.0/24
ip prefix-list 172.18.0.5-100-ipv4-localpref-prefixes seq 2 permit 192.168.1.0/24
ip prefix-list 172.18.0.5-pl-ipv4 seq 1 permit 192.168.0.0/24
ip prefix-list 172.18.0.5-pl-ipv4 seq 2 permit 192.168.1.0/24
ip prefix-list 172.18.0.5-pl-ipv4 seq 3 permit 192.168.2.0/24
ip prefix-list 172.18.0.5-pl-ipv4 seq 4 permit 192.168.3.0/24
ip prefix-list 172.18.0.5-150-ipv4-localpref-prefixes seq 1 permit 192.168.2.0/24
ip prefix-list 172.18.0.5-inpl-ipv4 seq 1 deny any
ip prefix-list 172.18.0.5-denied-inpl-ipv4 seq 1 permit 192.167.9.0/24 le 32

Route maps:


route-map 172.18.0.5-out permit 1
 match ip address prefix-list 172.18.0.5-100-ipv4-localpref-prefixes
 on-match next
 set local-preference 100
exit
!
route-map 172.18.0.5-out permit 2
 match ip address prefix-list 172.18.0.5-100-ipv4-localpref-prefixes
 on-match next
 set local-preference 100
exit
!
route-map 172.18.0.5-out permit 3
 match ip address prefix-list 172.18.0.5-150-ipv4-localpref-prefixes
 on-match next
 set local-preference 150
exit
!
route-map 172.18.0.5-out permit 4
 match ip address prefix-list 172.18.0.5-pl-ipv4
exit
!
route-map 172.18.0.5-out permit 5
 match ipv6 address prefix-list 172.18.0.5-pl-ipv4
exit
!

Show bgp ipv4 on 172.18.0.2

Instance default:
BGP table version is 4, local router ID is 172.18.0.2, vrf id 0
Default local pref 100, local AS 64512
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

    Network          Next Hop            Metric LocPrf Weight Path
 *> 192.168.0.0/24   0.0.0.0                  0         32768 i
 *> 192.168.1.0/24   0.0.0.0                  0         32768 i
 *> 192.168.2.0/24   0.0.0.0                  0         32768 i
 *> 192.168.3.0/24   0.0.0.0                  0         32768 i

Show bgp ipv4 on 172.18.0.5:

Dumping show bgp ip for ebgp-single-hop, local addresses: ipv4 - 172.18.0.5, ipv6 - fc00:f853:ccd:e793::5

Instance default:
BGP table version is 20, local router ID is 172.18.0.5, vrf id 0
Default local pref 100, local AS 4200000000
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

    Network          Next Hop            Metric LocPrf Weight Path
 *= 192.168.0.0/24   172.18.0.4               0             0 64512 i
 *>                  172.18.0.3               0             0 64512 i
 *= 192.168.1.0/24   172.18.0.4               0             0 64512 i
 *>                  172.18.0.3               0             0 64512 i
 *= 192.168.2.0/24   172.18.0.4               0             0 64512 i
 *>                  172.18.0.3               0             0 64512 i
 *= 192.168.3.0/24   172.18.0.4               0             0 64512 i
 *>                  172.18.0.3               0             0 64512 i

(as we can see some other instances were able to push the routes.

The logs say the prefix is being filtered by the route-map:


2024/07/11 11:05:26.147 BGP: [WJC21-WPGZF] u171:s171 announcing routes to 172.18.0.5, combined into 1 peers
2024/07/11 11:05:26.147 BGP: [T5JFA-13199] subgroup_process_announce_selected: p=192.168.0.0/24, selected=0x7fb35fffd0b0
2024/07/11 11:05:26.147 BGP: [VERY4-P6JC8] 172.18.0.5(Unknown) [Update:SEND] 192.168.0.0/24 is filtered by route-map '172.18.0.5-out'
2024/07/11 11:05:26.147 BGP: [T5JFA-13199] subgroup_process_announce_selected: p=192.168.1.0/24, selected=0x7fb35fffd1f0
2024/07/11 11:05:26.147 BGP: [VERY4-P6JC8] 172.18.0.5(Unknown) [Update:SEND] 192.168.1.0/24 is filtered by route-map '172.18.0.5-out'
2024/07/11 11:05:26.147 BGP: [T5JFA-13199] subgroup_process_announce_selected: p=192.168.2.0/24, selected=0x7fb35fffd330
2024/07/11 11:05:26.147 BGP: [VERY4-P6JC8] 172.18.0.5(Unknown) [Update:SEND] 192.168.2.0/24 is filtered by route-map '172.18.0.5-out'
2024/07/11 11:05:26.147 BGP: [T5JFA-13199] subgroup_process_announce_selected: p=192.168.3.0/24, selected=0x7fb35fffd470
2024/07/11 11:05:26.147 BGP: [VERY4-P6JC8] 172.18.0.5(Unknown) [Update:SEND] 192.168.3.0/24 is filtered by route-map '172.18.0.5-out'
2024/07/11 11:05:26.149 BGP: [WJC21-WPGZF] u174:s174 announcing routes to 172.30.0.3, combined into 1 peers

Version

I am experiencing this using the images quay.io/frrouting/frr:10.0.1 and quay.io/frrouting/frr:master

How to reproduce

It's a flake, the same test sometimes passes and some other time it does not. I have the feeling that it is associated to prefixlist with community / local preferences. On the other hand, it happens often enough to be able to provide the right dump commands after a test fails.

Expected behavior

The neighbor should receive the prefixes

Actual behavior

The instance running on the cluster is not advertising the prefixes.

Additional context

No response

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.
@fedepaol fedepaol added the triage Needs further investigation label Jul 11, 2024
@fedepaol
Copy link
Author

Full running config of 172.18.0.5 (the receiver)

####### Show running config
Building configuration...

Current configuration:
!
frr version 10.0.1_git
frr defaults traditional
hostname cd1e47ac22ac
log file /tmp/frr.log
log timestamp precision 3
!
debug zebra nht
debug bgp neighbor-events
debug bgp nht
debug bgp updates in
debug bgp updates out
debug bfd peer
!
password zebra
!
router bgp 4200000000
 bgp router-id 172.18.0.5
 no bgp ebgp-requires-policy
 no bgp default ipv4-unicast
 no bgp network import-check
 neighbor 172.18.0.2 remote-as 64512
 neighbor 172.18.0.3 remote-as 64512
 neighbor 172.18.0.4 remote-as 64512
 !
 address-family ipv4 unicast
  neighbor 172.18.0.2 activate
  neighbor 172.18.0.2 next-hop-self
  neighbor 172.18.0.3 activate
  neighbor 172.18.0.3 next-hop-self
  neighbor 172.18.0.4 activate
  neighbor 172.18.0.4 next-hop-self
 exit-address-family
exit
!
route-map RMAP permit 10
 set ipv6 next-hop prefer-global
exit
!
end

@fedepaol
Copy link
Author

Running config for the instance running on the cluster

runningconfig.txt

@fedepaol
Copy link
Author

logs of the instance running on the cluster
frrlogs.txt

@ton31337 ton31337 added the bgp label Jul 11, 2024
@ton31337 ton31337 self-assigned this Jul 11, 2024
@ton31337
Copy link
Member

ton31337 commented Jul 12, 2024

route-map 172.18.0.5-out permit 1
 match ip address prefix-list 172.18.0.5-100-ipv4-localpref-prefixes
 on-match next
 set local-preference 100
exit
!
route-map 172.18.0.5-out permit 2
 match ip address prefix-list 172.18.0.5-100-ipv4-localpref-prefixes
 on-match next
 set local-preference 100
exit

Two identical match/set with different sequence numbers?

Btw, could you please enable debug route-map detail? It would at least tell what's going on a bit more.

Also, I'm trying to understand the use case here with on-match next. Could you drop it and test without on-match next, because you don't use it at all (as it should be).

@fedepaol
Copy link
Author

fedepaol commented Jul 12, 2024

route-map 172.18.0.5-out permit 1
 match ip address prefix-list 172.18.0.5-100-ipv4-localpref-prefixes
 on-match next
 set local-preference 100
exit
!
route-map 172.18.0.5-out permit 2
 match ip address prefix-list 172.18.0.5-100-ipv4-localpref-prefixes
 on-match next
 set local-preference 100
exit

Two identical match/set with different sequence numbers?

Ugh, I guess we can remove them

Btw, could you please enable debug route-map detail? It would at least tell what's going on a bit more.

Yup, will do and re-run

Also, I'm trying to understand the use case here with on-match next. Could you drop it and test without on-match next, because you don't use it at all (as it should be).

The idea (maybe I am mis-using it) is to apply the local pref and then validate the ip later on to the following maps. I need it in case I want also to apply a community to the same prefixes

@fedepaol
Copy link
Author

@ton31337 here we go:

2024/07/12 15:53:58.328 BGP: [T5JFA-13199] subgroup_process_announce_selected: p=192.168.0.0/24, selected=0x7f56a3a81330
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-100-ipv4-localpref-prefixes specified does not exist defaulting to NO_MATCH
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-10:100-ipv4-community-prefixes specified does not exist defaulting to NO_MATCH
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-150-ipv4-localpref-prefixes specified does not exist defaulting to NO_MATCH
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-10:101-ipv4-community-prefixes specified does not exist defaulting to NO_MATCH
2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-pl-ipv4 specified does not exist defaulting to NO_MATCH
2024/07/12 15:53:58.328 BGP: [KZK58-6T4Y6] No best match sequence for pfx: 192.168.0.0/24 in route-map: 172.30.0.2-out, result: no match


Show running config and full log attached :
showrunning.txt.gz

fulllog.txt.gz

@fedepaol
Copy link
Author

I checked only the first, seem to be there, so I'd say it's not related to the redundant match (although it makes sense to remove it).

@ton31337
Copy link
Member

Is this happening during frr-reload, on boot, container restart?

@fedepaol
Copy link
Author

Definitely during reload. It's a battery of e2e tests, where on each test we run a different configuration and do some assertions.

@ton31337
Copy link
Member

Could it be possible to have a full reload log? frr-reload.py --debug --reload /etc/frr/frr.conf. I'd like to understand what actually triggers this flake...

@fedepaol
Copy link
Author

Here's another run:


This is the neighbor complaining about missing routes (172.18.0.5) 

Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

    Network          Next Hop            Metric LocPrf Weight Path
 *>i192.168.2.0/24   172.18.0.2               0    100      0 i
 *=i                 172.18.0.4               0    100      0 i

Displayed 1 routes and 2 total paths




This is 172.18.0.3:


Instance default:
BGP table version is 2, local router ID is 172.18.0.3, vrf id 0
Default local pref 100, local AS 64512
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

    Network          Next Hop            Metric LocPrf Weight Path
 *> 192.168.2.0/24   0.0.0.0                  0         32768 i
 *> 192.169.2.0/24   0.0.0.0                  0         32768 i

Displayed 2 routes and 2 total paths




ip prefix-list 172.18.0.5-pl-ipv4 seq 1 permit 192.168.2.0/24

Still have the filter

2024/07/23 15:29:13.553 BGP: [VERY4-P6JC8] 172.18.0.5(Unknown) [Update:SEND] 192.168.2.0/24 is filtered by route-map '172.18.0.5-out'



@fedepaol
Copy link
Author

Last reload event:
lastreload.txt.gz

@fedepaol
Copy link
Author

Sequence of reloads in the last 10 minutes
reload.txt.gz

Every reload sequence starts with
Caught SIGHUP and acquired lock! Reloading FRR.. and ends with FRR reloaded successfully! 2 seconds

Every cycle, we do a frr-reload.py --test first and a frr-reload.py --overwrite later.

@fedepaol
Copy link
Author

frr.zip

@fedepaol
Copy link
Author

As discussed on slack, I attached the logs of the reloader and the generated configuration when running with 9.1 (the generated config is the same regardless).

@ton31337
Copy link
Member

ton31337 commented Jul 31, 2024

First, this is the wrong sequence of operations:

ip prefix-list 172.30.0.2-pl-ipv4 seq 1 permit 192.168.2.0/24
ip prefix-list 172.30.0.2-pl-ipv4 seq 2 permit 192.169.2.0/24
route-map 172.30.0.2-out permit 1
  match ip address prefix-list 172.30.0.2-pl-ipv4
route-map 172.30.0.2-out permit 2
  match ipv6 address prefix-list 172.30.0.2-pl-ipv4

ipv6 prefix-list 172.30.0.2-pl-ipv4 seq 3 deny any <<<<<<<<<<<<<<<<<

ipv6 prefix-list ... should be created before route-map statements to avoid noisy logs like below.

Looking at the logs you provided, it's not clear which AFI triggered this output (I assume IPv6):

2024/07/12 15:53:58.328 BGP: [RYF1Z-ZKDRS] route_match_address_prefix_list: Prefix List 172.30.0.2-10:100-ipv4-community-prefixes specified does not exist defaulting to NO_MATCH

Could you move ipv6 prefix-list ... creation above route-maps and re-test. I'd like to get rid of any false positives.

Also, could you enable log commands to see what actual commands were executed and in which order.

@fedepaol
Copy link
Author

Sorry for the extreme delay @ton31337 , here I am with the additional information:

Here we have 172.18.0.5 not getting routes from 172.18.0.2.

From the logs the usual:

2024/09/12 16:01:41.695 BGP: [WJC21-WPGZF] u109:s109 announcing routes to 172.18.0.5, combined into 1 peers
2024/09/12 16:01:41.695 BGP: [T5JFA-13199] subgroup_process_announce_selected: p=192.168.2.0/24, selected=0x7fb5d00e47d0
2024/09/12 16:01:41.695 BGP: [KZK58-6T4Y6] No best match sequence for pfx: 192.168.2.0/24 in route-map: 172.18.0.5-out, result: no match
2024/09/12 16:01:41.695 BGP: [H5AW4-JFYQC] Route-map: 172.18.0.5-out, prefix: 192.168.2.0/24, result: deny
2024/09/12 16:01:41.695 BGP: [VERY4-P6JC8] 172.18.0.5(Unknown) [Update:SEND] 192.168.2.0/24 is filtered by route-map '172.18.0.5-out'

I applied the requested changes (define the prefix lists before the maps and adding log commands in the configuration):

Attaching a tgz with:

  • the frr log
  • the configuration file
  • the output of show running config
  • the logs of the last 3 reloads

Also, note that in a few reloads I am seeing

BGP is already running; AS is 64512
line 684: Failure to communicate[13] to bgpd, line: router bgp 64513 vrf red

line 685: Warning[4]...: early exit from config file
[2101|bgpd] Configuration file[/etc/frr/frr.conf] processing failure: 13
[2094|mgmtd] done
2024-09-12 15:57:51,557 �[91mWARNING�[0m: frr-reload.py failed due to
vtysh (exec file) exited with status 13

Not sure it's related or not. I am going to attach also a file with more reload events (reload_full.txt) which contains them too.

logs.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgp triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

2 participants