-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New link struct - Vxlan #1532
New link struct - Vxlan #1532
Conversation
45201fd
to
aa614a6
Compare
@hellt this one is fine to get reviewed |
remove vxlan stitched host inderfaces use EndpintGeneric in EndpointVxlan generalize MTU handling for srl config template
bd4f765
to
73bbd91
Compare
The tools command still needs to be migrated. |
note to self:
|
65e3414
to
30cdcf1
Compare
nodes/srl/srl.go
Outdated
} | ||
iface.Mtu = mtu | ||
} | ||
iface.Mtu = e.GetLink().GetMtu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what causes config to break now in the srl tests in CI
two issues here
- when we use
srl
node kind the errors from srl postdeploy are not visible (need to understand why) - when changed to
nokia_srlinux
kind, the following becomes evident:
ERRO[0017] failed to run postdeploy task for node srl1: command execution error:At line 10: Error: Error in path: .interface{.name=="ethernet-1/1"}.mtu
[InvalidArgument] mtu 9500 out of range [1500-9412]
ERRO[0017] failed to run postdeploy task for node srl2: command execution error:At line 13: Error: Error in path: .interface{.name=="ethernet-1/1"}.mtu
[InvalidArgument] mtu 9500 out of range [1500-9412]
the default 9500 MTU shouldn't be used in the srl config, as the 9500 link mtu is a veth mtu only, not for config in NOSes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in b51def3 I changed setting the mtu value in the template endpoints struct only if it is not the default veth mtu value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here is the issue with SR Linux and low MTU interfaces.
SR Linux emulates BCOM datapath, and port mtu on recent hw platforms can't be less that 1500B.
When using vxlan interfaces we ultimately rely on the MTU of the outgoing (parent) interfaces, which is often configured with 1500B itself.
This leaves vxlan interface with 1450B MTU which is too low, and can't be configured on the SR Linux platform.
Because of that, using vlxan interfaces inside the SRL netns can only be done if the parent interface of the vlxan tunnel has 1550B+ MTU.
In practice with connection to virtual labs this likely means that we will still use more of the vxlan-stitched mode, rather than vxlan-embed.
tests/08-vxlan/01-vxlan.clab.yml
Outdated
name: vxlan | ||
|
||
topology: | ||
nodes: | ||
s1: | ||
kind: srl | ||
image: ghcr.io/nokia/srlinux | ||
startup-config: 01-vxlan-s1.config | ||
links: | ||
- type: vxlan | ||
endpoint: | ||
node: s1 | ||
interface: e1-1 | ||
mac: 02:00:00:00:00:04 | ||
remote: 192.168.66.1 | ||
vni: 100 | ||
udp-port: 5555 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of this lab we can use the following:
name: vxlan-embed
topology:
nodes:
l1:
kind: linux
image: alpine:3
exec:
- ip addr add dev eth1 192.168.0.1/30
mgmt-ipv4: 172.20.20.91
l2:
kind: linux
image: alpine:3
exec:
- ip addr add dev eth1 192.168.0.2/30
mgmt-ipv4: 172.20.20.92
links:
- type: vxlan
endpoint:
node: l1
interface: l1eth1
remote: 172.20.20.92
vni: 100
- type: vxlan
endpoint:
node: l2
interface: l2eth1
remote: 172.20.20.91
vni: 100
That way you don't have to touch the host at all (and clean it up, which is not happening currently)
Note, that this lab can't be deployed today, some error about the same file being present...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the code it seems this lab is not possible to deploy. Since we have to find the outgoing interface (I wonder if this is a mandatory step?) we rely on the host's routing stack. It makes sense, since in real life applications, the tunnels are established between host nodes, not from within the containers, as in the lab above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not possible to deploy this one, since vxlan links created in a single netns can't have the same VNI.
I will rework this lab
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still worth looking into this closer. Could be alpine issue or smth. Everything below this line is an early observation.
This lab works, but with prereq.
When the vlxan interface is first created in the host netns and then moved to the container netns, it needs to initiate connection towards the remote first.
If the remote end (l2
) starts to ping l1
first, it won't succeed. l1 needs to start pinging first.
name: vxlan-embed
topology:
nodes:
l1:
kind: linux
image: alpine:3
exec:
- ip addr add dev eth1 192.168.0.1/30
mgmt-ipv4: 172.20.20.91
l2:
kind: linux
image: alpine:3
exec:
- >
ash -c '
apk add iproute2 &&
ip link add name vxlan0 type vxlan id 100 remote 172.20.20.91 dstport 14789 &&
ip l set dev vxlan0 up &&
ip addr add dev vxlan0 192.168.0.2/30'
mgmt-ipv4: 172.20.20.92
links:
- type: vxlan
endpoint:
node: l1
interface: eth1
remote: 172.20.20.92
vni: 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is something strange happening with vxlan tunnels moved to container netns.
I have built a topology that it's deployed in two different hosts like this:
#host1
name: vxlan-embed
prefix: ""
mgmt:
network: fixedips
ipv4_subnet: 100.103.1.0/24
topology:
kinds:
linux:
image: ghcr.io/hellt/network-multitool
nodes:
node1:
kind: linux
mgmt_ipv4: 100.103.1.11
exec:
- ip addr add dev eth1 192.168.0.1/30
links:
- type: vxlan
endpoint:
node: node1
interface: eth1
mac: 02:00:00:00:00:02
remote: 100.103.2.22
vni: 100
udp-port: 4789
#host2
name: vxlan-embed
prefix: ""
mgmt:
network: fixedips
ipv4_subnet: 100.103.2.0/24
topology:
kinds:
linux:
image: ghcr.io/hellt/network-multitool
nodes:
node2:
kind: linux
mgmt_ipv4: 100.103.2.22
exec:
- ip addr add dev eth1 192.168.0.2/30
links:
- type: vxlan
endpoint:
node: node2
interface: eth1
mac: 02:00:00:00:00:01
remote: 100.103.1.11
vni: 100
udp-port: 4789
ping from 192.168.0.1 to 192.168.0.2 and vice versa does not work.
The first thing I see is that vxlan tunnels are created with the “nolearning” flag. This disables the snooping of incoming packets that builds MAC to VTEP address fdb table.
2829: eth1@if2829: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 02:00:00:00:00:01 brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0 minmtu 68 maxmtu 65535
vxlan id 100 remote 100.103.6.11 dev if30 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 192.168.0.1/30 scope global eth1
If we look at fdbs in node1 and node2:
host1:
00:00:00:00:00:00 dev eth1 dst 100.103.2.22 via ifindex 9 link-netnsid 0 self permanent
host2:
00:00:00:00:00:00 dev eth1 dst 100.103.1.11 via ifindex 30 link-netnsid 0 self permanent
BUM traffic will be delivered to those dst IP addresses.
These ifindexes (Nexthops) are root namespace interfaces. They are the interfaces that are used to reach destination IP tunnel. It can be the mgmt. bridge or other interface in the host.
If we do a tcpdump to check vxlan traffic, we see that it’s not coming from “mgmt_ipv4” of the node but from the IP of the bridge in the host. It looks that the udp socket of the vxlan tunnel is created at the root namespace. I understand that’s the reason you also see “udp port 4789 unreachable” in the tcpdumps of container.
If I add fdb entries, using IPs in the root namepspace:
##host1
ip netns exec node1 bridge fdb append 00:00:00:00:00:00 dev eth1 self dst 100.103.2.1 vni 100 port 4789
##host2
ip netns exec node1 bridge fdb append 00:00:00:00:00:00 dev eth1 self dst 100.103.1.1 vni 100 port 4789
.. traffic starts to flow immediately.
Hope it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @michelredondo
We will bring back the learning flag back.
As for the interface IP - it is indeed expected that the interface belongs to the root netns, as this is where the tunnel interface was created.
In general, I feel that moving vxlan to container netns is a bit hacky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with the learning flag, topologies like the one I described will not work. Both containers have the “00:00:00:00:00:00” fdb entry pointing to the incorrect IP, so it will never be possible to learn MACs from the other tunnel end.
The https://github.com/srl-labs/containerlab/blob/main/tests/08-vxlan/01-vxlan.clab.yml test case works because srl1 fdb entry points to the correct IP (172.20.25.22). When you originate traffic from that side, srl1 sends BUM traffic to l2 node and l2 learns how to send traffic back.
If you try to originate traffic from l2 node, BUM traffic is sent to srl1 mgmt IP (172.20.25.21), but the udp vxlan socket is not listening there. It’s listening in root namespace (bridge IP: 172.20.25.1). Learning in this case never happens.
The only way I can think to make this work is to add an additional fdb entry pointing to the IP address in the root ns. Maybe we could add this as an optional parameter in the link definition.
links/link.go
Outdated
// rename the link created with random name if its length is acceptable by linux | ||
if len(endpt.GetIfaceName()) < 15 { | ||
err := netlink.LinkSetName(l, endpt.GetIfaceName()) | ||
if err != nil { | ||
return fmt.Errorf( | ||
"failed to rename link: %v", err) | ||
} | ||
} else { | ||
// else we set the desired long name as alias | ||
// in future we need to set it as an alternative name, | ||
// pending https://twitter.com/ntdvps/status/1709580216648024296 | ||
err := netlink.LinkSetAlias(l, endpt.GetIfaceName()) | ||
if err != nil { | ||
return fmt.Errorf( | ||
"failed to add alias: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a ~big change I made to the original proposal.
Instead of creating some hashed name for the interface I propose the following:
- if an interface name is 14+ chars we use the random name that has already been generated by clab and use it
- the long name we set as an alias for the interface. The alias is like a description of the interface that can container 128 chars. The alias though doesn't let you use
ip link show <alias>
. - I am waiting on this to merge Add support for alternative names vishvananda/netlink#862 to use alternative name instead of the alias. The altname allows users to use it in
ip
utility.
// VethCleanup iterates over links found in clab topology to initiate removal of dangling veths | ||
// in host networking namespace or attached to linux bridge. | ||
// See https://github.com/srl-labs/containerlab/issues/842 for the reference. | ||
func (c *CLab) VethCleanup(ctx context.Context) error { | ||
hostBasedEndpoints := []links.Endpoint{} | ||
|
||
// collect the endpoints of regular nodes | ||
for _, n := range c.Nodes { | ||
if n.Config().IsRootNamespaceBased || n.Config().NetworkMode == "host" { | ||
hostBasedEndpoints = append(hostBasedEndpoints, n.GetEndpoints()...) | ||
} | ||
} | ||
|
||
// collect the endpoints of the fake nodes | ||
hostBasedEndpoints = append(hostBasedEndpoints, links.GetHostLinkNode().GetEndpoints()...) | ||
hostBasedEndpoints = append(hostBasedEndpoints, links.GetMgmtBrLinkNode().GetEndpoints()...) | ||
|
||
var joinedErr error | ||
for _, ep := range hostBasedEndpoints { | ||
// finally remove all the collected endpoints | ||
log.Debugf("removing endpoint %s", ep.String()) | ||
err := ep.Remove() | ||
if err != nil { | ||
joinedErr = errors.Join(joinedErr, err) | ||
} | ||
} | ||
|
||
return joinedErr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to find a solution for this removed func. While working on the vlxan-stich I see that when I remove the lab, the host link remains.
This was the reason to have this func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Remove() function is implemented on the vxlan link and forwarded to the local endpoint. The local endpoint is actually a generic endpoint. My thinking as of now is, that the removal is triggered but executed in the wrong namespace, that of the container not in the host namespace.
Let me verify.
links/link_vxlan.go
Outdated
ip := net.ParseIP(lr.Remote) | ||
// if the returned ip is nil, an error occured. | ||
// we consider, that we maybe have a textual hostname | ||
// e.g. dns name so we try to resolve the string next | ||
if ip == nil { | ||
ips, err := net.LookupIP(lr.Remote) | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
// prepare log message | ||
sb := strings.Builder{} | ||
for _, ip := range ips { | ||
sb.WriteString(", ") | ||
sb.WriteString(ip.String()) | ||
} | ||
log.Debugf("looked up hostname %s, received IP addresses [%s]", lr.Remote, sb.String()[2:]) | ||
|
||
// always use the first address | ||
if len(ips) <= 0 { | ||
return nil, fmt.Errorf("unable to resolve %s", lr.Remote) | ||
} | ||
ip = ips[0] | ||
} | ||
|
||
parentIf := lr.ParentInterface | ||
|
||
if parentIf == "" { | ||
conn, err := rtnl.Dial(nil) | ||
if err != nil { | ||
return nil, fmt.Errorf("can't establish netlink connection: %s", err) | ||
} | ||
defer conn.Close() | ||
r, err := conn.RouteGet(ip) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to find a route to VxLAN remote address %s", ip.String()) | ||
} | ||
parentIf = r.Interface.Name | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steiler I think this code piece is now shared between
func (lr *LinkVxlanRaw) resolveStitchedVxlanComponent(params *ResolveParams) (*LinkVxlan, error)
and
func (lr *LinkVxlanRaw) resolveVxlan(params *ResolveParams) (Link, error)
It is a good candidate to create two common funcs in utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No not in utils but still in LinkVxlanRaw.
However there is a slight difference in the vxlan endpoint part.
Anyways, let me see how we can carve the common part out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
links/link_vxlan_stitched.go
Outdated
// Deploy provisions the stitched vxlan link with all its underlaying sub-links | ||
func (l *VxlanStitched) Deploy(ctx context.Context) error { | ||
return l.deploy(ctx, false) | ||
} | ||
|
||
func (l *VxlanStitched) deploy(ctx context.Context, skipVethCreation bool) error { | ||
// deploy the vxlan link | ||
err := l.vxlanLink.Deploy(ctx) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
// the veth creation might be skipped if it already exists | ||
if !skipVethCreation { | ||
err = l.vethLink.Deploy(ctx) | ||
if err != nil { | ||
return err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steiler do we need the inner deploy
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes in my opinion we do.
If you take a look at from where the deploy is called, it is
// DeployWithExistingVeth provisons the stitched vxlan link whilst the
// veth interface does already exist, hence it is not created as part of this
// deployment
func (l *VxlanStitched) DeployWithExistingVeth(ctx context.Context) error {
return l.deploy(ctx, true)
}
// Deploy provisions the stitched vxlan link with all its underlaying sub-links
func (l *VxlanStitched) Deploy(ctx context.Context) error {
return l.deploy(ctx, false)
}
The withExistingVeth is basically to support the tools command that we have.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1532 +/- ##
==========================================
+ Coverage 48.58% 49.41% +0.82%
==========================================
Files 133 136 +3
Lines 12798 13231 +433
==========================================
+ Hits 6218 6538 +320
- Misses 5874 5961 +87
- Partials 706 732 +26
|
Implement VxLAN as a struct that implements the links.Link interface.
This is then meant to be utilized via the tools command, but will also allow for the definition of VxLAN links inbetween containerlab nodes via the topology file.