Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for internet gateways #6475

Merged
merged 36 commits into from
Oct 6, 2024
Merged

Support for internet gateways #6475

merged 36 commits into from
Oct 6, 2024

Conversation

rcgoodfellow
Copy link
Contributor

@rcgoodfellow rcgoodfellow commented Aug 29, 2024

@rcgoodfellow
Copy link
Contributor Author

Something I'm not too sure about with this PR is how it needs to interact with the reconfigurator. If any services move around, we need to trigger the vpc_route_manager background task. @jgallagher do you have some insight here?

// I think we should move decisions around change
// propagation to be based on actual delta
// calculation, rather than trying to manually
// maintain a signal.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this comment out as something to discuss in this PR. CC @FelixMcFelix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had maybe overlooked the scope of things that could change in connection with an Internet Gateway. That's a big surface area to keep an eye on...

I'm not really sure how we would want to work around this, however. The motivation for using router versions was more for performance rather than correctness. My intent was to prevent each Nexus's copy of the background task from querying every Project's/VPC's/Subnet's state on a minute-by-minute (and per-wakeup) basis, since there are a lot of tables which are referenced during route resolution. This partly stems from the granularity of background task wakeups, which don't AFAIK allow a selective wakeup. So my initial feeling is that having Nexus store resolved tables and then send deltas is orthogonal to that?

It could also, in practice, not be too expensive to just rebuild VPC routing tables periodically! In which case, the places where we perform version bumps and/or background task wakeups would keep the system responsive rather than be crucial for correctness. If cost remains a concern, I wonder if we can just use modified timestamps in connected tables to drive the route update process (which would, I hope, be less error-prone).

@jgallagher
Copy link
Contributor

Something I'm not too sure about with this PR is how it needs to interact with the reconfigurator. If any services move around, we need to trigger the vpc_route_manager background task. @jgallagher do you have some insight here?

Maybe! The bit of reconfigurator that can actually move services around is also an RPW (blueprint_executor). However, it doesn't currently know whether any given execution actually moved services. It's structured as "idempotently send every sled the set of services it should be running", and almost always that does not result in any changes. So I think I have some questions and some options:

  • Does the vpc_route_manager already run periodically, so we're only looking for an optimization to tighten the window between "services moved around" and "VPC route manager fixed things"?
  • If so, would it make sense to trigger vpc_route_manager after every blueprint_executor run? That would effectively cause it to run at the period of blueprint execution + its own timer, I think; I don't know whether that would be okay.
  • I think blueprint_executor could know whether any sleds actually changed, if we modified the sled-agent API it's using to return information about whether it made any changes, at which point it could explicitly trigger vpc_route_manager only if services moved (or maybe only if certain services moved?). There might be wrinkles here I'm not aware of that make this tricky; should we look into this?

@twinfees twinfees added the customer For any bug reports or feature requests tied to customer requests label Sep 18, 2024
@twinfees twinfees removed the customer For any bug reports or feature requests tied to customer requests label Sep 26, 2024
rcgoodfellow added a commit to oxidecomputer/oxide.rs that referenced this pull request Sep 28, 2024
@rcgoodfellow
Copy link
Contributor Author

rcgoodfellow commented Sep 30, 2024

Testing Notes

I performed to following procedure to test the functionality of this PR in a local Omicron development environment.

This test procedure requires oxidecomputer/oxide.rs#853 and oxidecomputer/opte#588

The test creates two IP pools. The first one is the default IP pool for the silo. The second is an IP pool we create to associate with a gateway. In this test the gateway is called sv2 and we'll create a VPC route that ensures that all traffic destined to 45.154.216.0/24 goes through that gateway. Finally we create a floating IP from each IP pool and assign both to an
instance. Then traffic is sent from the instance to address inside and outside 45.154.216.0/24. The traffic destined to 45.154.216.0/24 is observed to have a source address form the second IP pool as expected. Traffic destined to
prefixes other than 45.154.216.0/24 is observed to have a source address form the first IP pool as expected.

image
  1. Create IP pools

IP ranges are ones that work for the network my workstation is on. Adjust as necessary for other environments.

#
# default ip pool
#
oxide ip-pool create \
    --name default \
    --description default

oxide ip-pool range add \
    --pool default \
    --first 192.168.20.32 \
    --last 192.168.20.47

oxide ip-pool silo link \
    --pool default \
    --is-default true \
    --silo recovery

#
# sv2 ip pool
#
oxide ip-pool create \
    --name sv2 \
    --description colo

oxide ip-pool range add \
    --pool sv2 \
    --first 192.168.20.48 \
    --last 192.168.20.63

oxide ip-pool silo link \
    --pool sv2 \
    --is-default false \
    --silo recovery
  1. Create an internet gateway
oxide internet-gateway create \
    --project 'test' \
    --description 'colo gateway' \
    --vpc default \
    --name sv2

oxide internet-gateway ip-pool attach \
    --project 'test' \
    --vpc default \
    --gateway sv2 \
    --json-body igw-pool-attach.json

igw-pool-attach.json

{
    "gateway": "sv2",
    "ip_pool": "sv2",
    "name": "sv2",
    "description": "sv2 pool attachemnt"
}
  1. Create a custom VPC router

This router forwards traffic to 45.154.216.0 through the sv2 gateway.

oxide vpc router create \
    --project 'test' \
    --vpc default \
    --description 'default router' \
    --name default

oxide vpc router route create \
    --project 'test' \
    --vpc default \
    --router default \
    --json-body igw-route.json

oxide vpc subnet update \
    --project 'test' \
    --vpc default \
    --subnet default \
    --custom-router default

igw-route.json

{
    "name": "sv2",
    "description": "route to sv2",
    "target": {
        "type": "internet_gateway",
        "value": "sv2"
    },
    "destination": {
        "type": "ip_net",
        "value": "45.154.216.0/24"
    }
}
  1. Create a pair of floating IPs
oxide floating-ip create \
    --project 'test' \
    --name 'default-igw' \
    --description 'floating ip for default gateway' \
    --pool default

oxide floating-ip create \
    --project 'test' \
    --name 'sv2-igw' \
    --description 'floating ip for sv2 gateway' \
    --pool sv2
  1. Create an instance

This assumes the existence of a debian12 image. This is just what I was using for testing, the particular image does not matter.

oxide instance from-image \
    --project 'test' \
    --name 'oak' \
    --description 'a test instance' \
    --hostname 'test' \
    --memory 1g \
    --ncpus 2 \
    --image debian12 \
    --size 2g \
    --start
  1. Attach floating IPs
oxide floating-ip attach \
    --project 'test' \
    --floating-ip 'default-igw' \
    --kind instance \
    --parent 'oak'

oxide floating-ip attach \
    --project 'test' \
    --floating-ip 'sv2-igw' \
    --kind instance \
    --parent 'oak'

@rcgoodfellow rcgoodfellow marked this pull request as ready for review September 30, 2024 05:10
@FelixMcFelix
Copy link
Contributor

I've implemented the followup changes from the minor OPTE-side rework -- I had to work around the issue you mentioned in the oxide.rs PR, but was otherwise able to recreate the testing scenario you described:

kyle@farme:~/gits/omicron$ pfexec opteadm dump-layer -p opte6 nat
Port opte6 - Layer nat
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO  SRC IP           SPORT  DST IP       DPORT  HITS  ACTION
ICMP   1.1.1.1          0      10.1.222.13  0      0     NAT
ICMP   9.9.9.9          0      10.1.222.13  0      0     NAT
ICMP   45.154.216.1     2484   10.1.223.1   2484   0     NAT
ICMP   45.154.216.2     2489   10.1.223.1   2489   0     NAT
TCP    142.250.187.206  80     10.1.222.13  35294  1     NAT
TCP    142.250.187.238  80     10.1.222.13  49950  1     NAT
TCP    142.250.187.238  443    10.1.222.13  53824  1     NAT
UDP    1.1.1.1          53     10.1.222.13  44319  1     NAT
UDP    1.1.1.1          53     10.1.222.13  45002  1     NAT
UDP    1.1.1.1          53     10.1.222.13  59701  1     NAT
UDP    9.9.9.9          53     10.1.222.13  44319  1     NAT
UDP    9.9.9.9          53     10.1.222.13  45002  1     NAT
UDP    9.9.9.9          53     10.1.222.13  59701  1     NAT
UDP    51.89.139.80     123    10.1.222.13  40867  0     NAT

Outbound Flows
----------------------------------------------------------------------
PROTO  SRC IP      SPORT  DST IP           DPORT  HITS  ACTION
ICMP   172.30.0.5  0      1.1.1.1          0      0     NAT
ICMP   172.30.0.5  0      9.9.9.9          0      2     NAT
ICMP   172.30.0.5  2484   45.154.216.1     2484   1     NAT
ICMP   172.30.0.5  2489   45.154.216.2     2489   0     NAT
TCP    172.30.0.5  35294  142.250.187.206  80     0     NAT
TCP    172.30.0.5  49950  142.250.187.238  80     0     NAT
TCP    172.30.0.5  53824  142.250.187.238  443    0     NAT
UDP    172.30.0.5  44319  1.1.1.1          53     0     NAT
UDP    172.30.0.5  45002  1.1.1.1          53     0     NAT
UDP    172.30.0.5  59701  1.1.1.1          53     0     NAT
UDP    172.30.0.5  44319  9.9.9.9          53     0     NAT
UDP    172.30.0.5  45002  9.9.9.9          53     0     NAT
UDP    172.30.0.5  59701  9.9.9.9          53     0     NAT
UDP    172.30.0.5  40867  51.89.139.80     123    0     NAT

Inbound Rules
----------------------------------------------------------------------
ID    PRI  HITS  PREDICATES                           ACTION
1057  5    0     inner.ip.dst=10.1.222.13,10.1.223.1  "Stateful: 172.30.0.5 <=> (external)"
1058  10   0     inner.ip.dst=10.1.222.12             "Stateful: 172.30.0.5 <=> (external)"
DEF   --   0     --                                   "allow"

Outbound Rules
----------------------------------------------------------------------
ID    PRI  HITS  PREDICATES                                                   ACTION
2740  5    4     inner.ether.ether_type=IPv4                                  "Stateful: 172.30.0.5 <=> 10.1.222.13"
                 meta: router-target=ig=abe47d79-49d2-4e82-85c3-81872d02c8f2

2739  5    1     inner.ether.ether_type=IPv4                                  "Stateful: 172.30.0.5 <=> 10.1.223.1"
                 meta: router-target=ig=7eab28c6-979d-45de-b460-6f8d5e3fce2e

2741  10   0     inner.ether.ether_type=IPv4                                  "Stateful: 172.30.0.5 <=> 10.1.222.12"
                 meta: router-target=ig=abe47d79-49d2-4e82-85c3-81872d02c8f2

2742  100  0     inner.ether.ether_type=IPv4                                  "Stateful: 10.1.222.11:0-16383"
                 meta: router-target=ig=abe47d79-49d2-4e82-85c3-81872d02c8f2

2743  255  0     meta: router-target-class=ig                                 "Deny"
DEF   --   0     --                                                           "allow"

kyle@farme:~/gits/omicron$ pfexec opteadm dump-layer -p opte6 router
Port opte6 - Layer router
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO  SRC IP  SPORT  DST IP  DPORT  HITS  ACTION

Outbound Flows
----------------------------------------------------------------------
PROTO  SRC IP  SPORT  DST IP  DPORT  HITS  ACTION

Inbound Rules
----------------------------------------------------------------------
ID   PRI  HITS  PREDICATES  ACTION
DEF  --   226   --          "allow"

Outbound Rules
----------------------------------------------------------------------
ID   PRI  HITS  PREDICATES                         ACTION
4    26   15    inner.ip.dst=45.154.216.0/24       "Meta: Target = IG(Some(7eab28c6-979d-45de-b460-6f8d5e3fce2e))"
1    31   0     inner.ip.dst=172.30.0.0/22         "Meta: Target = Subnet: 172.30.0.0/22"
2    75   236   inner.ip.dst=0.0.0.0/0             "Meta: Target = IG(Some(abe47d79-49d2-4e82-85c3-81872d02c8f2))"
3    139  0     inner.ip6.dst=fd22:d1be:7ce8::/64  "Meta: Target = Subnet: fd22:d1be:7ce8::/64"
0    267  0     inner.ip6.dst=::/0                 "Meta: Target = IG(Some(abe47d79-49d2-4e82-85c3-81872d02c8f2))"
DEF  --   0     --                                 "deny"

kyle@farme:~/gits/omicron$ pfexec opteadm dump-uft -p opte6
UFT Inbound: 7/65535
----------------------------------------------------------------------
PROTO  SRC IP           SPORT  DST IP       DPORT  HITS  XFORMS
TCP    142.250.187.238  80     10.1.222.13  49950  3     hdr: decap,NAT,, body:
TCP    142.250.187.238  443    10.1.222.13  53824  17    hdr: decap,NAT,, body:
UDP    1.1.1.1          53     10.1.222.13  45002  1     hdr: decap,NAT,, body:
UDP    1.1.1.1          53     10.1.222.13  59701  1     hdr: decap,NAT,, body:
UDP    9.9.9.9          53     10.1.222.13  45002  1     hdr: decap,NAT,, body:
UDP    9.9.9.9          53     10.1.222.13  59701  1     hdr: decap,NAT,, body:
UDP    93.93.131.118    123    10.1.222.13  33128  0     hdr: decap,NAT,, body:

UFT Outbound: 12/65535
----------------------------------------------------------------------
PROTO  SRC IP      SPORT  DST IP           DPORT  HITS  XFORMS
ICMP   172.30.0.5  0      1.1.1.1          0      1     hdr: NAT,encap, body:
ICMP   172.30.0.5  0      9.9.9.9          0      1     hdr: NAT,encap, body:
ICMP   172.30.0.5  2484   45.154.216.1     2484   0     hdr: NAT,encap, body:
ICMP   172.30.0.5  2489   45.154.216.2     2489   1     hdr: NAT,encap, body:
TCP    172.30.0.5  49950  142.250.187.238  80     5     hdr: NAT,encap, body:
TCP    172.30.0.5  53824  142.250.187.238  443    20    hdr: NAT,encap, body:
UDP    172.30.0.5  45002  1.1.1.1          53     1     hdr: NAT,encap, body:
UDP    172.30.0.5  59701  1.1.1.1          53     1     hdr: NAT,encap, body:
UDP    172.30.0.5  45002  9.9.9.9          53     1     hdr: NAT,encap, body:
UDP    172.30.0.5  59701  9.9.9.9          53     1     hdr: NAT,encap, body:
UDP    172.30.0.5  40867  51.89.139.80     123    0     hdr: NAT,encap, body:
UDP    172.30.0.5  33128  93.93.131.118    123    0     hdr: NAT,encap, body:

So far as I can tell now, generic gateway traffic is going out on the correct IP addresses (priority-respecting)!

I suppose this does raise some questions around edge cases (with the caveat that I have been focussed on configuring OPTE, rather than the API side):

  • What is the intended behaviour when an assigned floating IP does not come from any InetGW referenced by the routers an instance knows about?
  • Are we wanting InternetGateways to, at some point, determine what SNAT pool an instance draws from?
  • Discussed out of band a bit, but DB migration needs handling.

I'll run through reviewing the rest of the API (hopefully) tonight -- let me know if you agree with this tagging-style approach.

Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for putting this together; this is a huge chunk of work, and I'm excited for this to land soonish.

Reiterating some high-level questions around the actual use/configuration of Internet Gateways in addition to my earlier comment:

  • Is attaching a name and description to each link the right model?
  • What are the intended link cardinalities between an IGW and address sources?
    • I'm guessing that we should be able to link many IP pools/individual addresses to an IGW.
    • Between IGWs, should we be imposing any limitations on linked resources? E.g., an IP pool can only be linked to a single IGW in each VPC.
    • Do linked IpAddrs need to come from a pool linked to the silo? Or are they intended to fixup certain floating IPs? Should they be unique across all VPCs?

@@ -104,6 +108,20 @@ z_swadm () {
pfexec zlogin oxz_switch /opt/oxide/dendrite/bin/swadm $@
}

# only set this if you want to override the version of opte/xde installed by the
# install_opte.sh script
OPTE_COMMIT="7bba530f897f69755723be2218de2628af0eb771"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this to make sure we comment this line out before merging.

illumos-utils/src/opte/port_manager.rs Outdated Show resolved Hide resolved
illumos-utils/src/opte/port_manager.rs Outdated Show resolved Hide resolved
illumos-utils/src/opte/port_manager.rs Outdated Show resolved Hide resolved
nexus/db-fixed-data/src/vpc.rs Outdated Show resolved Hide resolved
schema/crdb/dbinit.sql Outdated Show resolved Hide resolved
schema/crdb/dbinit.sql Show resolved Hide resolved
DELETE FROM omicron.public.router_route WHERE
time_deleted IS NULL AND
name = 'default-v6' AND
vpc_router_id = '001de000-074c-4000-8000-000000000001';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration notes:

  • The IDs for the existing inetgw:outbound routes in fixed-data are 001de000-074c-4000-8000-000000000002 and 001de000-074c-4000-8000-000000000003
  • If we're deleting these entries we need to replace them with equivalents: the datastore route ensure will only manage the Subnet entries today.
  • We need to create a default inetgw (and its link to the parent silo's default IP pool) per existing VPC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Database migration added in schema/crdb/internet-gateway/up13.sql

Comment on lines +242 to +248
RefreshExternalIps {
tx: oneshot::Sender<Result<(), ManagerError>>,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in future, we can cut a lot of this circuitous movement through InstanceRunner out (including EIP attach/detach) if we place external IP state onto NetworkInterface directly.

Definitely out of scope here given the immediacy of this PR!

@rcgoodfellow
Copy link
Contributor Author

Noting that test_vpc_routers_custom_delivered_to_instance is now very flaky.

@rcgoodfellow
Copy link
Contributor Author

I've taken things for a lap with the test described above, and things still work as expected with the floating IP pinning.

@FelixMcFelix
Copy link
Contributor

I think I'm happy with how we percolate information out to instances now -- I've been running through the above gateway scenario in- and out-of-order (i.e., setting up routes before a valid gateway is created) and I haven't been able to break it just yet. So far as I can tell, the OPTE state is quickly updating in all cases I've tried, and our source address picking semantics are preserved.

Copy link
Contributor

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the OPTE side, I'm happy with the dataplane aspects and how we're setting them up. I think once the DB migration is in, we should be good to go! 🎉

rcgoodfellow and others added 18 commits October 3, 2024 14:32
This change mirrors OPTE#06ccc5b (and its more pedestrian followup
commits) -- InternetGateway routes now tag matched packets such that the
NAT layer can use them for selection. Nexus now produces a per-NIC map,
containing a list of EIP->IGW UUID mappings -- OPTE uses this map to
divide sets of external IPs into separate rules within the NAT layer.
This reverts VPC router calculation to a shared set of rules for all
NICs in a given subnet.

This change was necessary to ensure that, e.g., Ephemeral IP addresses
would be consistently chosen over SNAT (which would have been
unfortunate to hit!). The router layer isn't exactly best-placed to
enforce custom priorities or choices over equal-priority elements.

The downsde is that this approach only really works (today) for
Instances. Probes/Services on sled agent discard their external IP
information fairly quickly -- in these cases, I've written IGW tagging
to recreate the legacy (single IGW) behaviour. I'll put this as more evidence
that RPW-izing external IP configuration would be useful.
Way, way snappier for updating IGW<->EIP mappings.
The place we were missing router version bumps was CRUD operations on
Internet Gateways themselves. The rationale here is that these are
resolved by name, so we need to indicate to the RPW that there may be a
new resource that can actually be resolved.

Returns one `continue` back into active service, since it appears to be
behaving well.
I noticed that bumping all routers' versions was awfully expensive in a
few places after adding it into IGW create/delete. This should hopefully
trim that down a little bit -- it was starting to bite on e.g. VPC
create.

From what I can tell, test_vpc_routers_custom_delivered_to_instance
still remains happy; I'm getting 17s per success on my home helios box.
It's currently possible for several IGWs in a VPC to be linked to the same IP
pool -- we now report all such mappings to OPTE and sled-agent.
OPTE now allows a given address to be selected for any IGW which would
admit it.

Previously, this led to a case where an OPTE port would lack usable NAT
entries for the IGW it was actually configured to use (due to the
presence of a second matching IGW).
@rcgoodfellow rcgoodfellow merged commit e516410 into main Oct 6, 2024
19 checks passed
@rcgoodfellow rcgoodfellow deleted the igw branch October 6, 2024 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Internet Gateway concept
5 participants