NSX-T Edge Maintenance Mode 2

Edge Maintenance Mode Overview

The NSX-T Edge cluster is a logical grouping of NSX-T Edge virtual machines that provide North-South routing for the workloads in compute clusters. NSX-T Edges can be taken out of production by being placed in maintenance mode, if for example, the Edge has become inoperable.

In the first post of this series, we looked at placing an NSX-T Edge in Maintenance Mode using the REST API and the CLI. In this post, we will try and get a better understanding of how disruptive Maintenance Mode operations are. I’m curious to see if the outage will be in the order of milliseconds, tenths of a second, or seconds.

Lab Setup

I’ve introduced a couple of CentOS based Guest VMs to the Lab that are well suited for the testing to be performed during this analysis.

The goal with this topology is to have traffic that traverses either Edge Node nsxtedge01 or nsxtedge02. The traffic path will be controlled by selecting which cluster member is in Maintenance Mode. Notice here that traffic from VM6 to VM7 is over Edge Node nsxtedge1:

Also, notice that the traffic from VM7 to VM6 is also over Edge Node nsxtedge1:

How disruptive are these Maintenance Mode operations?

The goal is to try and find approximately how long it takes for traffic to recover after placing an Edge that houses active gateways into maintenance mode.

Method 1. Failover time measured using continuous ping

Here are the steps in this approach:

generate a flow of ICMP Echo Requests, from VM6 to VM7, using a continuous ping of 100/sec
- root@vm6# ping 192.168.90.90 -i 0.01
capture traffic on the vDS uplink port to VM7
- root@esxcna01-s1# pktcap-uw –switchport 67108880 –dir 1 -o fastping-capture-outbound.pcap
place the Edge that houses active gateways into maintenance mode
- nsxtedge01> set maintenance-mode enabled
view the traffic capture with WireShark, view Time Display Format as Seconds since Previous Displayed Packet
set the WireShark display filter to icmp
sort on time to find the longest time between packets
use ICMP Echo reply sequence number to identify the number of unreceived packets

Notice that with “g 192.168.90.90 -i 0.01” we are seeing 100 packets/second as expected and that there is some packets loss when Edge nsxtedge01 is placed in maintenance mode:

With Time Display Format set as Seconds since Previous Displayed Packet, here is the traffic capture sorted on time to find the longest time between packets in the capture, which is 0.132859 seconds. Note that Packet 1402 has an ICMP sequence number of 1405:

Note that Packet 1401 has an ICMP sequence number of 1398:

Results, where Failover time is measured using continuous ping

packet 1401 has an ICMP sequence number of 1398
packet 1402 has an ICMP sequence number of 1405
ICMP requests marked with sequence numbers 1399, 1400, 1401, 1402, 1403, 1404 were lost
this represents approximately 0.06 seconds of lost data.
packet 1402 arrived 0.13 seconds after packet 1401
over this 0.13 second period, 0.06 seconds of data was lost
keep in mind that the overall system benefits from interfaces that can buffer packets

Method 2, Failover time measured with iPerf3 UDP Stream

I’ve used iPerf in the past to characterize bandwidth, jitter, and delay between two endpoints, but for me here is a new spin on using the tool.

Here are the steps in this approach:

generate a continuous flow of UDP packets, from VM6 to VM7 using iPerf3
run iPerf3 in client mode on VM6 and server mode on VM7
- root@vm6# iperf3 -c 192.168.90.90 -t 60 -i 1 -V -u
- root@vm7# iperf3 -s
capture traffic on the vDS uplink port to VM7
- root@esxcna01-s1# pktcap-uw –switchport 67108880 –dir 1 -o iperf3-capture-outbound.pcap
place the Edge that houses active gateways into maintenance mode
determine how many frames were lost and not received on VM7

iPerf3 results viewed on the receiver, VM7:

[root@vm7 ~]# iperf3 -s
 Server listening on 5201
 Accepted connection from 192.168.70.70, port 57992
 [  5] local 192.168.90.90 port 5201 connected to 192.168.70.70 port 53231
 [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
 [  5]   0.00-1.00   sec   116 KBytes   950 Kbits/sec  0.104 ms  0/82 (0%)
 [  5]   1.00-2.00   sec   129 KBytes  1.05 Mbits/sec  0.128 ms  0/91 (0%)
 [  5]   2.00-3.00   sec   127 KBytes  1.04 Mbits/sec  0.148 ms  0/90 (0%)
 [  5]   3.00-4.00   sec   129 KBytes  1.05 Mbits/sec  0.072 ms  0/91 (0%)
 [  5]   4.00-5.00   sec   127 KBytes  1.04 Mbits/sec  0.102 ms  0/90 (0%)
 [  5]   5.00-6.00   sec   129 KBytes  1.05 Mbits/sec  0.123 ms  0/91 (0%)
 [  5]   6.00-7.00   sec   127 KBytes  1.04 Mbits/sec  0.186 ms  0/90 (0%)
 [  5]   7.00-8.00   sec   129 KBytes  1.05 Mbits/sec  0.143 ms  0/91 (0%)
 [  5]   8.00-9.00   sec   127 KBytes  1.04 Mbits/sec  0.111 ms  0/90 (0%)
 [  5]   9.00-10.00  sec   129 KBytes  1.05 Mbits/sec  0.213 ms  0/91 (0%)
 [  5]  10.00-11.00  sec   127 KBytes  1.04 Mbits/sec  0.186 ms  0/90 (0%)
 [  5]  11.00-12.00  sec   117 KBytes   961 Kbits/sec  0.159 ms  8/91 (8.8%)     <---- 8 packets lost over a 1 sec period
 [  5]  12.00-13.00  sec   127 KBytes  1.04 Mbits/sec  0.381 ms  0/90 (0%)
 [  5]  13.00-14.00  sec   129 KBytes  1.05 Mbits/sec  0.200 ms  0/91 (0%)
 [  5]  14.00-15.00  sec   127 KBytes  1.04 Mbits/sec  0.150 ms  0/90 (0%)
 [  5]  15.00-16.00  sec   129 KBytes  1.05 Mbits/sec  0.110 ms  0/91 (0%)
 [  5]  16.00-17.00  sec   127 KBytes  1.04 Mbits/sec  0.106 ms  0/90 (0%)
 [  5]  17.00-18.00  sec   129 KBytes  1.05 Mbits/sec  0.129 ms  0/91 (0%)
 [  5]  18.00-19.00  sec   127 KBytes  1.04 Mbits/sec  0.139 ms  0/90 (0%)
 [  5]  19.00-20.00  sec   129 KBytes  1.05 Mbits/sec  0.075 ms  0/91 (0%)
 [  5]  20.00-21.00  sec   127 KBytes  1.04 Mbits/sec  0.057 ms  0/90 (0%)
 [  5]  21.00-22.00  sec   129 KBytes  1.05 Mbits/sec  0.064 ms  0/91 (0%)
 ^C[  5]  22.00-22.85  sec   115 KBytes  1.11 Mbits/sec  0.085 ms  0/81 (0%)

 [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
 [  5]   0.00-22.85  sec  0.00 Bytes  0.00 bits/sec  0.085 ms  8/2064 (0.39%)
 iperf3: interrupt - the server has terminated

Results, where Failover time is measured with iPerf3 UDP Stream

over a 1-second period, there is alternately 90 and 91 packets
on average there is 90.5 packets per second
8 packets were lost in the maintenance mode operation
(8 packets lost / 90.5 packets per second) = 0.09 seconds of lost packets
0.09 seconds of lost packets
we know that the system benefits from interfaces that can buffer packets, so the actual outage was longer

Overall Result Summary

The goal of this post was to try and find approximately how long it takes for traffic to recover after placing an Edge that houses active gateways into maintenance mode.

testing was performed on a non-production NSX-T 2.4.1 lab environment
this lab occasionally suffers from resource constraints
this setup does not have any stateful services configured such as NAT or firewalling, which may impact the results
even in this underperforming lab environment, placing an Edge in Maintenace Mode is expected to impact traffic for 0.06 to 0.09 seconds

As a general guideline, it should be safe to approximate the service impact of placing an Edge in Maintenace Mode to something less than a tenth of a second.

NSX-T Edge Maintenance Mode 2

Edge Maintenance Mode Overview

Lab Setup

How disruptive are these Maintenance Mode operations?

Method 1. Failover time measured using continuous ping

Results, where Failover time is measured using continuous ping

Method 2, Failover time measured with iPerf3 UDP Stream

Results, where Failover time is measured with iPerf3 UDP Stream

Overall Result Summary

Introduction to NSX-T Principal Identities

The Standalone VLAN Based OneArm Load Balancer

NSX-T ESXi Commands

Troubleshooting NSX Stateful Active Active mode

NSX-T 3.0 User Interface Mode

NSX Home Lab Series – 6. VDS

2 Comments

Edge Maintenance Mode Overview

Lab Setup

How disruptive are these Maintenance Mode operations?

Method 1. Failover time measured using continuous ping

Results, where Failover time is measured using continuous ping

Method 2, Failover time measured with iPerf3 UDP Stream

Results, where Failover time is measured with iPerf3 UDP Stream

Overall Result Summary

Similar Posts

2 Comments