Saturday, January 17, 2015

Packet Loss and TCP Retransmissions

Seeing TCP retransmissions in network capture traces are a very common problem.
TCP retransmissions generally are not considered a good sign, as they happen most probably due to some packets being dropped; they are not always a bad sign, nor are they always the cause of an application slowness problem.

Packet drops can happen due to many reasons.
Network protocol designers and implementers deal with packet drop problems by coming up with many TCP recovery algorithms and mechanisms to quickly recover the lost packets in order improve the overall response time for upper-layer protocols and applications.
Effective and efficient TCP recovery algorithms and mechanisms are like good shock absorbers.
With good shock absorbers, the passengers in a car would feel calm and steady when the car passing through bumpy roads.
In layman-term, we know that the roads outside can never be 100% smooth and even and bumpy (packet drops can easily happen), and therefore people are making good shock absorbers (good TCP recovery algorithms and mechanisms) to make people feel comfortable inside the cars.

Some of the RFCs related to TCP recovery are as below:
  • RFC 2018 - TCP Selective Acknowledgment Options
  • RFC 2988 - Computing TCP's Retransmission Timer
  • RFC 2011 - TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms
The 1st screenshot below shows you how ideally a good TCP recovery mechanism works upon a packet drop scenario.
In this particular scenario, there are:
  • 8 x TCP Previous segment not captured are reported (which generally means 8 packets or more are dropped). Packet#2996, Packet#3001, Packet#3006, Packet#3010, Packet#3016, Packet#3022, Packet#3055, Packet#3061.
  • 13 x TCP Retransmission are reported. Packet#3072, Packet#3074, Packet#3077, Packet#3078, Packet#3079, Packet#3080, Packet#3085, Packet#3086, Packet#3087, Packet#3091, Packet#3092, Packet#3093, Packet#3094.
  • 36 x TCP Dup ACK are reported.
8 or more packets dropped.
Sounds like a bumpy road isn't it?
Does it caused slowness issues upon the application? (TCP/1433 - MS SQL in this case)

The answer is no.
Because we can see that packets traverse across the bumpy road starting on 19:35:27.405, and the road started to be smooth again on 19:35:27.411.
That is 6ms.
1 second consists of 1000ms.
The user would not feel slowness impact upon this.

Now let's have a look on how a lousy TCP recovery mechanism is being implemented, which means that it is unable to recovery lost packets due to packet drops in a fast manner, and has caused slowness impact upon the upper-layer application and eventually the user experience.

In this particular scenario, there are:
  • 1 x TCP Previous segment not captured is reported. (Packet #17655).
  • 1 x TCP Retransmission is reported. Packet#17999.
  • 3 x TCP Dup ACK are reported. Packet#17661, Packet#17662, Packet#18012.
That particular TCP retransmission confirmed due to a packet dropped across a Juniper NetScreen firewall, due to the TCP Sequence Number Checking security feature.

Whether to see TCP Previous segment not captured messages in Wireshark depends whether the packet trace files is captured near to the Sender or the Receiver. Assuming a packet is dropped along the path from the Sender to the Receiver, the packet would be seen in the packet trace file captured near the Sender, and the packet would not be seen in the packet trace file captured near the Receiver, in which Wireshark will flag the next packet arrived upon the Receiver with a TCP Previous segment not captured message.

In this particular scenario, the TCP retransmission process, as part of the TCP recovery mechanism for a single packet drop, has taken 329ms.

The packet drops issue happening frequently and consistently due to the firewall security feature.
Assuming a TCP recovery process took 333ms, generally it can take up to 1 second to recover 3 packet drops.

To answer why lousy TCP recovery implementations can still found on modern networks?
Most probably this particular application (MQ on IBM AS/400) are designed for LAN environments, which assuming packet loss are to be very minimal, and also the TCP/IP stack for the operating system is not being stress-tested in high packet loss environment.

For this particular case, disabling this particular feature using the set flow no-tcp-seq-check command resolved the packet drops issues, and eventually the application slowness issues, and end users are happy. :-)

Take Home Lesson:
Packet loss always happens.
If the TCP recovery upon packet loss occurs quickly enough, users would not feel anything.