Transmission Control Protocol: Difference between revisions
Line 360: | Line 360: | ||
*[http://www.medianet.kent.edu/technicalreports.html#TR2005-07-22 TCP EFSM diagram - A detailed description of TCP states.] |
*[http://www.medianet.kent.edu/technicalreports.html#TR2005-07-22 TCP EFSM diagram - A detailed description of TCP states.] |
||
* [http://www.joelonsoftware.com/articles/LeakyAbstractions.html The Law of Leaky Abstractions] by Joel Spolsky |
* [http://www.joelonsoftware.com/articles/LeakyAbstractions.html The Law of Leaky Abstractions] by Joel Spolsky |
||
* [http://edocs.tu-berlin.de/diss/2004/savoric_michael.htm Dissertation about TCP improvements in wired and wireless networks] (Dissertation) |
|||
[[Category: TCP/IP|Transmission Control Protocol]] |
[[Category: TCP/IP|Transmission Control Protocol]] |
Revision as of 02:57, 9 December 2006
Internet protocol suite |
---|
Application layer |
Transport layer |
Internet layer |
Link layer |
The Transmission Control Protocol (TCP) is a virtual circuit protocol that is one of the core protocols of the Internet protocol suite, often simply referred to as TCP/IP. Using TCP, applications on networked hosts can create connections to one another, over which they can exchange streams of data. The protocol guarantees reliable and in-order delivery of data from sender to receiver. TCP also distinguishes data for multiple connections by concurrent applications (e.g. Web server and e-mail server) running on the same host.
TCP supports many of the Internet's most popular application protocols and resulting applications, including the World Wide Web, e-mail and Secure Shell.
In the Internet protocol suite, TCP is the intermediate layer between the Internet Protocol (IP) below it, and an application above it. Applications often need reliable pipe-like connections to each other, whereas the Internet Protocol does not provide such streams, but rather only best effort delivery (i.e. unreliable packets). TCP does the task of the transport layer in the simplified OSI model of computer networks. The other main transport-level Internet protocol is UDP.
Applications send streams of octets (8-bit bytes) to TCP for delivery through the network, and TCP divides the byte stream into appropriately sized segments (usually delineated by the maximum transmission unit (MTU) size of the data link layer of the network the computer is attached to). TCP then passes the resulting packets to the Internet Protocol, for delivery through a network to the TCP module of the entity at the other end. TCP checks to make sure that no packets are lost by giving each packet a sequence number, which is also used to make sure that the data are delivered to the entity at the other end in the correct order. The TCP module at the far end sends back an acknowledgement for packets which have been successfully received; a timer at the sending TCP will cause a timeout if an acknowledgement is not received within a reasonable round-trip time (or RTT), and the (presumably lost) data will then be re-transmitted. The TCP checks that no bytes are damaged by using a checksum; one is computed at the sender for each block of data before it is sent, and checked at the receiver.
Protocol operation
Unlike TCP's traditional counterpart, User Datagram Protocol, which can immediately start sending packets, TCP provides connections that need to be established before sending data. TCP connections have three phases:
- connection establishment
- data transfer
- connection termination
Before describing these three phases, a note about the various states of a connection end-point or Internet socket:
- LISTEN
- SYN-SENT
- SYN-RECEIVED
- ESTABLISHED
- FIN-WAIT-1
- FIN-WAIT-2
- CLOSE-WAIT
- CLOSING
- LAST-ACK
- TIME-WAIT
- CLOSED
- LISTEN
- represents waiting for a connection request from any remote TCP and port. (usually set by TCP servers)
- SYN-SENT
- represents waiting for the remote TCP to send back a TCP packet with the SYN and ACK flags set. (usually set by TCP clients)
- SYN-RECEIVED
- represents waiting for the remote TCP to send back an acknowledgment after having sent back a connection acknowledgment to the remote TCP. (usually set by TCP servers)
- ESTABLISHED
- represents that the port is ready to receive/send data from/to the remote TCP. (set by TCP clients and servers)
- TIME-WAIT
- represents waiting for enough time to pass to be sure the remote TCP received the acknowledgment of its connection termination request. According to RFC 793 a connection can stay in TIME-WAIT for a maximum of four minutes.
Connection establishment
To establish a connection, TCP uses a 3-way handshake. Before a client attempts to connect with a server, the server must first bind to a port to open it up for connections: this is called a passive open. Once the passive open is established, a client may initiate an active open. To establish a connection, the three-way (or 3-step) handshake occurs:
- The active open is performed by sending a SYN to the server.
- In response, the server replies with a SYN-ACK.
- Finally the client sends an ACK (usually called SYN-ACK-ACK) back to the server.
At this point, both the client and server have received an acknowledgement of the connection.
Example:
- The initiating host (client) sends a synchronization (SYN flag set) packet to initiate a connection. Any SYN packet holds a Sequence Number. The Sequence Number is a 32-bit field in TCP segment header. For example let the Sequence Number value for this session be x.
- The other host receives the packet, records the Sequence Number of x from the client, and replies with an acknowledgment and synchronization (SYN-ACK). The Acknowledgment Number is a 32-bit field in TCP segment header. It contains the next sequence number that this host is expecting to receive (x + 1). The host also initiates a return session. This includes a TCP segment with its own initial Sequence Number value of y.
- The initiating host responds with a next Sequence Number (x+1) and a simple Acknowledgment Number value of y + 1, which is the Sequence Number value of the other host + 1.
Data transfer
There are a few key features that set TCP apart from User Datagram Protocol:
- Error-free data transfer
- Ordered-data transfer
- Retransmission of lost packets
- Discarding duplicate packets
- Congestion throttling
In the first two steps of the 3-way handshaking, both computers exchange an initial sequence number (ISN). This number can be arbitrary. This sequence number identifies the order of the bytes sent from each computer so that the data transferred is in order regardless of any fragmentation or disordering that occurs during transmission. For every byte transmitted the sequence number must be incremented.
Conceptually, each byte sent is assigned a sequence number and the receiver then sends an acknowledgement back to the sender that effectively states that they received it. What is done in practice is only the first data byte is assigned a sequence number which is inserted in the sequence number field and the receiver sends an acknowledgement value of the next byte they expect to receive.
For example, if computer A sends 4 bytes with a sequence number of 100 (conceptually, the four bytes would have a sequence number of 100, 101, 102, & 103 assigned) then the receiver would send back an acknowledgement of 104 since that is the next byte it expects to receive in the next packet. By sending an acknowledgement of 104, the receiver is signaling that it received bytes 100, 101, 102, & 103 correctly. If, by some chance, the last two bytes were corrupted then an acknowledgement value of 102 would be sent since 100 & 101 were received successfully.
This would not happen for a packet of 4 bytes but it can happen if, for example, 10,000 bytes are sent in 10 different TCP packets and a packet is lost during transmission. If the first packet is lost then the sender would have to resend all 10,000 bytes since the acknowledgement cannot say that it received bytes 1,000 to 10,000 but only that it expects byte 0 because 0 through 999 were lost. (This issue is addressed in SCTP by adding a selective acknowledgement.)
Sequence numbers and acknowledgments cover discarding duplicate packets, retransmission of lost packets, and ordered-data transfer. To assure correctness a checksum field is included (see #Packet structure for details on checksumming).
The TCP checksum is a quite weak check by modern standards. Data Link Layers with a high probability of bit error rates may require additional link error correction/detection capabilities. If TCP were to be redesigned today, it would most probably have a 32-bit cyclic redundancy check specified as an error check instead of the current checksum. The weak checksum is partially compensated for by the common use of a CRC or better integrity check at layer 2, below both TCP and IP, such as is used in PPP or the Ethernet frame. However, this does not mean that the 16-bit TCP checksum is redundant: remarkably, surveys of Internet traffic have shown that software and hardware errors that introduce errors in packets between CRC-protected hops are common, and that the end-to-end 16-bit TCP checksum catches most of these simple errors. This is the end-to-end principle at work.
Congestion avoidance
The final part to TCP is congestion throttling. Acknowledgements for data sent, or lack of acknowledgements, are used by senders to implicitly interpret network conditions between the TCP sender and receiver. Coupled with timers, TCP senders and receivers can alter the behavior of the flow of data. This is more generally referred to as flow control, congestion control and/or network congestion avoidance. TCP uses a number of mechanisms to achieve high performance and avoid congesting the network (i.e., send data faster than either the network, or the host on the other end, can utilize it). These mechanisms include the use of a sliding window, the slow-start algorithm, the congestion avoidance algorithm, the fast retransmit and fast recovery algorithms, and more.
Enhancing TCP to reliably handle loss, minimize errors, manage congestion and go fast in very high-speed environments are ongoing areas of research and standards development.
TCP window size
The TCP receive window size is the amount of received data (in bytes) that can be buffered during a connection. The sending host can send only that amount of data before it must wait for an acknowledgment and window update from the receiving host.
Window scaling
For more efficient use of high bandwidth networks, a larger TCP window size may be used. The TCP window size field controls the flow of data and is limited to between 2 and 65,535 bytes.
Since the size field cannot be expanded, a scaling factor is used. The TCP window scale option, as defined in RFC 1323, is an option used to increase the maximum window size from 65,535 bytes to 1 Gigabyte. Scaling up to larger window sizes is a part of what is necessary for TCP Tuning.
The window scale option is used only during the TCP 3-way handshake. The window scale value represents the number of bits to left-shift the 16-bit window size field. The window scale value can be set from 0 (no shift) to 14.
Connection termination
The connection termination phase uses, at most, a four-way handshake, with each side of the connection terminating independently. When an endpoint wishes to stop its half of the connection, it transmits a FIN packet, which the other end acknowledges with an ACK. Therefore, a typical teardown requires a pair of FIN and ACK segments from each TCP endpoint.
A connection can be "half-open", in which case one side has terminated its end, but the other has not. The side that has terminated can no longer send any data into the connection, but the other side can.
It is also possible for a 3-way handshake when host A sends a FIN and host B replies with a FIN & ACK (merely combines 2 steps into one) and host A replies with an ACK. This is perhaps the most common method.
Finally, it is possible for both hosts to send FINs simultaneously then both just have to ACK. This could possibly be considered a 2-way handshake since the FIN/ACK sequence is done in parallel for both directions.
TCP ports
TCP uses the notion of port numbers to identify sending and receiving application end-points on a host, or Internet sockets. Each side of a TCP connection has an associated 16-bit unsigned port number (1-65535) reserved by the sending or receiving application. Arriving TCP data packets are identified as belonging to a specific TCP connection by its sockets, that is, the combination of source host address, source port, destination host address, and destination port. This means that a server computer can provide several clients with several services simultaneously, as long as a client takes care of initiating any simultaneous connections to one destination port from different source ports.
Port numbers are categorized into three basic categories: well-known, registered, and dynamic/private. The well-known ports are assigned by the Internet Assigned Numbers Authority (IANA) and are typically used by system-level or root processes. Well-known applications running as servers and passively listening for connections typically use these ports. Some examples include: FTP (21), TELNET (23), SMTP (25) and HTTP (80). Registered ports are typically used by end user applications as ephemeral source ports when contacting servers, but they can also identify named services that have been registered by a third party. Dynamic/private ports can also be used by end user applications, but are less commonly so. Dynamic/private ports do not contain any meaning outside of any particular TCP connection.
Development of TCP
TCP is both a complex and evolving protocol. However, while significant enhancements have been made and proposed over the years, its most basic operation has not changed significantly since RFC 793, published in 1981. RFC 1122, Host Requirements for Internet Hosts, clarified a number of TCP protocol implementation requirements. RFC 2581, TCP Congestion Control, one of the most important TCP related RFCs in recent years, describes updated algorithms to be used in order to avoid undue congestion. In 2001, RFC 3168 was written to describe explicit congestion notification (ECN), a congestion avoidance signalling mechanism. In the early 21st century, TCP is typically used in approximately 95% of all Internet packets [citation needed]. Common applications that use TCP include HTTP (World Wide Web), SMTP (e-mail) and FTP (file transfer).
The original TCP congestion control was called TCP Reno, but recently, several alternative congestion control algorithms have been proposed:
- High Speed TCP proposed by Sally Floyd in RFC 3649
- TCP Vegas by Brakmo and Peterson at University of Arizona
- TCP Westwood by UCLA
- BIC TCP by Injong Rhee at North Carolina State University
- H-TCP by Hamilton Institute
- Fast TCP (Fast Active queue management Scalable Transmission Control Protocol) by Caltech.
- TCP Hybla by University of Bologna
An extension mechanism TCP Interactive (iTCP) allows applications to subscribe to TCP events and respond accordingly enabling various functional extensions to TCP including application assisted congestion control.
TCP Over Wireless
TCP has been optimized for wired networks. Any packet loss is considered to be the result of congestion and the window size is reduced dramatically as a precaution. However, wireless links are known to experience sporadic and usually temporary losses due to fading, shadowing, handoff etc. that cannot be considered congestion. Erroneous back-off of the window size due to wireless packet loss is followed by a congestion avoidance phase with a conservative decrease in window size which causes the radio link to be underutilized. Extensive research has been done on this subject on how to combat these harmful effects. Suggested solutions can be categorized as end-to-end solutions (which require modifications at the client and/or server), link layer solutions (such as RLP in CDMA2000), or proxy based solutions (which require some changes in the network without modifying end nodes).
Hardware TCP Implementations
TCP Offload Engines
One way to overcome the processing power requirements of TCP is building hardware implementations of it, widely known as TCP Offload Engines (TOE). The main problem of TOEs is that they are hard to integrate into computing systems, requiring extensive changes in the operating system of the computer or device. The first company to develop such a device was Alacritech.
Debugging TCP
A packet sniffer, which intercepts TCP traffic on a network link, can be useful in debugging networks, network stacks and applications which use TCP by showing the user what packets are passing through a link.
Alternatives to TCP
For many applications TCP is not appropriate. One big problem (at least with normal implementations) is that the application cannot get at the packets coming after a lost packet until the retransmitted copy of the lost packet is received. This causes problems for real-time applications such as streaming multimedia (such as Internet radio), real-time multiplayer games and voice over IP (VoIP) where it is sometimes more useful to get most of the data in a timely fashion than it is to get all of the data in order.
Also for embedded systems, network booting and servers that serve simple requests from huge numbers of clients (e.g. DNS servers) the complexity of TCP can be a problem. Finally some tricks such as transmitting data between two hosts that are both behind NAT (using STUN or similar systems) are far simpler without a relatively complex protocol like TCP in the way.
Generally where TCP is unsuitable the User Datagram Protocol (UDP) is used. This provides the application multiplexing and checksums that TCP does, but does not handle building streams or retransmission giving the application developer the ability to code those in a way suitable for the situation and/or to replace them with other methods like forward error correction or interpolation.
SCTP is another IP protocol that provides reliable stream oriented services not so dissimilar from TCP. It is newer and considerably more complex than TCP so has not yet seen widespread deployment, however it is especially designed to be used in situations where reliability and near-real-time considerations are important.
TCP also has some issues in high bandwidth utilization environments. The TCP congestion avoidance algorithm works very well for ad-hoc environments where it is not known who will be sending data, but if the environment is predictable a timing based protocol such as ATM can avoid the overhead of retransmits that TCP needs.
Packet structure
A TCP packet consists of two sections:
- header
- data
The header consists of 11 fields, of which only 10 are required. The eleventh field is optional (pink background in table) and aptly named: options.
Header
+ | Bits 0–3 | 4–9 | 10–15 | 16–31 | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Source Port | Destination Port | ||||||||||||||||||||||||||||||
32 | Sequence Number | |||||||||||||||||||||||||||||||
64 | Acknowledgment Number | |||||||||||||||||||||||||||||||
96 | Data Offset | Reserved | Flags | Window | ||||||||||||||||||||||||||||
128 | Checksum | Urgent Pointer | ||||||||||||||||||||||||||||||
160 | Options (optional) | |||||||||||||||||||||||||||||||
160/192+ | Data |
- Source port
- This field identifies the sending port.
- Destination port
- This field identifies the receiving port.
- Sequence number
- The sequence number has a dual role. If the SYN flag is present then this is the initial sequence number and the first data byte is the sequence number plus 1. Otherwise if the SYN flag is not present then the first data byte is the sequence number.
- Acknowledgement number
- If the ACK flag is set then the value of this field is the sequence number the sender expects next.
- Data offset
- This 4-bit field specifies the size of the TCP header in 32-bit words. The minimum size header is 5 words and the maximum is 15 words thus giving the minimum size of 20 bytes and maximum of 60 bytes. This field gets its name from the fact that it is also the offset from the start of the TCP packet to the data.
- Reserved
- 6-bit reserved field for future use and should be set to zero.
- Flags (aka Control bits)
- This field contains 6 bit flags:
- URG
- Urgent pointer field is significant
- ACK
- Acknowledgement field is significant
- PSH
- Push function
- RST
- Reset the connection
- SYN
- Synchronize sequence numbers
- FIN
- No more data from sender
- Window
- The number of bytes the sender is willing to receive starting from the acknowledgement field value
- Checksum
- The 16-bit checksum field is used for error-checking of the header and data.
- With IPv4
- When TCP runs over IPv4, the method used to compute the checksum is defined in RFC 793:
- The checksum field is the 16 bit one's complement of the one's complement sum of all 16-bit words in the header and text. If a segment contains an odd number of header and text octets to be checksummed, the last octet is padded on the right with zeros to form a 16-bit word for checksum purposes. The pad is not transmitted as part of the segment. While computing the checksum, the checksum field itself is replaced with zeros.
- In other words, all 16-bit words are summed together using one's complement (with the checksum field set to zero). The sum is then one's complemented. This final value is then inserted as the checksum field. Algorithmically speaking, this is the same as for IPv4.
- The difference is in the data used to make the checksum. Included is a pseudo-header that mimics the IPv4 header:
+ | Bits 0–3 | 4–7 | 8–9 | 10–15 | 16–31 | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Source address | |||||||||||||||||||||||||||||||
32 | Destination address | |||||||||||||||||||||||||||||||
64 | Zeros | Protocol | TCP length | |||||||||||||||||||||||||||||
96 | Source Port | Destination Port | ||||||||||||||||||||||||||||||
128 | Sequence Number | |||||||||||||||||||||||||||||||
160 | Acknowledgement Number | |||||||||||||||||||||||||||||||
192 | Data Offset | Reserved | Flags | Window | ||||||||||||||||||||||||||||
224 | Checksum | Urgent Pointer | ||||||||||||||||||||||||||||||
256 | Options (optional) | |||||||||||||||||||||||||||||||
256/288+ | Data |
- The source and destination addresses are those in the IPv4 header. The protocol is that for TCP (see List of IPv4 protocol numbers): 6. The TCP length field is the length of the TCP header and data.
- With IPv6
- When TCP runs over IPv6, the method used to compute the checksum is changed, as per RFC 2460:
- Any transport or other upper-layer protocol that includes the addresses from the IP header in its checksum computation must be modified for use over IPv6, to include the 128-bit IPv6 addresses instead of 32-bit IPv4 addresses.
- When computing the checksum, a pseudo-header that mimics the IPv6 header is included:
+ | Bits 0 - 7 | 8–15 | 16–23 | 24–31 | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Source address | |||||||||||||||||||||||||||||||
32 | ||||||||||||||||||||||||||||||||
64 | ||||||||||||||||||||||||||||||||
96 | ||||||||||||||||||||||||||||||||
128 | Destination address | |||||||||||||||||||||||||||||||
160 | ||||||||||||||||||||||||||||||||
192 | ||||||||||||||||||||||||||||||||
256 | ||||||||||||||||||||||||||||||||
288 | TCP length | |||||||||||||||||||||||||||||||
320 | Zeros | Next Header | ||||||||||||||||||||||||||||||
352 | Source Port | Destination Port | ||||||||||||||||||||||||||||||
384 | Sequence Number | |||||||||||||||||||||||||||||||
416 | Acknowledgement Number | |||||||||||||||||||||||||||||||
448 | Data Offset | Reserved | Flags | Window | ||||||||||||||||||||||||||||
480 | Checksum | Urgent Pointer | ||||||||||||||||||||||||||||||
512 | Options (optional) | |||||||||||||||||||||||||||||||
512/544+ | Data |
- The source address is the one in the IPv6 header. The destination address is the final destination; if the IPv6 packet doesn't contain a Routing header, that will be the destination address in the IPv6 header, otherwise, at the originating node, it will be the address in the last element of the Routing header, and, at the receiving node, it will be the destination address in the IPv6 header. The Next Header value is the protocol value for TCP: 6. The TCP length field is the length of the TCP header and data.
- Urgent pointer
- If the URG flag is set, then this 16-bit field is an offset from the sequence number indicating the last urgent data byte.
- Options
- Additional header fields (called options) may follow the urgent pointer. If any options are present then the total length of the option field must be a multiple of a 32-bit word and the data offset field adjusted appropriately.
Data
The last field is not a part of the header. The contents of this field are whatever the upper layer protocol wants but this protocol is not set in the header and is presumed based on the port selection.
See also
- TCP congestion avoidance algorithms for more on TCP Reno, TCP Vegas, TCP Westwood, BIC TCP and Hybla
- TCP and UDP port
- TCP and UDP port numbers for a complete (growing) list of ports/services
- Connection-oriented protocol
- TCP Tuning for high performance networks
- T/TCP variant of TCP
- Path MTU discovery
- TCP Sequence Prediction Attack
- SYN flood
External links
- RFC793 in plain-text format
- RFC1122 some error-corrections
- RFC1323 TCP-Extensions
- IANA Port Assignments
- John Kristoff's Overview of TCP (Fundamental concepts behind TCP and how it is used to transport data between two endpoints)
- Introduction to TCP/IP - with pictures
- The basics of Transmission Control Protocol
- TCP, Transmission Control Protocol
- TCP EFSM diagram - A detailed description of TCP states.
- The Law of Leaky Abstractions by Joel Spolsky
- Dissertation about TCP improvements in wired and wireless networks (Dissertation)