tcp_sendfile() reads data directly from a file and does not use NET_TCP_WRITE_BUFFERS data flow
even if CONFIG_NET_TCP_WRITE_BUFFERS option is enabled.
Despite this, tcp_sendfile relied on NET_TCP_WRITE_BUFFERS specific flow control variables that
were idle during sendfile operation. Thus it was a total inconsistency.
E.g. because of the issue, TCP socket used by sendfile() operation never issued
FIN packet on close() command, and the TCP connection hung up.
As a result of the fix, simultaneously enabled CONFIG_NET_TCP_WRITE_BUFFERS and
CONFIG_NET_SENDFILE options can coexist.
If the remote TCP receiver advertised TCP window size greater than 64 KB
and TCP ACK packets returned to the NuttX TCP sender with a significant delay,
tx_unacked variable overflowed and further TCP send stalled forever
(until TCP re-connection).
In case of enabled packet forwarding mode, packets were forwarded in a reverse order
because of LIFO behavior of the connection event list.
The issue exposed only during high network traffic. Thus the event list started to grow
that resulted in changing the order of packets inside of groups of several packets
like the following: 3, 2, 1, 6, 5, 4, 8, 7 etc.
Remarks concerning the connection event list implementation:
* Now the queue (list) is FIFO as it should be.
* The list is singly linked.
* The list has a head pointer (inside of outer net_driver_s structure),
and a tail pointer is added into outer net_driver_s structure.
* The list item is devif_callback_s structure.
It still has two pointers to two different list chains (*nxtconn and *nxtdev).
* As before the first argument (*dev) of the list functions can be NULL,
while the other argument (*list) is effective (not NULL).
* An extra (*tail) argument is added to devif_callback_alloc()
and devif_conn_callback_free() functions.
* devif_callback_alloc() time complexity is O(1) (i.e. O(n) to fill the whole list).
* devif_callback_free() time complexity is O(n) (i.e. O(n^2) to empty the whole list).
* devif_conn_event() time complexity is O(n).
* Do not accept the window in old segments.
Implement SND.WL1/WL2 things in the RFC.
* Do not accept the window in the segment w/o ACK bit set.
The window is an offset from the ack seq.
(maybe it's simpler to just drop segments w/o ACK though)
* Subtract snd_wnd by the amount of the ack advancement.
Consider a bi-directional TCP connection:
1. we use all IOBs for tx queue
2. we advertize zero recv window because we have no free IOBs
3. if the peer tcp does the same thing,
both sides advertize zero window and can not drain the tx queue.
For a similar stall to happen, the peer doesn't need to be
a naive tcp implementation like nuttx. A naive application blocking
on send() without draining its read buffer is enough.
(Probably such an application should be fixed to drain rx even
when tx is full. However, it's another story.)
This commit avoids the situation by prevent tx from grabbing
the all IOBs in the first place. (assuming CONFIG_IOB_THROTTLE > 0)
1. change all window relative value type to uint32_t
2. move window range validity check(UINT16_MAX) before assembling TCP header
Signed-off-by: chao.an <anchao@xiaomi.com>
Do not bother to preserve segment boundaries in the tcp
readahead queues.
* Avoid wasting the tail IOB space for each segments.
Instead, pack the newly received data into the tail space
of the last IOB. Also, advertise the tail space as
a part of the window.
* Use IOB chain directly. Eliminate IOB queue overhead.
* Allow to accept only a part of a segment.
* This change improves the memory efficiency.
And probably more importantly, allows less-confusing
recv window advertisement behavior.
Previously, even when we advertise N bytes window,
we often couldn't actually accept N bytes. Depending on
the segment sizes and IOB configurations, it was causing
segment drops.
Also, the previous code was moving the right edge of the
window back and forth too often, even when nothing in
the system was competing on the IOBs. Shrinking the
window that way is a kinda well known recipe to confuse
the peer stack.
* Fixes the case where the window was small but not zero.
* tcp_recvfrom: Remove tcp_ackhandler. Instead, simply schedule TX for
a possible window update and make tcp_appsend decide.
* Replace rcv_wnd (the last advertized window size value) with
rcv_adv. (the window edge sequence number advertized to the peer)
rcv_wnd was complicated to deal with because its base (rcvseq) is
also moving.
* tcp_appsend: Send a window update even if there are no other reasons
to send an ack.
Namely, send an update if it increases the window by
* 2 * mss
* or the half of the max possible window size
Request the TCP ACK to estimate the receive window after handle
any data already buffered in a read-ahead buffer.
Change-Id: Id998a1125dd2991d73ba4bef081ddcb7adea4f0d
Signed-off-by: chao.an <anchao@xiaomi.com>
RFC2001: TCP Slow Start, Congestion Avoidance, Fast Retransmit,
and Fast Recovery Algorithms
...
3. Fast Retransmit
Modifications to the congestion avoidance algorithm were proposed in
1990 [3]. Before describing the change, realize that TCP may
generate an immediate acknowledgment (a duplicate ACK) when an out-
of-order segment is received (Section 4.2.2.21 of [1], with a note
that one reason for doing so was for the experimental fast-
retransmit algorithm). This duplicate ACK should not be delayed.
The purpose of this duplicate ACK is to let the other end know that a
segment was received out of order, and to tell it what sequence
number is expected.
Since TCP does not know whether a duplicate ACK is caused by a lost
segment or just a reordering of segments, it waits for a small number
of duplicate ACKs to be received. It is assumed that if there is
just a reordering of the segments, there will be only one or two
duplicate ACKs before the reordered segment is processed, which will
then generate a new ACK. If three or more duplicate ACKs are
received in a row, it is a strong indication that a segment has been
lost. TCP then performs a retransmission of what appears to be the
missing segment, without waiting for a retransmission timer to
expire.
Change-Id: Ie2cbcecab507c3d831f74390a6a85e0c5c8e0652
Signed-off-by: chao.an <anchao@xiaomi.com>
Add a fallback mechanism to ensure that there are still available
iobs for an free connection, Guarantees all connections will have
a minimum threshold iob to keep the connection not be hanged.
Change-Id: I59bed98d135ccd1f16264b9ccacdd1b0d91261de
Signed-off-by: chao.an <anchao@xiaomi.com>
Fixes an issue where tcp sockets with activated keepalives stalled and
were not properly closed. Poll would not indicate a POLLHUP and therefore
locks down the application.
* tcp_conn_s.tcp_conn_s & tcp_conn_s.keepintvl changed to uint32_t
According RFC1122 keepidle MUST have a default of 2 hours.
It's enough to check the buffer available in the net event handler
Change-Id: I2d7c7a03675cf6eff6ffb42a81b7c7245253e92c
Signed-off-by: Xiang Xiao <xiaoxiang@xiaomi.com>
MSG_DONTWAIT (since Linux 2.2)
Enables nonblocking operation; if the operation would block, the
call fails with the error EAGAIN or EWOULDBLOCK. This provides
similar behavior to setting the O_NONBLOCK flag (via the fcntl(2)
F_SETFL operation), but differs in that MSG_DONTWAIT is a per-call
option, whereas O_NONBLOCK is a setting on the open file description
(see open(2)), which will affect all threads in the calling process
and as well as other processes that hold file descriptors referring
to the same open file description.
1.Consolidate absolute to relative timeout conversion into one place(_net_timedwait)
2.Drive the wait timeout logic by net_timedwait instead of devif_timer
This patch help us remove devif_timer(period tick) to save the power in the future.
Change-Id: I534748a5d767ca6da8a7843c3c2f993ed9ea77d4
Signed-off-by: Xiang Xiao <xiaoxiang@xiaomi.com>
Here is the email loop talk about why it is better to remove the option:
https://groups.google.com/forum/#!topic/nuttx/AaNkS7oU6R0
Change-Id: Ib66c037752149ad4b2787ef447f966c77aa12aad
Signed-off-by: Xiang Xiao <xiaoxiang@xiaomi.com>
Author: Gregory Nutt <gnutt@nuttx.org>
Run all .h and .c files modified in last PR through nxstyle.
Author: Xiang Xiao <xiaoxiang@xiaomi.com>
Net cleanup (#17)
* Fix the semaphore usage issue found in tcp/udp
1. The count semaphore need disable priority inheritance
2. Loop again if net_lockedwait return -EINTR
3. Call nxsem_trywait to avoid the race condition
4. Call nxsem_post instead of sem_post
* Put the work notifier into free list to avoid the heap fragment in the long run. Since the allocation strategy is encapsulated internally, we can even refine the implementation later.
* Network stack shouldn't allocate memory in the poll implementation to avoid the heap fragment in the long run, other modification include:
1. Select MM_IOB automatically since ICMP[v6] socket can't work without the read ahead buffer
2. Remove the net lock since xxx_callback_free already do the same thing
3. TCP/UDP poll should work even the read ahead buffer isn't enabled at all
* Add NET_ prefix for UDP_NOTIFIER and TCP_NOTIFIER option to align with other UDP/TCP option convention
* Remove the unused _SF_[IDLE|ACCEPT|SEND|RECV|MASK] flags since there are code to set/clear these flags, but nobody check them.
Squashed commit of the following:
net/tmp: Rename the unacked field of the tcp connection structure to tx_unacked. Too confusing with the implementation of delayed RX ACKs.
net/tcp: Initial implementation of TCP delayed ACKs.
net/tcp: Add delayed ACK configuration selection. Rename tcp_ack() to tcp_synack(). It may or may not send a ACK. It will always send SYN or SYN/ACK.
Iobinstrumentation
* mm/iob: Introduces producer/consumer id to every iob call. This is so that the calls can be instrumented to monitor the IOB resources.
* iob instrumentation - Merges producer/consumer enumeration for simpler IOB user.
* fs/procfs: Starts adding support for /proc/iobinfo
* fs/procfs: Finishes first pass of simple IOB user stastics and /proc/iobinfo entry
Approved-by: Gregory Nutt <gnutt@nuttx.org>
Squashed commit of the following:
net/: Fix some naming inconsistencies, Fix final compilation issies.
net/inet/inet_close(): Now that we have logic to drain the buffered TX data, we can implement a proper lingering close.
net/inet,tcp,udp: Add functions to wait for write buffers to drain.
net/udp: Add support for notification when the UDP write buffer becomes empty.
net/tcp: Add support for notification when the TCP write buffer becomes empty.
arch/: Removed all references to CONFIG_DISABLE_POLL. The standard POSIX poll() can not longer be disabled.
sched/ audio/ crypto/: Removed all references to CONFIG_DISABLE_POLL. The standard POSIX poll() can not longer be disabled.
Documentation/: Removed all references to CONFIG_DISABLE_POLL. The standard POSIX poll() can not longer be disabled.
fs/: Removed all references to CONFIG_DISABLE_POLL. The standard POSIX poll() can not longer be disabled.
graphics/: Removed all references to CONFIG_DISABLE_POLL. The standard POSIX poll() can not longer be disabled.
net/: Removed all references to CONFIG_DISABLE_POLL. The standard POSIX poll() can not longer be disabled.
drivers/: Removed all references to CONFIG_DISABLE_POLL. The standard POSIX poll() can not longer be disabled.
include/, syscall/, wireless/: Removed all references to CONFIG_DISABLE_POLL. The standard POSIX poll() can not longer be disabled.
configs/: Remove all references to CONFIG_DISABLE_POLL. Standard POSIX poll can no longer be disabled.
Squashed commit of the following:
sched/sched/sched_getsockets.c: Fix an error in conditional compilation.
fs/: Remove all conditional logic based on CONFIG_NSOCKET_DESCRIPTORS == 0
Documentation/: Remove all references to CONFIG_NSOCKET_DESCRIPTORS == 0
include/: Remove all conditional logic based on CONFIG_NSOCKET_DESCRIPTORS == 0
libs/: Remove all conditional logic based on CONFIG_NSOCKET_DESCRIPTORS == 0
net/: Remove all conditional logic based on CONFIG_NSOCKET_DESCRIPTORS == 0
sched/: Remove all conditional logic based on CONFIG_NSOCKET_DESCRIPTORS == 0
syscall/: Remove all conditional logic based on CONFIG_NSOCKET_DESCRIPTORS == 0
tools/: Fixups for CONFIG_NSOCKET_DESCRIPTORS no longer used to disable sockets.
Squashed commit of the following:
Fix up some final compile isses.
net/netdev: Convert the network down notification logic to use the new wqueue-based notification factility.
net/udp: Convert the UDP readahead notification logic to use the new wqueue-based notification factility.
net/tcp: Convert the TCP readahead notification logic to use the new wqueue-based notification factility.
mm/iob: Convert the IOB notification logic to use the new wqueue-based notification factility.
sched/wqueue: Signals are not good IPCs to support the target poll functionality for several reasons including the amount of data that can be passed with a signal and in the fact that in protected and kernel modes, user threads executing signal handlers in protected, kernel memory is problematic. Instead, convert the same logic to perform the notifications via function callback on the high priority work queue.
Squashed commit of the following:
net/tcp: Add signal notification for the case when UDP read-ahead data is buffered. This is basically of clone of the TCP notification logic with naming adapted for UDP.
net/tcp: Add signal notification for the case when TCP read-ahead data is buffered.
Fix a few typo/compilation problems.
net/: Remove all CONFIG_NET_xxx_TCP_RECVWNDO configuration variables. They were used only to initialize the d_recwndo of the network device structure which no longer exists.
net/: Remove the device TCP receive window field (d_recvwndo) from the device structure. That value is no longer retained, but is calculated dynamically.
Remove some dangling references to CONFIG_NET_TCP_RWND_CONTROL.
net/tcp: Take read-ahead throttling into account when calculating the TCP receive window size.
net/tcp: tcp_get_recvwindow() now returns the receive window size directly (vs. indirectly via the device structure).
net/tcp: Remove CONFIG_NET_TCP_RWND_CONTROL. TCP window algorithm is now trigged only by CONFIG_NET_TCP_READAHEAD.
Squashed commit of the following:
sched: Rename all use of system_t to clock_t.
syscall: Rename all use of system_t to clock_t.
net: Rename all use of system_t to clock_t.
libs: Rename all use of system_t to clock_t.
fs: Rename all use of system_t to clock_t.
drivers: Rename all use of system_t to clock_t.
arch: Rename all use of system_t to clock_t.
include: Remove definition of systime_t; rename all use of system_t to clock_t.
net/tcp: Add logic to send probes when SO_KEEPALIVE is enabled.
net/tcp: TCP socket should not have to be connected to configure KeepAlive.
net/: Add a separate configuration to enable/disable KEEPALIVE socket options.
net/tcp: Arguments to TCP keep-alive timing functions probably should be struct timeval as are the times for other time-related socket options.
net/tcp: Fix a backward conditional
net/tcp: Add some more checks and debug output to TCP-protocol socket options.
net/tcp: Cosmetic changes to some alignment.
net/: Adds socket options needed to manage TCP-keepalive and TCP state machine logic to detect if that the remote peer is alive. Still missing the timer poll logic to send the keep-alive probes and the state machine logic to respond to probes.
User-space networking stack API allows user-space daemon to
provide TCP/IP stack implementation for NuttX network.
Main use for this is to allow use and seamless integration of
HW-provided TCP/IP stacks to NuttX.
For example, user-space daemon can translate /dev/usrsock
API requests to HW TCP/IP API requests while rest of the
user-space can access standard socket API, with socket
descriptors that can be used with NuttX system calls.
When a poll requesting POLLOUT happens, the poll should return
immediately if a write will not block. This change adds that, as
opposed to the old behaviour of blocking until a timer from the
Ethernet driver eventually triggers the poll to complete.
This is only implemented for buffered TCP. Unbuffered TCP should
behave as before.
The solution here is to eliminate the device devices semaphore altogether. This eliminates netdev_semtake() and netdev_semgive() and replaces them with net_lock() and net_unlock() which have larger scope as needed for this purpose.