l3/l4 stack will decouple the reference of d_buf gradually, Only legacy
devices still retain d_buf support, new net devices will use d_iob
Signed-off-by: chao an <anchao@xiaomi.com>
fix build break if enable CONFIG_NET_IPv6 only
In file included from tcp/tcp_sendfile.c:38:
tcp/tcp_sendfile.c: In function ‘sendfile_eventhandler’:
tcp/tcp_sendfile.c:173:27: error: ‘struct tcp_conn_s’ has no member named ‘domain’
173 | DEBUGASSERT(conn->domain == PF_INET6);
| ^~
Signed-off-by: chao an <anchao@xiaomi.com>
Issue:
recv return 0 means peer side has already closed the connection according to
man page's description.
0 is returned without set errno when TCP rx timeout happens with current
design.
Solution:
return result instead of ir_result when ir_result is 0, then -1 will be
returned with errno set to EAGAIN for tcp rx buffer empty case.
Signed-off-by: liangchaozhong <liangchaozhong@xiaomi.com>
devif/ipv6_input.c: In function 'ipv6_input':
devif/ipv6_input.c:59:33: error: 'TCPIPv6BUF' undeclared (first use in this function); did you mean 'IPv6BUF'?
59 | #define PAYLOAD ((FAR uint8_t *)TCPIPv6BUF)
| ^~~~~~~~~~
devif/ipv6_input.c:302:14: note: in expansion of macro 'PAYLOAD'
302 | payload = PAYLOAD; /* Assume payload starts right after IPv6 header */
|
Signed-off-by: chao an <anchao@xiaomi.com>
Fixed by commit:
| net: remove pvconn reference from all devif callback
|
| Do not use 'pvconn' argument to get the connection pointer since
| pvconn is normally NULL for some events like NETDEV_DOWN.
| Instead, the connection pointer can be reliably obtained from the
| corresponding private pointer.
Signed-off-by: chao an <anchao@xiaomi.com>
I noticed that the conn instance will leak during stress test,
The close work queued from tcp_close_eventhandler() will be canceled
by tcp_timer() immediately:
Breakpoint 1, tcp_close_eventhandler (dev=0x565cd338 <up_irq_restore+108>, pvpriv=0x5655e6ff <getpid+12>, flags=0) at tcp/tcp_close.c:71
(gdb) bt
| #0 tcp_close_eventhandler (dev=0x565cd338 <up_irq_restore+108>, pvpriv=0x5655e6ff <getpid+12>, flags=0) at tcp/tcp_close.c:71
| #1 0x5658bf1e in devif_conn_event (dev=0x5660bd80 <g_sim_dev>, flags=512, list=0x5660d558 <g_cbprealloc+312>) at devif/devif_callback.c:508
| #2 0x5658a219 in tcp_callback (dev=0x5660bd80 <g_sim_dev>, conn=0x5660c4a0 <g_tcp_connections>, flags=512) at tcp/tcp_callback.c:167
| #3 0x56589253 in tcp_timer (dev=0x5660bd80 <g_sim_dev>, conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_timer.c:378
| #4 0x5658dd47 in tcp_poll (dev=0x5660bd80 <g_sim_dev>, conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_devpoll.c:95
| #5 0x5658b95f in devif_poll_tcp_connections (dev=0x5660bd80 <g_sim_dev>, callback=0x565770f2 <netdriver_txpoll>) at devif/devif_poll.c:601
| #6 0x5658b9ea in devif_poll (dev=0x5660bd80 <g_sim_dev>, callback=0x565770f2 <netdriver_txpoll>) at devif/devif_poll.c:722
| #7 0x56577230 in netdriver_txavail_work (arg=0x5660bd80 <g_sim_dev>) at sim/up_netdriver.c:308
| #8 0x5655999e in work_thread (argc=2, argv=0xf3db5dd0) at wqueue/kwork_thread.c:178
| #9 0x5655983f in nxtask_start () at task/task_start.c:129
(gdb) c
Continuing.
Breakpoint 2, tcp_update_timer (conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_timer.c:178
(gdb) bt
| #0 tcp_update_timer (conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_timer.c:178
| #1 0x5658952a in tcp_timer (dev=0x5660bd80 <g_sim_dev>, conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_timer.c:708
| #2 0x5658dd47 in tcp_poll (dev=0x5660bd80 <g_sim_dev>, conn=0x5660c4a0 <g_tcp_connections>) at tcp/tcp_devpoll.c:95
| #3 0x5658b95f in devif_poll_tcp_connections (dev=0x5660bd80 <g_sim_dev>, callback=0x565770f2 <netdriver_txpoll>) at devif/devif_poll.c:601
| #4 0x5658b9ea in devif_poll (dev=0x5660bd80 <g_sim_dev>, callback=0x565770f2 <netdriver_txpoll>) at devif/devif_poll.c:722
| #5 0x56577230 in netdriver_txavail_work (arg=0x5660bd80 <g_sim_dev>) at sim/up_netdriver.c:308
| #6 0x5655999e in work_thread (argc=2, argv=0xf3db5dd0) at wqueue/kwork_thread.c:178
| #7 0x5655983f in nxtask_start () at task/task_start.c:129
Since a separate work will add 24 bytes to each conn instance,
but in order to support the feature of asynchronous close(),
I can not find a better way than adding a separate work,
for resource constraints, I recommend the developers to enable
CONFIG_NET_ALLOC_CONNS, which will reduce the ram usage.
Signed-off-by: chao an <anchao@xiaomi.com>
Do not use 'pvconn' argument to get the connection pointer since
pvconn is normally NULL for some events like NETDEV_DOWN.
Instead, the connection pointer can be reliably obtained from the
corresponding private pointer.
Signed-off-by: chao.an <anchao@xiaomi.com>
Otherwise, when a long test triggers multiple timeout retransmissions,
the late timeout retransmissions are always delayed between 24 and 48 seconds
Signed-off-by: zhanghongyu <zhanghongyu@xiaomi.com>
since it is impossible to track producer and consumer
correctly if TCP/IP stack pass IOB directly to netdev
Signed-off-by: Xiang Xiao <xiaoxiang@xiaomi.com>
fix regression of invalid update the rexmit_seq in buffer mode
rexmit_seq should not be used instead of sndseq in fast retransmission,
sndseq of retransmission in the packet does not need to be re-updated
Signed-off-by: chao.an <anchao@xiaomi.com>
Net poll teardown is not protected by net lock, if the conn is released
before teardown, the assertion failure will be triggered during free dev
callback, this patch will add the net lock around net poll teardown to
fix race condition
nuttx/libs/libc/assert/lib_assert.c:36
nuttx/net/devif/devif_callback.c:85
nuttx/net/tcp/tcp_netpoll.c:405
nuttx/fs/vfs/fs_poll.c:244
nuttx/fs/vfs/fs_poll.c:500
Signed-off-by: chao.an <anchao@xiaomi.com>
When the free connection list is unenough to alloc a new instance,
the TCP stack will reuse the currently closed connection, but if
the handle is not released by the user via close(2), the reference
count of the connection remains in a non-zero value, it will cause
the assertion to fail, so when the handle is not released we should
not use such a conn instance when being actively closed, and ensure
that the reference count is assigned within the net lock protection
|(gdb) bt
|#0 up_assert (filename=0x565c78f7 "tcp/tcp_conn.c", lineno=771) at sim/up_assert.c:75
|#1 0x56566177 in _assert (filename=0x565c78f7 "tcp/tcp_conn.c", linenum=771) at assert/lib_assert.c:36
|#2 0x5657d620 in tcp_free (conn=0x565fb3e0 <g_tcp_connections>) at tcp/tcp_conn.c:771
|#3 0x5657d5a1 in tcp_alloc (domain=2 '\002') at tcp/tcp_conn.c:700
|#4 0x565b1f50 in inet_tcp_alloc (psock=0xf3dea150) at inet/inet_sockif.c:144
|#5 0x565b2082 in inet_setup (psock=0xf3dea150, protocol=0) at inet/inet_sockif.c:253
|#6 0x565b1bf0 in psock_socket (domain=2, type=1, protocol=0, psock=0xf3dea150) at socket/socket.c:121
|#7 0x56588f5f in socket (domain=2, type=1, protocol=0) at socket/socket.c:278
|#8 0x565b11c0 in hello_main (argc=1, argv=0xf3dfab10) at hello_main.c:35
|#9 0x56566631 in nxtask_startup (entrypt=0x565b10ef <hello_main>, argc=1, argv=0xf3dfab10) at sched/task_startup.c:70
|#10 0x565597fa in nxtask_start () at task/task_start.c:134
Signed-off-by: chao.an <anchao@xiaomi.com>
This reverts commit b88a1fd7fd. [1]
Because:
* It casues assertion failures like [2].
* I don't understand what it attempted to fix.
[1]
```
commit b88a1fd7fd
Author: chao.an <anchao@xiaomi.com>
Date: Sat Jul 2 13:17:41 2022 +0800
net/tcp: discard connect reference before free
connect reference should be set to 0 before free
Signed-off-by: chao.an <anchao@xiaomi.com>
```
[2]
```
#0 up_assert (filename=0x5516d0 "tcp/tcp_conn.c", lineno=771) at sim/up_assert.c:75
#1 0x000000000040a4bb in _assert (filename=0x5516d0 "tcp/tcp_conn.c", linenum=771) at assert/lib_assert.c:36
#2 0x000000000042a2ad in tcp_free (conn=0x597fe0 <g_tcp_connections+384>) at tcp/tcp_conn.c:771
#3 0x000000000053bdc2 in tcp_close_disconnect (psock=0x7f58d1abbd80) at tcp/tcp_close.c:331
#4 0x000000000053bc69 in tcp_close (psock=0x7f58d1abbd80) at tcp/tcp_close.c:366
#5 0x000000000052eefe in inet_close (psock=0x7f58d1abbd80) at inet/inet_sockif.c:1689
#6 0x000000000052eb9b in psock_close (psock=0x7f58d1abbd80) at socket/net_close.c:102
#7 0x0000000000440495 in sock_file_close (filep=0x7f58d1b35f40) at socket/socket.c:115
#8 0x000000000043b8b6 in file_close (filep=0x7f58d1b35f40) at vfs/fs_close.c:74
#9 0x000000000043ab22 in nx_close (fd=9) at inode/fs_files.c:544
#10 0x000000000043ab7f in close (fd=9) at inode/fs_files.c:578
```
Retransmit only one the earliest not acknowledged segment
(according to RFC 6298 (5.4)). The issue is the same as it was
in tcp_send_unbuffered.c and tcp_sendfile.c.
Signed-off-by: chao.an <anchao@xiaomi.com>
The time consuming of tcp waving hands(close(2)) will be affected
by network jitter, especially the wireless device cannot receive
the last-ack under worst environment, in this change we move the
tcp close callback into background and invoke the resource free
from workqueue, which will avoid the user application from being
blocked for a long time and unable to return in the call of close
Signed-off-by: chao.an <anchao@xiaomi.com>
since the code could map the unsupported work to the
supported one and remove select SCHED_WORKQUEUE from
Kconfig since SCHED_[L|H]PWORK already do the selection
Signed-off-by: Xiang Xiao <xiaoxiang@xiaomi.com>
1. remove the unnecessary interfaces tcp_close_monitor()
socket flags(s_flags) is a global state for net connection
remove the incorrect update for stop monitor
2. do not start the tcp monitor from duplicated psock
the tcp monitor has already registered in connect callback
------------------------------------------------------------
This patch also fix the telnet issue reported by:
https://github.com/apache/incubator-nuttx/pull/5434#issuecomment-1035600651
the orignal session fd is closed after dup, the connect state
has incorrectly migrated to close:
drivers/net/telnet.c:
977 static int telnet_session(FAR struct telnet_session_s *session)
...
1031 ret = psock_dup2(psock, &priv->td_psock);
...
1082 nx_close(session->ts_sd);
Signed-off-by: chao.an <anchao@xiaomi.com>
and removed two tcp_send_txnotify() calls from tcp_sendfile (they are not needed anymore).
As a result, the TX throughput of both the tcp_send_buffered and tcp_send_unbuffered
is significantly boosted in case of TUN/TAP network device.
According to RFC 5681 (3.2) the TCP Fast Retransmit algorithm should start
if the threshold of 3 duplicate ACKs is reached.
Thus the threshold should be a constant, not an integer option.
(conn->sndseq was updated in multiple places that was unreasonable and complicated).
This optimization is the same as it was done for tcp_send_unbuffered.
Wrong unackseq calculation locked conn->tx_unacked at non-zero values
even if all ACKs were received.
This issue is the same as it was with tcp_send_unbuffered.
Do not use pvconn argument to get the TCP connection pointer because pvconn is
normally NULL for some events like NETDEV_DOWN. Instead, the TCP connection pointer
can be reliably obtained from the corresponding TCP socket.
Both the snd_ackcb and snd_datacb callbacks were created and destroyed right after sending every packet.
Whenever TCP_REXMIT event occurred due to TCP send timeout, TCP_REXMIT was ignored because
snd_ackcb callback had been destroyed by the time.
The issue is fixed as follows:
- both the snd_ackcb and snd_datacb callbacks are combined into one snd_cb callback
(the same way as in tcp_send_unbuffered.c).
- the snd_cb callback lives until all requested data (via sendfile) is sent,
including all ACKs and possible retransmissions.
As a positive side effect of the code optimization / fix, sendfile TCP payload throughput is increased.
tcp_sendfile() reads data directly from a file and does not use NET_TCP_WRITE_BUFFERS data flow
even if CONFIG_NET_TCP_WRITE_BUFFERS option is enabled.
Despite this, tcp_sendfile relied on NET_TCP_WRITE_BUFFERS specific flow control variables that
were idle during sendfile operation. Thus it was a total inconsistency.
E.g. because of the issue, TCP socket used by sendfile() operation never issued
FIN packet on close() command, and the TCP connection hung up.
As a result of the fix, simultaneously enabled CONFIG_NET_TCP_WRITE_BUFFERS and
CONFIG_NET_SENDFILE options can coexist.
Wrong unackseq calculation locked conn->tx_unacked at non-zero values
even if all ACKs were received. Thus unbuffered psock_tcp_send() never completed.
If the remote TCP receiver advertised TCP window size greater than 64 KB
and TCP ACK packets returned to the NuttX TCP sender with a significant delay,
tx_unacked variable overflowed and further TCP send stalled forever
(until TCP re-connection).
While it's a neat idea, it doesn't work well in reality.
* Many of modern tcp stacks don't obey the "ack every other packet"
rule these days. (Linux, macOS, ...)
* Even if a traditional TCP implementation is assumed, we can't
predict/control which packets are acked reliably. For example,
window updates can easily mess up our strategy.
tcp_timer: eliminated false decrements of conn->timer in case of multiple network adapters.
The false timer decrements sometimes provoked TCP spurious retransmissions due to premature timeouts.
In case of enabled packet forwarding mode, packets were forwarded in a reverse order
because of LIFO behavior of the connection event list.
The issue exposed only during high network traffic. Thus the event list started to grow
that resulted in changing the order of packets inside of groups of several packets
like the following: 3, 2, 1, 6, 5, 4, 8, 7 etc.
Remarks concerning the connection event list implementation:
* Now the queue (list) is FIFO as it should be.
* The list is singly linked.
* The list has a head pointer (inside of outer net_driver_s structure),
and a tail pointer is added into outer net_driver_s structure.
* The list item is devif_callback_s structure.
It still has two pointers to two different list chains (*nxtconn and *nxtdev).
* As before the first argument (*dev) of the list functions can be NULL,
while the other argument (*list) is effective (not NULL).
* An extra (*tail) argument is added to devif_callback_alloc()
and devif_conn_callback_free() functions.
* devif_callback_alloc() time complexity is O(1) (i.e. O(n) to fill the whole list).
* devif_callback_free() time complexity is O(n) (i.e. O(n^2) to empty the whole list).
* devif_conn_event() time complexity is O(n).
Gregory Nutt has submitted the SGA
UVC Ingenieure has submitted the SGA
Max Holtzberg has submitted the ICLA
as a result we can migrate the licenses to Apache.
Signed-off-by: Alin Jerpelea <alin.jerpelea@sony.com>
* Do not accept the window in old segments.
Implement SND.WL1/WL2 things in the RFC.
* Do not accept the window in the segment w/o ACK bit set.
The window is an offset from the ack seq.
(maybe it's simpler to just drop segments w/o ACK though)
* Subtract snd_wnd by the amount of the ack advancement.
Fix a wrong assertion in:
```
commit 98ec46d726
Author: YAMAMOTO Takashi <yamamoto@midokura.com>
Date: Tue Jul 20 09:10:43 2021 +0900
tcp_send_buffered.c: fix iob allocation deadlock
Ensure to put the wrb back onto the write_q when blocking
on iob allocation. Otherwise, it can deadlock with other
threads doing the same thing.
```
I forget to submit this with https://github.com/apache/incubator-nuttx/pull/4257
With an applictation using mbedtls, I observed retransmitted segments
with corrupted user data, detected by the peer tls during mac processing.
Looking at the packet dump, I suspect that a wrb which has been put back
onto the write_q for retransmission was partially sent but fully acked.
Note: it's normal for a retransmission to be acked before sent.
In that case, the bug fixed in this commit would cause the wrb have
a wrong sequence number, possibly the same as the next wrb. It matches
what I saw in the packet dump. That is, the broken segments contain the
payload identical to one of the previous segment.
Consider a bi-directional TCP connection:
1. we use all IOBs for tx queue
2. we advertize zero recv window because we have no free IOBs
3. if the peer tcp does the same thing,
both sides advertize zero window and can not drain the tx queue.
For a similar stall to happen, the peer doesn't need to be
a naive tcp implementation like nuttx. A naive application blocking
on send() without draining its read buffer is enough.
(Probably such an application should be fixed to drain rx even
when tx is full. However, it's another story.)
This commit avoids the situation by prevent tx from grabbing
the all IOBs in the first place. (assuming CONFIG_IOB_THROTTLE > 0)
Since we do not have the Nagle's algorithm,
the TCP_NODELAY socket option is enabled by default.
Change-Id: I0c8619bb06cf418f7eded5bd72ac512b349cacc5
Signed-off-by: chao.an <anchao@xiaomi.com>
Since a SOL option IP_TTL exist, we should rename this IP_TTL
in netconfig.h to avoid confusion.
Signed-off-by: Huang Qi <huangqi3@xiaomi.com>
Change-Id: Ib04c36553f23bce8d362e97294a8b83eaa050cf3
quote from https://man7.org/linux/man-pages/man2/sendfile.2.html:
If offset is not NULL, then it points to a variable holding the
file offset from which sendfile() will start reading data from
in_fd. When sendfile() returns, this variable will be set to the
offset of the byte following the last byte that was read. If
offset is not NULL, then sendfile() does not modify the file
offset of in_fd; otherwise the file offset is adjusted to reflect
the number of bytes read from in_fd.
If offset is NULL, then data will be read from in_fd starting at
the file offset, and the file offset will be updated by the call.
The change also align with the implementation at:
libs/libc/misc/lib_sendfile.c
Signed-off-by: Xiang Xiao <xiaoxiang@xiaomi.com>
Change-Id: I607944f40b04f76731af7b205dcd319b0637fa04
1. change all window relative value type to uint32_t
2. move window range validity check(UINT16_MAX) before assembling TCP header
Signed-off-by: chao.an <anchao@xiaomi.com>
tcp_close disposes the connection immediately if it's called in
TCP_LAST_ACK. If it happens, we will end up with responding the
last ACK with a RST.
This commit fixes it by making tcp_close wait for the completion
of the passive close.
This fixes connection closing issues with CONFIG_NET_TCP_WRITE_BUFFERS.
Because TCP_CLOSE is used for both of input and output for tcp_callback,
the close callback and the send callback confuses each other as
the following. As it effectively disposes the connection immediately,
we end up with responding to the consequent ACK and FIN/ACK from the peer
with RSTs.
tcp_timer
-> tcp_close_eventhandler
returns TCP_CLOSE (meaning an active close)
-> psock_send_eventhandler
called with TCP_CLOSE from tcp_close_eventhandler, misinterpet as
a passive close.
-> tcp_lost_connection
-> tcp_shutdown_monitor
-> tcp_callback
-> tcp_close_eventhandler
misinterpret TCP_CLOSE from itself as
a passive close
The current code just leave the window value from the segment
from the peer. It doesn't make sense.
Instead, always use 0.
This matches what NetBSD and Linux do.
(As far as I read their code correctly.)
* It doesn't make sense to have this conditional on our own
SO_KEEPALIVE support. (CONFIG_NET_TCP_KEEPALIVE)
Actually we don't have a control on the peer tcp stack,
who decides to send us keep-alive probes.
* We should respond them for non ESTABLISHED states. eg. FIN_WAIT_2
See also:
https://github.com/apache/incubator-nuttx/pull/3919#issuecomment-868248576
Do not bother to preserve segment boundaries in the tcp
readahead queues.
* Avoid wasting the tail IOB space for each segments.
Instead, pack the newly received data into the tail space
of the last IOB. Also, advertise the tail space as
a part of the window.
* Use IOB chain directly. Eliminate IOB queue overhead.
* Allow to accept only a part of a segment.
* This change improves the memory efficiency.
And probably more importantly, allows less-confusing
recv window advertisement behavior.
Previously, even when we advertise N bytes window,
we often couldn't actually accept N bytes. Depending on
the segment sizes and IOB configurations, it was causing
segment drops.
Also, the previous code was moving the right edge of the
window back and forth too often, even when nothing in
the system was competing on the IOBs. Shrinking the
window that way is a kinda well known recipe to confuse
the peer stack.
* Move the code to advance rcvseq for user data from tcp_input
to receive handlers.
Motivation: allow partial ack.
* If we drop a segment, ignore FIN as well. Note than tcp FIN bit is
logically after the user data in the same segment.
I assume this was just an oversight because I couldn't
find any obvious reason to special-case only the first IOB.
The commit message of the original commit is cited below.
```
commit bf21056001
Author: chao.an <anchao@xiaomi.com>
Date: Fri Nov 27 09:50:38 2020 +0800
net/tcp: fallback to unthrottle pool to avoid deadlock
Add a fallback mechanism to ensure that there are still available
iobs for an free connection, Guarantees all connections will have
a minimum threshold iob to keep the connection not be hanged.
Change-Id: I59bed98d135ccd1f16264b9ccacdd1b0d91261de
Signed-off-by: chao.an <anchao@xiaomi.com>
```
* Fixes the case where the window was small but not zero.
* tcp_recvfrom: Remove tcp_ackhandler. Instead, simply schedule TX for
a possible window update and make tcp_appsend decide.
* Replace rcv_wnd (the last advertized window size value) with
rcv_adv. (the window edge sequence number advertized to the peer)
rcv_wnd was complicated to deal with because its base (rcvseq) is
also moving.
* tcp_appsend: Send a window update even if there are no other reasons
to send an ack.
Namely, send an update if it increases the window by
* 2 * mss
* or the half of the max possible window size