Trivially triggered with two or more threads write to regular syslog stream and to emergency stream. In this case, freeze happens because of mismatch of priv->ie (TXEIE == 0) and actually enabled interrupts in USART registers (TXEIE == 1), which leads to unhandled TXE interrupt
and causes interrupt storm for USART.
Task A holds an IOB. There are no further IOBs. The value of semcount is zero.
Task B calls iob_alloc(). Since there are not IOBs, it calls sem_wait(). The v
alue of semcount is now -1.
Task A frees the IOB. iob_free() adds the IOB to the free list and calls sem_post() this makes Task B ready to run and sets semcount to zero NOT 1. There is one IOB in the free list and semcount is zero. When Task B wakes up it would increment the sem_count back to the correct value.
But an interrupt or another task runs occurs before Task B executes. The interrupt or other tak takes the IOB off of the free list and decrements the semcount. But since semcount is then < 0, this causes the assertion because that is an invalid state in the interrupt handler.
So I think that the root cause is that there the asynchrony between incrementing the semcount. This change separates the list of IOBs: Currently there is only a free list of IOBs. The problem, I believe, is because of asynchronies due sem_post() post cause the semcount and the list content to become out of sync. This change adds a new 'committed' list: When there is a task waiting for an IOB, it will go into the committed list rather than the free list before the semaphore is posted. On the waiting side, when awakened from the semaphore wait, it will expect to find its IOB in the committed list, rather than free list.
In this way, the content of the free list and the value of the semaphore count always remain in sync.
The reason was that the code hit the line "canerr("ERROR: No available mailbox\n");" in stm32_cansend, so can_xmit thinks it has sent the packet to the hardware, but actually has not. Therefore the transmit interrupt never happens which would call can_txdone, and so the size of the FIFO size does not decrease.
The reason why the code actually hit the mentioned line above, is because stm32can_txready uses a different (incomplete) condition than stm32can_send to determine if the mailbox can be used for sending, and thus can_xmit forwards the packet to stm32can_send. stm32can_txready considered mailboxes OK for sending if the mailbox was empty, but did not consider that mailboxes may not yet be used if the request completed bit is set - stm32can_txinterrupt has to process these mailboxes first.
Note that I have also modified stm32can_txinterrupt - I removed the if condition, because the CAN controller retries to send the packet until it succeeds. Also if the condition would not evaluate to true, can_txdone would not be called and the FIFO size would not decrease also.
I don't like the idea that syslog_dev_write() still scans the entire output buffer to expand CR-LF sequence. This seems really wasteful, especially in this case where we can be sure that the is no CR or LF without a matching LF or CR. Bu, I think, the existing behavior in syslog_dev_write() must be retained because it is needed in other contexts.