4654 tests (200K * 4654 = 930M) 1234 tests (200K * 1234 = 247M)
As an experiment I changed the size of the default thread stack from 4096 to 16384 bytes. This value is in kyu.h and it is necessary to do a "make clean ; make" to rebuild Kyu after such a change. This did not solve the problem, the test hung after 1224 transfers. Looking at wireshark (which I have continually recording all traffic to the ethernet address of my test system) I see that the hang happened in mid transfer. I now have generate the data for the transfer so that the start and end blocks are clearly evident (they are filled with the strings "START" or "END--" repeated to fill a 500 byte block. Blocks in the middle are filled with "PK168" or some such, where the number counts which block of the transfer it is. This makes it sort of easy to diagnose where in the game we are when things do hang.
I do writes of 2000 bytes in Kyu, but wireshark tells me that all of the traffic is packets with 500 byte payloads, so I have taken care that the "data marking" is done with this quantization in mind.
A person might ask whether we are just rolling the dice for something like a race condition or whether we have something like a memory leak. It is hard to say with just a few data points on how frequent it is. But at any event, it would be sloppy work without excuse to ignore this.
Perform other tests using the network, find out what does or does not cause the system to hang. Try transfers with UDP (which has nothing to do with the new Kyu code). Think of diagnostics that will keep Kyu busy for days on end without having anything to do with the network. Try back and forth TCP testing with short packets (which is a common actual use case).
Get this running on other hardware (the BBB or even the Zynq) and see if the problem reveals itself in a different way.
Find out if Kyu is responding to interrupts when hung. We could use a GPIO interrupt (i.e. a button) or a serial interrupt. Currently kyu sits polling the uart when it is idle. If we implement a serial console, Kyu will sit in the idle loop waiting for interrupts.
How does Kyu act if we go into a useless loop in some thread running at higher priority than the console? We would expect the console to just "go away" as it will never get scheduled. Timer interrupts should still run. We could have a thread running at a higher priority than our "hog" thread and perhaps do something like "say hi" every 60 seconds. If this works, we could run such a thing while we do a long network test and see if during a hang we are still getting timer interrupts and scheduling the "hi" thread. Actually, we might as well avoid the complication of the "hi" thread and just send a greeting directly from the interrupt code, although both would be instructive.
Thread: name ( &tp ) state sp pri Thread: net (400a75a8) SEM J 40578000 10 Thread: dog (400a7---) SEM J 4057---- 10 Thread: net_slow (400a74b8) REPEAT C 4057c000 11 Thread: tcp-bsd (400a7788) SEM J 40570000 12 Thread: tcp-timer (400a7698) REPEAT C 40574000 13 Thread: wangdoodle (400a72d8) SEM J 40584000 30 * Thread: shell (400a7968) READY I 40568000 40 Thread: idle (400a7878) READY C 4056c000 1234 Thread Cur : (400a7968)The "dog" thread at priority 10 is what sends out the "alive and kicking" message. Our test server ("wangdoodle") at priority 30 is what doles out data in large quantities to our python script. The Kyu shell runs at priority 40.
As currently coded, the shell must be the lowest priority thread in the mix because it polls the uart endlessly waiting for characters to be typed. So, if wangdoodle finds some way to go into an infinite loop (ar any of the other tcp and network related threads), then the shell will be starved and the system will appear to be hung.
The good news is that nothing appears to be wrong with Kyu itself. What is tempting to do at this point is to recode the getchar routine that the shell uses so that it is interrupt driven. Then it could be run at a higher priority (and block on some semaphore related to characters being typed). This would allow the shell to be run at a higher priority than any network threads and remain responsive.
What I do next is to overhaul the orange pi serial driver so it can use interrupts. I have to do a bit of special work so that the console can come up initially using polling, then switch to using interrupts when Kyu is fully initialized. This is because I need Kyu semaphores for the driver, but it is also a good thing in general to be robust during initialization.
Once this is done, I can run the shell at priority 11 without it
hogging all the CPU resources to do polling, and I launch the
TCP test. Now if and when it hangs I get get some information.
The threads and priorities now look like this:
Thread: name ( &tp ) state sp pri * Thread: shell (400a7984) READY J 40568000 11 Thread: net (400a75c4) SEM J 40578000 12 Thread: net_slow (400a74d4) REPEAT C 4057c000 13 Thread: tcp-bsd (400a77a4) SEM J 40570000 14 Thread: tcp-timer (400a76b4) READY C 40574000 15 Thread: wangdoodle (400a7204) SEM J 40588000 30 Thread: wang_thr (400a73e4) READY I 40580000 31 Thread: idle (400a7894) READY C 4056c000 1234
I make this a topic of its own because I expect it to be a substantial can of works.
First of all, we need to get JTAG hardware connected to our board. It is not entirely clear how to do this with the Nanopi Neo board I am now working with (unless it can be done by repurposing GPIO pins, it simply isn't possible). A bigger Orange Pi board might do better, but even then it may or may not be possible.
Both the BBB and my Zynq based EBAZ boards have clearly labeled and designated connectors for JTAG. On one hand, the BBB uses an odd little surface mount connector. I have the connector and spent good money for it, expecting just such a day like this. On the other hand, the EBAZ has a nice connector ready to go with standard pins on 0.1 inch spacing.
Shifting my work to the EBAZ would have another benefit. I have not done a Kyu port to the Zynq, but intend to do so. It would provide a bit of a break from all this work on TCP, but the biggest part of the job would be writing an ethernet driver. This would be new hardware (albeit still a 32 bit ARM) which could expose the hang in a different way. I would like to learn how to use JTAG with the EBAZ for other reasons (like injecting a Kyu image via JTAG into NAND flash on board). This is tempting.
Kyu / [email protected]