Barnsley fern fractal

Thoughts on software architecture and development, and methods and techniques for improving the quality thereof.

David B. Robins (home)

Code Visions: Improving software quality
Asix (Ethernet over USB) Linux driver patches

By David B. Robins Saturday, October 17, 2015 12:19 EST (link)

Exacq's UNA (preview video posted to LinkedIn and Twitter) is a new video recorder with 8 or 16 built-in PoE ports for IP camera connection. Combined with auto-discovery and addressing, it makes for a great "plug and play" experience for businesses looking for video security.

Unfortunately, the driver for the Ethernet-over-USB device used, Asix's AX88772C, crashed our Linux test system during an "extreme debugging" session, leaving only a stack trace (photographed before the watchdog reset the system), and we haven't been able to reproduce it since. Fortunately, I was able to work backwards using the trace and the module source to determine what must have happened, and I made and tested a fix and submitted it to the kernel mailing lists (my first kernel patch). It was accepted; however, maybe a day later someone else submitted a set of patches superseding mine (since I don't believe in coincidence, I think at least my post caused him to post changes he'd been sitting on). I urged the maintainer to accept the 5-patch set over mine since it fixed the issue and also cleaned up the code in other ways.

The original issue was that if a memory allocation for a network buffer (sk_buff, or "skb") failed, a flag set earlier and not cleared properly in this case would cause a later call to assume the buffer had been allocated, and pass it to skb_put, which would dereference it and cause the crash.

Wanting to use what would (eventually) become the mainline kernel code, we tested the 5-patch set but it caused horrible throughput issues (down from tens of Mbit/s to tens of kbit/s). We then scaled back to my original patch, which seemed to work for a while, but then we found a machine that reliably crashed every few hours in the same part of the Asix driver. Investigation showed that this was a different issue related to how the Asix hardware passes data to the driver over USB. It includes a 32-bit header, which has some status bits and two 11-bit values specifying the size of the (network) packet, one bitwise inverted and used as a check. Due to other hardware/BIOS issues, we were getting a lot of bad data on that particular machine (tens of thousands of bad packets per hour), and with an 11-bit check, 1/2048 random 32-bit values will match by chance. This lead to accepting invalid size values that doesn't match the actual packets being sent, and if the size in the header—used to allocate an skb—is smaller than the actual packet, adding the packet using skb_put will cause an overflow that stops in skb_over_panic.

I wanted to take the 5-patch set and see if it fixed both crashes, so I examined the throughput problem and determined it was caused by patch 4/5. The "fixup" handler, asix_rx_fixup_internal, was intended to take one or more USB data packets and return a network packet; since it could be called multiple times for one network packet, it saved some state in the driver private data area, such as the remaining (network) packet size. If called with a continuation packet (i.e., remaining size > 0), this patch checks to see if the supposed continuation packet begins with what looks like a header (with the aformentioned 32-bit size/not-size), and if it did not, it would assume the continuation was actually the start of a new network packet (and the USB layer had gotten out of sync or lost the expected continuation). This is completely backwards: it would make some sense to assume it was a new packet if it did look like a header, but even then with a 1/2048 chance of any random continuation "looking like a header" even then it's dubious. And it easy to see why it affected throughput: any "continued" packet, which might well be most of them, would be lost—only small packets or those 1/2048 whose continuation "looked like a header" would get through!

I ended up removing that part of the patch set, and adding another check for the earlier mentioned size mismatch causing an overflow in skb_put; it had no throughput issues when we tested and has been stable thus far. I have contacted the author of the 5-patch set, but not received a reply; if he does not get back to me I will post to the appropriate kernel lists with my findings and suggesting not taking patch 4/5 and adding an additional check for skb overflow.

Content on this site is licensed under a Creative Commons Attribution 3.0 License and is copyrighted by the authors.