Barnsley fern fractal

Thoughts on software architecture and development, and methods and techniques for improving the quality thereof.

David B. Robins (home)

Code Visions: Improving software quality
Asix (Ethernet over USB) Linux driver patches

By David B. Robins Saturday, October 17, 2015 12:19 EST (link)

Exacq's UNA (preview video posted to LinkedIn and Twitter) is a new video recorder with 8 or 16 built-in PoE ports for IP camera connection. Combined with auto-discovery and addressing, it makes for a great "plug and play" experience for businesses looking for video security.

Unfortunately, the driver for the Ethernet-over-USB device used, Asix's AX88772C, crashed our Linux test system during an "extreme debugging" session, leaving only a stack trace (photographed before the watchdog reset the system), and we haven't been able to reproduce it since. Fortunately, I was able to work backwards using the trace and the module source to determine what must have happened, and I made and tested a fix and submitted it to the kernel mailing lists (my first kernel patch). It was accepted; however, maybe a day later someone else submitted a set of patches superseding mine (since I don't believe in coincidence, I think at least my post caused him to post changes he'd been sitting on). I urged the maintainer to accept the 5-patch set over mine since it fixed the issue and also cleaned up the code in other ways.

The original issue was that if a memory allocation for a network buffer (sk_buff, or "skb") failed, a flag set earlier and not cleared properly in this case would cause a later call to assume the buffer had been allocated, and pass it to skb_put, which would dereference it and cause the crash.

Wanting to use what would (eventually) become the mainline kernel code, we tested the 5-patch set but it caused horrible throughput issues (down from tens of Mbit/s to tens of kbit/s). We then scaled back to my original patch, which seemed to work for a while, but then we found a machine that reliably crashed every few hours in the same part of the Asix driver. Investigation showed that this was a different issue related to how the Asix hardware passes data to the driver over USB. It includes a 32-bit header, which has some status bits and two 11-bit values specifying the size of the (network) packet, one bitwise inverted and used as a check. Due to other hardware/BIOS issues, we were getting a lot of bad data on that particular machine (tens of thousands of bad packets per hour), and with an 11-bit check, 1/2048 random 32-bit values will match by chance. This lead to accepting invalid size values that doesn't match the actual packets being sent, and if the size in the header—used to allocate an skb—is smaller than the actual packet, adding the packet using skb_put will cause an overflow that stops in skb_over_panic.

I wanted to take the 5-patch set and see if it fixed both crashes, so I examined the throughput problem and determined it was caused by patch 4/5. The "fixup" handler, asix_rx_fixup_internal, was intended to take one or more USB data packets and return a network packet; since it could be called multiple times for one network packet, it saved some state in the driver private data area, such as the remaining (network) packet size. If called with a continuation packet (i.e., remaining size > 0), this patch checks to see if the supposed continuation packet begins with what looks like a header (with the aformentioned 32-bit size/not-size), and if it did not, it would assume the continuation was actually the start of a new network packet (and the USB layer had gotten out of sync or lost the expected continuation). This is completely backwards: it would make some sense to assume it was a new packet if it did look like a header, but even then with a 1/2048 chance of any random continuation "looking like a header" even then it's dubious. And it easy to see why it affected throughput: any "continued" packet, which might well be most of them, would be lost—only small packets or those 1/2048 whose continuation "looked like a header" would get through!

I ended up removing that part of the patch set, and adding another check for the earlier mentioned size mismatch causing an overflow in skb_put; it had no throughput issues when we tested and has been stable thus far. I have contacted the author of the 5-patch set, but not received a reply; if he does not get back to me I will post to the appropriate kernel lists with my findings and suggesting not taking patch 4/5 and adding an additional check for skb overflow.

State of the hotel mobile key world

By David B. Robins Sunday, October 11, 2015 11:44 EST (link)

There are, I found recently, a lot of companies with hotel mobile key solutions (i.e., sign up via an app earlier, bypass the front desk on arrival, open your hotel room door with your mobile phone). This doesn't even include the residentials, but I wouldn't be surprised if some of them are planning to enter the hotel market too (at least one of them already did). When I did search, I didn't find anywhere that had gathered together the current slate of competitors, so I've done my best to list them all together here, with links to each's solution:

(I should mention that I currently work for Tyco, although I've never even been in the same room with anyone involved in their work with Assa Abloy, and that I have worked and consulted for Hilton and Yikes in the past. I'm just gathering and providing information I've found online out of professional interest in the subject, and am attempting to provide an objective summation.)

Numbers-wise, the hotel chains might have it, although many hotels are franchised rather than company-owned and franchise owners may be able to opt out of the system (although if costs are reasonable the large-scale adoption of the system should mean their signing on would be low risk); second, the lock companies, for smaller chains and independents, and then the startups get to fight over the scraps. It's helpful that lock companies have realized that hotels are not usually going to want to upgrade all their locks and have provided options supporting existing locks similar to the startups' in-door dongles, but possibly less disruptive if they fit inside the lock itself: in particular, Assa Abloy, makers of the popular VingCard Classic, Signature, and Essence RFID locks, offers an upgrade by "[adding] a small Bluetooth Low Energy board to the lock" and also offers an upgrade path for their older locks.

Value-adds that might make a hotel owner consider doing business with one over another would include things like having a full solution with a "cloud" web portal to set up accounts, PMS (property management system) integration, ability to brand ("skin") the mobile app with their hotel brand, full installation, and technical and system support (24/7, if possible). But it looks like most of them already do all that. And in case of the smaller companies failing or the larger companies discontinuing their mobile key project (i.e., if a clear "winner" emerges), having the plans and code in escrow against such a possibility might be a requirement too. Note I didn't mention "security" as a feature, because it's more of "if you don't have it you're dead" and it can be laughable to see claims about number of bits or "the lock provides half of the security key!" (paraphrase) or other buzzwords that really mean nothing at all except that the blurb writer didn't have anyone that understood security to check with.

It would not surprise me if hotel owners are very much taking a "wait and see" attitude toward mobile key—to see which system or systems become dominant: let's face it, they can live without it and it's very much a novelty at this point. Almost certainly any electronic lock system that they have will work with at least one system, so there's no need to pay for a costly replacement of all of their locks. A New York Times article agreed, noting "Investing in a technology that might become obsolete or never take off is one of the risks hotels face in building a mobile key program. This is one reason not every brand is climbing on the mobile key bandwagon."

As the landscape evens out, a mobile key solution is becoming commoditized, which means that rather than competing on features, the providers are competing on price in a race to the bottom—an unenviable position to be in especially if that's your primary line of business. Profits are lower, and it's harder to recoup investment. It's not such a big deal if you're Assa Abloy and a mobile key solution is just a value-add, a catchup, something you might roll into the price of the lock and maintenance agreement; but it's a little like Microsoft crushing your startup by accident when they add a feature.

There are also a number of residential solutions in varying states of function: of the above, Unikey/Kwikset's Kevo is shipping; the rest don't appear to have shipping hotel solutions yet; others have done better at listing these and they aren't the focus: August, Lockitron, Schlage, Goji, Yale.

It will be interesting to see how this settles out; and unfortunately for sellers, hotels may take the same position, with adoption (outside of brands doing their own) being slow to start and then picking up as a direction emerges. And please let me know if I missed any other solutions out there!

Over the air debugging

By David B. Robins tags: C++, Python, Tools, Debugging Saturday, April 11, 2015 15:57 EST (link)

I had the idea a few months ago to debug our embedded device application over a Bluetooth (Low Energy, BLE) link. Three weeks ago I finished over the air firmware updates, and this last week I got a chance to try my over the air debugging plan; and as it happened it worked fairly well.

We use SEGGER's J-Link GDB server to debug our application over a USB tag cable. I figured that I could write a GDB server that could instead use Bluetooth to communicate. It would be able to leverage GDB for symbol lookup and examine and write to memory although not (at least at first) run and step (stopping execution would terminate the Bluetooth connection). I looked up the GDB remote serial protocol, but as it happened an open source Python GDB server project already existed: pyOCD. I made some fixes to make it work with Python 3 (and stop obnoxiously logging to console), and wrote BLE transport and target classes.

When the pyOCD GDB server is launched, it runs in its own thread. My implementations of the read and write memory functions (readBlock32, readMem, writeBlock32, and writeMem) leveraged my existing ayncio-based tools framework that communicates to our embedded device over BLE using Bluegiga USB dongles. The functions use loop.call_soon_threadsafe to invoke the BLE communication to read or write on the main thread, and wait on an event until it completes.

This is of course a dangerous facility to leave enabled on shipping code, which is why it is (until we have better general security) enabled only under #ifdef DEBUGGER; our builds are automatically tagged with these feature defines so it will be clear when it is enabled.

Having this over the air memory debugging support is more convenient than having to attach a cable, and allows doing some useful debugging of state at remote locations too. More importantly, usually the case of attaching the normal debugger disrupts the device state (losing any information that could be obtained), or after attaching gets disrupted by static, and this will not. One of the current issues we're tracking has been very hard to reproduce, and we don't have enough physical debuggers to attach to every lab device. This new facility will be a great aid in resolving down such issues.

The compiler optimized out my destructor (and it was right)

By David B. Robins tags: C++, Bugs Tuesday, February 10, 2015 12:02 EST (link)

I do most of my testing with a debug build, which has fewer optimizations than release to make debugging easier (not "none", because that makes the binary image too big for the embedded device; -Og, actually). I was doing some pre-release testing with a release build, and noticed that the destructor for one of my objects didn't appear to be called in release builds (using -Os). This object was somewhat of a special case, since it was created with placement new and destroyed with an explicit destructor call; and I always had a faint feeling of not-quite-rightness with it, and now I know why.

The destructor set a state in the object to mark it as free, since it was one of several slots that could be activated when associated with a Bluetooth advertisement or connection. This is where you should be having uneasy feelings, because looking at destroyed objects is not kosher at all; in fact, it is undefined behavior (C++ standard N3690, 3.8 Object lifetime [], paragraph 5). Thus, the compiler reasons that if you are not allowed to look at the object's fields, then there's no point in wasting good CPU time setting them! So, the state was never set to free, and (in this particular case) the advertisement never expired; although this defect was hidden by another issue for a long time.

Fixed by using an explicit Free function.

That's not a hash… that's an encoding!

By David B. Robins tags: Development, Hiring Wednesday, February 4, 2015 20:13 EST (link)

The Trello Android Developer job page requires applicants to solve a programming puzzle, the solution of which must be sent with an application. (In case the question changes or page moves, searching for the source string "acdegilmnoprstuw" or 680131659347, the "hash" of "leepadg" should bring up the question.) The question is presented as reversing the hash of a string, i.e., given a (numeric) hash value, producing the original string. For strong hashes used in the real-world, such as MD5 or SHA1, this is a difficult problem and almost certainly requires a degree of brute-forcing, i.e., enumerating all the possible strings (billions and billions…), putting them through the hash function, and checking if the result matches the given hash value. Doing it significantly more efficiently would make someone famous and (from generating Bitcoins alone) rich.

However, the Trello developer question is actually an encoding, and while that doesn't technically exclude it from being a hash, one desirable quality in a hash is that it's difficult to go from hash value to original string (or even a possible original string, accounting for collisions), whereas being able to do that is a necessary property of an encoding. (I expect the Trello problem writers know this, and want to filter brute force attempts, which are, given the smaller space than in-use hashes, possible.) Encoding vs. Encryption vs. Hashing is a well-written post that clarifies the differences. It is trivial for someone with the most basic understanding of algebra to reverse the encoding in the question (it's a "weeder" question; I'm certain they'll have harder ones to ask in an actual interview).

I'd refrain if others hadn't already posted some solutions on the web (since it's a pain to have your screening question's answer available for cutting and pasting, although a few quick live (phone or in person) questions to the candidate should suffice to detect ignorant copying), but here's a Perl one-liner that solves it:

perl -wle '$t=956446786872726; while($t) { unshift @a, $t % 37; $t -= $a[0]; $t /= 37; } shift @a; print map { substr("acdegilmnoprstuw", $_, 1) } @a'

Given that it's folded into one line, it won't win you any style points anyway. While the Android developer position doesn't specify a language, the Node.js developer opening also posted specifically requires JavaScript or CoffeeScript. And note that at least one of the answers posted online does present a brute force and ignorance "solution"; perhaps a trap for those that might ignorantly search and copy?

Wireless mouse problem

By David B. Robins tags: Tools, Bugs Wednesday, January 28, 2015 11:18 EST (link)

Just a minor observation today, but it was driving me nuts.

One day I found that my Microsoft wireless mouse was stalling and not responding after I stopped using it for a few minutes. I'd be typing, go to move the mouse again, and it would take a few seconds to respond, kind of stuttering and jumping around. It had never done this before, so I suspected a Windows update might have broken something, but when I searched I didn't see reports of similar problems.

It turns out that this time I had plugged the mouse wireless USB dongle into a "charging port" on my USB hub. Apparently it didn't like that, and it caused the observed behavior. Plugging it back into a regular port fixed it.

Buddy allocator

By David B. Robins tags: C++, Development, Embedded Saturday, January 24, 2015 09:32 EST (link)

Dynamic memory allocation is generally to be avoided on memory-constrained embedded devices, and we are very constrained: after the vendor's "soft device" we only get 6K of the total 16K RAM to use (a future chip upgrade will provide 32K, but we're months from getting it and will likely have to support the 16K chips for a while). However, we got to a point where it was necessary to allocate and deallocate buffers since we couldn't afford having them all exist statically at the same time. I had concerns about the built-in malloc being a first-fit allocator due to hard to predict fragmentation, so decided to go with a buddy allocator.

I have posted the source to GitHub with permission of my employer; it is buildable using SCons, and by default builds a project that runs a test suite using GoogleTest. Currently it is set to build for MinGW on Windows (patches to make more general welcome), and I build it using GNU tools for ARM embedded processors (GCC 4.8 and 4.9) on the device. The documentation there explains how to use in a project. I quote from the implementation notes:

The implementation sacrifices some speed to save memory by not using free lists; instead, a bitmap representing a contiguous array of the smallest block is used to track allocations, with 2 bits per smallest block (see above). The first (high) bit is set if the block is in-use; the second is set if it is the end of a logical block. For example, the initial state is 00 00 00 ... 00 01, meaning one free block the size of the whole arena. It can then be split recursively down to the minimum size by flagging the appropriate smallest-block as an end-block.

It does make use of C++11 features, and I appreciate some embedded projects may be stuck using ancient compilers stuck on C++98 or worse; I make no apologies for using modern tools; being able to use std::unique_ptr and RAII helps avoid leaks.

Note that I'm only using "copyright" to stop someone else from claiming the work and copyrighting it and causing problems for me; otherwise I'd make it public domain. I don't care what you do with the code, although attribution would be appreciated. I hope it is useful to someone out there.

Code review tools and SVNKit tricks

By David B. Robins tags: Tools, Bugs, Java Wednesday, January 21, 2015 11:44 EST (link)

For a few months now, with the expectation of adding a new developer to our embedded group (present population: me), I've been looking for a code review tool. Admittedly, I'd been spoiled by SmartBear Collaborator (formerly Code Collaborator, but they're pitching it as document review too) at Exacq/Tyco, which was teriffic but is also $500 per named user (which is not that bad, but while we're not in bad shape we're also not in "let's drop a few thousand on a development tool" mode). Before that, we did code review by email (not great) and at Microsoft a couple of the dev managers really liked reviews being in-person, so one would pack up a DPK (diff package), throw it on a share, and interrupt someone's flow to get a review.

The highest-recommended open source tool was Review Board, so I set it up on my Amazon EC2 server and monkeyed with it until it saw my SVN repository, and left it for a while. Then I tried using it for a review, and hoo boy, I know it's free (so's Linux), but it's all kinds of terrible. The command-line tools they required for pre-commit review didn't work, so we tried post-commit (in a branch) and while I could comment the comments were hard to see/open after creating, and I don't believe replies were supported. It also didn't support updating a changeset after review comments, something Collaborator did pretty seamlessly most of the time.

Another recommendation was Atlassian Crucible, which looked pretty good in their overview/screenshots, and we're already using a number of their OnDemand (now "Cloud") tools (JIRA bug database, Confluence wiki, HipChat) so I gave it a try. I first downloaded the Windows 64-bit version on my laptop, which installed easily (to give the Windows ecosystem credit, Windows installers are usually good), but it had trouble accessing my SVN repository. I poked around a little but since I was eventually going to host it on a Linux server anyway figured I'd eliminate Windows from the equation and installed the Linux version. I must say, I was very favorably impressed with how easy it was to install (and that it didn't require a "blessed" Linux distro, working fine on Arch); unlike Windows, Linux installers are hit and miss when they exist at all. In this case it was: unzip, make a data directory, run a script and tell it where to find the config, run a script to start the service, connect via web browser. However, I had the same SVN repo access issues.

My SVN repository is accessed via an SSH tunnel (svn+ssh protocol). Atlassian recommends using their provided jsvn script to analyze Subversion connectivity issues, so I tried that; same error. I cranked up the Java log level to "FINEST" and it showed me the internal protocol data, and at some point it gave me an SVN E1700001 "authorization failed" error. I cut and pasted the same commands to a local svnserve session I initiated with the same user, and—no error (same with the standard svn command-line utility). Logging svnserve is extremely hacky, but I set it up and saw that it was logging a different username than the SSH user—the client local user, in fact. I added the command-line parameters to the log file and saw it was passing a --tunnel-user argument. Why? Since it used SVNKit for Subversion access, I downloaded and built the current trunk and fired up jdb. I didn't find out why it sent the local username rather than the SSH username, but I did find an undocumented property (similar to the documented ones) that I could use to set the remote SVN username: (.username is the SSH user). Setting that let me in; in fact I didn't even need to set it in FISHEYE_OPTS since it seems to have been saved in the Fisheye/Crucible server user's ~/.subversion directory.

That made everything happy, and I could see something much like Trac's "timeline" view for commits, search the repository, etc.; I created a test review and it worked decently well except it doesn't have anything like Collaborator's defects. It does have a way to mark a comment as a defect, but there's no way to close a defect, just unmark it (and use a convention like inserting the text [RESOLVED]). The linked ticket is #6 in Fisheye/Crucible tickets by votes, and the one above is similar (checkbox for reviewer to acknowledge a comment); however, it's 2.5 years old (the other ticket is marked as duplicating it, and it's 7.5 years old), so I'm not getting my hopes up.

Since we do plan to get a license (using trial 30-day now), as it's only $10 each for Fisheye/Crucible for 1-5 users, and I believe that's perpetual and not recurring, and includes a year of support. I upgraded to PostgreSQL a few days after installing, since the built-in "HSQLDB" isn't supported outside evaluation. This was also super-painless following the migration page, a pleasant surprise.

Learning Android opens doors

By David B. Robins tags: Development, Mobile Tuesday, December 23, 2014 23:32 EST (link)

For a little while now I've been thinking I should learn a little about mobile development. The last time I considered it I was so put off by the thought of using Java that I gave up; but this time I gritted my teeth, installed Android Studio, and created a project. I went with a blank Activity, which I learned is sort of like the fundamental UI building block of an Android project, and used the visual editor to add some controls: some static text, a button (originally "Go!" but it later became "Unlock"), and a multiline text area to be used for status/logs. Later, a static image (of holly, for Christmas) was added, and I'm sure I violated all good positioning practices to get it where I wanted it without losing the button.

From there I started reading about the Android support for BLE (Bluetooth Low Energy) and added basic scan code to the onCreate handler and a handler for the button click event. Originally I connected on the button click event; eventually I did that in onCreate and had it send the authentication message and modified the button to (1) be enabled only when connected, and (2) send an unlock message to unlock the door when clicked. I also updated the read-only log area with certain events: seeing a door, including the remote address (take that, iOS!); connection; disconnection (disables the unlock button, too); and button unlock. Honestly, it wasn't much trouble at all; a fun little project: Java's a lot less painful than it once was.

Now obviously it's a proof of concept. There's buckets of things it doesn't handle that the real guest app does: it probably doesn't do backgrounding well; the UI isn't pretty; it doesn't handle log in to our cloud site; it's not very flexible about which doors it will unlock. Those are all easy to fix (and rather tedious); this is just "Look, I can unlock doors with Android!", and a "my first Android app". Am I likely to want to work in mobile development in future? No, I don't think so; but I can do it if pressed and if I see a need it's easier for me to whip up an app than it was before.

This experience also validates my preferred method of testing candidates: I use a problem that requires that they both consume and produce an interface; in the specific case for C and C++, the interface consumed is the Python C API (which tends to make people think it's a Python problem when it isn't). Learning and implementing new interfaces is a huge part of programming and when the skill is developed it allows for rapid expansion to new areas of programming. Even embedded programming is far more about new interfaces and constraints than it is hardware.

Fix one thing, break another

By David B. Robins tags: C++, Development, Embedded, Bugs Wednesday, November 19, 2014 15:58 EST (link)

I've been doing a bunch of work with the nRF51822's on-chip UART this week; I found an issue with the vendor's code and contributed a fix back. Their code has generally been good and I imagine closing down the UART while the other side is transmitting, and then expecting to be able to reopen and pick up where they left off, isn't something people usually do. I had suspected some random UART issues but hadn't isolated them yet, so I wrote a test that:

  1. sent an incrementing stream of bytes from one chip to the other and verified it arrived with no gaps;
  2. verified that the other side saw the same first number (had to make some RTS/CTS fixes there);
  3. then I went to full duplex;
  4. had one side shut down the UART randomly, wait, and re-enable and ensure nothing was lost (this is where I had to make the library code fix).

I also found an interesting case where a fix exposed a bug: I fixed a function that had no return value (which you'd expect GCC with -Wall to warn about, but it doesn't, at least if the function has branches):

bool FSendMessage(message::Header const &msg, size_t cb)
   if (!FAppendToQueue(msg, cb))
   return true;  // this was missing

elsewhere I had this code:

message::ShuttingDown msg;
APP_ERROR_CHECK_BOOL(FSendMessage(msg, sizeof(msg)));

   __WFE();  // "wait for events" instruction, basically uses less power than pure busy wait


APP_ERROR_CHECK_BOOL is basically an assert, and the while loop wasn't there before. When FSendMessage fell off the end of the function, it happened to return 0 and the assert fired. However, when it did that it went into an infinite loop (in the assert handler) and allowed the queued message to finish sending, and then since it was looping forever it behaved similarly to the chip being powered off: it stopped responding, so it seemed like it was working properly.

When I added the return true, the assert no longer fired, and (no while loop yet) Serial::Close shut down the serial port before the message finished sending. I added the while loop to wait for transmission to finish. (I could have put it into Serial::Close itself, but that wasn't the right thing; it's OK for it to shut down the port with data to send in the general case; it can resume later.)

/ previous entries /

Content on this site is licensed under a Creative Commons Attribution 3.0 License and is copyrighted by the authors.