Barnsley fern fractal

Thoughts on software architecture and development, and methods and techniques for improving the quality thereof.

David B. Robins (home)

Code Visions: Improving software quality
That's not a hash… that's an encoding!

By David B. Robins tags: Development, Hiring Wednesday, February 4, 2015 20:13 EST (link)

The Trello Android Developer job page requires applicants to solve a programming puzzle, the solution of which must be sent with an application. (In case the question changes or page moves, searching for the source string "acdegilmnoprstuw" or 680131659347, the "hash" of "leepadg" should bring up the question.) The question is presented as reversing the hash of a string, i.e., given a (numeric) hash value, producing the original string. For strong hashes used in the real-world, such as MD5 or SHA1, this is a difficult problem and almost certainly requires a degree of brute-forcing, i.e., enumerating all the possible strings (billions and billions…), putting them through the hash function, and checking if the result matches the given hash value. Doing it significantly more efficiently would make someone famous and (from generating Bitcoins alone) rich.

However, the Trello developer question is actually an encoding, and while that doesn't technically exclude it from being a hash, one desirable quality in a hash is that it's difficult to go from hash value to original string (or even a possible original string, accounting for collisions), whereas being able to do that is a necessary property of an encoding. (I expect the Trello problem writers know this, and want to filter brute force attempts, which are, given the smaller space than in-use hashes, possible.) Encoding vs. Encryption vs. Hashing is a well-written post that clarifies the differences. It is trivial for someone with the most basic understanding of algebra to reverse the encoding in the question (it's a "weeder" question; I'm certain they'll have harder ones to ask in an actual interview).

I'd refrain if others hadn't already posted some solutions on the web (since it's a pain to have your screening question's answer available for cutting and pasting, although a few quick live (phone or in person) questions to the candidate should suffice to detect ignorant copying), but here's a Perl one-liner that solves it:

perl -wle '$t=956446786872726; while($t) { unshift @a, $t % 37; $t -= $a[0]; $t /= 37; } shift @a; print map { substr("acdegilmnoprstuw", $_, 1) } @a'

Given that it's folded into one line, it won't win you any style points anyway. While the Android developer position doesn't specify a language, the Node.js developer opening also posted specifically requires JavaScript or CoffeeScript. And note that at least one of the answers posted online does present a brute force and ignorance "solution"; perhaps a trap for those that might ignorantly search and copy?

Wireless mouse problem

By David B. Robins tags: Tools, Bugs Wednesday, January 28, 2015 11:18 EST (link)

Just a minor observation today, but it was driving me nuts.

One day I found that my Microsoft wireless mouse was stalling and not responding after I stopped using it for a few minutes. I'd be typing, go to move the mouse again, and it would take a few seconds to respond, kind of stuttering and jumping around. It had never done this before, so I suspected a Windows update might have broken something, but when I searched I didn't see reports of similar problems.

It turns out that this time I had plugged the mouse wireless USB dongle into a "charging port" on my USB hub. Apparently it didn't like that, and it caused the observed behavior. Plugging it back into a regular port fixed it.

Buddy allocator

By David B. Robins tags: C++, Development, Embedded Saturday, January 24, 2015 09:32 EST (link)

Dynamic memory allocation is generally to be avoided on memory-constrained embedded devices, and we are very constrained: after the vendor's "soft device" we only get 6K of the total 16K RAM to use (a future chip upgrade will provide 32K, but we're months from getting it and will likely have to support the 16K chips for a while). However, we got to a point where it was necessary to allocate and deallocate buffers since we couldn't afford having them all exist statically at the same time. I had concerns about the built-in malloc being a first-fit allocator due to hard to predict fragmentation, so decided to go with a buddy allocator.

I have posted the source to GitHub with permission of my employer; it is buildable using SCons, and by default builds a project that runs a test suite using GoogleTest. Currently it is set to build for MinGW on Windows (patches to make more general welcome), and I build it using GNU tools for ARM embedded processors (GCC 4.8 and 4.9) on the device. The documentation there explains how to use in a project. I quote from the implementation notes:

The implementation sacrifices some speed to save memory by not using free lists; instead, a bitmap representing a contiguous array of the smallest block is used to track allocations, with 2 bits per smallest block (see above). The first (high) bit is set if the block is in-use; the second is set if it is the end of a logical block. For example, the initial state is 00 00 00 ... 00 01, meaning one free block the size of the whole arena. It can then be split recursively down to the minimum size by flagging the appropriate smallest-block as an end-block.

It does make use of C++11 features, and I appreciate some embedded projects may be stuck using ancient compilers stuck on C++98 or worse; I make no apologies for using modern tools; being able to use std::unique_ptr and RAII helps avoid leaks.

Note that I'm only using "copyright" to stop someone else from claiming the work and copyrighting it and causing problems for me; otherwise I'd make it public domain. I don't care what you do with the code, although attribution would be appreciated. I hope it is useful to someone out there.

Code review tools and SVNKit tricks

By David B. Robins tags: Tools, Bugs, Java Wednesday, January 21, 2015 11:44 EST (link)

For a few months now, with the expectation of adding a new developer to our embedded group (present population: me), I've been looking for a code review tool. Admittedly, I'd been spoiled by SmartBear Collaborator (formerly Code Collaborator, but they're pitching it as document review too) at Exacq/Tyco, which was teriffic but is also $500 per named user (which is not that bad, but while we're not in bad shape we're also not in "let's drop a few thousand on a development tool" mode). Before that, we did code review by email (not great) and at Microsoft a couple of the dev managers really liked reviews being in-person, so one would pack up a DPK (diff package), throw it on a share, and interrupt someone's flow to get a review.

The highest-recommended open source tool was Review Board, so I set it up on my Amazon EC2 server and monkeyed with it until it saw my SVN repository, and left it for a while. Then I tried using it for a review, and hoo boy, I know it's free (so's Linux), but it's all kinds of terrible. The command-line tools they required for pre-commit review didn't work, so we tried post-commit (in a branch) and while I could comment the comments were hard to see/open after creating, and I don't believe replies were supported. It also didn't support updating a changeset after review comments, something Collaborator did pretty seamlessly most of the time.

Another recommendation was Atlassian Crucible, which looked pretty good in their overview/screenshots, and we're already using a number of their OnDemand (now "Cloud") tools (JIRA bug database, Confluence wiki, HipChat) so I gave it a try. I first downloaded the Windows 64-bit version on my laptop, which installed easily (to give the Windows ecosystem credit, Windows installers are usually good), but it had trouble accessing my SVN repository. I poked around a little but since I was eventually going to host it on a Linux server anyway figured I'd eliminate Windows from the equation and installed the Linux version. I must say, I was very favorably impressed with how easy it was to install (and that it didn't require a "blessed" Linux distro, working fine on Arch); unlike Windows, Linux installers are hit and miss when they exist at all. In this case it was: unzip, make a data directory, run a script and tell it where to find the config, run a script to start the service, connect via web browser. However, I had the same SVN repo access issues.

My SVN repository is accessed via an SSH tunnel (svn+ssh protocol). Atlassian recommends using their provided jsvn script to analyze Subversion connectivity issues, so I tried that; same error. I cranked up the Java log level to "FINEST" and it showed me the internal protocol data, and at some point it gave me an SVN E1700001 "authorization failed" error. I cut and pasted the same commands to a local svnserve session I initiated with the same user, and—no error (same with the standard svn command-line utility). Logging svnserve is extremely hacky, but I set it up and saw that it was logging a different username than the SSH user—the client local user, in fact. I added the command-line parameters to the log file and saw it was passing a --tunnel-user argument. Why? Since it used SVNKit for Subversion access, I downloaded and built the current trunk and fired up jdb. I didn't find out why it sent the local username rather than the SSH username, but I did find an undocumented property (similar to the documented ones) that I could use to set the remote SVN username: (.username is the SSH user). Setting that let me in; in fact I didn't even need to set it in FISHEYE_OPTS since it seems to have been saved in the Fisheye/Crucible server user's ~/.subversion directory.

That made everything happy, and I could see something much like Trac's "timeline" view for commits, search the repository, etc.; I created a test review and it worked decently well except it doesn't have anything like Collaborator's defects. It does have a way to mark a comment as a defect, but there's no way to close a defect, just unmark it (and use a convention like inserting the text [RESOLVED]). The linked ticket is #6 in Fisheye/Crucible tickets by votes, and the one above is similar (checkbox for reviewer to acknowledge a comment); however, it's 2.5 years old (the other ticket is marked as duplicating it, and it's 7.5 years old), so I'm not getting my hopes up.

Since we do plan to get a license (using trial 30-day now), as it's only $10 each for Fisheye/Crucible for 1-5 users, and I believe that's perpetual and not recurring, and includes a year of support. I upgraded to PostgreSQL a few days after installing, since the built-in "HSQLDB" isn't supported outside evaluation. This was also super-painless following the migration page, a pleasant surprise.

Learning Android opens doors

By David B. Robins tags: Development, Mobile Tuesday, December 23, 2014 23:32 EST (link)

For a little while now I've been thinking I should learn a little about mobile development. The last time I considered it I was so put off by the thought of using Java that I gave up; but this time I gritted my teeth, installed Android Studio, and created a project. I went with a blank Activity, which I learned is sort of like the fundamental UI building block of an Android project, and used the visual editor to add some controls: some static text, a button (originally "Go!" but it later became "Unlock"), and a multiline text area to be used for status/logs. Later, a static image (of holly, for Christmas) was added, and I'm sure I violated all good positioning practices to get it where I wanted it without losing the button.

From there I started reading about the Android support for BLE (Bluetooth Low Energy) and added basic scan code to the onCreate handler and a handler for the button click event. Originally I connected on the button click event; eventually I did that in onCreate and had it send the authentication message and modified the button to (1) be enabled only when connected, and (2) send an unlock message to unlock the door when clicked. I also updated the read-only log area with certain events: seeing a door, including the remote address (take that, iOS!); connection; disconnection (disables the unlock button, too); and button unlock. Honestly, it wasn't much trouble at all; a fun little project: Java's a lot less painful than it once was.

Now obviously it's a proof of concept. There's buckets of things it doesn't handle that the real guest app does: it probably doesn't do backgrounding well; the UI isn't pretty; it doesn't handle log in to our cloud site; it's not very flexible about which doors it will unlock. Those are all easy to fix (and rather tedious); this is just "Look, I can unlock doors with Android!", and a "my first Android app". Am I likely to want to work in mobile development in future? No, I don't think so; but I can do it if pressed and if I see a need it's easier for me to whip up an app than it was before.

This experience also validates my preferred method of testing candidates: I use a problem that requires that they both consume and produce an interface; in the specific case for C and C++, the interface consumed is the Python C API (which tends to make people think it's a Python problem when it isn't). Learning and implementing new interfaces is a huge part of programming and when the skill is developed it allows for rapid expansion to new areas of programming. Even embedded programming is far more about new interfaces and constraints than it is hardware.

Fix one thing, break another

By David B. Robins tags: C++, Development, Embedded, Bugs Wednesday, November 19, 2014 15:58 EST (link)

I've been doing a bunch of work with the nRF51822's on-chip UART this week; I found an issue with the vendor's code and contributed a fix back. Their code has generally been good and I imagine closing down the UART while the other side is transmitting, and then expecting to be able to reopen and pick up where they left off, isn't something people usually do. I had suspected some random UART issues but hadn't isolated them yet, so I wrote a test that:

  1. sent an incrementing stream of bytes from one chip to the other and verified it arrived with no gaps;
  2. verified that the other side saw the same first number (had to make some RTS/CTS fixes there);
  3. then I went to full duplex;
  4. had one side shut down the UART randomly, wait, and re-enable and ensure nothing was lost (this is where I had to make the library code fix).

I also found an interesting case where a fix exposed a bug: I fixed a function that had no return value (which you'd expect GCC with -Wall to warn about, but it doesn't, at least if the function has branches):

bool FSendMessage(message::Header const &msg, size_t cb)
   if (!FAppendToQueue(msg, cb))
   return true;  // this was missing

elsewhere I had this code:

message::ShuttingDown msg;
APP_ERROR_CHECK_BOOL(FSendMessage(msg, sizeof(msg)));

   __WFE();  // "wait for events" instruction, basically uses less power than pure busy wait


APP_ERROR_CHECK_BOOL is basically an assert, and the while loop wasn't there before. When FSendMessage fell off the end of the function, it happened to return 0 and the assert fired. However, when it did that it went into an infinite loop (in the assert handler) and allowed the queued message to finish sending, and then since it was looping forever it behaved similarly to the chip being powered off: it stopped responding, so it seemed like it was working properly.

When I added the return true, the assert no longer fired, and (no while loop yet) Serial::Close shut down the serial port before the message finished sending. I added the while loop to wait for transmission to finish. (I could have put it into Serial::Close itself, but that wasn't the right thing; it's OK for it to shut down the port with data to send in the general case; it can resume later.)

Resurrection on demand

By David B. Robins tags: C++, Development, Architecture, Embedded Sunday, July 27, 2014 17:07 EST (link)

It's been a few months since I've written an entry here, and that's mainly due to the job change to Senior Development Manager at Yikes. But I've been writing a whole lot of C++ and Python using technologies and techniques new to me and have plenty to write about on sundry topics.

Let me first describe the system we're building here at Yikes (and by "here" I mean scattered all over North America). The near-term goal is a hotel check-in system whereby a guest can bypass the front desk, go directly to their room, and unlock it with their phone (I know, I know, first-world problems); longer-term, the plan is to expand into the proximity space generally. To make this system work there are several components (the existence of none of which is secret in itself, just don't expect to see any schematics; but you didn't, did you?): a mobile phone app, a cloud API/database server, some on-property ("property" is jargon for "hotel" here) small form-factor computers, and a device with radios that resides in each door and controls the lock. It is on this device we are going to focus, since although I've been architecting various parts of the system it is where I've been writing code.

This device consists of a custom board with two chips which each have Bluetooth (Low Energy or "Smart" 4.0) connectivity responsibilities: one talks to the other hotel systems, and the other chip talks to a guest's phone and the door lock, unlocking it if the right authentication is received.

The device runs on a battery (as well as the physical difficulty of getting a power cable into a door, there are fire and liability concerns). Obviously, the longer this battery lasts the better (due to both cost and the inconvenience of staff having to replace it). A measurement a month or so ago showed it only lasting two days, when we're hoping for it to last more along the lines of 18 months+; but I have reason to believe that measurement is spurious and am trying to get details of its circumstances, since I've been running my development device 8-10+ hours a day for weeks off the same battery.

Nonetheless, wherever it's at now if we can do a little work for a large return in power-saving, we should do it. And it struck me that while there are reasons the first chip needs to run continually (it has to scan for advertisements and keep a clock, which it already does in a low-power mode), if the second chip has no room authorizations pending (phones to authenticate and then allow in if matched), it can shut down completely. I read through the relevant sections of the (Nordic) API and the nRF51 chip manual, checked that we could pin-reset the second chip from the first, and started experimenting.

To give the chip a chance to finish current operations, the per-second "tick" handler checks if anything is pending and then invokes the API go to to System OFF mode, sd_system_power_off. I later realized that the first chip would need to know the other chip was shut down (so it didn't offer needless resets and so it could sync communications), so using an existing messaging system I had the second chip send a message first. The continually-running chip can set a flag and then knows that when it next needs to send a message to the other, it needs to send a reset and wait for boot-up.

Obviously… it's critical that the second chip check for messages on wake-up, and process them, before it finds it's not busy and decides to sleep again!

We have employed the services of a contractor to do proper power profiling for the device in various modes; it may turn out, for example, that booting up is expensive and usage patterns show that it makes sense to wait for, say, 2 minutes of no activity before shutting down rather than doing it immediately when not busy. As another interesting note, we had been building the code (using Keil's ARMCC) at debug optimization (-O0); I added a debug flag to our build (we use SCons); enables optimization (-O2): code size dropped about 30% each side. It remains to be evaluated what effect this has on power consumption. At some point I may try building with GCC, but I've heard that Keil does much better and we're still well under the free code size limit (32k) so it's not a high priority.

Generating GSSAPI tokens from arbitrary Kerberos logins

By David B. Robins tags: Development, Unix Saturday, April 12, 2014 23:22 EST (link)

Forums such as StackOverflow are very helpful for providing solutions and code samples for shallow issues; but when one needs to dig a little deeper, they tend not to have answers. Of course, this is why they pay us the big bucks: to do the required research to make all the various libraries play nicely together. This is such a story.

The problem: given a username, an associated Kerberos domain (e.g., a Windows/Active Directory/LDAP domain), and a password, obtain a GSSAPI token suitable for logging in to a service. The service will use gss_accept_sec_context (Linux/Mac, hence "Unix", GSSAPI) or (Windows SSPI) AcceptSecurityContext to validate the token. Existing code was able to fetch a token for an arbitrary username and password using the Windows APIs, which made it easier, but GSSAPI didn't support it directly so instead only "single sign-on" using credentials from a cache (set via kinit) would be supported on Linux/Mac. The accept code, however, would accept any valid token, if we could figure out a way to generate it.

Another dev made a first attempt at generating a token, going to the MIT Kereberos library (libkrb5) directly since GSSAPI didn't expose usable APIs. He did get Kerberos to cough up a token using krb5_get_init_creds_password (and the obvious supporting functions to create a context, parse the server name, etc.) but the server, using the API functions mentioned above, didn't accept it: Windows servers said SEC_E_INSUFFICIENT_MEMORY. He passed the problem on to me, and I got the same error. I made sure my Linux client machine was joined to the domain (using tools called "Centrify"); same result. I played around with different Kerberos calls, e.g., setting up a credentials context with krb5_init_creds_init and adding the server principal and password separately; no dice.

Eventually I asked if there were Linux servers among the servers that were configured for Kerberos logins; and it turned out there were. They coughed up a slightly different error, saying I had an invalid token. Well, that made more sense than E_INSUFFICIENT_MEMORY and might allow for some progress. I set up a stub program on the same machine as the server (wanting to vary as little as possible), using the same token validation code, and unsurprisingly it gave the same error: but it was a lot easier for me to build and link it with a debug version of libkrb5 that I could step into with gdb (our server's watchdog also likes to reboot the machine if one halts in the debugger too long; that can be disabled, but it's an obstacle). Being able to step into libkrb5, I could see it was rejecting the token because it didn't begin with the magic value 0x60, and I tracked down a function that could properly encapsulate the token with the required RFC 4121 framing: gss_encapsulate_token, by dint of grepping around the source. This got me further, but unforutnately gss_encapsulate_token purposely left off a field called the "token type" which was necessary, so I had to write my own version that passed the correct type to the internal wrapper, g_make_token_header, to make another internal function, g_verify_token_header, happy.

That got me further, but a function that decoded the payload, which was supposed to be ASN.1 DER formatted, threw back an error, since my payload was the raw Kerberos credentials. I poked around the source some more, and found something that looked like it might wrap things up properly, and even a relevant-seeming public interface: krb5_mk_req_extended. I still had some #if'd branched versions of earlier attempts; but it turned out that the krb5_init_creds_init version, with krb5_init_creds_set_password and krb5_init_creds_set_service, was the ticket, so to speak. I pared the code down to what was absolutely required, removing unnecessary calls and experimental branches, and passed it back.

I'll also be integrating it into the shared authentication code, now that the "magic spell" is known. It was certainly a great help to have the Kerberos source, given that the (protocol) documentation is so sparse and generally unhelpful. Basically, it was a framing issue: the raw Kerberos token needed to be wrapped in an appropriate ASN.1 description and then that in a GSSAPI header. That perhaps explains why the Windows implementation complained of insufficient memory: intepreting the credentials as a GSSAPI token might have meant interpreting some credentials data as length fields, which may have contained invalid data making the length appear impossibly long (hopefully SSPI just rejected it based on the token length rather than trying to actually allocate memory).

Spooky action at a distance Python bug

By David B. Robins tags: Python, Bugs Wednesday, March 12, 2014 21:56 EST (link)

This was a fun one: when used with a new kind of camera, our web service crashed, and only on Linux. Technically, not a new camera, but a new library (provided by the manufacturer) that "de-warps" a fisheye (very wide-angle) lens using software, straightening out lines curved by the lens and providing (in some ways) a more natural view. This is part of a feature known as "digital PTZ" (PTZ = Pan, Tilt, and Zoom), differentiated from physical PTZ which allows for physically moving cameras to alter their viewing angle by sending commands to internal motors.

The Exacq web service (link goes to our demo server) makes use of the Exacq SDK (evAPI), which I am responsible for; and it is the API code that wraps the manufacturer-specific dewarp libraries and invokes them for the appropriate cameras when requested. Thus, the web team asked us to investigate, since the crash happened within an API invocation. It took a bit of time to get setup to debug the web server, by attaching, but I eventually figured that out. I asked for, and the web team provided, a standalone Python repro, which made investigation easier.

Early on I figured out it had something to do with the Python ctypes built-in library, which makes it possible to invoke functions in C libraries (our API is natively a C library, provided as a DLL or shared object). The Python 2.6 from the Ubuntu "deadsnakes" PPA (my local Python was 2.7) did not repro the bug, and nor did 2.7; and it was possible to just switch out to make it go away. So I started looking in later Python 2.6 releases than the web server used to see if there was a fix. No dice: and with the Python 2.6 versions I built myself, the issue persisted. I tried using strace to see what the working Python did differently, and nm to see if there were any important different imports or exports, but got nowhere.

I went back to the debugger, gdb: when I had first looked at the crash location, it had looked like it was accessing unallocated memory below the current stack (via info proc mappings); but either I had calculated wrongly or miscopied something, because looking later showed the destination of the instruction (movapd %xmm0, 0(%eax)) as valid memory (verified by checking the process memory map, or p *(int *)$eax, although that's just the first 4 bytes of it). I ran the succeeding and failing cases in separate (GNU screen) windows, again. I wrote down the addresses being accessed, and a light came on: the SSE XMM registers are large: what are their alignment requirements?

And so it was: the SIGSEGV failure, coming from a processor general protection fault, was indeed an alignment error: the instruction required a 16-byte aligned memory location, and it wasn't. I had even gone to the trouble earlier of creating a Python C extension that called the same function, bypassing ctypes, which succeeded: and when it did, so did the call via ctypes (it fetched the dewarped image), most likely because the first call, judging by the names of the internal functions (we had symbols but not source), initialized the library by populating some tables that only needed to be set up once. Once I knew what to look for, I found the Python bug, which had only been fixed in 2.7 and 3.2+.

So, lessons? Check the reason for a segment violation (available in $_siginfo in gdb, although it might not have been helpful since alignment errors are expected as SIGBUS), and double-check diagnoses, like the initial belief that the memory address was bad. Isolate the fault as best possible (I found it was ctypes fairly quickly).

We resolved to leave this dewarping library, which was one of several others already in use, out of the Linux release until the web service upgraded to Python 2.7, which was already planned, and possibly issue a mid-cycle release update (outside the normal quarterly channel) if there is sufficient business justification (read: people willing to give us money for it).

All aboard the D-Bus!

By David B. Robins tags: Development, Python Tuesday, January 28, 2014 19:17 EST (link)

I'm sure the title's been repeated numerous times at this point all across the Interwebs: but this is my site, and I like it, so we're going with it.

A little while ago, I was looking at Celery, an interface to distributed message queues. But after ruminating on it for months and not making much progress, I have to confess that queues are a bad fit for the kind of RPC I was looking to do (COM would be a good fit if I were developing on Windows). I decided to give D-Bus a shot, using Python, since the service I wanted to share is in Python. An actually decent Python DBus Tutorial actually lived up to the name, and I wrote a server around it that wrapped some functions in a module I already had.

The reason I wanted to run the module as a service was that it cached some data that I didn't want to continually reload (even from a file). So I provided a simple interface (the details aren't terribly important, but it takes an approximate name and returns the best good match and a related id on success (a tuple), None on failure. It uses Levenshtein distance, or, rather, a ratio (to length) to avoid certain pathological results. To allow for auto-updates, when it is run it looks for an existing object of its kind on the bus and sends a quit request.

To start the local bus for a user, I used a solution which I can't find now that stores connection settings (DBUS_SESSION_BUS_ADDRESS and DBUS_SESSION_BUS_PID) in ~/.dbus_settings, stored there or executed by .bashrc. I then hooked up the new service to a frequently run program that usually had to hit the Internet and do a slow and inaccurate search; worked great; accuracy and time both improved. That isn't everything I wanted to do, but making decent progress after so long a disinterested hiatus at least indicates to me that this technology is a better fit for the problem rather than the queues rathole.

/ previous entries /

Content on this site is licensed under a Creative Commons Attribution 3.0 License and is copyrighted by the authors.