Statistically makes sense

Gerald Sussman, an MIT professor and a co-author of “Structure and Interpretation of Computer Programs” (SICP) once said that programming today is “more like science. You grab this piece of library and you poke at it. You write programs that poke it and see what it does. And you say, ‘Can I tweak it to do the thing I want?’” This quote, occasionally thrown in to emphasize the predicament of modern software development practices (I’m also guilty of this), refers to the MIT’s decision to stop teaching the famous SICP course. Even now, I can picture Sussman talking to the interviewer: his voice a little sour, a wry smile fading on a tired wrinkly face.

Admittedly, this is just my imagination reaching for a convenient stereotype of a disenchanted industry pioneer. “Yes! I’ve been there. I, too, have had to poke some obscure code until it sort of worked, and then I moved on, but the shame follows me everywhere to this day.” While I might never know how Sussman really felt about present-day software development practices, my recent work experience made me re-evaluate this quote. Instead of a displeased remark, I choose to see a piece of advice.

In programming, there are few things more empowering than working on a codebase that fits in your head. Ideally, the codebase of your own authorship. This is how most people get into programming: writing small programs, then bigger programs, then glueing those together with abstractions. This gets you far enough, and it works almost as well in small teams with several developers. But, keep adding people, and, suddenly, there are competing abstractions, imperfect information, and so much code that any improvement you have in mind takes months or years to adopt, and hurts consistency in the meantime. The skeptic’s view is “don’t write large programs”, “avoid big teams”, “slow down, review, and refactor.” I admit, I find this lofty attitude appealing, although acknowledging that in practice we don’t always have control over every codebase we have to change. The latter likely means that senior engineers will always need tools and skills for dealing with complexity.

Recently, I had to get up to speed with a large (more like gargantuan) codebase in an unfamiliar domain, and despite all my experience with the tech stack, despite using the most advanced Ruby IDE, at first, I was paralyzed by the overwhelming complexity of that code. There was also the typical pressure to make progress immediately. Of course, it’d be straightforward enough to identify what needs to be changed, make a quick local fix that ignores established patterns and hope the review practices are lax enough to let it through. I, on the other hand, wanted the new code to feel like an organic extension of the existing “structure”. Unfortunately this kind of integration requires deep understanding of the domain, established patterns, and the “house style”. I tried to dig around in my IDE, but at the end of the day still had more questions than answers. What are the common naming patterns? What base class should I inherit from? Which mixins can I include?

This reminded me of my experiments with machine learning algorithms, where you have this enormous dataset, completely impossible to comprehend by studying individual rows. But look at the distributions of numerical fields, aggregate categorical data, run some correlation tests, and suddenly you have a pretty good idea of what you’re dealing with. And then, the next day, while looking at some twenty classes, which shared an expansive, loosely-defined interface, I had a eureka moment. I was trying to manually build a list of all the method names in a scratch buffer, constantly losing focus and hoping somebody pings me in Slack thus saving me from this mind torture. To turn things around, I decided to channel my inner Nick Canzoneri, fired up the terminal and started to write a chain of commands that began with grep:

$ grep --no-filename --only-matching "def [^\(]*" *

This produces a full list of methods defined by these classes. Now, aggregate, and, voilà: you know which methods are implemented by every class, and which are only shared by a few or altogether unique.

$ grep --no-filename --only-matching "def [^\(]*" * | sort | uniq -c | sort --reverse

  15 def invoke
  15 def result
  12 def build_description
  12 def calculate_score
   2 def alive?
   1 def dead?

By itself, this example is simplistic, and I wouldn’t be surprised if some IDE already did it for some programming language. Still, it does a good job of illustrating the method: make a hypothesis, then write a command to prove or disprove it. In the past two months, I had a handful of chances to use it and every time it helped me increase confidence and refine my solutions.

Fighting complexity is crucial for any software development team that wishes to succeed long-term. When possible, complex parts should be eliminated (or hidden away). Still, making sense of difficult, tangled code remains a useful skill for professional software developers. All I’m saying is: next time you find yourself lost in the depths of some unfriendly codebase, remember that you can improve your understanding by running some simple statistical experiments from the command line.