Atomic symlinks

Abstraction is a form of simplification at a cost of information loss. This is especially true for computer systems. Even the most ubiquitous abstractions: file systems, automatic garbage collection, or virtual memory, all take their toll. No abstraction is capable of overcoming the underlying limitations, but it easily obscures them, becoming a possible source of error. When that happens, the programmer’s job is to debug the abstraction, untangling it knot by knot, until the abstracted implementation is revealed. Not too long ago, while working on a bug fix for DeployBot, I had to grapple with an abstraction supported by most modern operating systems: symbolic links. It started out as a simple task but ultimately led me to improve my understanding of the subject.

A symbolic link is a special kind of file that contains a symbolic reference to another file (in other words, a file path). The OS then ensures that the referenced path is used any time the symlink is opened. A symbolic link has to be distinguished from a hard link, which is a directory entry that associates the name of the file with its metadata. The latter is normally stored in the inode, but more about that below. More importantly, the purpose of both symbolic and hard links is to abstract the file location from the user. Being an application programmer, I’m used to thinking about symbolic links as “file references”, and hard links as “just files”. This simplified mental model may seem harmless at first, but I’ll demonstrate how it can become problematic as soon as the concealed underlying limitations come into play.

1 operation ≠ atomicity

The bug in question was discovered in the shell script used for DeployBot’s atomic deployments. If you’re not familiar with the term, it’s a method of application deployment that keeps a number of the most recent application versions on the target server and atomically switches between them by updating a symbolic link. The target of the link determines the current version. This technique was popularized by Capistrano and enables zero-downtime deployments and easy rollbacks with no inconsistent state.

A customer pointed out that the symlink update operation wasn’t, in fact, atomic. After taking a look, I had to agree. The script was running the following commands:

rm -f $CURRENT
ln -s $RELEASE $CURRENT

Tip. If you have trouble remembering the order of arguments to ln, it’s consistent with cp and mv.

Most of the DeployBot users are developers themselves, so bug reports often come with a suggested solution. This one wasn’t an exception. Still surprised that the issue had gone unnoticed for so long, I decided to try the proposed ln -f. I fired up the terminal, created two nonce files foo and bar and ran ln -s foo buz followed by ln -sf bar buz (analogous to mv -f or rm -f). It seemed to work. There was a link buz pointing to bar, and I didn’t even have to use rm.

.
├── bar
├── buz -> bar
└── foo

Was this operation atomic? Updating a hard link with mv -f normally is, so it’s easy to assume that ln -f shouldn’t be an exception. If I’d been too entrenched in my mental model of symlinks, I would have stopped right at this step. After all, we‘re “just updating a file reference”. Indeed, if you search “update a symlink” on StackOverflow and similar resources, you’ll find no shortage of answers suggesting ln -f. In some contexts, it might be an acceptable solution, but any experienced programmer knows that optimistic assumptions and relying on things they don’t understand will eventually come back and haunt you.

First, I consulted the man page of ln. ln is among the utilities with options that differ significantly between platforms. The -f option is supported by both GNU/Linux and Mac (BSD).

GNU/Linux

     -f, --force
            remove existing destination files

Mac

     -f    If the target file already exists, then unlink it so that the link
           may occur.  (The -f option overrides any previous -i options.)

The man page stated that the destination file is removed. Still reluctant to leave my comfortable application programmer mindset, I turned to strace. Strace prints system calls executed by the specified program.

$ strace -e file ln -sf bar buz

execve("/bin/ln", ["ln", "-sf", "bar", "buz"], [/* 36 vars */]) = 0
…
stat("buz", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lstat("buz", {st_mode=S_IFLNK|0777, st_size=3, ...}) = 0
stat("bar", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
symlink("bar", "buz")                   = -1 EEXIST (File exists)
unlink("buz")                           = 0
symlink("bar", "buz")                   = 0
+++ exited with 0 +++

It was undeniable, ln -sf translates into two separate system calls: unlink(2) and symlink(2). The former deletes a hard link (file), and the latter creates a symlink. But is there an atomic call to update a symlink? Before I try to answer this question, let me go into more detail about the mechanism of symbolic links.

It’s all inodes

Above, I defined a symlink as “a special kind of file that contains a symbolic reference to another file”. The OS determines whether a file is a symlink by reading its inode, which is linked via the file name (remember, hard links?) from the directory entry. You can view inode numbers of all files in a directory by running ls -i.

$ ls -il

total 0
661168 -rw-rw-r-- 1 artem artem 0 Feb 14 16:04 bar
661169 lrwxrwxrwx 1 artem artem 3 Feb 14 16:05 buz -> bar
661167 -rw-rw-r-- 1 artem artem 0 Feb 14 16:04 foo

Some common information stored in the file’s inode can be viewed by using the standard utility stat.

$ stat buz

  File: 'buz' -> 'bar'
  Size: 3               Blocks: 0          IO Block: 4096   symbolic link
Device: 801h/2049d      Inode: 661169      Links: 1
Access: (0777/lrwxrwxrwx)  Uid: ( 1000/   artem)   Gid: ( 1000/   artem)
Access: 2017-02-14 16:13:17.122017000 -0500
Modify: 2017-02-14 16:05:04.610017000 -0500
Change: 2017-02-14 16:05:04.610017000 -0500
 Birth: -

Here, the 3-byte size corresponds to the length of the linked file name (bar), and the “Links” counter shows the number of hard links pointing to the file. This counter is incremented by each link syscall and decremented by unlink. Once there are no links left, the file is considered removed and its inode and blocks can be recycled.

Based on this knowledge, one can assume that updating a symlink would entail simply overwriting the target destination. Unfortunately, this is not as easy as it sounds. Even if a symlink was stored in the blocks listed by its inode, the operating system proxies any open(2) calls to the destination. Not only that, but it isn’t usually stored this way. More often it’s attached as an extra field to the inode. We can see that by viewing the symlink’s inode in debugfs on GNU/Linux, which displays a more detailed information than stat.

# debugfs

debugfs 1.42.13 (17-May-2015)
debugfs:  open /dev/sda1
debugfs:  stat <661169>

Inode: 661169   Type: symlink    Mode:  0777   Flags: 0x0
Generation: 1368090758    Version: 0x00000000:00000001
User:  1000   Group:  1000   Size: 3
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x58a37100:91707ba0 -- Tue Feb 14 16:05:04 2017
 atime: 0x58a372ed:1d1753a0 -- Tue Feb 14 16:13:17 2017
 mtime: 0x58a37100:91707ba0 -- Tue Feb 14 16:05:04 2017
crtime: 0x58a37100:91707ba0 -- Tue Feb 14 16:05:04 2017
Size of extra inode fields: 32
Fast_link_dest: bar

Note the Fast_link_dest field in the output below. “Fast” refers to symbolic links with the destination path stored in their inode. However, because the symbolic link’s destination can be as long as 4096 characters, the file system might decide to fall back to using a “slow link”, which puts the target path in a file. In this case, retrieving the link’s value becomes more expensive in terms of I/O.

At this point, it should be clear that the OS holds a monopoly on any operations on symlinks and doesn’t provide a system call to modify one. All is not lost, though. As we learned above, symlinks are linked by hard links and for those there is the rename(2) syscall. Thus, if we create a new symlink and “replace” the old one with it, this operation should be atomic. Let’s verify this.

$ ln -s bar newbuz
$ tree

.
├── bar
├── buz -> foo
├── foo
└── newbuz -> bar

$ strace -e file mv newbuz buz

execve("/bin/mv", ["mv", "newbuz", "buz"], [/* 38 vars */]) = 0
…
stat("buz", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lstat("newbuz", {st_mode=S_IFLNK|0777, st_size=3, ...}) = 0
lstat("buz", {st_mode=S_IFLNK|0777, st_size=3, ...}) = 0
rename("newbuz", "buz")                 = 0
+++ exited with 0 +++

This worked exactly as anticipated. Switching a symlink this way requires just a single syscall.

“The Capistrano Trick”

One problem remained. Throughout the examples above, I was working with symlinks to files. However, for atomic deployments, one needs to update the symlinks pointing to directories. It isn’t immediately obvious, but mv -s newcurrent current will just move newcurrent to current/newcurrent if current is a symlink to a directory. This is where the inconsistent command-line utilities’ options come to the fore once again. The GNU/Linux version of mv has a designated -T flag that makes mv treat the destination as a regular file.

Linux

       -T, --no-target-directory
              treat DEST as a normal file

Unfortunately, there’s no corresponding flag on BSD/Mac. DeployBot aims to work with most UNIX-like environments, so this could possibly cause some complications. Fortunately, this problem has already been solved by Capistrano, which uses an ingenious trick: they create a new symlink in a subdirectory and then move it via a relative path.

To demonstrate how it works, let’s first create a Capistrano-like directory structure.

$ mkdir -p releases releases/1 releases/2

$ ln -s `realpath releases`/1 current

$ tree
.
├── current -> /tmp/test/releases/1
└── releases
    ├── 1
    └── 2
4 directories, 0 files

And here’s how Capistrano updates the current symlink.

$ ln -s `realpath releases`/2 releases/current

$ tree
.
├── current -> /tmp/test/releases/1
└── releases
    ├── 1
    ├── 2
    └── current -> /tmp/test/releases/2

5 directories, 0 files

$ mv -f releases/current .

$ tree
.
├── current -> /tmp/test/releases/2
└── releases
    ├── 1
    └── 2
4 directories, 0 files

This approach still relies on rename(2) to update symlinks, but works equally well on most UNIX systems.

Verifying the theory

To gather more empirical proof, I performed several synthetic tests. For that purpose, I wrote a simple Ruby program that reads a given file’s mtime at a specified frequency (rps) and a Bash script that performs 2000 link swap operations non-stop while the Ruby program is running. I experimented with different configurations, including reducing I/O priority of the symlink update operations with ionice, but that didn’t change the overall figures. The table below shows the percent loss of reads under different workloads.

Method 10s, 10 rps 10s, 100 rps 10s, 1000 rps 5s, 1000rps
rm; ln -s 18.0% 15.8% 16.9% 33.2%
ln -sf 0.0% 0.1% 0.4% 0.5%
mv 0.0% 0.0% 0.0% 0.0%

While it’s unlikely that anybody is going to conduct 2000 deployments within a 10 second interval under a steady stream of reads, it was reassuring seeing that the mv theory survived the experiment.

Recap

So, symbolic links are “file references”, and hard links are “just files”. What’s wrong with that? For one, symbolic links are files, and thus they, too, are linked via hard links. This is pretty obvious when you think about it, but I’ve never heard anyone say that aloud. I even asked my Twitter friends the following question (please, please don’t use it to torture candidates at a job interview):

The answer is, of course, both. I received several replies, but they either ignored the trick nature of the question or assumed that I can’t read man pages. I wouldn’t draw any far-reaching conclusions from this joke, but — all things considered — I do get a feeling that at least some programmers don’t think about files in directories as hard links by default (I don’t).

One may argue that already abstract concepts shouldn’t be further simplified. However, this ignores the subliminal nature of their emergence. In a way, they are just a verbal approximation of our intuition within the domain. We might be able to abstain from explaining monads as burritos to others (because why would you? It’s ridiculous!), but we can’t dictate to the brain how the formal abstraction should be interpreted. And if we’re unlikely to change the way our brains work, the best we can do is test our conclusions thoroughly.

Rely only on reliable things. Don’t depend on accidents or assumptions. If you can’t tell the difference in particular circumstances, assume the worst.

— “The Pragmatic Programmer: From Journeyman to Master”, Andrew Hunt, David Thomas