Ensuring That a Linux Program Is Running at Most Once by Using Abstract Sockets

It is often useful to have a way of ensuring that a program is running at most once (e.g. a system daemon or Cron job). Unfortunately, most commonly used solutions are not without problems. In this post, I show a simple, reliable, Linux-only solution that utilizes Unix domain sockets and the abstract socket namespace. The post includes a sample implementation in the Rust programming language.

Introduction

When having a daemon or a program that is periodically run via Cron, it is often desirable to have a way of ensuring that at most once instance of the program is running at any given time. For example, having two web servers running on the same port would certainly result in a clash. Or, consider a Cron job that regularly runs a program to process new data. Sometimes, however, the processing may take more time than expected, in which case there might accidentally be two running instances handling the same data. This may result in duplication or corruption of data.

There are multiple solutions to this problem. Arguably, the most common one is to use so-called PID files. When a program starts, it writes its process identifier (PID) into a chosen file. Before it exits, it removes the file. The single-instance test is then done by checking the presence of the file. However, a mere presence of the file does not guarantee that the program is running. Indeed, the program might have been killed without a chance to do a proper cleanup. To solve this, one might consider checking whether a program with the PID stored in the file is running. Unfortunately, PIDs may be recycled, so by then, a totally different application having the stored PID might be running. Other problems with this method may include race conditions when manipulating the file (what if two processes simultaneously find out that the PID file does not exist?) and the need to use a file in the filesystem.

The race condition in the above approach can be alleviated by using file locking (e.g. by calling fcntl(), flock(), or lockf()). A neat thing is that file locks are always automatically released when your process dies for any reason. Nevertheless, file locks are not a panacea as they have known issues (1, 2). Also, they require the use of files in a filesystem. What if the program is always executed in a chroot jail (sandbox), thus having no or a very limited access to the filesystem?

Luckily, there are also solutions that do not rely on files. One of them is the use of semaphores. However, POSIX named semaphores have kernel persistence (see the Persistence section in the overview). So, if your process dies before explicitly calling sem_unlink(), you will be unable to run the program again until you manually remove the lock via ipcrm or reboot.

Another idea is to utilize sockets. One can create a socket and bind it to a port. After that, when another process tries to bind a socket to the same port, it will fail. The problem with this approach is that there is only a limited number of ports available. Thus, it may easily happen that another application using the same port as your program will prevent it from starting.

What we would like is a simple, reliable solution that does not require the use of files. And, fortunately, there is one (well, at least on Linux).

Unix Domain Sockets and the Abstract Namespace

Unix domain sockets are sockets created by passing AF_UNIX (or its alias AF_LOCAL) as a protocol family when calling socket(). They are useful for exchanging data between processes executing on the same operating system. Ordinarily, Unix domain sockets are bound to files. So, one process creates a socket, binds it to a file somewhere in the filesystem, and waits for connections. However, this brings us back to files, which we deemed undesirable.

Luckily, Unix domain sockets support another addressing mode, via the so-called abstract namespace. This allows us to bind sockets to names rather than to files. What we get are the following benefits (see page 1175 in The Linux Programming Interface):

  • There is no need to worry about possible collisions with existing files in the filesystem.
  • There is no socket file to be removed upon program termination.
  • We do not need to create a file for the socket at all. This obviates target directory existence, permissions checks, and reduces filesystem clutter. Also, it works in chrooted environments.

All we need is to generate a unique name for our program and pass it as the address when calling bind(). The trick is that instead of specifying a file path as the address, we pass a null byte followed by the name of our choosing (e.g. "\0my-unique-name"). The initial null byte is what distinguishes abstract socket names from conventional Unix domain socket path names, which consist of a string of one or more non-null bytes terminated by a null byte.

For more details, see this article or Section 57.6 in The Linux Programming Interface.

Sample Implementation in Rust

Lets put the gained knowledge to use and build a program prototype in Rust. A link to the complete source code is available at the end of the present blog post.

We start by defining the main function:

fn main() {
    if let Err(e) = run() {
        eprintln!("error: {}", e);
        std::process::exit(1);
    }
}

Since the use of the ? operator (described later) in main() is not yet available (see RFC 1937), we create a separate run() function that returns a Result (either it finishes successfully or returns an error). When it fails with an error, we print it to the standard error by using the new eprintln!() macro (available since Rust 1.19) and terminate the program with exit code 1. If run() succeeds, the program automatically returns zero, so there is no need for an explicit std::process::exit(0).

The run() function, which represents the heart of our program, has the following signature:

fn run() -> Result<(), Box<Error>>;

The Result type encodes the type of values returned upon success (() denotes no value) and upon error (Error provides base functionality for all errors in Rust). Inside the function, we will perform the single-instance test, do some work, and return.

For working with sockets, we will use the nix crate. It contains Rust-friendly bindings to low-level Unix APIs. This allows us to use a neat, higher-level interface.

To create a Unix domain socket, we call socket() with the following arguments:

let s = socket(AddressFamily::Unix, SockType::Stream, SockFlag::empty(), 0)?;

The trailing zero indicates that the default protocol should be used. The above socket() call roughly corresponds to the following C call:

socket(AF_UNIX, SOCK_STREAM, 0) // in C

Before we move on, let me explain the question mark after the socket() call. The ? operator (see error handling in Rust) represents a shortcut for propagating errors. Roughly speaking, the compiler converts the above piece of code into the following one:

let s = match socket(/*same arguments as above*/) {
    Ok(s) => s,
    Err(e) => return Err(Box::new(e)),
}

Next, we create a socket address in the abstract namespace using a unique ID. The nix crate provides a handy function for this, which frees us from prepending the null byte by ourselves:

let addr = SockAddr::Unix(UnixAddr::new_abstract(b"some-unique-id")?);

Finally, we try binding the socket to the above address in the abstract namespace. When bind() fails with EADDRINUSE, it means that we should quit as another instance of our program is already running. In code, this looks as follows:

if let Err(e) = bind(s, &addr) {
    match e {
        nix::Error::Sys(nix::Errno::EADDRINUSE) => {
            eprintln!("program is already running");
            return Ok(());
        },
        _ => {
            // bind() failed because of an unexpected reason.
            return Err(Box::new(e));
        }

    };
}

Also, I would like to point out that when the process is terminated, the kernel automatically closes the socket. Hence, there is no need for explicitly closing it before returning from the function. This is handy as programs may be abruptly killed via SIGINT, SIGKILL, etc.

Finally, we do our work (represented by a call to an imaginary do_work() function) and return with success:

do_work();

Ok(())

And that is it. If you compile and execute the program, it will start successfully. When you try to run two instances of it at the same time (put a sleep or an infinite loop inside do_work()), the second one will print program is already running to the standard error and immediately quit.

Limitations and Caveats

Alas, every rose has its thorns. When using Unix domain sockets and the abstract namespace, be aware of the following caveats:

  • The abstract socket namespace is a non-portable Linux extension. This is fine when your code runs only on Linux but might require a use of a different solution if you need to run the code elsewhere.
  • You have to find a way to get a unique name. If two programs accidentally use the same name, you are out of luck.
  • There are no permissions. Anyone who guesses your unique name might start using it. If this matters, you will have to implement access control by yourself.

Complete Source Code

The complete source code is available on GitHub.

Discussion

Apart from comments below, you can also discuss this post at /r/rust and /r/linux

3 Comments

  1. Hi, excellent write up. I’m going to use it for my personal project. Could you extend this example by demonstrating inter-process-communication with it? Before exiting, I would love to send a message to the previous instance. And I would like the previous instance to read the message passed by the second instance and do something with it. Any help is highly appreciated.

    Reply
    • Hi. Unix domain sockets are regular sockets, so you can use them for inter-process communication. You just need the first instance to listen on that socket and let all the other instances send a message there (after they fail to bind to that socket, which means an instance is already running).

      Reply
  2. Hi, I implemented above unix abstract sockets solution in python and it runs very well for 2 years.

    However, recently I detected a socket still apearing in netstat for 2 days after the program that created it was no longer running, but its PID
    Was still displayed using ‘ps -ef’ command.
    Any clue why this happened and how to prevent it ?
    Thanks

    Reply

Leave a Comment.