A Little Story About the `yes` Unix Command

What’s the simplest Unix command you know?
There’s echo, which prints a string to stdout and true, which always terminates with an exit code of 0.

Among the series of simple Unix commands, there’s also yes. If you execute it without arguments, you get an infinite stream of y’s, separated by a newline:

y
y
y
y
(...you get the idea)

What seems to be pointless in the beginning turns out to be pretty helpful :

yes | sh boring_installation.sh

Ever installed a program, which required you to type “y” and hit enter to keep going? yes to the rescue! It will carefully fulfill its duty, so you can keep watching Pootie Tang.

Writing yes

Here’s a basic version in… uhm… BASIC.

10 PRINT "y"
20 GOTO 10

And here’s the same thing in Python:

while True:
    print("y")

Simple, eh? Not so quick!
Turns out, that program is quite slow.

python yes.py | pv -r > /dev/null
[4.17MiB/s]

Compare that with the built-in version on my Mac:

yes | pv -r > /dev/null
[34.2MiB/s]

So I tried to write a quicker version in Rust. Here’s my first attempt:

use std::env;

fn main() {
  let expletive = env::args().nth(1).unwrap_or("y".into());
  loop {
    println!("{}", expletive);
  }
}

Some explanations:

The string we want to print in a loop is the first command line parameter and is named expletive. I learned this word from the yes manpage.
I use unwrap_or to get the expletive from the parameters. In case the parameter is not set, we use “y” as a default.
The default parameter gets converted from a string slice (&str) into an owned string on the heap (String) using into().

Let’s test it.

cargo run --release | pv -r > /dev/null
   Compiling yes v0.1.0
    Finished release [optimized] target(s) in 1.0 secs
     Running `target/release/yes`
[2.35MiB/s]

Whoops, that doesn’t look any better. It’s even slower than the Python version! That caught my attention, so I looked around for the source code of a C implementation.

Here’s the very first version of the program, released with Version 7 Unix and famously authored by Ken Thompson on Jan 10, 1979:

main(argc, argv)
char **argv;
{
  for (;;)
    printf("%s\n", argc>1? argv[1]: "y");
}

No magic here.

Compare that to the 128-line-version from the GNU coreutils, which is mirrored on Github. After 25 years, it is still under active development! The last code change happened around a year ago. That’s quite fast:

# brew install coreutils
gyes | pv -r > /dev/null
[854MiB/s]

The important part is at the end:

/* Repeatedly output the buffer until there is a write error; then fail.  */
while (full_write (STDOUT_FILENO, buf, bufused) == bufused)
  continue;

Aha! So they simply use a buffer to make write operations faster. The buffer size is defined by a constant named BUFSIZ, which gets chosen on each system so as to make I/O efficient (see here). On my system, that was defined as 1024 bytes. I actually had better performance with 8192 bytes.

I’ve extended my Rust program:

use std::env;
use std::io::{self, BufWriter, Write};

const BUFSIZE: usize = 8192;

fn main() {
    let expletive = env::args().nth(1).unwrap_or("y".into());
    let mut writer = BufWriter::with_capacity(BUFSIZE, io::stdout());
    loop {
        writeln!(writer, "{}", expletive).unwrap();
    }
}

The important part is, that the buffer size is a multiple of four, to ensure memory alignment.

Running that gave me 51.3MiB/s. Faster than the version, which comes with my system, but still way slower than the results from this Reddit post that I found, where the author talks about 10.2GiB/s.

Update

Once again, the Rust community did not disappoint.
As soon as this post hit the Rust subreddit, user nwydo pointed out a previous discussion on the same topic. Here’s their optimized code, that breaks the 3GB/s mark on my machine:

use std::env;
use std::io::{self, Write};
use std::process;
use std::borrow::Cow;

use std::ffi::OsString;
pub const BUFFER_CAPACITY: usize = 64 * 1024;

pub fn to_bytes(os_str: OsString) -> Vec<u8> {
  use std::os::unix::ffi::OsStringExt;
  os_str.into_vec()
}

fn fill_up_buffer<'a>(buffer: &'a mut [u8], output: &'a [u8]) -> &'a [u8] {
  if output.len() > buffer.len() / 2 {
    return output;
  }

  let mut buffer_size = output.len();
  buffer[..buffer_size].clone_from_slice(output);

  while buffer_size < buffer.len() / 2 {
    let (left, right) = buffer.split_at_mut(buffer_size);
    right[..buffer_size].clone_from_slice(left);
    buffer_size *= 2;
  }

  &buffer[..buffer_size]
}

fn write(output: &[u8]) {
  let stdout = io::stdout();
  let mut locked = stdout.lock();
  let mut buffer = [0u8; BUFFER_CAPACITY];

  let filled = fill_up_buffer(&mut buffer, output);
  while locked.write_all(filled).is_ok() {}
}

fn main() {
  write(&env::args_os().nth(1).map(to_bytes).map_or(
    Cow::Borrowed(
      &b"y\n"[..],
    ),
    |mut arg| {
      arg.push(b'\n');
      Cow::Owned(arg)
    },
  ));
  process::exit(1);
}

Now that’s a whole different ballgame!

We prepare a filled string buffer, which will be reused for each loop.
Stdout is protected by a lock. So, instead of constantly acquiring and releasing it, we keep it all the time.
We use a the platform-native std::ffi::OsString and std::borrow::Cow to avoid unnecessary allocations.

The only thing that I could contribute was removing an unnecessary mut. 😅

Lessons learned

The trivial program yes turns out not to be so trivial after all. It uses output buffering and memory alignment to improve performance. Re-implementing Unix tools is fun and makes me appreciate the nifty tricks, which make our computers fast.