A wise man once said:

SSBuZXZlciBzYWlkIHRoYXQ=

Now I know that looks like Cat just ran all over my keyboard-

I did not!

Calm down, I’m setting a scene

But I assure you, that means something. And I’m gonna help you figure out what. All it’ll take is a little bit of luck & a whole lotta fun bit tricks along the way.

But first…

Base64 for Dummies

The text above is encoded in Base64, an encoding system defined in RFC 4648 alongside the lesser-used Base16 & Base32. But don’t go googling “base64 decoder” just yet, because we’re gonna make one together so that you can have something to brag about to your friends learn some bit manipulation tricks.

Base64 uses 65 US-ASCII characters to represent arbitrary byte sequences as text. The 1st 64 are used for actual encoding, while the 65th (=) is used for padding the encoded string so that its length is a multiple of 4. Its use of exclusively ASCII characters makes it ideal for communicating arbitrary data in scenarios where non-text is suboptimal, as it can encode literally any data and be successfully decoded so long as the intended format is known (the bytes will always be the same, but you can’t just say a PNG is an MP3 & have it work). The main character set uses the characters ['A','Z']+['a','z']+['0','9']+'+'+'/' in that order.

For the unaware, I’m using mathematical range notation there. The square brackets mean the range includes the end it’s nearest to (so [1,10] is 1, 2, 3 ... 10). If I had used parenthesis, then the edge items would have been excluded ([1,10) is 1, 2, 3 ... 9).

I’ll explain everything else when we come to it whilst implementing a translator for base64, starting with the easier part.

Encoding

Encoding base64 is rather straightforward. Data is encoded in chunks of 3 bytes.

00000000 00000000 00000000

Actually, they’re octets

Actually, it doesn’t matter

These clusters are then concatenated into 24-bit blocks, like so

00000000_00000000_00000000

and finally, into 4 sextets1

000000 000000 000000 000000

from here, each sextet is encoded by using its value as an index into the character array (that’s why order matters) & pushed to the output.

Let’s try doing that in Rust

cargo new based64

And in main.rs

mod encode; // Add this line

fn main() {
    println!("Hello, world!");
}

After a quick cargo run to check that everything is set up correctly, we can start implementing the encoding algorithm.

// `src/encode.rs`
pub fn encode(bytes: &[u8]) -> String {
    todo!()
}

Now that we have our encode function to take a sequence of bytes & encode it to a base64 String, how do we implement the algorithm in Rust? Because I can see the future, I say we create a second function, encode_triplet, to do the work of encoding a set of 3 bytes.

fn encode_triplet(triplet: [u8; 3]) -> [char; 4] {
    todo!()
}

You may be curious as to why I’m using the array types [u8; 3] & [char; 4] here instead of a &[u8] & String like before. This is so that there’s no chance of ever being given something that isn’t a triplet, and so that it’s clear & enforced by the type system that the function returns exactly 4 characters.

Now, we could go about the encoding completely manually, but I’m gonna start us off simple & use a crate2 to help us. Specifically, bitreader’s BitReader struct. This will allow us to read bits directly from our array instead of needing to use manual bit operations.

Figuring out what those operations are is left as an exercise to the reader

cargo add bitreader
# `Cargo.toml`

# ...

[dependencies]
bitreader = "0.3.6"

Now we need to import it

// `src/encode.rs` (at the very top)

use bitreader::BitReader

Now we can use the BitReader struct to read however many bits into our array we want.

// in `encode_triplet`
// ...
let mut reader = BitReader::new(&triplet);
let first = reader.read_u8(6).unwrap();
let second = reader.read_u8(6).unwrap();
let third = reader.read_u8(6).unwrap();
let fourth = reader.read_u8(6).unwrap();

We use unwrap everywhere because read_u8 returns a Result since it could fail if, for example, there weren’t enough bits available to read from the slice. Since we know we’ll always have an input with 24 bits, we can just pretend it isn’t fallible by unwraping it to either give us the Ok variant’s value or panic (which is a fancy way of saying “crash the program”). This is another reason for using a fixed-size array as our input type instead of a variable-length slice - we don’t have to handle errors related to the size of our input.

But just how are we going to decide what each 6-bit block encodes to? I said earlier

each sextet is encoded by using it’s value as an index into the character array

so, let’s make an array!

// in `src/encode.rs`
const ENCODE_MAP: [char; 64] = [
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
        'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
        'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1',
        '2', '3', '4', '5', '6', '7', '8', '9', '+', '/',
    ];

From here, we just need to go ahead & index that array with whatever we read from the input array

// in `encode_triplet`
// ...
[
    ENCODE_MAP[first as usize], ENCODE_MAP[second as usize], ENCODE_MAP[third as usize], ENCODE_MAP[fourth as usize]
]

We need to use as usize on the values because arrays can only be indexed by numbers of type usize, but we have u8s from read_u8.

and our complete encode_triplet function looks like this:

fn encode_triplet(triplet: [u8, 3]) -> [char; 4] {
    let mut reader = BitReader::new(&triplet);
    let first = reader.read_u8(6).unwrap();
    let second = reader.read_u8(6).unwrap();
    let third = reader.read_u8(6).unwrap();
    let fourth = reader.read_u8(6).unwrap();

    [ENCODE_MAP[first as usize], ENCODE_MAP[second as usize], ENCODE_MAP[third as usize], ENCODE_MAP[fourth as usize]]
}

You can pat yourself on the back now, you’re one step closer to a fully functioning base64 encoder! But there are still some things in the way: edge cases. Luckily, they aren’t too difficult, and there are only 2 of them! So let’s get back to work & finish this encoder.

Edge Case 1: Only 2 Bytes are Available

If we only have 2 bytes available, then we can only make 3 encoded characters. If we insist on having a multiple-of-4-length encoded string, then that’s where our padding character, =, comes in. We’ll simply encode as much as we can from what we have & fill the remainder with = characters. In the case of a 2-byte remainder, the result will look like [c, c, c, '='].

First, let’s make a constant for the padding character for the sake of never making a typo

// At the top of `src/encode.rs`
const PADDING: char = '='

Next, we can scaffold our encode_doublet function

// `src/encode.rs`
fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
    let mut reader = BitReader::new(&rem);
    let six1 = reader.read_u8(6).unwrap();
    let six2 = reader.read_u8(6).unwrap();

    todo!()
}

From here we run into a problem. After reading 12 bytes from rem, we’re only left with 4 bits. Just using read_u8(4) will give us the bits, but in the wrong place and our encoded character will be off. How far off? Well, only by a factor of 4!

Why would it do that?!

I have no goddamn clue. It just does.

This does mean it’s an easy problem to fix though. All we have to do is left-shift it by 2 & we’ll have quadrupled our value, solving the problem.

Translated into Rust, we get

// `src/encode.rs`
fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
    // ...
    let (nibble, _) = reader.read_u8(4).unwrap().overflowing_shl(2)

    [ENCODE_MAP[six1 as usize], ENCODE_MAP[six2 as usize], ENCODE_MAP[nibble as usize], PADDING]
}

We use overflowing_shl to make it clear that overflows are intentional. After all, the upper 4 bits of our nibble will all be 0 anyways, so it doesn’t matter. We need to assign to (nibble, _ because overflowing_shl returns a tuple where the 2nd value is a bool for whether or not the shift overflowed. Using _ lets us discard the bool since we don’t care about it & cargo would warn us that it’s unused otherwise (we could also prefix the name with a _ (like _bool) to make cargo shut up but we have no reason to store it so it’s just a wasted variable).

With that, our full encode_doublet function is as follows:

fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
    let mut  reader = BitReader::new(&rem);
    let six1 = reader.read_u8(6).unwrap();
    let six2 = reader.read_u8(6).unwrap();

    let (nibble, _) = reader.read_u8(4).unwrap().overflowing_shl(2);

    [ENCODE_MAP[six1 as usize], ENCODE_MAP[six2 as usize], ENCODE_MAP[nibble as usize], PADDING]
}

Edge Case 2: Only 1 Byte is Available

At last, we reach the final edge case, where we only have 1 full byte remaining. So let’s just get right to the code:

// `src/encode.rs`
fn encode_singlet(rem: [u8; 1]) -> [char; 4] {
    let mut reader = BitReader::new(&rem);
    let six = reader.read_u8(6).unwrap();

    todo!();
}

Wait, why are we taking rem as a [u8; 1] instead of just a u8?

Glad you asked, cat. Because BitReader::new takes a slice (a reference to part of a list type), we need to be borrowing an array anyways. If we tried to do it like this

// Note: `rem` is now a `u8`
let mut reader = BitReader::new(&[rem])

Then we’d get an error!

$ cargo check
error[E0716]: temporary value dropped while borrowed
   --> src/encode.rs:131:42
    |
131 |         let mut reader = BitReader::new(&[rem]);
    |                                          ^^^^^ - temporary value is freed at the end of this statement
    |                                          |
    |                                          creates a temporary which is freed while still in use
132 |         let six = reader.read_u8(6).unwrap();
    |                   ----------------- borrow later used here
    |
help: consider using a `let` binding to create a longer lived value
    |
131 ~         let binding = [rem];
132 ~         let mut reader = BitReader::new(&binding);
    |

For more information about this error, try `rustc --explain E0716`.
error: could not compile `based` due to previous error

Thankfully, cargo check is helpful enough to tell us exactly how to fix that; simply create an extra variable for our single-length array, then borrow that. I, however, happen to think that looks ugly & prefer to simply make rem a [u8; 1] in the first place so there’s no need for the extra variable. But you can do it cargo’s way if you want, I won’t mind.

Anyway, we now have just 2 bits left of our byte after snagging the 1st six, so what should we do with them? We learnt last time that simply taking a partial input as-is will lead to our encoding being off, but will it still be off by a factor of 4, or will it be by some other amount?

Turns out that in the case of 2 bits, the input is actually off by a factor of 16! This time we’ll need to left-shift our input by 4 in order to correct it.

// `encode_singlet` in `src/encode.rs`
// ...
let (half_nib, _) = reader.read_u8(2).unwrap().overflowing_shl(4);

And finally, the mapping

// `encode_singlet` in `src/encode.rs`
[ENCODE_MAP[six as usize], ENCODE_MAP[half_nib as usize], PADDING, PADDING]

Thanks to everything we’ve learnt, encode_singlet was done quite quickly! Here’s what it looks like:

// `src/encode.rs`
fn encode_singlet(rem: [u8; 1]) -> [char; 4] {
    let mut reader = BitReader::new(&rem);
    let six = reader.read_u8(6).unwrap();
    let (half_nib, _) = reader.read_u8(2).unwrap().overflowing_shl(4);

    [ENCODE_MAP[six as usize], ENCODE_MAP[half_nib as usize], PADDING, PADDING]
}

Don’t pat yourself on the back just yet though, we still have the encode function to write!

Tying Together our encoding functions

Currently, encode looks like this:

// `src/encode.rs`
fn encode(bytes: &[u8]) -> String {
    todo!()
}

Before we start filling it out, let’s make a game plan. What we need is to:

  1. Split the input into 3-byte chunks
  2. Encode all the triplets
  3. Encode the remainder (if any) using either encode_doublet or encode_singlet

What’s nice about these 3 objectives is that we only need 2 blocks to do them - 2. & 3. can be done in one block. For 1., we’ll use the chunks method on a slice. This will give us all the nonoverlapping chunks with a length less than or equal to the one we specify - 3, in our case.

// `src/encode.rs`
fn encode(bytes: &[u8]) -> String {
    let chunks = bytes.chunks(3);
}

We also need to initialise an array to push our encoded characters into, so let’s do that now:

// `src/encode.rs`
fn encode(bytes: &[u8]) -> String {
    let mut encoded: Vec<[char; 4]> = vec![]; // New
    let chunks = bytes.chunks(3);
}

encoded has the type [char; 4] because that’s what our encoding functions return. Don’t worry, we can get it to a String at the very end when we need it.

Moving on, all we’ll need from here is a for loop with a match statement.

// `encode` in `src/encode.rs`
// ...
    for chunk in chunks {
        match chunk.len() {
            3 => todo!(),
            2 => todo!(),
            1 => todo!(),
            _ => todo!(),
        }
    }

All a match statement does is execute different blocks depending on the structure of what you give it. chunk.len() gives us a usize, so we can match against different possible values. We use _ in the final match arm because a match statement must be exhaustive - it has to cover every possible case - and _ is a wildcard so we can catch everything else.

The match arms themselves are very simple since we’ve done all the hard work already while writing our encode_ functions. All we need to do now is fill them in.

For encode_triplet, it looks like this:

// `encode` in `src/encode.rs`
// ...
for chunk in chunks {
        match chunk.len() {
            // Other arms omitted
            3 => encoded.push(encode_triplet([chunk[0], chunk[1], chunk[2]])),
            _ => unreachable!("Mathematically impossible"),
        }
    }

Notice that we also fill out the _ arm with an unreachable!, signalling that it will never actually execute (if it does, then there’s a pretty big problem).

The other match arms are similar

// `encode` in `src/encode.rs`
// ...
match chunk.len() {
    // Other arms omitted
    2 => encoded.push(encode_doublet([chunk[0], chunk[1]])),
    1 => encoded.push(encode_singlet([chunk[0]])),
}

And with that, we have everything we need to fully encode a byte sequence! After the for loop finishes, encoded will be filled with the encoded characters.

But we can’t just return it as-is. Remember, the type of encoded is currently Vec<[char; 4]> but we need it to be String, so how do we do that?

We can collect an iterator over chars (An Iterator<Item = char) into a String, so let’s try & get our Vec<[char; 4]> to that. If we just call iter on encoded now, we’ll be given an Iterator<Item = [char; 4]>, which is very much not what we want. We need to somehow take the items in the arrays out of the arrays & into the Vec itself. Almost like squishing the nested list into a list with only one level.

So, kinda like flattening it?

Right on the money

Iterator’s flatten method is our friend here. It’ll take an Iterator over a collection (such as an array), & return an iterator over the items of that collection. Now we just need to do

// `encode` in `src/encode.rs`
// ...
encoded.iter().flatten().collect()

and we’ll be returning the correct type. We don’t need to specify the type to collect into because the compiler can guess that we want a String from the function’s signature telling it that’s what it returns. We could specify it like so

encoded.iter().flatten().collect::<String>()

(known as turbofish syntax) but that’s unnecessary.

Our final encode function is now the following:

// `src/encode.rs`
fn encode(bytes: &[u8]) -> Vec<String> {
    let chunks = bytes.chunks(3);
    let mut encoded = vec![];

    for chunk in chunks {
        match chunk.len() {
            3 => encoded.push(encode_triplet([chunk[0], chunk[1], chunk[2]])?),
            2 => encoded.push(encode_doublet([chunk[0], chunk[1]])?),
            1 => encoded.push(encode_singlet([chunk[0]])?),
            _ => unreachable!("Mathematically impossible"),
        }
    }

    encoded.iter().flatten().collect()
    }

With that, we’ve completed all the encoding functionality we need! To test it out, let’s make a quick CLI

A Quick & Dirty CLI

Since this isn’t the main focus, I won’t spend too long explaining the CLI. We’ll just throw together something that works for now so that we can see how our encoding works & we can touch it up later.

First, we need to fetch the command-line arguments provided to our program

// `src/main.rs`
mod encode;

fn main() {
    let Some(input) = std::env::args().nth(2) else {
        eprintln!("Must be called with an input string");
        std::process::exit(1);
    };
}

std::env::args will give us all the arguments supplied to the program, with the first being the filename. Since we don’t care about that, we use .nth(2) to grab the 2nd argument, which would be the input string to encode. We assign to input using the new let-else syntax. It’s like a match statement (note how we use Some(input) as we’re matching against the result being Some), but the else branch must terminate the program. We just print an error message & then exit the program with an exit code of 1 to indicate failure.

Next, we encode the input & print it

// `src/main.rs`
// ...
use encode::encode;

fn main() {
    // ...
    let encoded = encode(input.as_bytes());
    println!("{encoded}");
}

This is much more simple. We pass the byte representation of the input using as_bytes() because input is a String, which isn’t what we need. Then, we print the encoded string by surrounding it in curly braces within the string we send to println!. We could also do

println!("{}", encoded);

but I prefer the first option, as they both do the same thing (in this case) and the first is shorter. It’s entirely personal preference though.

Now, we can test out our encoder

$ cargo run --quiet -- 'Hello world!'
SGVsbG8gd29ybGQh

Success! Now, all we have to do is implement decoding & based64 will be the best base64 encoder/decoder around!

No it won’t, he’s lying to you