Learning Bit Magic by Translating Keysmashes

A wise man once said:

SSBuZXZlciBzYWlkIHRoYXQ=

Now I know that looks like Cat just ran all over my keyboard-

I did not!

Calm down, I’m setting a scene

But I assure you, that means something. And I’m gonna help you figure out what. All it’ll take is a little bit of luck & a whole lotta fun bit tricks along the way.

But first…

Base64 for Dummies⌗

The text above is encoded in Base64, an encoding system defined in RFC 4648 alongside the lesser-used Base16 & Base32. But don’t go googling “base64 decoder” just yet, because we’re gonna make one together so that you can ~~have something to brag about to your friends~~ learn some bit manipulation tricks.

Base64 uses 65 US-ASCII characters to represent arbitrary byte sequences as text. The 1st 64 are used for actual encoding, while the 65th (=) is used for padding the encoded string so that its length is a multiple of 4. Its use of exclusively ASCII characters makes it ideal for communicating arbitrary data in scenarios where non-text is suboptimal, as it can encode literally any data and be successfully decoded so long as the intended format is known (the bytes will always be the same, but you can’t just say a PNG is an MP3 & have it work). The main character set uses the characters ['A','Z']+['a','z']+['0','9']+'+'+'/' in that order.

For the unaware, I’m using mathematical range notation there. The square brackets mean the range includes the end it’s nearest to (so [1,10] is 1, 2, 3 ... 10). If I had used parenthesis, then the edge items would have been excluded ([1,10) is 1, 2, 3 ... 9).

I’ll explain everything else when we come to it whilst implementing a translator for base64, starting with the easier part.

Encoding⌗

Encoding base64 is rather straightforward. Data is encoded in chunks of 3 bytes.

00000000 00000000 00000000

Actually, they’re octets

Actually, it doesn’t matter

These clusters are then concatenated into 24-bit blocks, like so

00000000_00000000_00000000

and finally, into 4 sextets¹

000000 000000 000000 000000

from here, each sextet is encoded by using its value as an index into the character array (that’s why order matters) & pushed to the output.

Let’s try doing that in Rust

cargo new based64

And in main.rs

mod encode; // Add this line

fn main() {
    println!("Hello, world!");
}

After a quick cargo run to check that everything is set up correctly, we can start implementing the encoding algorithm.

// `src/encode.rs`
pub fn encode(bytes: &[u8]) -> String {
    todo!()
}

Now that we have our encode function to take a sequence of bytes & encode it to a base64 String, how do we implement the algorithm in Rust? Because I can see the future, I say we create a second function, encode_triplet, to do the work of encoding a set of 3 bytes.

fn encode_triplet(triplet: [u8; 3]) -> [char; 4] {
    todo!()
}

You may be curious as to why I’m using the array types [u8; 3] & [char; 4] here instead of a &[u8] & String like before. This is so that there’s no chance of ever being given something that isn’t a triplet, and so that it’s clear & enforced by the type system that the function returns exactly 4 characters.

Now, we could go about the encoding completely manually, but I’m gonna start us off simple & use a crate² to help us. Specifically, bitreader’s BitReader struct. This will allow us to read bits directly from our array instead of needing to use manual bit operations.

Figuring out what those operations are is left as an exercise to the reader

cargo add bitreader

# `Cargo.toml`

# ...

[dependencies]
bitreader = "0.3.6"

Now we need to import it

// `src/encode.rs` (at the very top)

use bitreader::BitReader

Now we can use the BitReader struct to read however many bits into our array we want.

// in `encode_triplet`
// ...
let mut reader = BitReader::new(&triplet);
let first = reader.read_u8(6).unwrap();
let second = reader.read_u8(6).unwrap();
let third = reader.read_u8(6).unwrap();
let fourth = reader.read_u8(6).unwrap();

We use unwrap everywhere because read_u8 returns a Result since it could fail if, for example, there weren’t enough bits available to read from the slice. Since we know we’ll always have an input with 24 bits, we can just pretend it isn’t fallible by unwraping it to either give us the Ok variant’s value or panic (which is a fancy way of saying “crash the program”). This is another reason for using a fixed-size array as our input type instead of a variable-length slice - we don’t have to handle errors related to the size of our input.

But just how are we going to decide what each 6-bit block encodes to? I said earlier

each sextet is encoded by using it’s value as an index into the character array

so, let’s make an array!

// in `src/encode.rs`
const ENCODE_MAP: [char; 64] = [
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
        'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
        'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1',
        '2', '3', '4', '5', '6', '7', '8', '9', '+', '/',
    ];

From here, we just need to go ahead & index that array with whatever we read from the input array

// in `encode_triplet`
// ...
[
    ENCODE_MAP[first as usize], ENCODE_MAP[second as usize], ENCODE_MAP[third as usize], ENCODE_MAP[fourth as usize]
]

We need to use as usize on the values because arrays can only be indexed by numbers of type usize, but we have u8s from read_u8.

and our complete encode_triplet function looks like this:

fn encode_triplet(triplet: [u8, 3]) -> [char; 4] {
    let mut reader = BitReader::new(&triplet);
    let first = reader.read_u8(6).unwrap();
    let second = reader.read_u8(6).unwrap();
    let third = reader.read_u8(6).unwrap();
    let fourth = reader.read_u8(6).unwrap();

    [ENCODE_MAP[first as usize], ENCODE_MAP[second as usize], ENCODE_MAP[third as usize], ENCODE_MAP[fourth as usize]]
}

You can pat yourself on the back now, you’re one step closer to a fully functioning base64 encoder! But there are still some things in the way: edge cases. Luckily, they aren’t too difficult, and there are only 2 of them! So let’s get back to work & finish this encoder.

Edge Case 1: Only 2 Bytes are Available⌗

If we only have 2 bytes available, then we can only make 3 encoded characters. If we insist on having a multiple-of-4-length encoded string, then that’s where our padding character, =, comes in. We’ll simply encode as much as we can from what we have & fill the remainder with = characters. In the case of a 2-byte remainder, the result will look like [c, c, c, '='].

First, let’s make a constant for the padding character for the sake of never making a typo

// At the top of `src/encode.rs`
const PADDING: char = '='

Next, we can scaffold our encode_doublet function

// `src/encode.rs`
fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
    let mut reader = BitReader::new(&rem);
    let six1 = reader.read_u8(6).unwrap();
    let six2 = reader.read_u8(6).unwrap();

    todo!()
}

From here we run into a problem. After reading 12 bytes from rem, we’re only left with 4 bits. Just using read_u8(4) will give us the bits, but in the wrong place and our encoded character will be off. How far off? Well, only by a factor of 4!

Why would it do that?!

I have no goddamn clue. It just does.

This does mean it’s an easy problem to fix though. All we have to do is left-shift it by 2 & we’ll have quadrupled our value, solving the problem.

Translated into Rust, we get

// `src/encode.rs`
fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
    // ...
    let (nibble, _) = reader.read_u8(4).unwrap().overflowing_shl(2)

    [ENCODE_MAP[six1 as usize], ENCODE_MAP[six2 as usize], ENCODE_MAP[nibble as usize], PADDING]
}

We use overflowing_shl to make it clear that overflows are intentional. After all, the upper 4 bits of our nibble will all be 0 anyways, so it doesn’t matter. We need to assign to (nibble, _ because overflowing_shl returns a tuple where the 2nd value is a bool for whether or not the shift overflowed. Using _ lets us discard the bool since we don’t care about it & cargo would warn us that it’s unused otherwise (we could also prefix the name with a _ (like _bool) to make cargo shut up but we have no reason to store it so it’s just a wasted variable).

With that, our full encode_doublet function is as follows:

fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
    let mut  reader = BitReader::new(&rem);
    let six1 = reader.read_u8(6).unwrap();
    let six2 = reader.read_u8(6).unwrap();

    let (nibble, _) = reader.read_u8(4).unwrap().overflowing_shl(2);

    [ENCODE_MAP[six1 as usize], ENCODE_MAP[six2 as usize], ENCODE_MAP[nibble as usize], PADDING]
}

Edge Case 2: Only 1 Byte is Available⌗

At last, we reach the final edge case, where we only have 1 full byte remaining. So let’s just get right to the code:

// `src/encode.rs`
fn encode_singlet(rem: [u8; 1]) -> [char; 4] {
    let mut reader = BitReader::new(&rem);
    let six = reader.read_u8(6).unwrap();

    todo!();
}

Wait, why are we taking rem as a [u8; 1] instead of just a u8?

Glad you asked, cat. Because BitReader::new takes a slice (a reference to part of a list type), we need to be borrowing an array anyways. If we tried to do it like this

// Note: `rem` is now a `u8`
let mut reader = BitReader::new(&[rem])

Then we’d get an error!

$ cargo check
error[E0716]: temporary value dropped while borrowed
   --> src/encode.rs:131:42
    |
131 |         let mut reader = BitReader::new(&[rem]);
    |                                          ^^^^^ - temporary value is freed at the end of this statement
    |                                          |
    |                                          creates a temporary which is freed while still in use
132 |         let six = reader.read_u8(6).unwrap();
    |                   ----------------- borrow later used here
    |
help: consider using a `let` binding to create a longer lived value
    |
131 ~         let binding = [rem];
132 ~         let mut reader = BitReader::new(&binding);
    |

For more information about this error, try `rustc --explain E0716`.
error: could not compile `based` due to previous error

Thankfully, cargo check is helpful enough to tell us exactly how to fix that; simply create an extra variable for our single-length array, then borrow that. I, however, happen to think that looks ugly & prefer to simply make rem a [u8; 1] in the first place so there’s no need for the extra variable. But you can do it cargo’s way if you want, I won’t mind.

Anyway, we now have just 2 bits left of our byte after snagging the 1st six, so what should we do with them? We learnt last time that simply taking a partial input as-is will lead to our encoding being off, but will it still be off by a factor of 4, or will it be by some other amount?

Turns out that in the case of 2 bits, the input is actually off by a factor of 16! This time we’ll need to left-shift our input by 4 in order to correct it.

// `encode_singlet` in `src/encode.rs`
// ...
let (half_nib, _) = reader.read_u8(2).unwrap().overflowing_shl(4);

And finally, the mapping

// `encode_singlet` in `src/encode.rs`
[ENCODE_MAP[six as usize], ENCODE_MAP[half_nib as usize], PADDING, PADDING]

Thanks to everything we’ve learnt, encode_singlet was done quite quickly! Here’s what it looks like:

// `src/encode.rs`
fn encode_singlet(rem: [u8; 1]) -> [char; 4] {
    let mut reader = BitReader::new(&rem);
    let six = reader.read_u8(6).unwrap();
    let (half_nib, _) = reader.read_u8(2).unwrap().overflowing_shl(4);

    [ENCODE_MAP[six as usize], ENCODE_MAP[half_nib as usize], PADDING, PADDING]
}

Don’t pat yourself on the back just yet though, we still have the encode function to write!

Tying Together our encoding functions⌗

Currently, encode looks like this:

// `src/encode.rs`
fn encode(bytes: &[u8]) -> String {
    todo!()
}

Before we start filling it out, let’s make a game plan. What we need is to:

Split the input into 3-byte chunks
Encode all the triplets
Encode the remainder (if any) using either encode_doublet or encode_singlet

What’s nice about these 3 objectives is that we only need 2 blocks to do them - 2. & 3. can be done in one block. For 1., we’ll use the chunks method on a slice. This will give us all the nonoverlapping chunks with a length less than or equal to the one we specify - 3, in our case.

// `src/encode.rs`
fn encode(bytes: &[u8]) -> String {
    let chunks = bytes.chunks(3);
}

We also need to initialise an array to push our encoded characters into, so let’s do that now:

// `src/encode.rs`
fn encode(bytes: &[u8]) -> String {
    let mut encoded: Vec<[char; 4]> = vec![]; // New
    let chunks = bytes.chunks(3);
}

encoded has the type [char; 4] because that’s what our encoding functions return. Don’t worry, we can get it to a String at the very end when we need it.

Moving on, all we’ll need from here is a for loop with a match statement.

// `encode` in `src/encode.rs`
// ...
    for chunk in chunks {
        match chunk.len() {
            3 => todo!(),
            2 => todo!(),
            1 => todo!(),
            _ => todo!(),
        }
    }

All a match statement does is execute different blocks depending on the structure of what you give it. chunk.len() gives us a usize, so we can match against different possible values. We use _ in the final match arm because a match statement must be exhaustive - it has to cover every possible case - and _ is a wildcard so we can catch everything else.

The match arms themselves are very simple since we’ve done all the hard work already while writing our encode_ functions. All we need to do now is fill them in.

For encode_triplet, it looks like this:

// `encode` in `src/encode.rs`
// ...
for chunk in chunks {
        match chunk.len() {
            // Other arms omitted
            3 => encoded.push(encode_triplet([chunk[0], chunk[1], chunk[2]])),
            _ => unreachable!("Mathematically impossible"),
        }
    }

Notice that we also fill out the _ arm with an unreachable!, signalling that it will never actually execute (if it does, then there’s a pretty big problem).

The other match arms are similar

// `encode` in `src/encode.rs`
// ...
match chunk.len() {
    // Other arms omitted
    2 => encoded.push(encode_doublet([chunk[0], chunk[1]])),
    1 => encoded.push(encode_singlet([chunk[0]])),
}

And with that, we have everything we need to fully encode a byte sequence! After the for loop finishes, encoded will be filled with the encoded characters.

But we can’t just return it as-is. Remember, the type of encoded is currently Vec<[char; 4]> but we need it to be String, so how do we do that?

We can collect an iterator over chars (An Iterator<Item = char) into a String, so let’s try & get our Vec<[char; 4]> to that. If we just call iter on encoded now, we’ll be given an Iterator<Item = [char; 4]>, which is very much not what we want. We need to somehow take the items in the arrays out of the arrays & into the Vec itself. Almost like squishing the nested list into a list with only one level.

So, kinda like flattening it?

Right on the money

Iterator’s flatten method is our friend here. It’ll take an Iterator over a collection (such as an array), & return an iterator over the items of that collection. Now we just need to do

// `encode` in `src/encode.rs`
// ...
encoded.iter().flatten().collect()

and we’ll be returning the correct type. We don’t need to specify the type to collect into because the compiler can guess that we want a String from the function’s signature telling it that’s what it returns. We could specify it like so

encoded.iter().flatten().collect::<String>()

(known as turbofish syntax) but that’s unnecessary.

Our final encode function is now the following:

// `src/encode.rs`
fn encode(bytes: &[u8]) -> Vec<String> {
    let chunks = bytes.chunks(3);
    let mut encoded = vec![];

    for chunk in chunks {
        match chunk.len() {
            3 => encoded.push(encode_triplet([chunk[0], chunk[1], chunk[2]])?),
            2 => encoded.push(encode_doublet([chunk[0], chunk[1]])?),
            1 => encoded.push(encode_singlet([chunk[0]])?),
            _ => unreachable!("Mathematically impossible"),
        }
    }

    encoded.iter().flatten().collect()
    }

With that, we’ve completed all the encoding functionality we need! To test it out, let’s make a quick CLI

A Quick & Dirty CLI⌗

Since this isn’t the main focus, I won’t spend too long explaining the CLI. We’ll just throw together something that works for now so that we can see how our encoding works & we can touch it up later.

First, we need to fetch the command-line arguments provided to our program

// `src/main.rs`
mod encode;

fn main() {
    let Some(input) = std::env::args().nth(2) else {
        eprintln!("Must be called with an input string");
        std::process::exit(1);
    };
}

std::env::args will give us all the arguments supplied to the program, with the first being the filename. Since we don’t care about that, we use .nth(2) to grab the 2nd argument, which would be the input string to encode. We assign to input using the new let-else syntax. It’s like a match statement (note how we use Some(input) as we’re matching against the result being Some), but the else branch must terminate the program. We just print an error message & then exit the program with an exit code of 1 to indicate failure.

Next, we encode the input & print it

// `src/main.rs`
// ...
use encode::encode;

fn main() {
    // ...
    let encoded = encode(input.as_bytes());
    println!("{encoded}");
}

This is much more simple. We pass the byte representation of the input using as_bytes() because input is a String, which isn’t what we need. Then, we print the encoded string by surrounding it in curly braces within the string we send to println!. We could also do

println!("{}", encoded);

but I prefer the first option, as they both do the same thing (in this case) and the first is shorter. It’s entirely personal preference though.

Now, we can test out our encoder

$ cargo run --quiet -- 'Hello world!'
SGVsbG8gd29ybGQh

Success! Now, all we have to do is implement decoding & based64 will be the best base64 encoder/decoder around!

No it won’t, he’s lying to you

More of a teensy weensy exaggeration, but will still be good enough.

Decoding⌗

Since I’ve already explained encoding, decoding won’t need too much explanation. It’s basically the same as encoding, but in reverse.

We start with an input sequence

ABCDEFGHIJK=

Split it into 4-character chunks

ABCD EFGH IJK=

And decode each character as its offset into the base64 character array (A has an offset of 0, B has an offset of 1, etc.) as a 6-bit number. If the chunk ends in a padding character, then we need to set those bits to 0.

Then, we concatenate the sextets into one long 24-bit group, then split that into 3 bytes. The base64 string ABCD gives us 00000000 00010000 10000011 (underscores added for easier reading).

Now that we know roughly what we’re doing, let’s get to it!

Outlining the Main Decoding Function⌗

// `src/main.rs`
mod decode;
// ...

// `src/decode.rs`
pub fn decode(b64: &str) -> Vec<u8> {
    // Something...
}

Since our algorithm requires that we have 4 base64 characters per decoding group, we need to make sure that’s the case before we begin decoding.

We could also just assume that the input is the right shape, but that would be a bit irresponsible

To do that, we’ll check that the length of our input string is divisible by 4, and pad it if not:

// `decode` in `src/decode.rs`
    // ...
    let padded = if b64.len() % 4 == 0 {
        b64.to_string()
        } else { 
            let mut tmp = b64.to_string();
            while tmp.len() % 4 != 0 {
                tmp.push('=');
            }
            tmp
    };

This takes advantage of the fact that Rust’s if-else statements are expressions - they can return a value.

Most things in rust are!

We need to call .to_string on b64 so that we can get a String instead of a &str that we can mutate by .pushing to, otherwise, this would be a good deal more tedious.

Next, we initialise a Vec to put our decoded output into.

// `decode` in `src/decode.rs`
    // ...
    let mut decoded: Vec<u8> = vec![];

Unlike with encode, our output Vec isn’t nested, since we append items to it a little differently than before.

We can then scaffold our decoding loop, as well as a helper function we’ll use to do the real dirty work.

// `decode` in `src/decode.rs`
    // ...
    for chunk in padded.chars().collect::<Vec<_>>.chunks_exact(4) {
        if chunk.ends_with(&['=', '=']) {
            todo!()
        } else if chunk.ends_with(&['=']) {
            todo!()
        } else { // Exact multiple of 4
            todo!()
        }
    }

A Fun Helper Function⌗

// `src/decode.rs`
fn decode_quad([a, b, c, d]: [char; 4]) -> [u8; 3] {
    todo!()
}

decode_quad’s parameters may look weird to you, but the explanation is rather simple. Rust allows us to pattern match within a function’s input parameters, like in a match statement, so we can match a, b, c, & d to the 1st, 2nd, 3rd, & 4th items in a 4-length array of chars. This lets us access the array items without having to index the array, which is marginally less typing.

As for how we’ll get a [char; 4] to a [u8; 3], we’re gonna do some bit-manipulation!

Finally, the post’s title makes sense. It only took 17 minutes

That’s only an estimate, cat. It may well have taken longer

But before we do that, we have one last bit of boilerplate to do - we need a decoding map just like we did for encoding.

// `src/decode.rs`
const DECODE_MAP: [u8; 128] = [
        0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64,
        0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64,
        0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x3E, 0x64,
        0x64, 0x64, 0x3F, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x64, 0x64,
        0x64, 0x64, 0x64, 0x64, 0x64, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09,
        0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18,
        0x19, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x20, 0x21,
        0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F, 0x30,
        0x31, 0x32, 0x33, 0x64, 0x64, 0x64, 0x64, 0x64,
    ];

This map will let us use the char’s numerical value as an index into it to get its offset into the encoding map.

Some things don’t have to be understood, just copied

Now, we’re actually gonna do some bit manipulation!

Our mission is simple:

Concatenate the sextets into one 24-bit group
Split the 24-bit group into 3 8-bit groups

We could do these with bit_reader like before, but it’s more fun to do it this way.

To concatenate the 4 sextets, we can essentially shift each one over to the position it should be in the concatenated sequence. For a, this means shifting it so that its most significant bit is in position 24 instead of position 8 - which can be done by a shift-left of 18 places.

We could also do this by multiplying its numerical value by 39, but that’s a lot less clear what our intentions are

In Rust, this looks like

// `decode_quad` in `src/decode.rs`
let concat_bytes = DECODE_MAP[a as usize] as u32 << 18;

The indexed value is converted to a 32-bit integer using as u32 because that’s the smallest available type that will contain a 24-bit value.

Getting the other sextets into position is rather simple (just decrease the left-shift amount by 6 each time), but how do we join them together? The answer is with a bitwise OR operation (expressed in Rust (and most programming languages) as |). This will look at the bits of each operand³ in the same positions and create a new number where every position that has a 1 in either input bit is a 1 in the output, and a 0 otherwise. 1 OR 1 = 1, 1 OR 0 = 1, and 0 OR 0 = 1 (0 or 1 skipped because OR is commutative⁴, like +)

This allows us to combine the bits of our operands, which will allow us to join our 4 shifted sextets into one number since none of them overlap.

The Rust code for this looks like the following:

// `decode_quad` in `src/decode.rs`
let concat_bytes = (DECODE_MAP[a as usize] as u32 << 18)
    | (DECODE_MAP[b as usize] as u32 << 12)
    | (DECODE_MAP[c as usize] as u32 << 6)
    | DECODE_MAP[d as usize] as u32;

Now that step 1 is complete, we just need to split our 24-bit sequence into 3 bytes. We can do this by first shifting each byte into position, then using a logical AND with the byte 1111 1111 to make sure that all other bits are cleared. logical AND will produce a 1 bit only if both of the operands are 1 - 1 AND 1 = 1, 1 AND 0 = 0, & 0 AND 0 = 0. Most programming languages use & for bitwise AND. This process is called masking, & is a pretty big part of messing with bits.

Translating this into Rust code, we get

// `decode_quad` in `src/decode.rs`
// ...
[
    ((concat_bytes >> 16) & 0b1111_1111) as u8,
    ((concat_bytes >> 8) & 0b1111_1111) as u8,
    (concat_bytes & 0b1111_1111) as u8
]

And that’s all we need to be able to decode a quadruplet! Bit manipulation really isn’t as scary as it may at first seem; you just need a way to visualise what you want to manipulate the bits into doing & the ability to google what operations to use to make it happen.

Our full decode_quad function looks like this:

fn decode_quad([a, b, c, d]: [char; 4]) -> [u8; 3] {
    let concat_bytes = (DECODE_MAP[a as usize] as u32 << 18)
    | (DECODE_MAP[b as usize] as u32 << 12)
    | (DECODE_MAP[c as usize] as u32 << 6)
    | DECODE_MAP[d as usize] as u32;

    [
    ((concat_bytes >> 16) as u8) & 0b1111_1111,
    ((concat_bytes >> 8) as u8) & 0b1111_1111,
    (concat_bytes as u8) & 0b1111_1111
    ]
}

Finishing the Job⌗

Now, we return to the decode function, which currently looks like this:

pub fn decode(b64: &str) -> Vec<u8> {
    let padded = if b64.len() % 4 == 0 {
        b64.to_string()
        } else { 
            let mut tmp = b64.to_string();
            while tmp.len() % 4 != 0 {
                tmp.push('=');
            }
            tmp
    };
    let mut decoded: Vec<u8> = vec![];

    for chunk in padded.chars().collect::<Vec<_>>.chunks_exact(4) {
        if chunk.ends_with(&['=', '=']) {
            todo!()
        } else if chunk.ends_with(&['=']) {
            todo!()
        } else { // Exact multiple of 4
            todo!()
        }
    }
}

Filling out these if blocks is rather simple - for every = in the chunk, we instead pass a null (0) character to decode_quad.

// `decode` in `src/decode.rs`
    // ...
        if chunk.ends_with(&['=', '=']) {
            let tri = Self::decode_quad([seg[0], seg[1], 0 as char, 0 as char])?;
            decoded.push(tri[0]);
        } else if chunk.ends_with(&['=']) {
            let tri = Self::decode_quad([seg[0], seg[1], seg[2], 0 as char])?;
            decoded.push(tri[0]);
            decoded.push(tri[1]);
        } else { // Exact multiple of 4
            let tri = Self::decode_quad([seg[0], seg[1], seg[2], seg[3]])?;
            decoded.extend_from_slice(&tri)
        }

Now we just need to return decoded, and decode is complete!

pub fn decode(b64: &str) -> Vec<u8> {
    let padded = if b64.len() % 4 == 0 {
        b64.to_string()
        } else { 
            let mut tmp = b64.to_string();
            while tmp.len() % 4 != 0 {
                tmp.push('=');
            }
            tmp
    };
    let mut decoded: Vec<u8> = vec![];

    for chunk in padded.chars().collect::<Vec<_>>.chunks_exact(4) {
        if chunk.ends_with(&['=', '=']) {
            let tri = Self::decode_quad([seg[0], seg[1], 0 as char, 0 as char])?;
            decoded.push(tri[0]);
        } else if chunk.ends_with(&['=']) {
            let tri = Self::decode_quad([seg[0], seg[1], seg[2], 0 as char])?;
            decoded.push(tri[0]);
            decoded.push(tri[1]);
        } else { // Exact multiple of 4
            let tri = Self::decode_quad([seg[0], seg[1], seg[2], seg[3]])?;
            decoded.extend_from_slice(&tri)
        }
    }

    decoded
}

Now, we can put together a real CLI.

A Proper CLI⌗

Currently, this is our CLI:

// `src/main.rs`
use encode::encode;

fn main() {
    let Some(input) = std::env::args().nth(2) else {
        eprintln!("Must be called with an input string");
        std::process::exit(1);
    };
    let encoded = encode(input.as_bytes());
    println!("{encoded}");
}

This certainly works - we pass our program a string & it returns our string encoded to base64 - but it’s hardly very good. For a normal program, we would use clap to make handling command-line arguments easy - it’s the defacto default argument parser for a reason - but this is no normal program. This is based64. We can’t just do things the easy way!

Please don’t say what I think you’re going to say

Instead of using clap, like sane people, we’re going to parse the arguments manually.

This is already 20 minutes long. What are you doing?!

I’m having fun

The CLI architecture we’ll go with will feel sort of like clap’s derive API, but with absolutely none of the features so that we can reasonably make it before next year.

A Command enum⌗

We’ll represent our valid commands as an enum, with each variant storing everything it needs to actually execute.

// `src/main.rs`
#[derive(Debug, Clone)]
enum Command {
    Encode { src: String },
    Decode { src: String },
    Help,
}

These don’t need to be struct-like variants; we could have used tuple variants, like so:

enum Command {
    Encode(String),
    Decode(String),
    Help,
}

but doing that would make it slightly harder to add fields later on since we’d have to then come up with a name for the .0 field and change every variant.0 to be variant.field. This is also how clap handles things, and we’re apparently trying to make a crappy version of that.

Now, we need to figure out how we’re going to get from an Args (what std::env::args gives us) to either a Command::Encode or a Command::Decode with the src field correctly filled in. We’re gonna need to change our main function.

Parsing Commands⌗

Let’s first grab our arguments

// `src/main.rs`
fn main() {
    let args = std::env::args().skip(1).collect::<Vec<String>>;
}

We use .skip(1) to skip the first argument since we really don’t care about the file path to our program. We then .collect it into a Vec<String>, since we can play with that more than we can play with an Args iterator.

Now, we can add a super simple error condition: args is an empty Vec. This means that the user hasn’t supplied any arguments, and we should politely ask them to do that next time

//`main` in `src/main.rs`
    if args.is_empty() {
        eprintln!("No arguments supplied. Use `help` for usage instructions");
        std::process::exit(1);
    }

Now, let’s make a helper function to do the real parsing & give us an instance of the Command enum.

// `src/main.rs`
fn parse_args(args: &[String]) -> Result<Command, /* Hmmm... */> {
    todo!()
}

But wait! What should we use as the error variant? The good rustacean⁵ thing to do would be to create an error type to encapsulate all the ways in which parse_args could fail, but that’s a little time-consuming. We’re the only ones that could be harmed by our bad practice, so this is one of the few situations where it’s probably ok to just throw a String in there & forget about it. At least for the time being.

// `src/main.rs`
fn parse_args(args: &[String]) -> Result<Command, String> {
    todo!()
}

Now we can get to the real meat of our parsing. First, we can easily match against the first item in the args parameter,

// `parse_args` in `src/main.rs`
    match args[0].as_str() {
        "help" => Ok(Command::Help),
        "encode" => todo!(),
        "decode" => todo!(),
        _ => Err("Invalid argument".into())
    }

We need to use .as_str() here because, unfortunately, matching against a String (or any other owned wrapper type, like a Vec) isn’t possible. Maybe it’ll be fixed someday, but until then this is the workaround.

I already filled in the match arm for Command::Help, since it doesn’t have any extra information tied to it. We just need to remember to wrap the return values in Ok or Err since our function returns a Result rather than the type itself.

The _ wildcard case is equally simple: just return an error saying that we encountered an invalid argument & everything’s fine!

Because Command::Encode & Command::Decode are so similar (by “similar” I mean “exactly the same”) the logic for one will work pretty much verbatim for the other. I’ll start with Command::Encode.

For the src field, we’re gonna need a little finesse. We can’t just index into args, since index 1 might not exist and the indexing would panic. Instead, we can use Vec::get to receive an Option<String>, allowing us to return a proper error.

// `parse_args` in `src/main.rs`
    match args[0].as_str() {
        "encode" => {
            let Some(src) = args.get(1) else { return Err("No text provided to encode".into()) };
            Ok(Command::Encode { src })
        }
        // Other arms omitted
    }

We use a let-else statement to immediately return an error in the case that args.get(1) returns None, allowing us to easily return a Command::Encode using the src when it’s Some. We don’t need to write

Command::Encode {src: src}

because Rust allows us to simplify it whenever a variable shares a name with a field of a struct or a struct-like enum.

The "decode" match arm is largely the same, except that we use Command::Decode instead of Command::Encode

// `parse_args` in `src/main.rs`
    match args[0].as_str() {
        "decode" => {
            let Some(src) = args.get(1) else { return Err("No text provided to decode".into()) };
            Ok(Command::Decode { src })
        }
    }

With that, our final parse_args function is as follows:

fn parse_args(args: &[String]) -> Result<Command, String> {
    match args[0].as_str() {
        "help" => Ok(Command::Help),
        "encode" =>  {
            let Some(src) = args.get(1) else { return Err("No text provided to encode".into()) };
            Ok(Command::Encode { src })
        }
        "decode" => {
            let Some(src) = args.get(1) else { return Err("No text provided to decode".into()) };
            Ok(Command::Decode { src })
        }
        _ => Err("Invalid argument".into())
    }
}

In case you’re wondering why sometimes we use return & sometimes we don’t, Rust allows (encourages, even) the return keyword to be omitted in the last expression of a function. Since there’s nothing else to go, it just assumes we’re returning that instead. This means that return is really only used for early returns from a function, such as in the else branches of our let-elses

Now, we need to actually execute them from our main function.

Running the Commands⌗

In main, we can simply match against the result of a parse_args call to decide what to do for each command

// `main` in `src/main.rs`
    // ...
    match parse_args(&args) {
        Ok(cmd) => todo!(),
        Err(e) => todo!(),
    }

The Err case is easy, just print the error & exit with a nonzero code:

// `main` in `src/main.rs`
    // ...
    match parse_args(&args) {
        Err(e) => {
            eprintln!("Error: {e}");
            std::process::exit(1)
        }
        // other arms omitted
    }

For the Ok case, we’re gonna need some more matching.

// `main` in `src/main.rs`
    // ...
    match parse_args(&args) {
        Ok(cmd) => match cmd {
            Command::Help => todo!(),
            Command::Encode { src } => todo!(),
            Command::Decode { src } => todo!(),
        }
        // other arms omitted
    }

These secondary match arms aren’t difficult at all, though. We can just call the right function & print out the result. All the hard work has been done already

// `main` in `src/main.rs`
// Don't forget the imports!
use encode::encode;
use decode::decode;
    // ...
    match parse_args(&args) {
        Ok(cmd) => match cmd {
            Command::Help => println!("Use either `encode` or `decode` with an input string"),
            Command::Encode { src } => println!("{}", encode(src.as_bytes())),
            Command::Decode { src } => println!("{}", decode(src.as_bytes())),
        }
    }

Don’t you just love it when you get a bunch of one-liners? So beautiful…

That leaves our final main.rs looking like this:

mod encode;
mod decode;

use encode::encode;
use decode::decode;

fn main() {
    let args = std::env::args().skip(1).collect::<Vec<String>>();
    
    match parse_args(&args) {
        Ok(cmd) => match cmd {
            Command::Help => println!("Use either `encode` or `decode` with an input string"),
            Command::Encode { src } => println!("{}", encode(src.as_bytes())),
            Command::Decode { src } => println!("{}", decode(src.as_bytes())),
        }
        Err(e) => {
            eprintln!("Error: {e}");
            std::process::exit(1)
        }
    }
}

fn parse_args(args: &[String]) -> Result<Command, String> {
    match args[0].as_str() {
        "help" => Ok(Command::Help),
        "encode" =>  {
            let Some(src) = args.get(1) else { return Err("No text provided to encode".into()) };
            Ok(Command::Encode { src })
        }
        "decode" => {
            let Some(src) = args.get(1) else { return Err("No text provided to decode".into()) };
            Ok(Command::Decode { src })
        }
        _ => Err("Invalid argument".into())
    }
}

With that, we can finally translate the quote from the beginning,

What was that again? It’s been so long that I forgot all about it

It was SSBuZXZlciBzYWlkIHRoYXQ=

$ cargo run --quiet -- decode 'SSBuZXZlciBzYWlkIHRoYXQ='
I never said that

Ah, I see.

We had to read half an hour for that?

Well, not everything in life can be extravagant. But hey, we just created the world’s best worst base64 translator! You should pat yourself on the back.

But, there is one thing still bothering me about based64…

Nononononononononono NO-

Zero Dependencies?⌗

Currently, we have a problem. The [dependencies] section in our Cargo.toml isn’t empty. We just went through the trouble of parsing command line arguments manually to avoid clap, yet we still have something there.

# `Cargo.toml`
[dependencies]
bitreader = "0.3.6"

This is absolutely unforgivable. The greatest base64 translator known to man couldn’t possibly rely on someone else’s code!

Thankfully, since we just did decoding without making use of bitreader, we have all the knowledge necessary to remove it from the encoding portion of our code.

Recall that, to encode a byte triplet, we need to concatenate them into one 24-bit chunk, then split that into octets. With our newfound knowledge of bitwise operations, this becomes a simple task.

Let’s replace our encode_triplet function

// `src/encode.rs`
fn encode_triplet([a, b, c]: [u8; 3]) -> [char; 4] {
    todo!()
}

We use destructuring syntax like in decode_quad because we’re gonna need to play with the individual items in the input array.

Now, we can shift each byte into place using << and OR them together to concatenate them. The first byte needs to be shifted left by 16, and each one after that is shifted 8 places less.

// `encode_triplet` in `src/main.rs`
    let concat_bytes = ((a as u32) << 16) | ((b as u32) << 8) | (c as u32);

We also need to cast them to u32s so that the final number doesn’t overflow.

Next, we can shift the sextets into place and use them to index our encoding array.

// `encode_triplet` in `src/main.rs`
    let first = ENCODE_MAP[((concat_bytes >> 18) & 0b0011_1111) as usize];
    let second = ENCODE_MAP[((concat_bytes >> 12) & 0b0011_1111) as usize];
    let third = ENCODE_MAP[((concat_bytes >> 6) & 0b0011_1111) as usize];
    let fourth = ENCODE_MAP[(concat_bytes & 0b0011_1111) as usize];

All that’s left is to return the final encoded array, and the full function is now this:

fn encode_triplet([a, b, c]: [u8; 3]) -> [char; 4] {
    let concat_bytes = ((a as u32) << 16) | ((b as u32) << 8) | (c as u32);

    let first = ENCODE_MAP[((concat_bytes >> 18) & 0b0011_1111) as usize];
    let second = ENCODE_MAP[((concat_bytes >> 12) & 0b0011_1111) as usize];
    let third = ENCODE_MAP[((concat_bytes >> 6) & 0b0011_1111) as usize];
    let fourth = ENCODE_MAP[(concat_bytes & 0b0011_1111) as usize];

    [first, second, third, fourth]
}

While we technically do still need to rewrite encode_doublet & encode_singlet, we can actually just cheat & simply have them call encode_triplet with their arguments & 0 bytes for the remainder. That way it’s quick & easy and we don’t have to change code anywhere else

// `src/encode.rs`
fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
        let res = encode_triplet([rem[0], rem[1], 0x00]);

        [res[0], res[1], res[2], A::PADDING]
}

fn encode_singlet(rem: [u8; 1]) -> [char; 4] {
    let res = Self::encode_triplet([rem[0], 0x00, 0x00]);

    [res[0], res[1], A::PADDING, A::PADDING]
}

If you want to change the code elsewhere & remove these functions entirely, then that’s up to you.

At last, our Cargo.toml file can be pure and good, with a completely empty [dependencies] section

# ...
[dependencies]
# Nothing!

VGhlIEVuZA==⌗

Is that it? Is it finally over?

Well, there is one more thing that could be improved-

NO. I refuse to let you drag this out any longer.

Alright, alright, I’ll just mention it quickly.

Currently, if we pass in an invalid character to the decode function, our program panics⁶

$ cargo run -- decode £$%^
thread 'main' panicked at 'index out of bounds: {some more info}'

This is fine for application code but absolutely unacceptable for library code. If you want to make based64 even better, then you can replace all the indexing operations with some proper error handling.

Take a look at the Result type in the standard library & the chapter on Error Handling in The Book if you want extra help with that.

There are a bunch of other things about this implementation that are suboptimal, but I’ll leave you to do more research yourself to find & fix them if you want.

With all that said, this has gone on long enough. Hope you enjoyed the ride & learnt something along the way.

RW5qb3kgeW91ciBicmFnZ2luZyByaWdodHMh

Groups of six ↩︎
A Rust library/package/stolen-code-sack ↩︎
The thing being operated on. There are 2 in the case of OR ↩︎
The order of its operands doesn’t matter ↩︎
Someone who uses Rust. It’s a pun on ‘crustacean’, since the mascot is a crab (named Ferris) ↩︎
Crashes, but in a sort of controlled way ↩︎