Learning Bit Magic by Translating Keysmashes
Contents
A wise man once said:
SSBuZXZlciBzYWlkIHRoYXQ=
Now I know that looks like Cat just ran all over my keyboard-
I did not!
Calm down, I’m setting a scene
But I assure you, that means something. And I’m gonna help you figure out what. All it’ll take is a little bit of luck & a whole lotta fun bit tricks along the way.
But first…
Base64 for Dummies⌗
The text above is encoded in Base64
, an encoding system defined in RFC 4648
alongside the lesser-used Base16
& Base32
. But don’t go googling “base64 decoder” just yet, because
we’re gonna make one together so that you can have something to brag about to your friends learn some
bit manipulation tricks.
Base64 uses 65 US-ASCII characters to represent arbitrary byte sequences as text.
The 1st 64 are used for actual encoding, while the 65th (=
) is used for padding the encoded string so
that its length is a multiple of 4. Its use of exclusively ASCII characters makes it ideal for communicating
arbitrary data in scenarios where non-text is suboptimal, as it can encode literally any data and be successfully
decoded so long as the intended format is known (the bytes will always be the same, but you can’t just say a PNG
is an MP3 & have it work). The main character set uses the characters ['A','Z']+['a','z']+['0','9']+'+'+'/'
in
that order.
For the unaware, I’m using mathematical range notation there. The square brackets mean the range includes the end
it’s nearest to (so [1,10]
is 1, 2, 3 ... 10
). If I had used parenthesis, then the edge items would have been
excluded ([1,10)
is 1, 2, 3 ... 9
).
I’ll explain everything else when we come to it whilst implementing a translator for base64, starting with the easier part.
Encoding⌗
Encoding base64 is rather straightforward. Data is encoded in chunks of 3 bytes.
00000000 00000000 00000000
Actually, they’re octets
Actually, it doesn’t matter
These clusters are then concatenated into 24-bit blocks, like so
00000000_00000000_00000000
and finally, into 4 sextets1
000000 000000 000000 000000
from here, each sextet is encoded by using its value as an index into the character array (that’s why order matters) & pushed to the output.
Let’s try doing that in Rust
cargo new based64
And in main.rs
mod encode; // Add this line
fn main() {
println!("Hello, world!");
}
After a quick cargo run
to check that everything is set up correctly, we can start implementing the
encoding algorithm.
// `src/encode.rs`
pub fn encode(bytes: &[u8]) -> String {
todo!()
}
Now that we have our encode
function to take a sequence of bytes
& encode it to a base64 String
, how
do we implement the algorithm in Rust? Because I can see the future, I say we create a second function,
encode_triplet
, to do the work of encoding a set of 3 bytes.
fn encode_triplet(triplet: [u8; 3]) -> [char; 4] {
todo!()
}
You may be curious as to why I’m using the array types [u8; 3]
& [char; 4]
here instead of a &[u8]
& String
like before. This is so that there’s no chance of ever being given something that isn’t a triplet, and so that it’s
clear & enforced by the type system that the function returns exactly 4 characters.
Now, we could go about the encoding completely manually, but I’m gonna start us off simple & use a crate2 to help us.
Specifically, bitreader’s BitReader
struct. This will allow us to read bits directly from our array instead
of needing to use manual bit operations.
Figuring out what those operations are is left as an exercise to the reader
cargo add bitreader
# `Cargo.toml`
# ...
[dependencies]
bitreader = "0.3.6"
Now we need to import it
// `src/encode.rs` (at the very top)
use bitreader::BitReader
Now we can use the BitReader
struct to read however many bits into our array we want.
// in `encode_triplet`
// ...
let mut reader = BitReader::new(&triplet);
let first = reader.read_u8(6).unwrap();
let second = reader.read_u8(6).unwrap();
let third = reader.read_u8(6).unwrap();
let fourth = reader.read_u8(6).unwrap();
We use unwrap
everywhere because read_u8
returns a Result
since it could fail if, for example, there
weren’t enough bits available to read from the slice. Since we know we’ll always have an input with 24 bits,
we can just pretend it isn’t fallible by unwrap
ing it to either give us the Ok
variant’s value or panic
(which
is a fancy way of saying “crash the program”). This is another reason for using a fixed-size array as our input type
instead of a variable-length slice - we don’t have to handle errors related to the size of our input.
But just how are we going to decide what each 6-bit block encodes to? I said earlier
each sextet is encoded by using it’s value as an index into the character array
so, let’s make an array!
// in `src/encode.rs`
const ENCODE_MAP: [char; 64] = [
'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R',
'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1',
'2', '3', '4', '5', '6', '7', '8', '9', '+', '/',
];
From here, we just need to go ahead & index that array with whatever we read from the input array
// in `encode_triplet`
// ...
[
ENCODE_MAP[first as usize], ENCODE_MAP[second as usize], ENCODE_MAP[third as usize], ENCODE_MAP[fourth as usize]
]
We need to use as usize
on the values because arrays can only be indexed by numbers of type usize
, but we have u8
s
from read_u8
.
and our complete encode_triplet
function looks like this:
fn encode_triplet(triplet: [u8, 3]) -> [char; 4] {
let mut reader = BitReader::new(&triplet);
let first = reader.read_u8(6).unwrap();
let second = reader.read_u8(6).unwrap();
let third = reader.read_u8(6).unwrap();
let fourth = reader.read_u8(6).unwrap();
[ENCODE_MAP[first as usize], ENCODE_MAP[second as usize], ENCODE_MAP[third as usize], ENCODE_MAP[fourth as usize]]
}
You can pat yourself on the back now, you’re one step closer to a fully functioning base64 encoder! But there are still some things in the way: edge cases. Luckily, they aren’t too difficult, and there are only 2 of them! So let’s get back to work & finish this encoder.
Edge Case 1: Only 2 Bytes are Available⌗
If we only have 2 bytes available, then we can only make 3 encoded characters. If we insist on having
a multiple-of-4-length encoded string, then that’s where our padding character, =
, comes in. We’ll simply
encode as much as we can from what we have & fill the remainder with =
characters. In the case of a 2-byte
remainder, the result will look like [c, c, c, '=']
.
First, let’s make a constant for the padding character for the sake of never making a typo
// At the top of `src/encode.rs`
const PADDING: char = '='
Next, we can scaffold our encode_doublet
function
// `src/encode.rs`
fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
let mut reader = BitReader::new(&rem);
let six1 = reader.read_u8(6).unwrap();
let six2 = reader.read_u8(6).unwrap();
todo!()
}
From here we run into a problem. After reading 12 bytes from rem
, we’re only left with 4 bits. Just using
read_u8(4)
will give us the bits, but in the wrong place and our encoded character will be off. How far off?
Well, only by a factor of 4!
Why would it do that?!
I have no goddamn clue. It just does.
This does mean it’s an easy problem to fix though. All we have to do is left-shift it by 2 & we’ll have quadrupled our value, solving the problem.
Translated into Rust, we get
// `src/encode.rs`
fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
// ...
let (nibble, _) = reader.read_u8(4).unwrap().overflowing_shl(2)
[ENCODE_MAP[six1 as usize], ENCODE_MAP[six2 as usize], ENCODE_MAP[nibble as usize], PADDING]
}
We use overflowing_shl
to make it clear that overflows are intentional. After all, the upper 4 bits of
our nibble will all be 0 anyways, so it doesn’t matter. We need to assign to (nibble, _
because overflowing_shl
returns a tuple
where the 2nd value is a bool
for whether or not the shift overflowed. Using _
lets us discard the bool
since we don’t care about it & cargo
would warn us that it’s unused otherwise (we could also prefix the name with a _
(like _bool
) to make cargo
shut up but we have no reason to store it so it’s just a wasted variable).
With that, our full encode_doublet
function is as follows:
fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
let mut reader = BitReader::new(&rem);
let six1 = reader.read_u8(6).unwrap();
let six2 = reader.read_u8(6).unwrap();
let (nibble, _) = reader.read_u8(4).unwrap().overflowing_shl(2);
[ENCODE_MAP[six1 as usize], ENCODE_MAP[six2 as usize], ENCODE_MAP[nibble as usize], PADDING]
}
Edge Case 2: Only 1 Byte is Available⌗
At last, we reach the final edge case, where we only have 1 full byte remaining. So let’s just get right to the code:
// `src/encode.rs`
fn encode_singlet(rem: [u8; 1]) -> [char; 4] {
let mut reader = BitReader::new(&rem);
let six = reader.read_u8(6).unwrap();
todo!();
}
Wait, why are we taking rem
as a [u8; 1]
instead of just a u8
?
Glad you asked, cat. Because BitReader::new
takes a slice (a reference to part of a list type), we need to be borrowing
an array anyways. If we tried to do it like this
// Note: `rem` is now a `u8`
let mut reader = BitReader::new(&[rem])
Then we’d get an error!
$ cargo check
error[E0716]: temporary value dropped while borrowed
--> src/encode.rs:131:42
|
131 | let mut reader = BitReader::new(&[rem]);
| ^^^^^ - temporary value is freed at the end of this statement
| |
| creates a temporary which is freed while still in use
132 | let six = reader.read_u8(6).unwrap();
| ----------------- borrow later used here
|
help: consider using a `let` binding to create a longer lived value
|
131 ~ let binding = [rem];
132 ~ let mut reader = BitReader::new(&binding);
|
For more information about this error, try `rustc --explain E0716`.
error: could not compile `based` due to previous error
Thankfully, cargo check
is helpful enough to tell us exactly how to fix that; simply create an extra variable for our
single-length array, then borrow that. I, however, happen to think that looks ugly & prefer to simply make rem
a [u8; 1]
in the first place so there’s no need for the extra variable. But you can do it cargo
’s way if you want, I won’t mind.
Anyway, we now have just 2 bits left of our byte after snagging the 1st six, so what should we do with them? We learnt last time that simply taking a partial input as-is will lead to our encoding being off, but will it still be off by a factor of 4, or will it be by some other amount?
Turns out that in the case of 2 bits, the input is actually off by a factor of 16! This time we’ll need to left-shift our input by 4 in order to correct it.
// `encode_singlet` in `src/encode.rs`
// ...
let (half_nib, _) = reader.read_u8(2).unwrap().overflowing_shl(4);
And finally, the mapping
// `encode_singlet` in `src/encode.rs`
[ENCODE_MAP[six as usize], ENCODE_MAP[half_nib as usize], PADDING, PADDING]
Thanks to everything we’ve learnt, encode_singlet
was done quite quickly! Here’s what it looks like:
// `src/encode.rs`
fn encode_singlet(rem: [u8; 1]) -> [char; 4] {
let mut reader = BitReader::new(&rem);
let six = reader.read_u8(6).unwrap();
let (half_nib, _) = reader.read_u8(2).unwrap().overflowing_shl(4);
[ENCODE_MAP[six as usize], ENCODE_MAP[half_nib as usize], PADDING, PADDING]
}
Don’t pat yourself on the back just yet though, we still have the encode
function to write!
Tying Together our encoding functions⌗
Currently, encode
looks like this:
// `src/encode.rs`
fn encode(bytes: &[u8]) -> String {
todo!()
}
Before we start filling it out, let’s make a game plan. What we need is to:
- Split the input into 3-byte chunks
- Encode all the triplets
- Encode the remainder (if any) using either
encode_doublet
orencode_singlet
What’s nice about these 3 objectives is that we only need 2 blocks to do them - 2.
& 3.
can be
done in one block. For 1.
, we’ll use the chunks
method on a slice. This will give us all the nonoverlapping
chunks with a length less than or equal to the one we specify - 3, in our case.
// `src/encode.rs`
fn encode(bytes: &[u8]) -> String {
let chunks = bytes.chunks(3);
}
We also need to initialise an array to push our encoded characters into, so let’s do that now:
// `src/encode.rs`
fn encode(bytes: &[u8]) -> String {
let mut encoded: Vec<[char; 4]> = vec![]; // New
let chunks = bytes.chunks(3);
}
encoded
has the type [char; 4]
because that’s what our encoding functions return. Don’t worry,
we can get it to a String
at the very end when we need it.
Moving on, all we’ll need from here is a for
loop with a match
statement.
// `encode` in `src/encode.rs`
// ...
for chunk in chunks {
match chunk.len() {
3 => todo!(),
2 => todo!(),
1 => todo!(),
_ => todo!(),
}
}
All a match
statement does is execute different blocks depending on the structure of what you give it.
chunk.len()
gives us a usize
, so we can match against different possible values. We use _
in the
final match arm because a match
statement must be exhaustive - it has to cover every possible case -
and _
is a wildcard so we can catch everything else.
The match arms themselves are very simple since we’ve done all the hard work already while writing our
encode_
functions. All we need to do now is fill them in.
For encode_triplet
, it looks like this:
// `encode` in `src/encode.rs`
// ...
for chunk in chunks {
match chunk.len() {
// Other arms omitted
3 => encoded.push(encode_triplet([chunk[0], chunk[1], chunk[2]])),
_ => unreachable!("Mathematically impossible"),
}
}
Notice that we also fill out the _
arm with an unreachable!
, signalling that it will never actually
execute (if it does, then there’s a pretty big problem).
The other match arms are similar
// `encode` in `src/encode.rs`
// ...
match chunk.len() {
// Other arms omitted
2 => encoded.push(encode_doublet([chunk[0], chunk[1]])),
1 => encoded.push(encode_singlet([chunk[0]])),
}
And with that, we have everything we need to fully encode a byte sequence! After the for
loop finishes,
encoded
will be filled with the encoded characters.
But we can’t just return it as-is. Remember, the type of encoded
is currently Vec<[char; 4]>
but we need
it to be String
, so how do we do that?
We can collect an iterator over char
s (An Iterator<Item = char
) into a String
, so let’s try & get our Vec<[char; 4]>
to that.
If we just call iter
on encoded
now, we’ll be given an Iterator<Item = [char; 4]>
, which is very much
not what we want. We need to somehow take the items in the arrays out of the arrays & into the Vec
itself. Almost like
squishing the nested list into a list with only one level.
So, kinda like flattening it?
Right on the money
Iterator
’s flatten
method is our friend here. It’ll take an Iterator
over a collection (such as an array),
& return an iterator over the items of that collection. Now we just need to do
// `encode` in `src/encode.rs`
// ...
encoded.iter().flatten().collect()
and we’ll be returning the correct type. We don’t need to specify the type to collect
into because the compiler
can guess that we want a String
from the function’s signature telling it that’s what it returns. We could
specify it like so
encoded.iter().flatten().collect::<String>()
(known as turbofish syntax) but that’s unnecessary.
Our final encode
function is now the following:
// `src/encode.rs`
fn encode(bytes: &[u8]) -> Vec<String> {
let chunks = bytes.chunks(3);
let mut encoded = vec![];
for chunk in chunks {
match chunk.len() {
3 => encoded.push(encode_triplet([chunk[0], chunk[1], chunk[2]])?),
2 => encoded.push(encode_doublet([chunk[0], chunk[1]])?),
1 => encoded.push(encode_singlet([chunk[0]])?),
_ => unreachable!("Mathematically impossible"),
}
}
encoded.iter().flatten().collect()
}
With that, we’ve completed all the encoding functionality we need! To test it out, let’s make a quick CLI
A Quick & Dirty CLI⌗
Since this isn’t the main focus, I won’t spend too long explaining the CLI. We’ll just throw together something that works for now so that we can see how our encoding works & we can touch it up later.
First, we need to fetch the command-line arguments provided to our program
// `src/main.rs`
mod encode;
fn main() {
let Some(input) = std::env::args().nth(2) else {
eprintln!("Must be called with an input string");
std::process::exit(1);
};
}
std::env::args
will give us all the arguments supplied to the program, with the first being the filename.
Since we don’t care about that, we use .nth(2)
to grab the 2nd argument, which would be the input string
to encode. We assign to input
using the new let-else
syntax. It’s like a match
statement (note how we use
Some(input)
as we’re matching against the result being Some
), but the else
branch must terminate the program.
We just print an error message & then exit the program with an exit code of 1
to indicate failure.
Next, we encode the input & print it
// `src/main.rs`
// ...
use encode::encode;
fn main() {
// ...
let encoded = encode(input.as_bytes());
println!("{encoded}");
}
This is much more simple. We pass the byte representation of the input using as_bytes()
because
input
is a String
, which isn’t what we need. Then, we print the encoded string by surrounding it in
curly braces within the string we send to println!
. We could also do
println!("{}", encoded);
but I prefer the first option, as they both do the same thing (in this case) and the first is shorter. It’s entirely personal preference though.
Now, we can test out our encoder
$ cargo run --quiet -- 'Hello world!'
SGVsbG8gd29ybGQh
Success! Now, all we have to do is implement decoding & based64
will be the best base64 encoder/decoder around!
No it won’t, he’s lying to you
More of a teensy weensy exaggeration, but will still be good enough.
Decoding⌗
Since I’ve already explained encoding, decoding won’t need too much explanation. It’s basically the same as encoding, but in reverse.
We start with an input sequence
ABCDEFGHIJK=
Split it into 4-character chunks
ABCD EFGH IJK=
And decode each character as its offset into the base64 character array (A
has an offset of 0,
B
has an offset of 1, etc.) as a 6-bit number. If the chunk ends in a padding character, then we need to set those bits to 0.
Then, we concatenate the sextets into one long 24-bit group, then split that into 3 bytes. The base64 string ABCD
gives us
00000000 00010000 10000011
(underscores added for easier reading).
Now that we know roughly what we’re doing, let’s get to it!
Outlining the Main Decoding Function⌗
// `src/main.rs`
mod decode;
// ...
// `src/decode.rs`
pub fn decode(b64: &str) -> Vec<u8> {
// Something...
}
Since our algorithm requires that we have 4 base64 characters per decoding group, we need to make sure that’s the case before we begin decoding.
We could also just assume that the input is the right shape, but that would be a bit irresponsible
To do that, we’ll check that the length of our input string is divisible by 4, and pad it if not:
// `decode` in `src/decode.rs`
// ...
let padded = if b64.len() % 4 == 0 {
b64.to_string()
} else {
let mut tmp = b64.to_string();
while tmp.len() % 4 != 0 {
tmp.push('=');
}
tmp
};
This takes advantage of the fact that Rust’s if-else
statements are expressions - they can return a value.
Most things in rust are!
We need to call .to_string
on b64
so that we can get a String
instead of a &str
that we can mutate by .push
ing to,
otherwise, this would be a good deal more tedious.
Next, we initialise a Vec
to put our decoded output into.
// `decode` in `src/decode.rs`
// ...
let mut decoded: Vec<u8> = vec![];
Unlike with encode
, our output Vec
isn’t nested, since we append items to it a little differently than before.
We can then scaffold our decoding loop, as well as a helper function we’ll use to do the real dirty work.
// `decode` in `src/decode.rs`
// ...
for chunk in padded.chars().collect::<Vec<_>>.chunks_exact(4) {
if chunk.ends_with(&['=', '=']) {
todo!()
} else if chunk.ends_with(&['=']) {
todo!()
} else { // Exact multiple of 4
todo!()
}
}
A Fun Helper Function⌗
// `src/decode.rs`
fn decode_quad([a, b, c, d]: [char; 4]) -> [u8; 3] {
todo!()
}
decode_quad
’s parameters may look weird to you, but the explanation is rather simple. Rust allows us to
pattern match within a function’s input parameters, like in a match
statement, so we can match a
, b
,
c
, & d
to the 1st, 2nd, 3rd, & 4th items in a 4-length array of char
s. This lets us access the array
items without having to index the array, which is marginally less typing.
As for how we’ll get a [char; 4]
to a [u8; 3]
, we’re gonna do some bit-manipulation!
Finally, the post’s title makes sense. It only took 17 minutes
That’s only an estimate, cat. It may well have taken longer
But before we do that, we have one last bit of boilerplate to do - we need a decoding map just like we did for encoding.
// `src/decode.rs`
const DECODE_MAP: [u8; 128] = [
0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64,
0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64,
0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x3E, 0x64,
0x64, 0x64, 0x3F, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x64, 0x64,
0x64, 0x64, 0x64, 0x64, 0x64, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09,
0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18,
0x19, 0x64, 0x64, 0x64, 0x64, 0x64, 0x64, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x20, 0x21,
0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F, 0x30,
0x31, 0x32, 0x33, 0x64, 0x64, 0x64, 0x64, 0x64,
];
This map will let us use the char
’s numerical value as an index into it to get its offset into
the encoding map.
Some things don’t have to be understood, just copied
Now, we’re actually gonna do some bit manipulation!
Our mission is simple:
- Concatenate the sextets into one 24-bit group
- Split the 24-bit group into 3 8-bit groups
We could do these with bit_reader
like before, but it’s more fun to do it this way.
To concatenate the 4 sextets, we can essentially shift each one over to the position it should
be in the concatenated sequence. For a
, this means shifting it so that its most significant bit
is in position 24 instead of position 8 - which can be done by a shift-left of 18 places.
We could also do this by multiplying its numerical value by 39, but that’s a lot less clear what our intentions are
In Rust, this looks like
// `decode_quad` in `src/decode.rs`
let concat_bytes = DECODE_MAP[a as usize] as u32 << 18;
The indexed value is converted to a 32-bit integer using as u32
because that’s the smallest available type that will contain a 24-bit value.
Getting the other sextets into position is rather simple (just decrease the left-shift amount by 6 each time),
but how do we join them together? The answer is with a bitwise OR
operation (expressed in Rust (and most programming languages) as |
).
This will look at the bits of each operand3 in the same positions and create a new number where every position that has a 1
in either
input bit is a 1
in the output, and a 0
otherwise. 1 OR 1
= 1
, 1 OR 0
= 1
, and 0 OR 0
= 1
(0 or 1
skipped because OR
is
commutative4, like +
)
This allows us to combine the bits of our operands, which will allow us to join our 4 shifted sextets into one number since none of them overlap.
The Rust code for this looks like the following:
// `decode_quad` in `src/decode.rs`
let concat_bytes = (DECODE_MAP[a as usize] as u32 << 18)
| (DECODE_MAP[b as usize] as u32 << 12)
| (DECODE_MAP[c as usize] as u32 << 6)
| DECODE_MAP[d as usize] as u32;
Now that step 1 is complete, we just need to split our 24-bit sequence into 3 bytes. We can do this by
first shifting each byte into position, then using a logical AND
with the byte 1111 1111
to make
sure that all other bits are cleared. logical AND
will produce a 1
bit only if both of the operands
are 1
- 1 AND 1
= 1
, 1 AND 0
= 0
, & 0 AND 0
= 0
. Most programming languages use &
for bitwise
AND
. This process is called masking, & is a pretty big part of messing with bits.
Translating this into Rust code, we get
// `decode_quad` in `src/decode.rs`
// ...
[
((concat_bytes >> 16) & 0b1111_1111) as u8,
((concat_bytes >> 8) & 0b1111_1111) as u8,
(concat_bytes & 0b1111_1111) as u8
]
And that’s all we need to be able to decode a quadruplet! Bit manipulation really isn’t as scary as it may at first seem; you just need a way to visualise what you want to manipulate the bits into doing & the ability to google what operations to use to make it happen.
Our full decode_quad
function looks like this:
fn decode_quad([a, b, c, d]: [char; 4]) -> [u8; 3] {
let concat_bytes = (DECODE_MAP[a as usize] as u32 << 18)
| (DECODE_MAP[b as usize] as u32 << 12)
| (DECODE_MAP[c as usize] as u32 << 6)
| DECODE_MAP[d as usize] as u32;
[
((concat_bytes >> 16) as u8) & 0b1111_1111,
((concat_bytes >> 8) as u8) & 0b1111_1111,
(concat_bytes as u8) & 0b1111_1111
]
}
Finishing the Job⌗
Now, we return to the decode
function, which currently looks like this:
pub fn decode(b64: &str) -> Vec<u8> {
let padded = if b64.len() % 4 == 0 {
b64.to_string()
} else {
let mut tmp = b64.to_string();
while tmp.len() % 4 != 0 {
tmp.push('=');
}
tmp
};
let mut decoded: Vec<u8> = vec![];
for chunk in padded.chars().collect::<Vec<_>>.chunks_exact(4) {
if chunk.ends_with(&['=', '=']) {
todo!()
} else if chunk.ends_with(&['=']) {
todo!()
} else { // Exact multiple of 4
todo!()
}
}
}
Filling out these if
blocks is rather simple - for every =
in the chunk, we instead
pass a null
(0) character to decode_quad
.
// `decode` in `src/decode.rs`
// ...
if chunk.ends_with(&['=', '=']) {
let tri = Self::decode_quad([seg[0], seg[1], 0 as char, 0 as char])?;
decoded.push(tri[0]);
} else if chunk.ends_with(&['=']) {
let tri = Self::decode_quad([seg[0], seg[1], seg[2], 0 as char])?;
decoded.push(tri[0]);
decoded.push(tri[1]);
} else { // Exact multiple of 4
let tri = Self::decode_quad([seg[0], seg[1], seg[2], seg[3]])?;
decoded.extend_from_slice(&tri)
}
Now we just need to return decoded
, and decode
is complete!
pub fn decode(b64: &str) -> Vec<u8> {
let padded = if b64.len() % 4 == 0 {
b64.to_string()
} else {
let mut tmp = b64.to_string();
while tmp.len() % 4 != 0 {
tmp.push('=');
}
tmp
};
let mut decoded: Vec<u8> = vec![];
for chunk in padded.chars().collect::<Vec<_>>.chunks_exact(4) {
if chunk.ends_with(&['=', '=']) {
let tri = Self::decode_quad([seg[0], seg[1], 0 as char, 0 as char])?;
decoded.push(tri[0]);
} else if chunk.ends_with(&['=']) {
let tri = Self::decode_quad([seg[0], seg[1], seg[2], 0 as char])?;
decoded.push(tri[0]);
decoded.push(tri[1]);
} else { // Exact multiple of 4
let tri = Self::decode_quad([seg[0], seg[1], seg[2], seg[3]])?;
decoded.extend_from_slice(&tri)
}
}
decoded
}
Now, we can put together a real CLI.
A Proper CLI⌗
Currently, this is our CLI:
// `src/main.rs`
use encode::encode;
fn main() {
let Some(input) = std::env::args().nth(2) else {
eprintln!("Must be called with an input string");
std::process::exit(1);
};
let encoded = encode(input.as_bytes());
println!("{encoded}");
}
This certainly works - we pass our program a string & it returns our string encoded to base64 - but it’s hardly very
good. For a normal program, we would use clap
to make handling command-line arguments easy -
it’s the defacto default argument parser for a reason - but this is no normal program. This is based64
. We can’t just do
things the easy way!
Please don’t say what I think you’re going to say
Instead of using clap
, like sane people, we’re going to parse the arguments manually.
This is already 20 minutes long. What are you doing?!
I’m having fun
The CLI architecture we’ll go with will feel sort of like clap
’s derive
API, but with absolutely none of the features so that
we can reasonably make it before next year.
A Command enum⌗
We’ll represent our valid commands as an enum, with each variant storing everything it needs to actually execute.
// `src/main.rs`
#[derive(Debug, Clone)]
enum Command {
Encode { src: String },
Decode { src: String },
Help,
}
These don’t need to be struct-like variants; we could have used tuple variants, like so:
enum Command {
Encode(String),
Decode(String),
Help,
}
but doing that would make it slightly harder to add fields later on since we’d have
to then come up with a name for the .0
field and change every variant.0
to be variant.field
.
This is also how clap
handles things, and we’re apparently trying to make a crappy version of that.
Now, we need to figure out how we’re going to get from an Args
(what std::env::args
gives us) to either a
Command::Encode
or a Command::Decode
with the src
field correctly filled in. We’re gonna need to change our
main
function.
Parsing Commands⌗
Let’s first grab our arguments
// `src/main.rs`
fn main() {
let args = std::env::args().skip(1).collect::<Vec<String>>;
}
We use .skip(1)
to skip the first argument since we really don’t care about the file path to our
program. We then .collect
it into a Vec<String>
, since we can play with that more than we can play with an
Args
iterator.
Now, we can add a super simple error condition: args
is an empty Vec
. This means that the user hasn’t supplied
any arguments, and we should politely ask them to do that next time
//`main` in `src/main.rs`
if args.is_empty() {
eprintln!("No arguments supplied. Use `help` for usage instructions");
std::process::exit(1);
}
Now, let’s make a helper function to do the real parsing & give us an instance of the Command
enum.
// `src/main.rs`
fn parse_args(args: &[String]) -> Result<Command, /* Hmmm... */> {
todo!()
}
But wait! What should we use as the error variant? The good rustacean5 thing to do would be to create
an error type to encapsulate all the ways in which parse_args
could fail, but that’s a little time-consuming.
We’re the only ones that could be harmed by our bad practice, so this is one of the few situations where it’s probably ok
to just throw a String
in there & forget about it. At least for the time being.
// `src/main.rs`
fn parse_args(args: &[String]) -> Result<Command, String> {
todo!()
}
Now we can get to the real meat of our parsing. First, we can easily match against the first item in the args
parameter,
// `parse_args` in `src/main.rs`
match args[0].as_str() {
"help" => Ok(Command::Help),
"encode" => todo!(),
"decode" => todo!(),
_ => Err("Invalid argument".into())
}
We need to use .as_str()
here because, unfortunately, matching against a String
(or any other owned wrapper type, like a Vec
) isn’t
possible. Maybe it’ll be fixed someday, but until then this is the workaround.
I already filled in the match
arm for Command::Help
, since it doesn’t have any extra information tied to it. We just need to remember
to wrap the return values in Ok
or Err
since our function returns a Result
rather than the type itself.
The _
wildcard case is equally simple: just return an error saying that we encountered an invalid argument & everything’s fine!
Because Command::Encode
& Command::Decode
are so similar (by “similar” I mean “exactly the same”) the logic for one will work
pretty much verbatim for the other. I’ll start with Command::Encode
.
For the src
field, we’re gonna need a little finesse. We can’t just index into args
, since index 1 might not exist and the indexing
would panic. Instead, we can use Vec::get
to receive an Option<String>
, allowing us to return a proper error.
// `parse_args` in `src/main.rs`
match args[0].as_str() {
"encode" => {
let Some(src) = args.get(1) else { return Err("No text provided to encode".into()) };
Ok(Command::Encode { src })
}
// Other arms omitted
}
We use a let-else
statement to immediately return an error in the case that args.get(1)
returns None
, allowing us
to easily return a Command::Encode
using the src
when it’s Some
. We don’t need to write
Command::Encode {src: src}
because Rust allows us to simplify it whenever a variable shares a name with a field of a struct
or a struct
-like enum
.
The "decode"
match arm is largely the same, except that we use Command::Decode
instead of Command::Encode
// `parse_args` in `src/main.rs`
match args[0].as_str() {
"decode" => {
let Some(src) = args.get(1) else { return Err("No text provided to decode".into()) };
Ok(Command::Decode { src })
}
}
With that, our final parse_args
function is as follows:
fn parse_args(args: &[String]) -> Result<Command, String> {
match args[0].as_str() {
"help" => Ok(Command::Help),
"encode" => {
let Some(src) = args.get(1) else { return Err("No text provided to encode".into()) };
Ok(Command::Encode { src })
}
"decode" => {
let Some(src) = args.get(1) else { return Err("No text provided to decode".into()) };
Ok(Command::Decode { src })
}
_ => Err("Invalid argument".into())
}
}
In case you’re wondering why sometimes we use return
& sometimes we don’t, Rust allows (encourages, even)
the return
keyword to be omitted in the last expression of a function. Since there’s nothing else to go,
it just assumes we’re returning that instead. This means that return
is really only used for early returns
from a function, such as in the else
branches of our let-else
s
Now, we need to actually execute them from our main
function.
Running the Commands⌗
In main
, we can simply match against the result of a parse_args
call to decide what to do for each command
// `main` in `src/main.rs`
// ...
match parse_args(&args) {
Ok(cmd) => todo!(),
Err(e) => todo!(),
}
The Err
case is easy, just print the error & exit with a nonzero code:
// `main` in `src/main.rs`
// ...
match parse_args(&args) {
Err(e) => {
eprintln!("Error: {e}");
std::process::exit(1)
}
// other arms omitted
}
For the Ok
case, we’re gonna need some more match
ing.
// `main` in `src/main.rs`
// ...
match parse_args(&args) {
Ok(cmd) => match cmd {
Command::Help => todo!(),
Command::Encode { src } => todo!(),
Command::Decode { src } => todo!(),
}
// other arms omitted
}
These secondary match arms aren’t difficult at all, though. We can just call the right function & print out the result. All the hard work has been done already
// `main` in `src/main.rs`
// Don't forget the imports!
use encode::encode;
use decode::decode;
// ...
match parse_args(&args) {
Ok(cmd) => match cmd {
Command::Help => println!("Use either `encode` or `decode` with an input string"),
Command::Encode { src } => println!("{}", encode(src.as_bytes())),
Command::Decode { src } => println!("{}", decode(src.as_bytes())),
}
}
Don’t you just love it when you get a bunch of one-liners? So beautiful…
That leaves our final main.rs
looking like this:
mod encode;
mod decode;
use encode::encode;
use decode::decode;
fn main() {
let args = std::env::args().skip(1).collect::<Vec<String>>();
match parse_args(&args) {
Ok(cmd) => match cmd {
Command::Help => println!("Use either `encode` or `decode` with an input string"),
Command::Encode { src } => println!("{}", encode(src.as_bytes())),
Command::Decode { src } => println!("{}", decode(src.as_bytes())),
}
Err(e) => {
eprintln!("Error: {e}");
std::process::exit(1)
}
}
}
fn parse_args(args: &[String]) -> Result<Command, String> {
match args[0].as_str() {
"help" => Ok(Command::Help),
"encode" => {
let Some(src) = args.get(1) else { return Err("No text provided to encode".into()) };
Ok(Command::Encode { src })
}
"decode" => {
let Some(src) = args.get(1) else { return Err("No text provided to decode".into()) };
Ok(Command::Decode { src })
}
_ => Err("Invalid argument".into())
}
}
With that, we can finally translate the quote from the beginning,
What was that again? It’s been so long that I forgot all about it
It was SSBuZXZlciBzYWlkIHRoYXQ=
$ cargo run --quiet -- decode 'SSBuZXZlciBzYWlkIHRoYXQ='
I never said that
Ah, I see.
We had to read half an hour for that?
Well, not everything in life can be extravagant. But hey, we just created the world’s best worst base64 translator! You should pat yourself on the back.
But, there is one thing still bothering me about based64
…
Nononononononononono NO-
Zero Dependencies?⌗
Currently, we have a problem. The [dependencies]
section in our Cargo.toml
isn’t empty. We just went through
the trouble of parsing command line arguments manually to avoid clap
, yet we still have something there.
# `Cargo.toml`
[dependencies]
bitreader = "0.3.6"
This is absolutely unforgivable. The greatest base64 translator known to man couldn’t possibly rely on someone else’s code!
Thankfully, since we just did decoding without making use of bitreader
, we have all the knowledge necessary
to remove it from the encoding portion of our code.
Recall that, to encode a byte triplet, we need to concatenate them into one 24-bit chunk, then split that into octets. With our newfound knowledge of bitwise operations, this becomes a simple task.
Let’s replace our encode_triplet
function
// `src/encode.rs`
fn encode_triplet([a, b, c]: [u8; 3]) -> [char; 4] {
todo!()
}
We use destructuring syntax like in decode_quad
because we’re gonna need to play with the individual items
in the input array.
Now, we can shift each byte into place using <<
and OR
them together to concatenate them. The first byte needs to be shifted
left by 16, and each one after that is shifted 8 places less.
// `encode_triplet` in `src/main.rs`
let concat_bytes = ((a as u32) << 16) | ((b as u32) << 8) | (c as u32);
We also need to cast them to u32
s so that the final number doesn’t overflow.
Next, we can shift the sextets into place and use them to index our encoding array.
// `encode_triplet` in `src/main.rs`
let first = ENCODE_MAP[((concat_bytes >> 18) & 0b0011_1111) as usize];
let second = ENCODE_MAP[((concat_bytes >> 12) & 0b0011_1111) as usize];
let third = ENCODE_MAP[((concat_bytes >> 6) & 0b0011_1111) as usize];
let fourth = ENCODE_MAP[(concat_bytes & 0b0011_1111) as usize];
All that’s left is to return the final encoded array, and the full function is now this:
fn encode_triplet([a, b, c]: [u8; 3]) -> [char; 4] {
let concat_bytes = ((a as u32) << 16) | ((b as u32) << 8) | (c as u32);
let first = ENCODE_MAP[((concat_bytes >> 18) & 0b0011_1111) as usize];
let second = ENCODE_MAP[((concat_bytes >> 12) & 0b0011_1111) as usize];
let third = ENCODE_MAP[((concat_bytes >> 6) & 0b0011_1111) as usize];
let fourth = ENCODE_MAP[(concat_bytes & 0b0011_1111) as usize];
[first, second, third, fourth]
}
While we technically do still need to rewrite encode_doublet
& encode_singlet
, we can actually just
cheat & simply have them call encode_triplet
with their arguments & 0 bytes for the remainder. That way
it’s quick & easy and we don’t have to change code anywhere else
// `src/encode.rs`
fn encode_doublet(rem: [u8; 2]) -> [char; 4] {
let res = encode_triplet([rem[0], rem[1], 0x00]);
[res[0], res[1], res[2], A::PADDING]
}
fn encode_singlet(rem: [u8; 1]) -> [char; 4] {
let res = Self::encode_triplet([rem[0], 0x00, 0x00]);
[res[0], res[1], A::PADDING, A::PADDING]
}
If you want to change the code elsewhere & remove these functions entirely, then that’s up to you.
At last, our Cargo.toml
file can be pure and good, with a completely empty [dependencies]
section
# ...
[dependencies]
# Nothing!
VGhlIEVuZA==⌗
Is that it? Is it finally over?
Well, there is one more thing that could be improved-
NO. I refuse to let you drag this out any longer.
Alright, alright, I’ll just mention it quickly.
Currently, if we pass in an invalid character to the decode function, our program panics6
$ cargo run -- decode £$%^
thread 'main' panicked at 'index out of bounds: {some more info}'
This is fine for application code but absolutely unacceptable for library code. If you want to make
based64
even better, then you can replace all the indexing operations with some proper error handling.
Take a look at the Result
type in the standard library
& the chapter on Error Handling in The Book
if you want extra help with that.
There are a bunch of other things about this implementation that are suboptimal, but I’ll leave you to do more research yourself to find & fix them if you want.
With all that said, this has gone on long enough. Hope you enjoyed the ride & learnt something along the way.
RW5qb3kgeW91ciBicmFnZ2luZyByaWdodHMh
-
Groups of six ↩︎
-
A Rust library/package/stolen-code-sack ↩︎
-
The thing being operated on. There are 2 in the case of
OR
↩︎ -
The order of its operands doesn’t matter ↩︎
-
Someone who uses Rust. It’s a pun on ‘crustacean’, since the mascot is a crab (named Ferris) ↩︎
-
Crashes, but in a sort of controlled way ↩︎