I was originally planning to tweet this by itself:
STARTUPS HATE HER: DATA STRUCTURES TO NAME-DROP WHEN YOU WANT TO SOUND SMART IN AN INTERVIEW
- bloom filter
- prefix trie
- ring buffer
But I realized I actually wanted to say some earnest, not-shitposty things about each of these data structures, so I figured I should take it to my neglected blog instead. If you just wanted the clickbait version, you can stop reading now.
So what makes a data structure good to name-drop in an interview? I would say that it has to be mildly obscure, so that you sound like an erudite data structures hipster, but it can’t be too obscure, lest your interviewer ask you to really explain the implementation details or trade-offs to the point that you reveal your ignorance. It’s best when you mention a data structure that’s somewhat obscure, but that your interviewer has heard of, because then your name-dropping validates their knowledge. They want to think of themselves as smart, and they’ve heard of this data structure, so when you show your knowledge of it, they deem you smart by the transitive principle. To this end, I’m not going to cover any truly obscure data structures, because that would defeat the purpose of why I’m writing this blog post to begin with.
Other than that, it should have real-world use cases so that there’s a legitimate reason for you to mention it in the context of a technical interview. It shouldn’t be too pedestrian either, or you won’t sound impressive for knowing something deemed too “undergrad” (pffttt you only know linked lists? get out of my startup, we only do blockchain here).
A bloom filter is a probabilistic version of a set. Sets contain elements and can tell you in O(1) time and O(N) space whether or not it contains that element. A bloom filter can tell you whether it probably contains an element, but in O(1) time and O(1) space!
Who would really use this?
✨Google Chrome! ✨Chrome needs to protect you from visiting spam websites without sacrificing speed or space. Imagine if every time you clicked on a link, Chrome had to make a network call to check its massive database of spam URLs before allowing you to visit the page. Further, imagine if Chrome’s solution for improving latency was to store that entire list of spam URLs locally. That’s not feasible! Instead, Chrome stores a bloom filter of potential spam URLs locally. Bloom filters are both time- and space-efficient, so it can quickly check for whether the given URL is probably spam. For normal URLs, the bloom filter’s response of “not spam” is sufficient. If a URL gets flagged as “maybe spam”, then Google can check its real database before moving forward. It turns out you can do great things when you’re willing to sacrifice absolutes! (Yeah, yeah, only a Sith deals in absolutes.)
Implementation details that you can scroll past
The Wikipedia article for bloom filters describes the implementation details with a whole lot of jargon, so I’m going to quickly describe the implementation in plain English here. You should check out Wikipedia if you want more precise details; I’m going to gloss over a lot of information because this blog post is quickly turning out to not be clickbait.
Let’s say you want to insert an element into your bloom filter. First, imagine you have N distinct, deterministic hash functions. When you use each hash function on an element, you get a different value (collisions are okay). You use the output of each hash function as an index into an array1, 2 and for each index i, you set the array[i] to true. You’re done! Insertion is O(1) because the only work you do on each insertion is running a constant number of hash functions and setting a constant number of array indices.3
How would you check whether your bloom filter contains that element? Run it through all of the same hash functions again! Your hash functions are deterministic, so the same input should return the same output. So now, for each index you have, you can check if your bloom filter’s array is set to true at that index. If every slot of the array for your hash functions’ outputs is true, then you can say with some high probability that the element is likely to have been inserted into your bloom filter in the past. However, there’s always a chance of a false positive, which would happen if those array slots were all set to true because the indices were used when some other elements were inserted. The great feature of bloom filters is that there will never be false negatives, though: there’s no way to find an array slot that’s false when that element had previously been inserted.
You have to do some cool math to figure out how many hash functions and how big of an array you will need to guarantee certain probabilities. Wikipedia goes into greater detail here and I think their proof is worth reading.
1 In case you’re wondering what I mean by using the output of a hash function as an index: This algorithm assumes your hash function outputs some integer value, let’s call it M. You take the length N. N % M (pronounced N modulo M) gives you a value Q such that 0 ≤ Q < N. It’s a convenient way to take arbitrary values and distribute them evenly within a range. If you haven’t come across this before, you should read about the modulo operator, draw out some example arrays, and play with different values of M to see what you get for N % M.
2 Well ackshually, you should use a bit array instead of a normal array. Each element of an array takes at least 1 byte and you only need to store true/false for each element of the “array”. So you can save space by storing it as a bit array, which is the whole point of this data structure. I just didn’t want to distract from the main discussion of this data structure with implementation details that aren’t conceptually important. I will say, though, that bit arrays (aka bit vectors) are also worthwhile to bring up interviews if you want to sound smart. The real interview pro tip is always in the footnotes.
3 mouth breathing intensifies Well ackchually, insertion is only O(1) if all of your hash functions run in O(1) time but honestly we’re at the 3rd footnote here so go fuck yourself if you were gonna comment on this?
A prefix trie is a data structure that allows you to quickly look up a string by its prefix and also find strings that share a common prefix.
My first pro tip for this data structure is to refer to it as a “prefix trie” as opposed to just a “trie”. That way, you suggest to the interviewer that you are the type of person who knows about algorithms related to both prefixes and suffixes, and also you like to be precise about your hipster data structures. Suffix trees are also a pretty interesting topic, but the implementation details are so gory that I wouldn’t be able to do it justice. That’s why I just talk about prefix tries and bluff knowing about suffix trees.
Who would really use this?
✨Genomics researchers!✨It turns out that modern genomic research relies heavily on string algorithms and data structures because you’re trying to find insights from the millions of nucleotides that make up a genome sequence. With genome data, you often want to align sequences, find differences, or find repeated patterns. If you want to learn more about this, you can start by reading up on bioinformatics, and then look into courses such as “Algorithms for DNA Sequencing” or “Algorithms for Bioinformatics” (offered at multiple schools).
If you want some really exciting bonus reading, I’d highly recommend reading about pharmacogenomics. With advances in genome sequencing and string algorithms, we can actually predict use an individual’s genome to determine whether they have the right genes to react properly to a medication. For example, if their genome is missing a gene for producing an enzyme that processes a certain drug, they might experience side effects. If we knew what genes were important, we could give them a different drug! We currently do exactly this for warfarin, a blood thinning medication.
(I have to confess that the connection between prefix tries and genomics is somewhat tenuous, but I wanted to motivate string algorithms in general. If you want the canonical use case for prefix tries, then yeah whatever, a prefix trie can be used to implement a dictionary. I’m bored again.)
Imagine you have a tree where every node has an array of 26 children, one for each letter of the alphabet. (You can change 26 to be a different value if you want to include other characters. After all, you are the Ashton Kutcher of your string data structure.) To represent a word in your trie, you would walk down the tree and add a node for each letter of the word. For example, here’s this image I stole from Wikipedia:
For the word “tea”, you start at the root, navigate to the t node, then e, and finally a. So searching for a word takes O(N) time (where N is the length of the word), and if the word’s prefix doesn’t exist, you can bail early. If I look up “zzzzzzzz”, the trie can stop looking for my term after “zz”.
A ring buffer is more of a nifty way to use a normal array, but in a clever way that makes it optimized for data streaming.
Who would really use this?
✨Maybe Netflix?! ✨I googled “netflix ring buffer” and found this open source ring buffer code they published, but when has a company ever used their open source code internally in production, anyway?
Totally unrelated, but Veneur, Stripe’s metrics pipeline that we use in production and just happens to be open source, uses a ring buffer for collecting tracing data. #weirdflexbutokay #sponsored
Is anyone even reading these sections? Wikipedia has a cool gif and their descriptions are sufficient for this data structure, so I won’t rewrite what they’ve said. This blog post is already getting pretty long! (If enough people complain, I’ll come back and edit this section.)
With this blog post, I like to think I’ve saved you about $20,000 in university tuition, gotten you excited about at least one real-world use case that you might not have heard of before, and, most importantly, helped you win your next computer science penis measurement contest.
If you’re looking for more practical interview advice, you might enjoy Last Minute Interview Tips for First-Time Interviewers and My Coding Interview Style. Both are tailored for an entry-level audience.
I already have a job as a new grad but this will be super helpful for my next job search. Thanks for writing this :)
That was really great! As a higher level developer, most of my headspace is taken up by business requirements and frameworks, but I still love reading about lower level stuff like this and getting my head around data structure basics.
This was interesting, clear and useful! Thank you so much!
gave me a good laugh
But which datastructure fits best for computer science penis measurement I wonder?
Where you write “N % M”, I think you actually meant “M % N”
Ring buffer implementation pls?
Thanks Amy, best explanation on bloom filters for a new bie
I am so namedropping these on my next interview!
So maybe you know this already but didn’t mention it for literary reasons, so I’m going to go ahead and comment on it in case anyone is wondering: ring buffers are pretty much the staple data structure of real-time processing. Virtually every device that does real-time signal processing or packet processing feeds data from the external sensors/converters (temperature sensors, ADCs, CCDs, microphone, Ethernet card, you name it) into a ring buffer, and the processing function reads data from the same ring buffer. It’s like a conveyor belt (you feed data on one end and process data on the other), except it wraps around.
It’s useful for a bunch of reasons. If you can’t process data quickly enough, you will end up overflowing the “head” of the buffer (and you’ll skip frames, drop packets, whatever), but your memory consumption won’t blow up. However, you still get a “linear” view of your buffer — you can treat it as a regular, 1D vector. It wraps around after some number of samples, but as long as you don’t process that many samples at once, it’s like a hamster wheel — if you’re tiny enough, it looks like a road of infinite length.
It’s important enough that many (most?) DSPs have hardware acceleration features for ring buffers, and there’s also a neat trick to (sort of) emulate that on systems that have a MMU (and, thus, virtual memory) but no hardware ring buffer acceleration.
(By the way, you just got featured on lobste.rs. This article may end up on HN eventually — maybe brace your server?)
Hullo! ^.^ I’m not much of a programmer, my health affects my abilities, so obviously I’m not looking for a job, but I read this just for fun today. (Found it via Lobsters.) I like to think about programming, actually to plan stuff and see if I could do it. (Sometimes I can.) I had to laugh when I saw that what I came up with for fast dictionary search in an interpreted programming language is something that would impress an interviewer because it’s used in genomics! :D I just didn’t know the name before today.
That was pretty awesome .. thanks :)
Hey Amy! I loved your blog post (found from the Programming Digest). I was wondering if you have any recommendations for places to read up on more cool data structures (preferably written in a similar fashion, I’m not very bright).
This was awesome. I read every word by the way!
You have in fact got me excited about a few possibile use cases for bloom filters that I hadn’t considered before so thank you!
Just want to applaud the layout of this post – click bait first, then well-labeled paragraph headers that give the casual reader permission to scroll past implementation details. Oh, enjoyed the content, too.
Unrelated, but I love you’re writing style, specifically how you manage to make something as dry as data structures fun to read about! :)
A good example of a ring buffer would be the one used by the Linux kernel to store log information.
Cool article, I’m a huge fan of algorithms and the bloom filter was a new one for me!
I would definitely add Merkle Tree to this list, especially given the popularity of the blockchain buzz.