paul shenposts
Exploring Codebases
It's the first day at your new job and you're trying to familiarize yourself with the codebase. You're going to be spending a good amount of time here so you down that coffee and dive in. Where do you start?
You've already been added to the company's GitHub organization (your manager is prepared). On github.com, you click around directories and files in your browser. Which file do you open?
GitHub's navigation is relatively limited so you clone the repo locally. You activate your personal toolkit by opening your IDE of choice. You have shortcuts for finding in project and jumping to definition. You can easily blame a line of code and browse source control. See also github1s and Sourcegraph.
Still, the question remains. Which file do you open first? App.tsx seems like a good starting place.
Scavenger Hunt
Rather than random walk through the codebase, I find it useful to have a goal, even if artificial. I come up with a statement — there MUST be a place where X happens — and then try to find it.
  • In React, there must be a place where it manipulates the DOM, probably with DOM APIs (appendChild? innerHTML?). How does it decide what DOM manipulations to perform?
  • In this app, there must be a place where it fetches data from the server. Maybe a fetch call? GraphQL? or some other abstraction.
A lot of engineering is looking for things.
There's a reason
Why does this line of code exist? Every line of code exists for a (good or bad) reason. A good codebase makes this an easier question to answer. Maybe there are useful comments nearby. Maybe the associated commits are well structured. Or the code is well structured.
Jump to definition and find references are table stakes language service features at this point. What are other useful language features for reading code?
Trace the code
What happens when the code runs? What functions are being called? What are the key operations? Stepping through a debugger line by line is too tedious though.
I like this idea of interactive flamegraphs. Today, reading codebases is usually bottom-up (start at lines of code). What if we had a top-down approach? Provide an overview and drill into the interesting bits.
Documentation
The repo you're browsing has been blessed with a quality README.md. There are instructions on how to set up and get the code running. There's even a sentence about each folder and an architectural overview. This is your first exposure to a massive codebase. These small investments pay dividends.
When I think of "code documentation", I think of ones written for consumers of an API or library. Documentation for the codebase is usually inline and autogenerated. See examples like rustdoc and TypeDoc. I was pleasantly surprised with the wiki pages on the TypeScript codebase although non-inline documentation inevitably gets stale.
Guides
It'd be nice if someone sat down and gave you a tour of the codebase. A video would be great but impossible to update. Though there's value in wandering and figuring things out yourself, pointers can really accelerate the process. There are ideas that are not visible at the lines of code level, like how pieces of the system interact with each other.
I like the approach of CodeTour, a VSCode extension that allows you to record and playback guided walkthroughs. The tours are just JSON that live within the codebase.
At Facebook, I designed the first assignment of bootcamp, the first few weeks for new engineers. It involved building an internal webpage and something with Pusheen. You went through the code review process and shipped it. I don't know if this is still around but I thought it was a fun and gentle way to approach a codebase.
Of course, you can also do a real first task. Instead of randomly browsing around, a task provides a lens and direction to understanding a codebase. The figure-it-out-as-you-go approach.
People
Working in codebases with contributors > 1 is a social activity.
An investigation that takes you hours may take someone else minutes to answer. Maybe you wander around a codebase jotting down questions and ask in batch. These questions/answers can be recorded for future explorers.
All future contributors go through the same process. See the codebase for the first time and build a mental map of how it works. This learning happens asynchronously but can we make tools to turn this into a shared experience?
Pair programming is cool. It often results in knowledge transfer (tips, commentary) that wouldn't be prompted otherwise.
Tooling
All this is in context of my exploration of tools for working with code. We know that there are better tools than Notepad for reading code. Syntax highlighting is nice. Jump to definition. Source control integration. What else can we leverage? Are there interesting user interfaces?
What if we viewed code as data? See Glamarous Toolkit and Pete Vilter's post on Codebase as Database.
What would a map of the codebase look like? Google Maps? A graphviz diagram? A spreadsheet?
I'd love to hear from you on Twitter, where I spend time on questions like these.