In this article we’ll explore how software evolution helps us make sense of large codebases. We’ll use the Linux Kernel as a practical case study. By analyzing patterns in the evolution of the Linux Kernel we’re able to break down million lines of code, authored by thousands of developers, into a set of specific and focused refactoring tasks that are most likely to give you the most bang for your efforts.
Our primary tool for exploring the Linux kernel is a Hotspot analysis. As we’ve learned earlier, Hotspots are complicated code that we have to work with often. We’ll find Hotspots on sub-system-, file- and function-levels. We’ll also learn how we can act upon our findings and use them as a driver to improve. Before we go there I’d like to talk a bit about the challenges of a large-scale codebase like Linux as a motivation to why it’s important to support decisions with data.
I spent six years of my career studying psychology at the university. During those years I also worked as a software consultant, which gave me the opportunity to relate the two fields. People I worked with knew about my studies (how could they not - I probably talked non-stop about brains, odd experiments and unexpected social biases) and were curious about psychology. The single most common question I got was why it was so hard to write good code. But I’d like to re-frame the question; The more I learned about cognitive psychology, the more surprised I got that we’re able to code at all. Given all the cognitive bottlenecks and biases of the brain, coding should be too hard for us. The human brain didn’t evolve to program.
Of course, even if programming should be too hard for us, we do it anyway. The reason we manage to pull this off is because we humans are great at workarounds. A lot of the practices we use to structure code are tailor made for this purpose. For example, abstraction, cohesion and good naming help us stretch the amount of information we can hold in our working memory. We use similar mechanisms to structure our code at a system level; Functions are grouped in modules, modules are aggregated into sub-systems that in turn are composed into a system.
But even when we manage to follow all these principles and practices, large codebases still present their own set of challenges. Take Linux as an example. The current kernel consists of +15 million lines of code and grows at a rapid rate. I’ll go out on a limb here and claim that there are few people in the world who can fit 15 million lines of code in their head and reason efficiently about it. In addition, a system under active development presents a moving target; Even if you knew how something worked last week, that code might have changed twice since then. Detailed knowledge in the solution domain gets outdated fast.
The picture becomes even more complex once we add the social dimension of software development. As a project grows beyond 15-20 developers, coordination, motivation and communication issues tend to give a significant overhead. We’ve known that since Fred Brooks wrote The Mythical Man-Month, yet we, as an industry, are still not up to the challenge. For example, we have tons of tools that let us measure technical aspects like coupling, cohesion and complexity. While these are all important facets of a codebase, it’s often even more important to know if a specific part of the code is a coordination bottleneck. And in this area supporting tools have been sadly absent.
I set out to change that by dedicating a whole part in Your Code As A Crime Scene to the topic of Master the Social Aspects of Code. Now we’ll apply some of those techniques to the Linux Kernel. Anything we can do to lower the bar of entry and provide an up-to-date overview of such a massive codebase is a huge win.
My first step is to run an analysis with CodeScene. CodeScene is a tool that analyzes the evolution of a codebase. All I have to do is point CodeScene to the GitHub mirror of the Linux repository and press play. Since I don’t have any previous knowledge of the Linux codebase, I chose to analyze the complete history of the code to get an overview. In this case, we get every commit since early 2006 when Linux Git history starts:
We noted earlier that GitHub gave up on counting the number of authors in Linux. In the CodeScene dashboard above we see that Linux has more than 15.000 contributors! We also see that there are more than half a million commits (CodeScene filters away merge commits by default) and that the majority of the code is, surprise, surprise, written in C with +14 million lines of code.
I’ve analyzed commercial systems of similar scale with CodeScene, but those closed source codebases never had that much development activity. Linux is indeed a unique snowflake. Trying to make sense of all that data requires a strategy. On large-scale systems divide and conquer has served me well. Here’s what I do when faced with a large legacy codebase:
The reason I split a large codebase into multiple analysis projects is because each sub-system will have a different audience. People tend to specialize. My goal here is to partition the analysis information on a level where each analysis result is tailored to the people working on that part.
There’s another advantage with separate analysis projects. CodeScene helps us prioritize the code we need to inspect (and probably improve) by applying machine learning algorithms that look for deeper patterns. This prioritized code, our Main Suspects as shown in the figure above, is always relative to the rest of the code. With separate and focused analysis projects we get prioritized Main Suspects for each sub-system. Let’s see how we do this.
CodeScene lets you configure your architectural components and sub-systems using a simple pattern language. In this case I provided a one to one mapping of the top level source code folders to components. However, since I noted in the initial Hotspot analysis that the
drivers package is huge, I chose to split it into several components as well. So for
drivers, I map each of its sub folders to a unique component.
Once we’ve defined the architectural boundaries we just kick-off a new analysis. Only this time we limit the analysis period to the evolution over the past year to avoid having historic data that obscures more recent trends. Now we just have to wait for the analysis to finish. Ok, here we go:
As you see in the picture above, the top architectural Hotspot in Linux is the
drivers/gpu package. That means the Linux authors have spent most development effort during 2016 on code inside that package. Armed with this information I configure an analysis of just the GPU component by white listing its content. Once that analysis finishes we’re down to a much more manageable amount of code for our investigation. Let’s start by investigating the architectural trends of the
drivers/gpu package content:
As you see in the stacked area chart above, there is a significant growth in this sub-system. The
amd package in particular exhibits a dramatic increase. We could use this information to configure even more specific analyses, but based on my experience we should already be on a level of scale where we can act. Let’s look at the overall numbers for the
An interesting result here is that the Main Suspects in
drivers/gpu only make up 1.4% of the code, but those 1.4% attract 15.9% of all commits. That part of the code is where our primary refactoring candidates are. Let’s look inside the
drivers/gpu component to reveal the top Hotspots on a file level:
This is an interesting finding: the file
intel_display.c is our top Hotspot based on the evolution of Linux during 2016. If you paid close attention you probably heard an alarm go off;
intel_display.c was also our main suspect as we explored the complete Git history of Linux initially. That is, the same file that is a Hotspot in the recent development history has been a Hotspot over the past decade too.
Now, let’s explore the implications of that and see how we can get information we can act upon.
So what does it mean when a file is identified as a “Hotspot”? Is that necessarily a bad thing? No, it isn’t, although there’s usually a correlation here. A Hotspot just means that we’ve identified a part of the code that requires our attention. The more often something is changed, the more important that this something is of high quality. So our next step is to find out how
intel_display.c, our main suspect, evolves over time. Does is change a lot because we make improvements to it? Or is it code that keeps degrading in quality?
A Complexity Trend analysis let us answer these questions. In a complexity trend analysis, we pick each historic revision of the Hotspot, measure the code complexity at that point in time, and plot a trend. The analysis is run automatically by CodeScene. Here’s how it looks on
Before we move on I just want to tie back to my earlier claim that Linux was a unique snowflake in terms of development activity. Remember that? However, our Hotspot investigation reveals that Linux evolves like most other codebases: most Hotspots tend to stay where they are and they also keep accumulating complexity over time. Most code doesn’t get refactored and our
intel_display.c is just one example out of many.
At this point I’d claim that we have all the information we need to suggest a refactoring of our main suspect: It’s a large unit, we need to change the code often, and as we do so we keep adding even more complexity to the code, which makes it harder and harder to understand.
Now, let’s pretend for a moment that this was your project and that you agree with the analysis result. Where would you start your refactorings? We see that
intel_display.c consists of 12.500 lines of C code (comments and blanks stripped away). A file of that size becomes like a system in itself. Parts of the code have probably been stable for years while others keep changing. We want to focus on the latter. This is precisely what CodeScene’s X-Ray analysis does. We saw X-Ray at work in a previous part of this series. Now we’ll put it to work on our Hotspot in Linux.
CodeScene’s X-Ray analysis runs the same Hotspot algorithm as we’ve already seen. The difference is the scope. X-Ray operates on a function level to detect Hotspot functions inside Hotspot files. That means we get to use the same analysis concepts as we travel down the abstraction levels of out codebase. Let’s see what X-Ray reveals about
As you see in the figure above, X-Ray ranks the functions based on how often you’ve made modifications to them. In most cases, this information serves as a prioritized list of refactoring candidates. Sure, there may be severe structural problems with a Hotspot, but in a large system with Hotspots consisting of thousands of lines of code you need to start somewhere. Refactoring that kind of code has to be an iterative process; The first refactoring addresses the most critical piece of code in terms of maintenance effort, then you move on to the next target in the list. After a series of such refactorings you’re likely to discover new design boundaries and can continue from there.
Back in the first part of this series I showed that the change distribution of code, on a file level, follows a power law shape. We learned that’s the reason why Hotspots serve so well as a tool to prioritize. Now that we’ve met X-Ray we can see that the modifications of individual functions inside a file also tend to form a power law:ish distribution. As an example, here’s the distribution of changes across the functions in our
This is why I recommend that we guide our refactorings by data. Data that is based on our own behavioral patterns in code. You see, refactoring legacy code is both expensive and high risk. With X-Ray as our guide we know that we spend our efforts where they are likely to be needed the most.
After this detour into a discussion of change frequency distributions we may still struggle with the refactoring of our main Hotspot. I briefly mentioned that Hotspots often have structural problems. In my previous blog post I showed how X-Ray detects patterns in how a Hotspot grows. More specific, I showed how temporal coupling lets you identify the functions in a Hotspot that tend to be modified together in the same commit. A temporal coupling analysis like that may help you detect some structural problems. Let’s see how it looks on our
As you see in the picture above, there are several functions inside
intel_display.c that are modified together all the time. For example, the top row shows that
intel_finish_page_flip_mmio have been modified together in every single commit that touched one of them. That implies that these two functions are intimately related.
The Similarity column in the table above is a clone detection algorithm. You see, a common reason that code changes together is because it contains duplication (either on the code level or in terms of knowledge). In our case, we note a similarity of 99% between the functions
intel_finish_page_flip_mmio. Let’s click on the Compare button in CodeScene to inspect the code:
You need to look carefully at the code above since there’s only a single character that differs between the clones. A single negation(
!) is all the difference there is. That leaves us with a clear case of code duplication. We also know that the duplication matters since the temporal coupling tells us that these two clones evolve together. Software clones like these are good in the sense that it’s a low-hanging fruit; Factor out the commonalities and you get an immediate drop in the amount of code you have in your Hotspot.
Once we’ve inspected our top Hotspot we just rinse and repeat the process with the other main suspects. And here you’ll note another advantage of configuring separate analyses for the different sub-systems. Inspecting a Hotspot is so much easier if you’re familiar with the domain and the code. So if we manage to align the scope of the analysis with the expertise of the team that act upon them we’re in a good place.
This is something I experience each time I present an analysis of a commercial codebase to its developers. As I present the Hotspot analyses of the different parts of the codebase I usually get approving nods from different parts of the audience depending on their area of expertise. Hotspots put numbers on your gut feelings.
So far we’ve learned the basics of how software evolution helps us uncover potential technical problems in a large codebase. But software evolution also helps you understand the social dimension of code. Since CodeScene uses version-control data for the analyses, CodeScene is able to detect patterns in how people work and collaborate.
Now, there’s a difference between the open source model used in the Linux project as compared to the collaboration mechanisms you usually see in closed source commercial projects. In the latter case you typically have several distinct teams, often co-located on the same site. And improving the coordination and communication between these teams is often even more important than addressing the immediate technical debt in your code.
CodeScene comes with a set of analyses that help you uncover such team-productivity bottlenecks. An example is code that has to be concurrently worked on by members of different teams. Since the Linux project doesn’t have a formal organization we’ll limit our social analyses to individuals.
The way we chose to organize influences the kind of code we write. There’s a strong difference between code developed by a single individual versus code that’s more of shared effort by multiple programmers. The quality risks are not so much about how many developers that have to work with a particular piece of code; It’s more about how diffused their contributions are.
In Your Code As A Crime Scene I wrote that “[..]the ownership proportion of the main developer is a good predictor of the quality of the code! The higher the ownership proportion of the main developer, the fewer defects in the code”.
Again, open source development may be different and encourage contributions to all parts of the code. However, there’s evidence that suggests that this comes with another cost. One study on Linux itself claims that code written by many developers is more likely to have security flaws (A. Meneely & L. Williams, 2009. Secure open source collaboration: an empirical study of Linus’ law). Wouldn’t it be great if we had an analysis that helps us identify those parts of the code?
Detecting code written by many developers is precisely what CodeScene’s Parallel Development analysis does. In particular, it doesn’t look at the number of authors, but on how diffused their contributions to each file are. Let’s see it in action:
You interpret the visualization above by looking at the color of each file; The more red, the more diffused work on that code. And if you’d like more information, you just click on one of the files to reveal its Fractal Figure;
The Fractal Figure above is based on a simple model; Each developer is assigned a unique color and the more that developer has contributed to the code, the larger their area of the fractal.
This concludes our exploration of the evolution of the Linux codebase for this time. My goal was to show you how to make sense of a large codebase by utilizing the information of how the system was built. By embracing the history of the code, we were able to identify patterns like Hotspots and implicit dependencies. That is, information that is invisible in the code itself. We also had a brief look at how we can uncover social and organizational information that helps us understand another important dimension of large-scale systems.
The best way to learn more is to try CodeScene on your own codebases. CodeScene is available as a service that’s free for open source projects: https://codescene.io/. CodeScene as a service evolves at a rapid rate and you can read more about upcoming features here.
Empear also provides CodeScene on-premise. The on-premise version is feature complete with all analyses used in this article. You get an on-premise version here.