Você está aqui

Free Software and Code Reuse

I've first written about this by 2006, but I wrote in Portuguese, so I'll have to repeat a good deal of stuff here. I'm now following with what changed since then.

Just the fact that now, 5 years later, what I'll say isn't obvious could mean that it is wrong. Or maybe I'm just too impatient, and in the brink of being proved right, who knows? Anyway, let's get to the matter, first some (short, I hope) background.

Exponential Growth

Humans aren't great at math, and one of our most visible shortcomings is our inability to deal with the exponential growth. By now that inability has become so famous that everybody and their dog have already listened about it, and knows that exponential growth is fast. The problem is that, well... isn't quite so. An image can say it better:

linXexp1_0.png

What's fast there? Of course, one could argue that I've cut the graph just a little before the exponential curve starts growing fast. That is right, the exponential will shortly overcome the linear curve. By the other way, one could also argue that I've cut the graph just a little after the exponential curve was distinguishable from zero. That is also right. Let's extend the curves a bit, to get the more common picture:

linXexp2_0.png

I hope now the feature I want to show is clear. Shortly after the exponential gets barely visible, it becomes huge. That, after an eternity of being almost zero. It is not that it is fast (what it is), but that it is sudden.

Knowing that, I'd ask you to take a look at the above image and try to determine where the exponential started to grow fast. You'll probably see a region, just right of the middle of the graph where its slope changed fast. That is your eyes betraying you. There is nothing special about that region, and to confirm that, let's zoom into other regions and see how they look like.

exp-zoom_0.png

Once you zoom any region of the curve and scale it to fit at the graph area, they all look the same! Yet, the beginning grows slowly, and the end grows fast. But the overall form doesn't change, the beginning was always slower, the future will always be faster. Now that we know that exponentials are fast, when they aren't slow, and change fast but never seem to change, let's get into why I'm talking about misbehaving functions in a text about Free Software. Let's see why exponentials are important.

Finally, Software

I'll start here with a couple of assumptions. I'll assume free software developers reuse code developed by third parties to a high degree. There is evidence of that on the huge number of libraries necessary for any normal Linux desktop installation, for example. In fact, that used to be a problem known as dependency hell, and solved by the modern package managers available on any distro. Mature software continuing to grow, instead of forking is also evidence of reuse. I'll also assume that proprietary software developers use just a limited set of code developed by third parties, mainly restricted to the toolkit distributed with the compiler of whatever language they are using. I will also assume in-house code reuse to be usual, despite evidence that it is limited on big organizations.

That last assumption is conservative; the first one needs to be tested; the middle one - limited code reuse from proprietary software developers - I'll state it as a widely accepted truth and feel no need of testing (if you want some kind of evidence, just look the size of the market for proprietary libraries, and how it grew - or better, vanished - since GUI and networking toolkits started being offered for free).

There is one outstanding argument against those assumptions: both worlds couldn't be that different because the developers on both of them aren't that different. In fact, often the same people develop proprietary and free software, how can they work so differently on each one? The answer can be found on The Tale of J Random Newbie (reading the remaining of the chapter - no, the remaining of the book - is recommended). It's as simply as that, proprietary software isn't packaged for reuse, for it to be so, it would need to be free software. In fact that statement is beginning to sound old. That is good news, maybe I'm right after all ;)

Now, everybody knows software reuse is important. If all of the above is true, free software has a clear advantage, but how big can it be? Well, a programmer with a set of tools has a certain productivity, if his tools get better, he gets more productive, and so on. The main feature of code reuse is that the "tool" is code already written, and the programmer product is new code; one uses tools to create new tools, to create new tools, to create new tools... Each set of new tools is done with higher productivity.

I'll call the "speedup" due to an available library the time needed to create some code without using the library, divided by the time needed using the library. That means, for example, if the speedup due to a certain library is 2, the developer is able to code 2 times faster using it than without it. I propose the speedup is proportional to the amount of code available for reuse - the proportionality constant may change with the quality of code, capability of the programmer, amount of coffee he drinks a day, or anything else, but the relation with the amount of code is proportional. That sounds like a safe proposition, since you can reuse more code when you have more code available, but would be nice to test. Unfortunately, you won't find that test here.

That relation being proportional, with a constant effort of the programmer(s) the speedup would grow following an exponential curve in time. Notice that no other kind of improvement can lead to an exponential growth of programmer productivity with constant effort, it does so because code is the thing that is getting done faster, and thus only tools made of code can take advantage of it.

The conclusion here is that all software should grow with a speed that is proportional to the amount of software available to the same owner, and grow exponentially with constant effort; thus the one who owns most software has the most productive programmers and sees the biggest exponent. Since free software is effectively "owned" by everybody that codes on a compatible license, free software written on popular licenses should experience an exponential speedup with time when compared with any proprietary software owner. That speedup can be reduced by hiring more coders at the proprietary side and making use of software of compatible licenses, like BSD, but will always remain exponential. Remembering, that one can outperform an exponential for some time, even for a very long time, but for any exponential the time comes that it starts to grow fast, and it happens just after the curve starts to get relevant, surprising people.

Now, the Data

So, there is one hypothesis here to test. It is that free software developers reuse code often, and that reuse isn't limited to a constant set of code. There is also a stricter version, that reuse is proportional to the amount of code available.

What better data to run that test in than Debian's package repository? It is simply a huge collection of useful code, with annotated dependencies, size, and everything else; that constantly grows with time, and purges anything that isn't useful. Within Debian, Sid is the best source. It is a rolling distribution, not subject to freezing, and not limited in time. Thus, since 2006 I've been collecting Sid's package index.

Now, what information to take from that data? For that experiment, I've created a simple metric. Let's define the "size" of a package as the size, as described by the uncompressed size of the package at the index. Let's also define the "full size" of a package as the size of a package added to the full size of all packages it requires. Adding those numbers for all packages, we get the estimated speedup as "full size"/"size" for the entire distro. The estimated speedup ignores the basic system, that every package depends upon, but since I'm interested on trends, that is not a problem. If there is code reuse that isn't limited to a specific set, that number must be growing. If code reuse is proportional to the amount of code available, that number must be growing exponentially. So how does it behave?

It behaves in an absolutely unexpected way. Data from 2006 is noisy, very noisy; and the estimated speedup is huge. By that metric, at 2006 the speedup was around 4000! That is a flaw of the metric, that doesn't work well with he way the index was organized by that time. There was a reorganization by 2007 that made that noise go away, so I ignore everything from before that reorganization. The result is the following:

Despite still noisy, that metric is clearly trending up. Free Software really does have that advantage. It is also low, and there isn't enough change to decide on the shape of that curve. In case it is really an exponential, there was one doubling in 4 years; for a comparative metric that is very fast; that means that in 16 years Free Software developers will be at least 16 times faster, or in 32 years they will be 256 times faster than proprietary (of course, there must be some limit on how big it grows). But that is a big if, for what there is little supporting evidence. To see if a function is exponential, we put it in a monolog scale. That is the same data, with a logarithmic scale at the vertical axis:

It doesn't make much difference, since the change is small. An exponential curve would show as a line on that graph, and the current data seems to grow even faster than that. But again, there isn't enough change to make a good fit for the data.

Conclusions

The current data suggests my hypohtesis is too conservative, but that data is not very trustworthy. A better idea about it may come from studying other sources, and it is important to get a metric that is less sensitive to the internal structure of the data.

But it is worth repeating: the current data suggests my hypothesis is too conservative.

Comentar