Linux Versus E. coli

By Carl Zimmer | May 3, 2010 5:07 pm

ecolilinux closeup.001

In 1991, a 21-year-old Finnish computer science student named Linus Torvalds got annoyed. He had bought a personal computer to use at home, but he couldn’t find an operating system for it that was as robust as Unix, the system he used on the computers at the University of Helsinki. So he wrote one. He posted it online, free for anyone to download. But he required that anyone who figured out a way to make it better would have share the improvement with everyone else who used the system. Torvalds would later tell Wired that his motives were not noble. “I didn’t want the headache of trying to deal with parts of the operating system that I saw as the crap work,” he said. “I wanted help.”

In his quest to avoid crap work, Torvalds unleashed a monster. People began to download the system, dubbed Linux, all over the world. Within a few weeks, Torvalds was getting emails from hundreds of users, explaining how to fix bugs and how to add new bells and whistles. People began to write programs that would only work on Linux computer. They founded companies around Linux-based software. Millions of people chose Linux for their computers, and major computer companies like Microsoft and Dell begn to support the system. Along the way, Linux evolved. Torvalds’s first version contained 10,000 lines of code. Linux now holds over 12 million lines.

Those 12 million lines may seem like a hopeless thicket of code, but it actually has a hidden structure. It’s divided up into chunks, each of which carries out a particular task. All told, they carry out 12,391 separate functions. The functions are also connected. If Linux carries out one function, the system will direct the computer to carry out other functions. You can think of Linux as a network, with the functions joined together by links of control. Computer programmers can map out that network as a so-called “call graph.”

Linux bears an uncanny resemblance to the genes in a living cell. Many genes make proteins that act as switches for other genes. The proteins clamp onto DNA near a target gene, allowing the cell to read the gene and make a new protein. And that new protein may, in turn, grab onto many other genes. Thanks to this hierarchy of switches, cells can respond to changes in their environment and quickly carry out complex behaviors, such as reorganizing themselves to feed on a new kind of food.

A number of scientists have begun to compare natural and manmade networks. A lot of the same rules appear to be at work in the growth of the Internet, airport connections, brain wiring, ecosystem food webs, and gene networks. But very often, scientists are finding, it’s the differences between natural and manmade networks that are most revealing, offering clues to the different ways in which people and evolution build complex things.

In the Proceedings of the National Academy of Sciences this week, Koon-Kiu Yang of Yale and his colleagues present the first detailed comparison of Linux’s network to a gene network. (The paper will be here.) Thanks to the open-source nature of Linux, the scientists could look at every line of code in every version of the system over the past two decades, from Torvald’s first primitive stab to its current sophisticated form. And for a living cell, Yang and his colleagues turned to the living equivalent of Linux–a biological network they could analyze from top to bottom. They chose E. coli. coli, since it is the best-studied species on Earth. (Why E. coli? There’s a certain book that will explain it to you.)

Over the past fifty years, scientists have mapped 1,378 interactions among E. coli genes. Out of that research, Yang and his colleagues built a microbial call graph. They assigned each gene to one of three categories. If a gene switched on one or more genes, but was not itself switched on by another gene, they called it a “master regulator.” If a gene was switched on by a different gene and then, in turn, switched on other genes, the scientists dubbed it a “middle manager.” And if the gene was switched on but did not then switch on any other genes, they called it a “workhorse.” The scientists drew the network of master regulators, middle managers, and workhorses.

The scientists sorted all the functions in Linux by the same rules. Here is the picture that emerged.


(N.B.: for the sake of clarity, the scientists only used 10% of the nodes in the full Linux call graph. But the complete picture would look the same.)

Both Linux and E. coli are organized into hierarchies. But their hierarchies have different shapes. E. coli‘s genome is dominated by workhorses. Middle-managers and master regulators make up less than 5% of the total number of genes. In Linux, by contrast, over 80% of the functions are in the upper echelons. Each workhorse in Linux is controlled to many middle managers. In E. coli, on the other hand, each workhorse gene is typically controlled either by a few genes or just one. And so in E. coli it’s the higher levels where genes have the most links, not the workhorses.

Once Yang and his colleagues had drawn the two networks, they looked at the paths information takes as it flows from master regulators down to workhorses. E. coli’s genes are organized into relatively distinct modules. When a master regulator swings into action–in response, say, to a spike in temperature–it switches on a set of other genes with relatively little overlap with the genes switched on by other master regulators. Linux, by contrast, has blurry boundaries. Four out of five Linux modules overlap, in contrast to 5% of E. coli‘s.

The networks in E. coli and Linux don’t just look different. They also grew in different ways as well. The oldest genes in E. coli‘s network–the ones shared by many other species of microbes–are its workhorses. The genes higher up in the E. coli hierarchy have emerged more recently. Those higher-ranking genes have also been undergoing a lot of evolutionary change since they first emerged. The old genes, by contrast, have changed little.

The history of Linux has played out differently. A lot of the oldest functions in Linux are middle managers or master regulators, not workhorses as in E. coli. And while old genes in E. coli haven’t evolved much, programmers have heavily rewritten Linux’s old functions.

Both networks developed, step by step, as increasingly sophisticated systems for operating things–computers or cells. But the Linux network was the work of programmers, while E. coli is the product of four billion years of evolution. The differences in the history and shape of the two networks emerge from the ways in which they developed. The programmers who built Linux did not have the time to invent entirely new workhorse functions. It was simpler for them to just use the old workhorse functions in new modules. But this strategy leaves Linux a lot more fragile than a biological network. Its modules overlap, so that in many cases, a workhorse function is essential for many different modules at once. As a result, Linux gets buggy and prone to crashing. And so as programmers improve Linux, they’ve had to fine-tune its all-purpose functions at every step of the way.

E. coli is far more rugged. Mutations crop up all the time as the bacteria multiply, and yet they generally don’t suffer a catastrophic network crash. One reason E. coli is so robust is that its modules have evolved to be distinct. Overlapping modules make cells particularly vulnerable to mutations, because a single mutation can shut down a lot of their essential biology. Natural selection favors organisms with a more rugged network.

Because E. coli is the product of evolution, rather than of programmers, parts of its genome have changed relatively little over billions of years. The oldest parts of the network are the workhorse genes–the ones that encode primitive proteins that do the fundamental work of life, like building new pieces of DNA. They can tolerate very little change. It’s much easier instead for E. coli to evolve new ways of controlling those workhorses.

This kind of comparison is very new, and it’s not clear yet what scientists will find when they compare Linux to other genomes–particular to the genomes of more complex species like ourselves. E. coli has only about 4300 genes. We have 20,000 protein-coding genes. A lot of those genes control other genes. Indeed, a typical human gene has a lot of switches, all of which have to be thrown in order for the gene to make a protein in a certain situation. The human genome is also packed with thousands of genes that don’t encode proteins, but which may encode RNA molecules that also switch genes on and off. Scientists just don’t know enough yet about the human genome to map its network the way they’ve mapped E. coli. But it’s possible that when they finally do, it will be a lot more top-heavy, with a lot more overlapping modules and multi-tasking workhorses.

If that turns out to be the case, biologists will have a new question to keep them busy for a long time to come: how did Linus get to be so much like Linux?

[Update: Fixed Torvalds’s name and other typos. Thanks for the proofing!]

CATEGORIZED UNDER: Evolution, Microcosm: The Book

Comments (30)

  1. Owlmirror

    major computer companies like Microsoft and Dell begn

    No, not (operating system competitor) Microsoft there. That should probably read “IBM” (computer server hardware manufacturer)(although it’s more complicated than that since IBM also has their own version of Unix, but does also support Linux).

    Also, note minor typo: “began”.

  2. ncbi rofl

    “scientits”, eh?

  3. Owlmirror

    If that turns out to be the case, biologists will have a new question to keep them busy for a long time to come: how did Linus get to be so much like Linux?

    …And we just know how evolution-denialists are going to answer this one.

    Oh, well. I suppose it’s inevitable.

  4. Well, these ideas are certainly interesting and trendy (think about Davidson and Boulouri’s biotapestry). However, working myself in that field, I am not convinced at all that bacteria are more or less equivalent to computers or that biology eventually will be a sub field of engineering. I actually think the real “out of the box” thinking today is to envision in what way living things and designed things are different. Evolution has absolutely no reason to find the same solution as human engineers, it is not clear at all that incremental evolutionary process is comparable to anything like the more global optimization principle used by computer scientist, etc…

    On the other hand, if you really believe this, you need to go a little bit further as you tell. If Linus is designed as Linux, you must conclude that evolution actually is rather constrained in the solutions it can find, therefore it means that evolution, as a theory, is much more predictable than what biologists usually think, and that we actually do not really understand much about it.

  5. Bob Carlson

    Linus Torvalds, not “Torvald.” I encountered the same error in a book chapter titled “The Future of Openness.” In this case, though, the book was neither about computer science nor biology but rather the The Secular Conscience.

  6. Brian Too

    This is extremely cool. Having worked on numerous proprietary systems, I believe that they would look much like the Linux system graph, assuming we are talking about systems that have had some time to evolve.

    Many computer systems generally, and Linux particularly, are designed with a fundamental bias towards efficient hardware utilization. This may have something to do with the proportionately higher number of control functions.

    There was a marvellous article years ago in Byte. It discussed the fundamental reasons why small computer operating systems were so different than large computer OS’s. In particular they focused on the aspect of quality control issues, and overall susceptibility to system crashes.

    It turned out that the fundamental issue at work really was fundamental. It boiled down to, what are your priorities? What do you spend your time on, and why? Large system OS’s place a premium on stability, backwards compatibility, security, and so forth. Small system OS’s place a premium on rapid development, new features, new programming methodologies, and so on. This leads to very different outcomes in what is delivered as finished product.

    If I had to recast that to E. coli and biology, here’s what I’d guess. The E. coli organism places a premium on living, reproducing, and adapting. Graceful degradation as opposed to collapse (death) would be actively selected for.

    That’s an area where computers still don’t compare well to biological systems.

  7. Kaleberg

    A big reason for the structural difference is that cells don’t have anything that works quite like a subroutine call. Any number of genes can activate a particular gene, but if genes A and B activate gene C, the information about which gene started C is generally lost. So if gene C needs to be activated as part of two possible gene sequences (or cascades or complexes), it will typically be copied so there will be a gene C used by A and its ilk and a gene C’ used by B and its ilk. In fact, this copying is the raw material for evolution, since once cloned, C and C’ can mutate and evolve separately.

    This means that gene invocation structure is not at all like subroutine structure. It is more like macro expansion. In a macro expansion based system, if two routines called another routine to open a file, for example, the compiler would expand the code for opening the file twice. and we would get a chart more like that for E coli. This would make the code incredibly hard to maintain. A minor bug in the file open routine, or the addition of a new feature, would require editing every routine that includes it. Of course, if we expect the system to maintain itself, this makes sense. There is no point in modifying the routines that use the old file open if they are working well enough. Only the ones that need to be changed to keep the system running will change. Otherwise, the system will die having failed to reproduce.

    It might be interesting to look at the LINUX chart from a macro expansion point of view. Even more interesting would be a chart based on E coli, but considering homologous genes to be the same for the purposes of usage charting. Given how much of life as we know it is cut and paste, I expect to find the large number of workhorse genes to fall into relatively few categories.

  8. There’s a subtle observation to be made here:

    E. coli is far more rugged. Mutations crop up all the time as the bacteria multiply, and yet they generally don’t suffer a catastrophic network crash.

    No, E. coli actually suffer catastrophic crashes all the time — those are the E. coli that died. And they do frequently die. Only the mutations that don’t cause network crashes have a chance of being preserved.

    [CZ: I think we’re saying the same thing. The robustness to mutations has evolved through natural selection.]

  9. “Evolution has absolutely no reason to find the same solution as human engineers, it is not clear at all that incremental evolutionary process is comparable to anything like the more global optimization principle used by computer scientist, etc…” — Tom Roud

    Yes and no. In the present context, yes, evolution is a very different (though vaguely analogous) process from Linux kernel development. Generalizing to engineering in the economy at large, the artificial process becomes less dependent on human design but still not evolutionary per se. There is some impetus for using more genuinely evolutionary models (i.e. far beyond Agile or XP methodology) in software engineering for very large, complex projects (Yaneer Bar-Yam comes to mind), but it’s on the periphery.

    As a computer scientist working in AI and complex systems, however, when I hear “global optimization principle,” I do not think of the social/design process in conventional software engineering, but the all-too-often-biology-inspired optimization algorithms that we apply to analyze and/or design complex nonlinear systems– systems more akin to E. Coli than Linux. In that arena research like this is very exciting, because the more we understand the similarities (and the differences too, as you rightly point out), the more hope there is that we can design algorithms that adapt in sophisticated ways — i.e. hope that the sort of convergent evolution you allude to belies a set of universal organizing principles that can be used to our advantage.

    I am “convinced that bacteria are equivalent to computers.” Whether we will ever find the right simplifying assumptions to implement such adaptivity and complexity on our limited hardware, however, is a different story. I for one hope research like this is a step in the right direction.


  10. Hye Carl, liking the connection. Are you talking about distributed systems? Not sure what Linux is. Reminds me of opposite of Penrose’s Emperor’s New Mind. It would question the whole of body and mind.

    Are E.coli not analogue?

  11. I read the article wrong – sorry about that. I am really tired. I see where the article is going now.

  12. ida

    All best Operating Systems these days have minimal functions. they add modules which can be loaded and unloaded at any time. I guess the kernel modules do look like genes in a way. and as always human tries to micmic nature.

  13. Steve P.

    If that turns out to be the case, biologists will have a new question to keep them busy for a long time to come: how did Linus get to be so much like Linux?

    …And we just know how evolution-denialists are going to answer this one.

    Yep,but we already know how design deniers have answered the question, right? “Hey,hey, emergence man, emergence!”

  14. […] Linux vs. E.coli
    (‘Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks’) […]

  15. I’m wondering how linguistic systems would fit into this schema. What are the ‘workhorses’ and ‘master regulators’ of language? There are many more ‘low-level’ words that refer to things than ‘higher level’ syntactic structures. This would make it like e-coli.

    On the other hand, there are relatively few ‘low level’ phonemes and very many ‘high level’ concepts. This would make it more like Linux.

    Maybe language has more ‘middle managers’ than anything else?

    Answering this may give an insight into how ‘designed’ language is, as opposed to ‘evolved’.

  16. MPL

    One thing that can happen in the evolution of a cell is the creation of new proteins outright, whereas Linux is mostly stuck working with a fixed piece of hardware (or rather, must work on a variety of hardware, over which the programmers have very little control).

    So the narrowing back down is somewhat of an inherent feature of trying to drive a very small system in many ways.

    What does the control graph look like for something where the designers can produce a wide hardware base, e.g. the network of controllers in a automobile’s electrical system, or the hardware inside a CPU?

  17. Steve

    This comparison gave me an idea for improving the linux code. One could apply evolution to linux by simulating mutations through swapping letters & numbers in the source code and benchmarking each “mutated” version then finally combining the code of the faster versions (one could use other criteria) … We could run many iterations on a supercomputer to simulate the necessary millions of years. I can imagine a lot of new linux distributions emerging from an experiment like this. Unfortunately in reality there is a higher probability for randomness to create chaos than order, so evolution is out of the question. I’m afraid Linux would go nowhere without Linus.

  18. Shubhashish

    Thanks CZ for this blog, this comparison is brilliant indeed, though I m not sure what kind of assembly language and decoding an OS like Linux uses. But, looking at the genome side of it, this can surely help us to think in a new way on why non coding RNAs are preserved in human and other mammalian genomes (Ponting, 2009: Cell 136) and may help us discover why(s) of many epigenetic regulators. If E.coli can excel in this, I feel optimistic that our genome will these capabilities too, its rather when and how we get to know it and start to use it for our benefit.

    I also wonder what kind of splice sites does the Linux OS have, in some ways, can they have the mutations like genomes and evolve like living beings. There certainly is a difference, but similarities are entertaining as well.


Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

The Loom

A blog about life, past and future. Written by DISCOVER contributing editor and columnist Carl Zimmer.

About Carl Zimmer

Carl Zimmer writes about science regularly for The New York Times and magazines such as DISCOVER, which also hosts his blog, The LoomHe is the author of 12 books, the most recent of which is Science Ink: Tattoos of the Science Obsessed.


See More

Collapse bottom bar