Yes, it’s the return of Friday Fisking. My first target is a chap calling himself William Wallace over at ERV. Some time ago, William left a comment in which he expressed skepticism that ERV data is justifiably used to support common descent. About a week later, he announced that he had made a model based on random insertion, and then asked for some help in creating an equivalent common descent model, with an aside that the results of the random model “doesn’t look good for your side.” This announcement was met with great derision, with calls for William to explain his conclusion. I directly addressed his question about his common descent model, pointing out where his assumption was incorrect. Even so, I remained puzzled as to how he achieved the results he claimed for both models.
Recently, it was brought up in relation to another post about ERVs. Although William refused to give any details about his model, he claimed that his results were unsurprising if you understood that math. He eventually offered to give me details via email. After a series of exchanges, I received enough information to be confident about the basics of his model.
At issue was whether, using 14 ERVs, can a nested hierarchy be created solely by random insertion without common descent. What William did was generate a series of datasets using a pseudo-random number generator. Each dataset consisted of ten species, each assigned what he calls an ERV ID hash. Basically, for each species, it sounds like he is using the pRNG to generate an integer between 0-16,383. Those familiar with binary numbers should immediately see what is going on. That range of integers can be created using 14 bits. Each bit, therefore, represents whether or not a given ERV is present in the species. For example, say the pRNG picks 14,849. Lets convert that to binary:
This species would have ERVs 1, 2, 3, 5, and 14, but none of the others. Repeat this for the remaining species, and you have one dataset. William sent me a sample output of his model:
SPECIES 4_ _Species 5 \ / SPECIES 3_ | | _Species 6 \ | | / SPECIES 2_ | | | | _Species 7 \ | | | | / SPECIES 1_ | | | | | | _Species 8 \ | | | | | | / SPECIES 0_ | | | | | | | | _Species 9 \ | | | | | | | | / ERV ID | | | | | | | | | | (hits) -------------------- - - - - - - - - - - ------ 1 0 2 3 4 5 7 9 (7) 2 0 2 3 4 5 7 9 (7) 3 0 2 3 4 5 7 9 (7) 4 2 3 4 5 7 9 (6) 5 2 3 4 5 7 9 (6) 6 3 4 5 7 9 (5) 7 3 4 5 7 9 (5) 8 4 5 7 9 (4) 9 4 5 7 (3) 10 4 7 (2) 11 4 7 (2) 12 4 (1) 13 4 (1) 14 4 (1)
As you can see, a distinct nested hierarchy is present. This result is not surprising, nor is his admission that he has to look closely to find this (chosen because it mimics the 7 species nested hierarchy he was emulating) and other hierarchies in the data. With small species counts and small binary trait counts, nested hierarchies can occasionally arise from purely random assortments. The probability decreases quickly as the number of species and traits increases. It is well-known, and part of the reason phylogenies based on morphological features in particular try to include as many traits as possible. It is also why we talk about consensus phylogenies, because we compare numerous phylogenies and try to find a best fit.
So William’s results are, on their face, unsurprising. They also illustrate that we need to exercise some caution when discussing phylogenies. Then again, this is not particularly noteworthy, and it’s nothingwe didn’t already have a firm grasp on.
The problem for William is that this doesn’t actually model what he claims it models. Contrary to his claim, this does not model random insertion of ERVs. An early response to his announcement nailed his error:
And Willy, what assumptions are you making about _where ERVs integrate in the genome_?
The answer is, William assumed that each of the 14 ERVs can only insert into a single location. But that is not true. ERVs randomly insert into the genome, though there is often a bias for where a certain ERV can insert. But there are millions and even billions of potential sites. When we talk about different species having the same ERVs, they are not considered the same ERV unless they are sited in the exact same spot as well as being nearly identical in structure. And there’s some hidden information in that last sentence. How do we know that these insertions are in the exact same spot? Because we have already done a phylogeny, one in which almost all the genome matches, and found some spots that differ in a specific manner. His model, if he wished to make it even somewhat realistic, should have used 50,000 bits instead of 14. The probability of finding a nested hierarchy with that many bits is astronomically low when all you have is random insertion, let alone one that matches the consensus.
That he used only 14 bits also explains why ‘recent’ insertions were dominating ‘early’ insertions. When there are only 14 places to mutate, it doesn’t take long for a back-mutation to occur.
A model is only as good as it’s assumptions. Unfortunately for William, his assumptions were so erroneous that they rendered his model useless for its intended purpose.