<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[Azimuth]]></provider_name><provider_url><![CDATA[https://johncarlosbaez.wordpress.com]]></provider_url><author_name><![CDATA[John Baez]]></author_name><author_url><![CDATA[https://johncarlosbaez.wordpress.com/author/johncarlosbaez/]]></author_url><title><![CDATA[The Mathematics of Biodiversity (Part&nbsp;2)]]></title><type><![CDATA[link]]></type><html><![CDATA[<p>How likely is it that the next thing we see is one of a brand new kind?   That sounds like a hard question.  <a href="https://johncarlosbaez.wordpress.com/2012/06/21/the-mathematics-of-biodiversity-part-1/">Last time</a> I told you about the Good&#8211;Turing rule for answering this question.   </p>
<p>The discussion that blog entry triggered has been very helpful!   Among other things, it got Lou Jost more interested in this subject.  Two days ago, he showed me the following simple argument for the Good&#8211;Turing estimate.</p>
<p>Suppose there are finitely many species of orchid.  Suppose the fraction of orchids belonging to the <img src='https://s0.wp.com/latex.php?latex=i&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='i' title='i' class='latex' />th species is <img src='https://s0.wp.com/latex.php?latex=p_i.&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='p_i.' title='p_i.' class='latex' /></p>
<p>Suppose we start collecting orchids.  Suppose each time we find one, the chance that it&#8217;s an orchid of the <img src='https://s0.wp.com/latex.php?latex=i&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='i' title='i' class='latex' />th species is <img src='https://s0.wp.com/latex.php?latex=p_i.&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='p_i.' title='p_i.' class='latex' />   Of course this is not true in reality!  For example, it&#8217;s harder to find a tiny orchid, like this:  </p>
<div align="center"><a href="http://en.wikinews.org/wiki/American_botanist_Lou_Jost_discovers_world%27s_smallest_orchid"><img width="50" src="https://i1.wp.com/upload.wikimedia.org/wikipedia/commons/thumb/8/85/Platystele_P5313ruler2.jpg/220px-Platystele_P5313ruler2.jpg" /></a></div>
<p>than a big one.  But never mind.</p>
<p>Say we collect a total of <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='N' title='N' class='latex' /> orchids.   What is the probability that we find no orchids of the <img src='https://s0.wp.com/latex.php?latex=i&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='i' title='i' class='latex' />th species?  It is</p>
<p><img src='https://s0.wp.com/latex.php?latex=%281+-+p_i%29%5EN&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='(1 - p_i)^N' title='(1 - p_i)^N' class='latex' />  </p>
<p>Similarly, the probability that we find exactly one orchid of the <img src='https://s0.wp.com/latex.php?latex=i&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='i' title='i' class='latex' />th species is</p>
<p><img src='https://s0.wp.com/latex.php?latex=N+p_i+%281+-+p_i%29%5E%7BN-1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='N p_i (1 - p_i)^{N-1}' title='N p_i (1 - p_i)^{N-1}' class='latex' />  </p>
<p>And so on: these are the first two terms in a binomial series.</p>
<p>Let <img src='https://s0.wp.com/latex.php?latex=n_1&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n_1' title='n_1' class='latex' /> be the expected number of <b>singletons</b>: species for which we find exactly one orchid of that species.  Then </p>
<p><img src='https://s0.wp.com/latex.php?latex=%5Cdisplaystyle%7B+n_1+%3D+%5Csum_i+N+p_i+%281+-+p_i%29%5E%7BN-1%7D+%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;displaystyle{ n_1 = &#92;sum_i N p_i (1 - p_i)^{N-1} }' title='&#92;displaystyle{ n_1 = &#92;sum_i N p_i (1 - p_i)^{N-1} }' class='latex' /></p>
<p>Let <img src='https://s0.wp.com/latex.php?latex=D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='D' title='D' class='latex' /> be the <b>coverage deficit</b>: the expected fraction of the total population consisting of species that remain undiscovered.  Given our assumptions, this is the same as the chance that the <i>next</i> orchid we find will be of a brand new species.</p>
<p>Then </p>
<p><img src='https://s0.wp.com/latex.php?latex=%5Cdisplaystyle%7B+D+%3D+%5Csum_i+p_i+%281-p_i%29%5EN+%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;displaystyle{ D = &#92;sum_i p_i (1-p_i)^N }' title='&#92;displaystyle{ D = &#92;sum_i p_i (1-p_i)^N }' class='latex' /></p>
<p>since <img src='https://s0.wp.com/latex.php?latex=p_i&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='p_i' title='p_i' class='latex' /> is the fraction of orchids belonging to the <img src='https://s0.wp.com/latex.php?latex=i&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='i' title='i' class='latex' />th species and <img src='https://s0.wp.com/latex.php?latex=%281-p_i%29%5EN+&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='(1-p_i)^N ' title='(1-p_i)^N ' class='latex' /> is the chance that this species remains undiscovered.</p>
<p>Lou Jost pointed out that the formulas for <img src='https://s0.wp.com/latex.php?latex=n_1&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n_1' title='n_1' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='D' title='D' class='latex' /> are very similar!  In particular, </p>
<p><img src='https://s0.wp.com/latex.php?latex=%5Cdisplaystyle%7B+%5Cfrac%7Bn_1%7D%7BN%7D+%3D+%5Csum_i+p_i+%281+-+p_i%29%5E%7BN-1%7D+%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;displaystyle{ &#92;frac{n_1}{N} = &#92;sum_i p_i (1 - p_i)^{N-1} } ' title='&#92;displaystyle{ &#92;frac{n_1}{N} = &#92;sum_i p_i (1 - p_i)^{N-1} } ' class='latex' /></p>
<p>should be very close to </p>
<p><img src='https://s0.wp.com/latex.php?latex=%5Cdisplaystyle%7B+D+%3D+%5Csum_i+p_i+%281+-+p_i%29%5EN+%7D+&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;displaystyle{ D = &#92;sum_i p_i (1 - p_i)^N } ' title='&#92;displaystyle{ D = &#92;sum_i p_i (1 - p_i)^N } ' class='latex' /></p>
<p>when <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='N' title='N' class='latex' /> is large.  So, we should have</p>
<p><img src='https://s0.wp.com/latex.php?latex=%5Cdisplaystyle%7B+D+%5Capprox+%5Cfrac%7Bn_1%7D%7BN%7D+%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;displaystyle{ D &#92;approx &#92;frac{n_1}{N} }' title='&#92;displaystyle{ D &#92;approx &#92;frac{n_1}{N} }' class='latex' /></p>
<p>In other words: the chance that the next orchid we find is of a brand new species should be close to the fraction of orchids that are singletons now.</p>
<p>Of course it would be nice to turn these &#8216;shoulds&#8217; into precise theorems!  Theorem 1 in this paper does that:</p>
<p>&bull; David McAllester and Robert E. Schapire, <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.41.7209&amp;rep=rep1&amp;type=pdf" rel="nofollow">On the convergence rate of Good&#8211;Turing estimators</a>, February 17, 2000.</p>
<p>By the way: the only difference between the formulas for <img src='https://s0.wp.com/latex.php?latex=n_1%2FN&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n_1/N' title='n_1/N' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='D' title='D' class='latex' /> is that the first contains the exponent <img src='https://s0.wp.com/latex.php?latex=N-1%2C&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='N-1,' title='N-1,' class='latex' /> while the second contains the exponent <img src='https://s0.wp.com/latex.php?latex=N.&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='N.' title='N.' class='latex' />  So, Lou Jost&#8217;s argument is a version of <a href="https://johncarlosbaez.wordpress.com/2012/06/21/the-mathematics-of-biodiversity-part-1/#comment-15963">Boris Borcic&#8217;s &#8216;time-reversal&#8217; idea</a>:</p>
<blockquote><p>
Good’s estimate is what you immediately obtain if you time-reverse your sampling procedure, e.g., if you ask for the probability that there is a change in the number of species in your sample when you randomly remove a specimen from it.
</p></blockquote>
]]></html><thumbnail_url><![CDATA[https://i1.wp.com/upload.wikimedia.org/wikipedia/commons/thumb/8/85/Platystele_P5313ruler2.jpg/220px-Platystele_P5313ruler2.jpg?fit=440%2C330]]></thumbnail_url><thumbnail_height><![CDATA[313]]></thumbnail_height><thumbnail_width><![CDATA[220]]></thumbnail_width></oembed>