<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Searching, Marketing, and Statistics &#187; Probability</title>
	<atom:link href="http://chuckpaulson.com/category/probability/feed/" rel="self" type="application/rss+xml" />
	<link>http://chuckpaulson.com</link>
	<description>A collection of thoughts, essays, and reviews</description>
	<lastBuildDate>Tue, 30 Jun 2009 20:40:17 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Estimating the Size of Google</title>
		<link>http://chuckpaulson.com/2009/06/30/estimating-the-size-of-google/</link>
		<comments>http://chuckpaulson.com/2009/06/30/estimating-the-size-of-google/#comments</comments>
		<pubDate>Tue, 30 Jun 2009 20:34:12 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Probability]]></category>
		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://chuckpaulson.com/?p=5</guid>
		<description><![CDATA[Ever wonder how many pages are indexed by Google? Getting an exact answer is impossible, but with a little bit of statistics and some common sense we can get an estimate.
Let’s suppose we have two words A and B. Among all pages indexed by Google, let’s say that the number of pages that contain the [...]]]></description>
			<content:encoded><![CDATA[<p>Ever wonder how many pages are indexed by Google? Getting an exact answer is impossible, but with a little bit of statistics and some common sense we can get an estimate.</p>
<p>Let’s suppose we have two words A and B. Among all pages indexed by Google, let’s say that the number of pages that contain the word A is N(A) and the number of pages that contain the word B is N(B). If the total number of pages in Google is N then the probability of a random page having the word A is N(A)/N</p>
<p>(1) P(A) = N(A)/N</p>
<p>and for word B it is N(B)/N.</p>
<p>(2) P(B) = N(B)/N</p>
<p>If a page has both A and B in it then we can say the probability is N(AB)/N.</p>
<p>(3) P(AB) = N(AB)/N</p>
<p>In our case, let’s assume that words A and B are independent of each other. That is, the fact that A appears on a web page in no way influences whether B appear on the same web page. For many words that go together, such as “real estate”, this is clearly false. But for now let’s suppose we can find such a pair of words. Then we have</p>
<p>(4) P(AB) = P(A) * P(A)</p>
<p>Where P(AB) means the probability that a page has both word A and word B, P(A) means the probability that a page has word A, and similarly for P(B).</p>
<p>Putting equations (1), (2), and (3) into (4) we end up with</p>
<p>N(AB)/N = N(A)/N * N(B)/N</p>
<p>Multiply both sides by N^2 and simplify and we end up with</p>
<p>(5) N = (N(A) * N(B))/N(AB)</p>
<p>What this equation means is that, if we can find a pair of words that occur independently of each other, then we can use Google to find N(A) (the number of pages that contain word A) and N(B) (the number of pages that contain word B) and N(AB) (the number of pages containing both word A and B). Once we find those 3 numbers from Google, then we can estimate N, the number of pages indexed by Google.</p>
<p>However, there are some caveats.</p>
<ol>
<li>We’ll never really know if words A and B are independent.</li>
<li>Google’s count of the number of pages containing a given word is just an estimate.</li>
<li>If words A or B don’t occur very often then N(A) * N(B) cannot get big enough to be a reasonable estimate. The problem here is that N(AB) has to be very small and there are problems with the sensitivity of the estimate. It can change wildly based on small changes to N(AB).</li>
</ol>
<p>Here’s an example I did of 3 word pairs (jackson, centari), (computer, algae), and (baby, quantum). The first 2 are pretty consistent, at around 6 billion pages indexed. However, the next pair goes all the way up to 18 billion pages. I think it’s probably safe to say, given the caveats that we mentioned earlier, that the third pair is the most independent word pair and that there are at least 18 billion pages indexed by Google.</p>
<table border="1">
<caption></caption>
<tbody>
<tr align="center">
<td>Word A</td>
<td>N(A)</td>
<td>Word B</td>
<td>N(B)</td>
<td>N(AB)</td>
<td>N</td>
</tr>
<tr align="center">
<td>jackson</td>
<td align="right">232,000,000</td>
<td>centauri</td>
<td align="right">1,830,000</td>
<td align="right">77,700</td>
<td align="right">5,464,092,664</td>
</tr>
<tr align="center">
<td>computer</td>
<td align="right">1,030,000,000</td>
<td>algae</td>
<td align="right">11,900,000</td>
<td align="right">1,990,000</td>
<td align="right">6,159,296,482</td>
</tr>
<tr align="center">
<td>baby</td>
<td align="right">707,000,000</td>
<td>quantum</td>
<td align="right">106,000,000</td>
<td align="right">4,030,000</td>
<td align="right">18,596,029,777</td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://chuckpaulson.com/2009/06/30/estimating-the-size-of-google/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

