Give or take a billion

by Eszter Hargittai on September 26, 2005

Inspired by this post on Digg, I started running searches on Google to see what would yield a really high number of results. A search on “www” yields results “of about 9,160,000,000”. This is curious given that according to Google’s homepage, the engine is “Searching 8,168,684,336 web pages”. Perhaps they are extrapolating to sites that they are not searching. Or perhaps those “of about” figures are not very accurate. In general, those numbers are hard to verify since Google won’t display more than 1000 results to any query. The figures may be helpful in establishing relative popularity, although it’s unclear whether the system can be trusted to be reliable even to that extent.

{ 1 trackback }

riting on the wall » Blog Archive » tech and usage
09.27.05 at 1:21 pm

{ 21 comments }

1

Kieran Healy 09.26.05 at 10:52 pm

I bet if you did the same search again (or on a different computer) it would give you a different number. “Give or take a billion” might not be so bad if you’re dealing with very large numbers anyway.

2

KCinDC 09.27.05 at 12:29 am

Technologies du Langage was looking into that sort of thing a while back.

3

Kenny Easwaran 09.27.05 at 12:44 am

Here’s a similar Language Log post discussing some of this stuff. It’s been impeding their “research” that they post on the blog.

4

Kenny Easwaran 09.27.05 at 1:13 am

Also, it looks like they’re taking down that number in a couple days, according to the NYTimes.

5

Alejandro 09.27.05 at 3:33 am

It seems “the” has even more hits than “www”, at least on my computer: 9,350,000,000.

6

dp 09.27.05 at 4:19 am

Véronis’ piece about missing pages is interesting – if a bit beyond my grasp.

But his remark about using the same word twice prompts a question. Does anyone provide a website that tracks the most popular site for a given set of words? For example, when I searched on ‘non’, the top result is for a comics page called non sequitur, and the following three results relate to nonprofit. These latter three results make me think that nonprofit is the most widely searched word containing ‘non’. I would not have guessed that, and am now wondering what results I’ll get for searching other words, like ‘happy’, ‘action’, ‘home’, ‘result’, and ‘liquid’.

Most of these words have ambiguous meanings, so the results should show some variation if searched over time. A larger set of such words, tracked over time, might reveal interesting patterns of meaning. It would be interesting to see what patterns show up on any given day.

7

Seth Finkelstein 09.27.05 at 6:10 am

See the Search Engine Watch article

“In talking with Google over the past weeks, it turned out that the home page count was its most conservative estimate of pages in the index. Beyond what was officially claimed were other documents, including “partially indexed” ones. These were documents that while not actually indexed might still come up in results and add toward a count.”

8

Chris 09.27.05 at 8:21 am

Crikey, what a nerdy question – who cares?
“Search engine finds more pages than it claims” – what kind of story is that??
Does the search get you what you want – generally either a specific site or the answer to a specific question? Answer – mostly yes. Does it cost me anything – no. Do I have plenty of other search engines I can use – yes.
Job done.

9

Eszter 09.27.05 at 8:44 am

Kieran, so did you try on your computer?:) What did you get? Yes, these numbers do tend to change (this computer is giving me 9,650,000,000), which adds to the concern that those numbers are unreliable.

It seems that being off by a billion is quite a bit if your total figure is just in the single billions. If you were off by a million and you’re dealing with billions, that’s another thing. (Or if you’re off by a billion, but really dealing with trillions that may be a different matter as well.)

I know there are various reasons for it and that’s fine. Part of the point of mentioning all this is to note that drawing conclusions from the numbers that come up in Google searches (or searches on other engines) may be quite problematic. People seem to use those numbers for various things yet it’s important to remember how inexact those figures can be.

10

Andrew Gelman 09.27.05 at 9:23 am

We had some discussion here and here about using googlefighting to estimate the frequency of Godwin’s Law violations regarding Bush and Clinton. The consensus was that the 1000-link limit would make it a difficult statistical problem–excellent for class discussion, maybe not such an easy research tool.

Some example results:

bush hitler: 1.5 million
clinton hitler: 0.7 million

bush: 83 million
clinton: 25 million

bush mcdonalds: 440,000
clinton mcdonalds: 200,000

But I haven’t read this reference which maybe gets around the counting problem somehow.

11

Sean 09.27.05 at 9:30 am

Each search queries the high-page-rank entries first, and estimates the total number of pages found from that number. So it’s not perfectly accurate, and it can even change as you click “Next.”

The also gets about 9 billion — and it’s very funny to see what comes up first. Here are some other enlightening results.

12

Seth Finkelstein 09.27.05 at 9:45 am

andrew: To perhaps be tedious, mentions of Bush and Hitler don’t tell you if they’re left-wingers who mean it (“Bush is like Hitler”), or right-wingers who are crying persecution over a fiction (“Left-wingers always say Bush is like Hitler”). This is big problem with Godwin’s Law – there’s a substantial incentive to make a phony accusation as a debate tactic (highly ironic …).

There’s an amazing amount of what I call “Googery”, abuse of meaningless numbers.

My favorite involved the size of “free porn”

13

bza 09.27.05 at 9:55 am

Also, given the explosion of blogging in the last couple of years as well as the general increase in computer use since the Clinton era, there are going to be many more mentions of Bush online, period, no matter what other search terms you use. E.g.,

bush gandhi: 1,570,000
clinton gandhi: 769,000

Running this kind of Googlefight is a valid method only when the two figures are roughly contemporaneous, and in Internet time Clinton and Bush are decidedly not so.

14

Matt Weiner 09.27.05 at 10:54 am

One way of controlling for the problem bza points out might be to search “bush” and “clinton” as denominators. But that’s still going to be too unreliable for serious work.

15

Eszter 09.27.05 at 11:07 am

Since I posted this entry yesterday, Google has taken down the exact number of claimed coverage. Conveniently, however, Google caches its own homepage so I was able to retrieve a screen shot that way. Here it is.

On a different note, check out there homepage today for a cute birthday logo. (I almost never go to the homepage since I use it from the toolbar. Today the logo is not copied on search results pages so I would’ve missed it had I not been checking for their coverage estimate.)

16

joeblow 09.27.05 at 11:15 am

Reportedly, if you search for “http” you get all the pages, sorted
into Page Rank order.

17

Jeff R. 09.27.05 at 11:18 am

I did a search on ‘the’ and got the same number as the original post (9,160,000,000.)

Interestingly, the top three hits for ‘the’, in order, are The Onion, the White House, and The Weather Channel.

18

Tombstone 09.27.05 at 11:25 am

Although I’ve been a long time Google fan, I’ve recently begun using Yahoo search again – with great success.

I’ve found that Yahoo is crawling sites extensively. When looking for specific long phrases, coding examples, error discussions, and other detailed search strings, Yahoo is coming up as the winner.

19

Bro. Bartleby 09.27.05 at 2:07 pm

“tits” scores 57,900,000. Now what does that tell us about humans? Or I should say, what does it tell us about males?

20

Anthony 09.27.05 at 6:22 pm

Bro. Bartleby, I’m not sure what searching for tits tells you about men.

21

Bro. Bartleby 09.28.05 at 4:41 pm

Bro. Anthony, it tells us that even though long ago we each were weaned from said glands, we still have the desire to return to the comfort that said glands provided upon our entry into the world. The birth trauma was quickly forgotten when warmth and tenderness and milk comforted us, and so too in this often traumatic world, we still find comfort with mere photographs of the said glands. So to find over 57 million websites devoted to tits, I take comfort in our collective honor of the remembrance of that very first suck.

Comments on this entry are closed.