Language in the Blogosphere

by Henry Farrell on August 14, 2004

Wandering around the blogosphere, I came across this “rather interesting page”: It seems to be a little outdated, but it provides an approximate count of the relative importance of different languages in the blogosphere. English comes first, unsurprisingly, then French. Portuguese is third, and Farsi fourth. This may seem a little surprising to those who aren’t familiar with the proliferation of Portuguese and Farsi blogs – both linguistic communities have also made substantial inroads into social network services like “”: too. This leads to an interesting sociological question – why these communities and not other linguistic communities of similar size – have reached takeoff in the blogosphere. Equally interesting is the lack of any Arab language blogs on the list. This may be a result of how the authors have seeded their survey or parsed their results – but it may also quite possibly reflect reality. As far as I know, there are less than 70 Iraqi blogs (many of which are in English). I’m not aware of any substantial blogging communities in other Arabic-speaking countries – but I’m happy to be enlightened if I’m wrong. The root causes may perhaps include cultural factors – but I would bet that restrictions on Internet access and poor technological infrastructures also play a very important role.



nick 08.14.04 at 12:39 am

Most of the Portuguese blogs, I imagine, are located in Brazil; one of Blogger’s first deals was with a Brazillian company.


Anon Again 08.14.04 at 3:51 am

Interesting that so many Arab blogs are in English.

is a good example, though it is an interesting English.

I think it may have to do with the availability of native language typefaces and programs and a desire to reach outside of one’s own culture.

The French feel no need. Most Arabic English Language blogs intend to reach beyond Arabs.


Omri 08.14.04 at 4:11 am

Arabs aren’t making big inroads into the net for several reasons. Arabic dialects are not mutually intelligible, and Modern Standard Arabic sucks. Connectivity isn’t great, but censorship is, and Arabic social mores also play in. You might not speak candidly on a blog about much of anything if you’re afraid that a loose comment will cost your cousin his promotion in the civil service.


digamma 08.14.04 at 4:35 am

Wouldn’t network effects have something to do with it? If you’re an Iraqi educated enough to start a blog, and you’re interested in blogs, you probably have decent English. There are a lot of good reason for you to blog in the same language as most other blogs out there, particularly blogs by citizens of the countries currently occupying Iraq.


eszter 08.14.04 at 4:50 am

Although they give us all sorts of details about their methodology, I still feel like we don’t know enough to try to make too much of the findings. They talk about a “certainly level” attached to the sites they crawl as to whether they are really blogs or not, but it’s not fully clear how that certainly level may (or may not) be included in calculations of general blog stats and classifications (e.g. language groupings). I wonder if we randomly picked cases from their list how many we’d end up classifying as blogs (setting aside the issue that it’s nearly impossible to come up with a common agreement about what constitutes a blog).

German and French rank quite high, but is that at least in part thanks to having blog lists like as starting points?

Also, it seems one may want to take into consideration the number of Internet users who speak these languages or at least number of users from countries where the various languages are commonly used to get a per capita type of score. (Sure, sheer numbers are interesting and relevant as well, but if you want to explain why one language is more represented than another, supply of possible bloggers would be relevant.) Of course, having done research in this area, I know how difficult it is to come up with such figures, but I still think it’s important to be conscious about the issue.


eszter 08.14.04 at 4:59 am

I think digamma makes a good point in that depending on your intended audience, you may decide to blog in a language other than your own so language of a blog does not equal primary language of blogger (not that anyone was arguing this, I guess;).

Also, regarding network effects, since following links was part of the methodology to find other blogs (very reasonable, of course), you’re likely going to find more of the same language blogs since those in language x will likely link to other blogs in language x. This was the point I was making above regarding the possible effects of something like

I realize the people behind the project have done a lot to circumvent relying solely on links from blogs they already know about, but this like-links-to-like issue could still have implications for what blogs are found and included in the stats in the end.


Alan K. Henderson 08.14.04 at 6:13 am

The Farsi end of the blogosphere is no doubt being fed by the huge dissident population in Iran.

Are most of the Portugese-language blogs in Portugal or in South America? What percentage of French-language blogs are outside France?


Peter Murphy 08.14.04 at 8:46 am

The methodology is broken, I believe. They rely on a Perl script called “textcat”, which I believe has not been updated since 1998. The textcat site states that some languages are only supported in certain encodings. The ensuing confusion between the language and the character encoding is show clearly by including Chinese twice: – once as “Chinese-gb2312” and once as “Chinese-big5”.

A better methodology would be to take notice of the character encoding markup (which the poll does not do), translate it to a common Unicode base (which should represent everything!), and then do the trigraph analysis there. You could do it with Perl, but Python is more international-friendly.

At least the methodology is not as broken as the map. Frankly, I never knew there were so many people blogging from the Somalian continental shelf. Now I do.


phnk 08.14.04 at 9:37 am

Alas, it looks like their data is completely wrong. The low scores for Japanese and Chinese should throw most readers into doubt.

According to Global Reach statistics (in millions speakers):
– English 230
– Non-English : 403
– incl . Chinese : 40
– incl. Japanese : 61
– incl. Spanish : 41

These results are far more logical :broad and cheap access to the Internet was delivered far more early in Japan than in France. The demographic potential of China is huge, while French represents part of Switzerland, part of France (digital divide), and Quebec.

Those stats come from the following WIPO report. You can find them in other UNCTAD reports.


Henry 08.14.04 at 6:15 pm

I think Eszter and Digamma’s concerns are to be taken seriously – the concerns about Unicode are real (although I think it’s fair enough to report Chinese twice, if two different forms of character coding are involved). I also suspect that Chinese has experienced an enormous surge since the survey was conducted. However, I simply don’t see how the WIPO survey proves or disproves anything – it’s not concerned with bloggers. Also, I think that the findings for high representation of Farsi and Portuguese are very likely robust. If you look at the Technorati top 100, they are pretty well the only foreign languages represented. Further, as I mentioned in the post, has been disproportionately colonized by Portuguese and Farsi speakers. This suggests to me that as one of the commenters noted, uptake of blogging is largely a community thing – if there is a s ubstantial community of individuals out there who you can speak to, it becomes a self perpetuating positive feedback driven phenomenon. And it is interesting that,say, Portuguese bloggers have a bigger presence in the blogosphere than German language bloggers. Some of it is clearly due to early innovations in Unicode – Hossein Derakashan’s early Farsi adaptation of unicode played a huge role in getting Iranians to start blogging.

The discussion of Arabic language blogging in the comments is also quite interesting. I frankly suspect that cultural factors are less important than poor access to infrastructure, and the risks of government punishment for either (a) ‘political’ blogging, or (b) ‘personal’ blogging by people with active sex lives outside of marriage etc, are the key variables. Also government filters which prevent access to blogspot, typepad and other popular blogging tools.


David Tiley 08.14.04 at 7:21 pm

This is anecdotal but intriguing. I have been told that one reason for the prevalance of Portuguese on the internet is the strong contact maintained between former Portuguese colonies. And it may be that a lot of Farsi blogging is within the US as the community develops its interconnections.


phnk 08.14.04 at 9:32 pm

The WIPO stats do not focus on blogs, indeed – but don’t you find it strange that Asian languages score so bad ?

The difference in alphabets is hiding many blogs from Technorati, and many pages from Google. The Internet was built using Latin-Roman and although unicode and new algorithms are meant to help all languages coexist, there still is a gap between Cyrillic, Hellenic, Asian and Western encodings.

The Portuguese colonies is a nice and curious anecdote (although I’m not sure Angola and Mozambique blog a lot, plus the digital divide is terrible in Brazil).

Just trying to point out the survey exposed here builds on technical mistakes that can be at least spotted, at best proved, by a bit of logic. The Asian community (which is extremely walled off, on the Web as IRL) is underestimated, I do not (cannot) believe the numerical proxies given by the survey.


john b 08.16.04 at 12:32 pm

More evidence for the irrelevance of the WIPO stats in this context: Japan doesn’t have widespread cheap web access in the sense that a US or UK observer would understand. The popular medium is via mobile handsets, with landline access a minority sport.

This makes the lack of blogging unsuprising – blogs aren’t a good small-screen crap-keypad medium. While I’ve made one or two blog posts via my cellphone, this is more for curiousity than ease of use.


Des von Bladet 08.16.04 at 6:54 pm

Nobody has mentioned Russian yet, but I am told (I don’t savvy the lingo) that all the Russkophones are on LiveJournal, and that this is usually beneath the radar for these kinds of undertakings.


phnk 08.17.04 at 2:22 am

Japan’s “minority sport” :

Check Figs 8.6/8.7.

