MAIN PROJECT-ENGLISH@WEB

English@Web: an historical bias

What is really the percentage of English in the Web?

You are surprised to see our results showing the percentage of Webpages in English around 20% when you see all over the media and queries to Search Engines values between 50 and 60%?

Indeed, you are witnessing a deep and long-standing misinformation which deserves attention and scrutiny. This page offers the information to better understand the issues at stake.

This misinformation is the result of a combination of factors:

  • A biased source: W3Techs.
  • The fact that this source is a very competent company offering useful and reliable stats on Web technologies and on the side offering also stats on contents per language.
  • The fact that this souce is performing since 2011 and has been totally alone providing data of contents per language til 2017.
  • The marketing and mediatic success of this source which have been amplified by the ranking system of Search Engines, Wikipedia and Statista, without the required prudency in the two last cases.
  • The comparitively to W3Techs low visibility of our work and website (except in scientific search engines such as GoogleScholar).

W3TECHS a biased source when refering to languages

W3Techs is a company which offers information about the usage of various types of technologies on the web and claims to provide the most reliable, the most extensive and the most relevant source of information on web technology usage.

This claim of reliablity is totally justified, except for one item, which is not precisely a web technology: the percentage of languages in contents. The method used by W3Techs is to crawl daily a sampling of websites and survey the presence or the different web technologies which are analyzed:

Content Management
Server-side Languages
Client-side Languages
JavaScript Libraries
CSS Frameworks
Web Servers
Web Panels
Operating Systems
Web Hosting
Data Centers
Reverse Proxies
DNS Servers
Email Servers
SSL Certificate Authorities
Content Delivery
Traffic Analysis Tools
Advertising Networks
Tag Managers
Social Widgets
Site Elements
Structured Data
Markup Languages
Character Encodings
Image File Formats
Top Level Domains
Server Locations

The last item on the list is Content Languages; as a matter of fact, languages are not another web technology and this have implications to be understood. At difference to the other web technologies it is not a binary question of being used or not in a specific website: several different languages could be used in the same website and this makes a strong difference that we will explore further.

The methodology used by W3Techs for its survey is to used Tranco (the list of the million most visited websites) as the sampling and to apply daily its algorithm of presence on that sampling. Note that previously, and until 2022 when it was ended, they used the list of 10 millions websites provided by Alexa. Note also that while the assumption that the extrapolation from the one million most visited websites to the whole web realm (which holds more than 200 million websites) is probably valid for most web technologies, this is not the case for website’s languages as it is probable that the more visited websites concentrates the use of the most used languages, and specially English, creating a loop effect.

So let’s focus now on the different biases of using the Web technology survey method for the languages of contents. There are different biases which are to be considered:

  • The language recognition bias. The language recognition algorithm have an error rate considered inferior to 10% and tend to recognize English above its real prevalence. This is a marginal bias although it does favor English.
  • The selection bias. One million websites represent less than 0.5% of the total web universe and in noway could be considered as representative of the total. As a matter of fact it favors most used languages on the Web and specially English.
  • Computing languages on websites instead of web pages. The correct computing of languages on contents must be done on web pages, dividing the total web pages in a given language by the total web pages. One has to admit that the size of the Web, estimated above 40 billion web pages, makes the computation almost impossible and method simplification is understandable. But there is a condition to be respected in order to make that simplification workable : paying due attention on the fact that many websites are multilingual and that property has to be taken into account in the method to avoid a huge bias favoring English (the multilingualism bias which is the main bias of W3Techs, see below)..
  • There is another bias in that scenario of computing per websites instead of webpages: the home page bias. The choice, as made by W3Techs, to apply the recognition algorithm in the home page of the website entail high risks. It is common that many non English websites make the effort to include some English text in their home page, being some synthesis (as done routinely for the abstract of scientific papers written in other language than English) or by accident (buttons in English for instance). There is a chance then that the algorithm identify erroneously this home page as English.
  • Last bias in that scenario is the classification, by error, of non English websites as English. This may happen either because the home page hold some text in English (previous case), but also because the website is not working and still classified as an English website due to a message in English reporting the error. The size of this bias could be around 10%.

There is another point which questions the results of W3Techs. W3Techs presents the results in terms of percentage for 38 languages, followed by a sorted list of some 200 languages for which it reports a figure lower than 0.1% for each of them.

However, If the sum of the percentages of the 38 first languages is made (on 04/23/2024) the result is exactly 99.90%. This means that the rest of 200 languages together represent less than 0.1% of the Web which sounds kind of weird, as it implies that in average those 200 languages will represent individually some 0.0005% which looks implausible.

To make a comparison, if we check the remaining percentage of contents out of the 361 languages we are computing in our model, it still represent 1.76% in our last version, a figure coherent with the fact that the rest of languages corresponds to the remaining 3.42% of L1+L2 population and 2.65% connected population.

The main bias of W3TECHS: not paying due attention to website’s multilingualism

The methodology of W3Techs is described here : https://w3techs.com/technologies. It is not a fully detailed and transparent description as would offer a scientific paper. From what is read, it is understood that each websites is attributed a single language. It is also assumed that the default working language is English, which implies that multilingual websites (such as Facebook) are classified as English websites. The methodology refers to “relevant websites” but there is no description about the treatment of websites which does not respond with expected contents and we can also assume that a proportion of websites with errors maybe recognized as English websites (note that in the Tranco sampling we have identified as much as 20% of websites with error, either 404 or other).

We have analyzed further the consequences of this bias and even find a way to correct the bias; the analysis is published in the preprint paper “Is it True that More Than Half of Web Contents are in English? If Web Multilingualism is Paid Due Attention, then No!“. Note that the bias correction leads to a figure of English comparable to OBDILCI figures. Note also that the paper makes reference to two other sets of figures for English contents in the same range (20%-30%). One comes from a scientific paper focusing websites inside the European Union top level domains (before Brexit and including English speaking countries); the second one, from another software company Netsweeper, which seems to proceed by analyzing web pages instead of websites (which prevents the multilingualism bias) and report a huge sampling of 15 billion web pages.

If you prefer a simple demonstration illustrated by an example instead of reading a scientific paper, go ahead and check:

English@Web: an historical bias

The situation of over-evaluating the reality of the percentage of English content is not something new. As a matter of fact, it has existed almost since the birth of the Web. The following curves show the difference between what our observatory has measured and the media reporting on that data.

Percentage of English Pages in the Web

This situation has been documented in many peer-reviewed articles since the beginning. You can check for instance:

The fact that authorities like Wikipedia or Statista prefer citing non scientific figures than peer-reviewed backed figures is clearly part of the problem. In both case we have tried to open the eyes of those providers of sources. In the case of STATISTA our well documented mail remained unanswered in the case of Wikipedia some progress has finally been made.

A LOOK AT THE CONTEXT AN THE USE OF SOME RATIONALITY

What is the context of that theme? In one hand, W3Techs shows, in its historical page, a percentage of English in the web almost stable from 2011 to 2024, between 50 and 60%. Note that until recently this histogram started in 2011. In the other hand, the Internet has evolved, from 2011 to 2024, with the following figures (sources : InternetWorldStats for 2011 and OBDILCI for 2024, note those are L1+L2 figures)

In 2011

  • The world number of connected people is 1 966 514 816 (28.7% of world population)
  • Of which the number of English speakers connected is 536.6 M (27.3% of connected persons)

In 2024

  • The world number of connected people is 6 800 888 461 (63,19% of world population)
  • Of which the number of English speakers connected is 1 186 451 052 (15.79 % of connected persons)

This represents a relative decline of English speakers connected to the Internet of 72.8%. Of course, the number of English speaking internauts have largely increased in 13 years, but much below the sum of the rest of other language’s internauts (for your information, based on Ethnologue #27 dataset, L1+L2 English speakers represent 1 515 231 760 persons which is 14.13% of the total of L1+L2 speakers) . For the English contents to remain stable in proportion in this context, it would mean that they have compensated that relative decline by increasing the production of content per English speakers’ user by 72.8%!

In 2024, result of the evolution of the Internet demography, the percentage of connected Chinese speakers has crossed over English speakers and is now 17.41% and the same value for Hindi speakers is 4.34%. Together, they represent more than 20% of the Internet users and following W3Techs figures they would only represent less than 1.4% of contents: this is nonsense.

As of our last measurement (V5.1, March 2024), if Chinese, Hindi, Arabic, Malay, Bengali, Turkish, Vietnamese, Urdu, Persian and Marathi internauts are summed up the results reach 35.81% of total internauts. If the contents are summed up they reach 34.45%.

For the same languages, W3Techs sums up less than 8% of contents: Chinese: 1.3%, Hindi: < 0.1%, Arabic: 0.6%, Malay: 1.2% (Indonesian), Bengali: <0.1%, Turkish: 1.9%, Vietnamese: 1.2%, Urdu: <0.1%, Persian: 1.4%, Marathi: <0.1%, TOTAL < 8%. More than one third of users having access to less than 8% of contents in their language: this is nonsense.

Consistent OBDILCI’s observations, since 1998, have been that there is some sort of relation between the percentage of users in a given language and the percentage of contents, some sort of economic law linking offer to demand; moreover, the ratio between contents and users’ percentages in a given language hardly is out of the window 0.5 – 2. For W3Techs this ratio is 1.3/17.4 = 0.08 for Chinese and 0.1/4.34= 0.02 for Hindi which is totally implausible.

Between 2011 and 2024, the Internet have grown more than 3 times in size and today many countries which one would not suspected have percentages of persons connected higher than, for example, Portugal or France ( which is just around  85%): Azerbaijan: 88.18%, Bhutan: 86.84%, Brunei Darussalam : 99%, Kazakhstan : 92%, Libya : 88.4%

No wonder why the lingua franca of the Internet is no more English but multilingualism supported by every day more efficient translation by program.

In conclusion, the correct approach to measuring languages on the Web is to measure the percentages of web pages per language. The difficulty of dealing with such a gigantic space justifies opting for a measurement approach per website, provided to take into account that the same site can have pages in several languages, which the most widely used source W3Techs does not. The most natural approach to this measurement work is to use a language recognition algorithm and apply it directly to the website space. As this space is itself gigantic, a smaller, more controllable space is generally selected to apply the method: this inevitably leads to a selection bias that needs to be analyzed, and which in the context of languages can be major.

In this context, alternative methods, such as those proposed by OBDILCI, have their place (which of course does not exempt them from analyzing the biases they themselves entail). As with all complex statistics, it is preferable to trust methods subject to scientific publication, which guarantees in the first place the total transparency of the method, a systematic effort to analyze biases, and moreover a better probability that errors or biases have been detected by competent colleagues in charge of review of publications or later commenting publications.

Projects by OBDILCI

  • Indicators for the Presence of Language in the Internet
  • The Languages of France in the Internet
  • French in the Internet
  • Portuguese in the Internet
  • Spanish in the Internet
  • AI and Multilingualism
  • DILINET
  • Pre-historic Projects…