MAIN Project – English@Web-W3Techs Bias

W3Techs Multilingualism Bias

W3Techs multilingualism Bias

Let’s imagine the web is composed of 5 websites (W1 to W5) for a total of 62 web pages and there are only 3 languages existing : English, French and Spanish.

Let’s suppose the following linguistic compositions of the 62 web pages as shown in this table:

WEB PAGESW1W2W3W4W5TOTAL%
English10510002540.32%
Spanish00100102032.26%
French0510201727.42%
TOTAL10103021062100%

In other terms, we have:

  • W1: a website with 10 pages in English
  • W2: a website with 5 pages in English and 5 pages in French
  • W3: a website with 10 pages in each language
  • W4: a website with 2 pages in French
  • W5: a website with 10 pages in Spanish

Which makes :

  • 25 pages in English
  • 20 pages in Spanish
  • 17 pages in French
  • for a total of 62 pages;

Therefore, the correct percentages of languages in the Web are:

  • English = 25/62 = 40.32%
  • Spanish = 20/62 = 32.26%
  • French = 17/62 = 27.42%

Suppose we are not able to compute per web pages and we do it per websites, with due consideration of website multilingualism, then the correct results will be made by dividing the number of websites in a given language by the total number of linguistic versions of websites. Then the table is:

WEB PAGESW1W2W3W4W5WEBSITES IN THIS LANGUAGE% PER LANGUAGE
English10510 337.50%
Spanish  1010225.00%
French 5102337.50%
Linguistic versions123118100%

And the results are:

  • English = 3/8 = 37.5%
  • Spanish = 2/8 = 25%
  • French = 3/8 = 37.5%

Clearly this method does not provide the correct answer, only an approximation, as it favors the languages belonging to websites with less pages, in that case French. Note that with the huge numbers of the real Web space this bias is probably not so important.

Now, if, like W3Techs, we do not take into account the multilingualism and our default language is English, we will take this table to compute:

WEB PAGESW1W2W3W4W5Language count%
English10510360%
Spanish1010120%
French5102120%
Home page language detectedEnglishEnglishEnglishFrenchSpanish5 

And the final results are very far from the reality with a huge bias favoring English:

  • English = 60%
  • French = 20%
  • Spanish = 20%

The formula we have defined, in the referenced paper, to unbias the results of W3Techs is : Correct English Percentage = W3Techs Results for English / Rate of multilingualism of the sampling.

The Rate of Multilingualism is defined by the total number of web linguistic versions divided by the total number of websites.

In this example the rate of multilingualism is 8/5 = 1.6

If you divide the last results for English, 60%, by 1.6 you get 37.50%, the same result we got in the second computing.

Note that this equation cannot be applied to other languages than English since English is the only language in W3Techs count which is valid and can therefore be corrected. The other languages percentages in W3Techs measure in fact only the websites in those languages which have no English version.

We have tried to approximate the rate of multilingualism of the Tranco sampling used by W3Techs and found it could be around 2. Note the the rate of multilingualism of the humanity is currently estimated to 1.42 and it is not a surprise that the rate of multilingualism of the Word Wide Web is largely superior. Based on that formula, when W3Techs says English represents 50% of the web, the unbiased results is around 25%. In other terms the bias of W3Techs for not taking into account the multilingualim of the Web could be over doubling the reality!

If you are interested to deepen those considerations, go ahead and read the paper!

Projects by OBDILCI

  • Indicators for the Presence of Language in the Internet
  • The Languages of France in the Internet
  • French in the Internet
  • Portuguese in the Internet
  • Spanish in the Internet
  • AI and Multilingualism
  • DILINET
  • Pre-historic Projects…