METHODOLOGY – V3.0 (March 2022)

Indicators for the Presence of Language on the Internet

NOTE: This is an archived version of the study. Click here to view the most up-to-date version

Basic Methodological Process – V3.0 (March 2022)

The model uses Ethnologue as the source for demo-linguistic data (L1 and L2 speakers repartition per country), and ITU and World Bank for connectivity data (% of persons connected to the Internet per country) and a large set of data sources (*) to produce 5 indicators:

  • Internauts : % of connected persons per language
  • Traffic : % of traffic per language (statistical work based on Alexa and SimilarWeb data aplplied to several hundreds of selected websites) (**)
  • Usage : % of Internet usage per language: from data divided between main social networks subscribers, connecting infrastructure (World Bank data), open applications, streaming and e.commerce (T-Index from Translated)
  • Interfaces and translation languages : counting the presence of languages in a large scope of application interfaces and online translation applications
  • Indexes : measuring the strength of countries in terms of Information society indicators and transforming it into languages (24 different indicators)

The average of those indicators is assumed to be a fair approximation of contents within a confidence interval of -20% +20%

(*)   Most sources offer data per country. The data per language is obtained by weighting with demo-linguistic data.

(**) Most sources hardly cover all countries; extrapolation technics weighting with the % of connected people or using quartile approach are used.

Why would the mean of previous indicators be a fair approximation of Web contents?

The logical method to measure the presence of languages in the Web seems to be the application of a reliable language recognition algorithm on all the existing webpages and counting…

Yes… but the Web is too large to make that method practically applicable and the intents lose meaning for 2 main reasons:

  1. The sampling supposed to be representing the whole universe is biased
  2. The lack of consideration of multilingualism

and the results are extremely biased for those reasons.

Only remains two possibilities:

  1. For whoever use the logical method, focus on biases and pay due attention to multilingualism
  2. For other players, use alternative methods.

The rationale of our alternative method

The data which can be relied upon because of limited biases, are :

  • demo-linguistic data (L1 and L2 speakers repartition by country)
  • Internet connecting rate data (% of persons connected to the Internet per country).

From those 2 sources and a working hypothesis stating that all language speakers have the same connecting rate in each country, it is possible to compute the connecting rate per language.

In absence of further data, this would be the first fair approximation of web contents per language as the experience have shown that the percentage of contents seems to be linked to the percentage of Internet users by some sort of natural economic law.

In order to improve and consider that some languages are doing better (or lesser) than the average in terms of content production, it is possible to try to modulate the previous figures from other indirect parameters.

This is exactly what our model is doing, considering factors such as traffic, use of applications, existence of interfaces or translation programs, scope of e.government, open data and other Information Society attributes.

Beyond the main indicator of speakers connected to the Internet, it can be considered that languages are, for economical, social, cultural, network education or other reasons generating more or less contents as a consequence of:

  • more or less Internet traffic resulting from tariff, cultural or education reasons,
  • more or less speakers subscribed to applications
  • more or less information society support where speakers live (e.g. e.government)
  • their absence (or presence) in application interface or translation programs
  • and, in general, their level of technological support for digital life, which can drastically limit or foster their use.

As a general rule, contents are produced by L1 speakers, however L2 speakers of a given language may decide also to generate contents because of economic reason (no wonder why the productivity of some major languages is so high compare to others !).

Our indirect method provided cannot obviously replace a real measurement. However, in the absence of such measurement, and in the context of extremely biased results from incomplete measurements, it is a fair or better approximation, as long as it duly reflects those different factors.

The method is basically a way of obtaining the contents repartition per language as a modulation of the connected speakers repartition per language, in function of various measured parameters.

Obviously, as for every statistical approach, all biases need to be exposed, made explicit and analyzed…

Basic Biases Evolution Across the Versions

ELEMENTVERSION 1VERSION 2VERSION 3
Demo linguistic sourceYoshua (2017)Ethnologue #24 (2021)
Experts may disagree with some data but yet the best data available
Ethnologue #24 (2021)
L2 extrapolationCompute L2 results from L1 extrapolation. Strong bias favors language with high presence in developing countries (English and French mainly)Solved

Ethnologue provides L2 data therefore this bias disappeared.
Same
Main weighting hypothesisAll speakers of each country are computed with the same connected %.
Light bias against European languages in developing countries and in favor of immigration languages in developed countries.
Same

As long as the model is not used to compare languages within a country and is limited to speakers population over one million, the bias is acceptable.
Same

This working hypothesis is the basis of the model as it allows most computing as a modulation of the value around the % of connected persons per country.
Extrapolation technics for sourcesThe bias favor the most connected countries but effects are considered marginal (specially when the source covers more than 70% of total)SameSame

Sources Biases: 0 = totally biased – 20 = total absence of bias

ELEMENTVERSION 1VERSION 2VERSION 3
Internauts18
ITU a fair source with yearly updates*
15 
ITU stopped updating its estimated when no data is given by country officials.
19 
World Bank took over the data and updates are frequent
Traffic13 
Alexa strongly biased against Asian languages and lightly biases in favor of European languages (except Portuguese). Selection bias somehow controlled by using the truncated mean at 20%.
11 
The Alexa bias against Asian countries seems overcome but a new bias and an error affects now European countries.
16 
Technic implemented to cancel the selection bias. Uses a mix of Alexa error-filtered and SimilarWeb. A small bias remains which affect many European languages.

(*) Tool’s biases are reflected in Chines’s result out of proportion.
Usage12 
Rely in data from main social networks. Biased against non-occidental languages.
12 
Same
15 
Integration of non occidental social networks. Some improvements still possible for V4.
Interface19 
Those are objective data and sampling is wide.
19 
Same
19 
Same
Indexes15 
The sampling needs to be enlarged.
18 
Sampling close to exhaustive.
18 
Same
Contents
Depends strongly on Wikimedia statistics which are excellent but strongly biases against non occidental languages and highly favor some languages (French, Hebrew, Swedish…).

Technics used to control Wikimedia statistics biases.
OUT 
After dense effort to include all online encyclopedias beyond Wikipedia, it is concluded it is better to suppress this indicator as the goal is not reachable as an input.

(*)  The use of top ranked websites deserve countries with higher information literacy rate where a larger portion of traffic goes to non top websites.

Bias Summary

V1 was strongly biased against non European languages, and at the same time biased in favor of the few European language with high presence in developing countries with low connectivity rate (mainly English and French).

V2 solved the second main bias and reduce the non European language negative bias but not enough as the content input indicator remained strongly biased.

V3 solved the content bias by suppressing it as input and removed almost all non European languages negative biases. Overall it remains now a slight negative European language bias but the level of reliability of the results have improved and reached a new quality threshold.

The evolution of the method has made a switch from strong negative biases towards non European languages to light negative biases toward European languages… and a possible positive bias towards Chinese due to the new Traffic indicator process.

That said data are to be taken with caution, as reliable only within a –20% +20% confidence interval, specially when comparing raw results which are within this interval (as shown in the inverted pyramid of the main content per language for the 4 languages in position 4).

Potential Improvements for Version 4

Content productivity is measured on the basis of L1+L2 figures. It should be quite useful to check the value of another content productivity factor based only on L1; as Version 3 of the model computes everything on L1+L2 basis this would require another version of the model.

The USAGE indicator still can be improved and its biases reduced by focusing:

  • Its video streaming component adding to YouTube and Netflix other sources
  • Its open data component adding to the unique source and focusing stats on open data, MOOCs, etc.
  • The biases have evolved from high against non European languages into low to European languages, this needs to be addressed.

The TRAFFIC indicator offers a result for Chinese out of proportion compared to the other languages. This needs to be investigated. The impact on the final result is however marginal, a value more proportioned would leave Chinese equal to English and anyway within the same confidence interval.


The Graphic View of the Evolution of the Method from V1 to V3

Published Article About Our Methodology

The method behind the unprecedented production of indicators of the presence of languages in the Internet, Sept. 2022

Projects by OBDILCI

  • Indicators for the Presence of Language in the Internet
  • The Languages of France in the Internet
  • French in the Internet
  • Portuguese in the Internet
  • Spanish in the Internet
  • AI and Multilingualism
  • DILINET
  • Pre-historic Projects…