MAIN PROJECT – V3.0 (March 2022)

Indicators of the Presence of Languages in the Internet

NOTE: This is an archived version of the study. Click here to view the most up-to-date version

Introduction – V3.0 (March 2022)


Version 3 : 3/2022, with comprehensive bias reduction and redefinition of some outputs

More than a new version, this is the reach of maturity for the method as all the biases are now controlled to an acceptable threshold and the produced indicators are reliable within  a ±20% confidence interval.

The Observatory is pleased to share the results of version 3 of its model for computing indicators of the presence of languages ​​on the Internet, which, as for version 2, announced in 2021, processes the 329 languages ​​over one million native speakers.

A confidence interval of -20% +20%, may seem wide if we apply the criteria of other statistical works, but for the data about the place of languages ​​on the Internet, a subject that has always been very difficult to reach, and prone to chronic misinformation, this is a feat.

All the results are available under CC-BY-SA 4.0 license

What do the results tell us? The winner is multilingualism.

Project Summary

Read a short peer-reviewed, open article presenting the results of V3 in terms of indicators and a synthesis of the method:

Resource: Indicators on the Presence of Languages in Internet, Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, a workshop of LREC2022


Methodological Note

This is an indirect approximation of the space of languages in the net using different data sources and statistics technics. All computations and results are made on the basis of L1+L2 where L1 is mother tongue and L2 second language(s)

Following our main demo-linguistic source (Ethnologue #24) the world population (L1) and L1+L2 speakers population are:

L1 = 7 231 699 136     L2 = 10 361 716 756       L1+L2/L1 = 1.4328

The confidence interval of all the produced figures is estimated to be within the window ±20%.

The detailed methodology has been published in a peer-reviewed open Journal : The method behind the unprecedented production of indicators of the presence of languages in the Internet. Frontiers Research Metrics & Analytics, Volume 8 – 2023.  doi: 10.3389/frma.2023.1149347

Results of the March 2022 Study (V3.0)

Results of the LC2022 (March 2022, V3.0) Study

All Indicators for the 30 languages with higher Content Percentage

RANK
CONTENTS
L1+L2
ISO
CODE
LANGUAGES%
INTERNAUTS
L1+L2
%
WORLD POPULATION
L1+L2
%
CONN.
SPEAKERS
%
CONTENTS
L1+L2
%
VIRTUAL
PRESENCE
L1+L2
%
CONTENT
PRODUCTIVITY
L1+L2
1zhoChinese Macro18,46%14,72%71,38%21,60%1,471,17
2engEnglish14,83%13,01%64,86%19,60%1,511,32
3spaSpanish6,79%5,24%73,72%7,85%1,501,16
4hinHindi4,19%5,80%41,16%3,76%0,650,90
5rusRussian3,51%2,49%80,32%3,76%1,511,07
6fraFrench2,98%2,58%65,80%3,33%1,291,12
7porPortuguese2,99%2,49%68,43%3,13%1,261,05
8araArabic Macro3,97%3,53%63,99%3,09%0,870,78
9jpnJapanese1,99%1,22%92,63%2,66%2,181,34
10deuGerman, Standard2,04%1,30%89,17%2,37%1,821,16
11msaMalay Macro2,36%2,36%56,93%1,96%0,830,83
12turTurkish1,17%0,85%78,05%1,14%1,350,98
13itaItalian0,87%0,66%75,83%1,00%1,531,14
14korKorean0,90%0,79%65,16%0,98%1,241,09
15fasPersian Macro1,08%0,81%75,91%0,88%1,090,82
16benBengali1,11%2,58%24,55%0,88%0,340,79
17vieVietnamese0,92%0,74%70,96%0,85%1,150,92
18urdUrdu0,95%2,22%24,38%0,66%0,300,70
19thaThai0,80%0,59%77,95%0,65%1,120,82
20polPolish0,60%0,39%87,09%0,63%1,591,04
21marMarathi0,69%0,96%41,06%0,58%0,600,83
22telTelugu0,68%0,92%41,69%0,56%0,600,82
23tamTamil0,61%0,82%42,15%0,51%0,620,83
24javJavanese0,62%0,66%53,76%0,44%0,660,70
25nldDutch0,38%0,24%91,14%0,41%1,731,08
26gujGujarati0,44%0,60%41,47%0,36%0,610,83
27ukrUkrainian0,40%0,32%71,02%0,35%1,090,88
28kanKannada0,41%0,57%41,11%0,33%0,590,82
29ronRomanian0,32%0,23%79,57%0,30%1,290,93
30azeAzerbaijani Macro0,33%0,23%81,54%0,28%1,210,85
  REMAIN22,60%30,10% 15,13%  
TOTAL100,00%100,00 %100,00 %

LEGEND

ISO = 3 letters ISO 639 code for languages
L1+L2 = first and second language speakers
Internauts = % of connected speakers
World Population = % of speaker’s population over the world total of L1+L2 speakers
CONNECTED = % of connected speakers over the world total of L1+L2 connected persons
CONTENTS = % of Web contents in each language over the total of Internet Webpages (NOT over the total of websites!)
VIRTUAL PRESENCE = the ratio of CONTENT over World Population for each language
CONTENT PRODUCTIVITY = the ratio of CONTENT over CONNECTED for each language

Complete Results

Comparison of Results with Other Providers

Download Complete Results for All 329 Languages

Videos

The Method Behind the Unprecedented Production of Indicators of the Presence of Languages in the Internet

Release Date: March 2023

Duration: 39min

L


Credits

OBDILCI Logo
La Francophonie
Unesco Chair on Language Policies for Multilingualism
Instituto Internacional de Lingua Portuguêsa (IILP)
Gov.BR

Projects by OBDILCI

  • Indicators for the Presence of Language in the Internet
  • The Languages of France in the Internet
  • French in the Internet
  • Portuguese in the Internet
  • Spanish in the Internet
  • AI and Multilingualism
  • Digital Languages Death
  • DILINET
  • Pre-historic Projects…