A peer-review publication says English domination on the Web a half-truth

An important milestone for OBDILCI: our article “Is it True that More Than Half of Web Contents are in English? Not If Multilingualism is Paid Due Attention!” has finally been published in the peer-reviewed Forum for Linguistic Studies.

We wish the subject importance be sensed by more people. There are interesting lessons learnt from the long and cumbersome process to reach that publication which we would like to share on that matter.

Why so? Because “English steadily represents more than 50% of Web contents since 2011” is a lie, or more diplomatically said, a half-truth, hiding the formidable multilingualism reality of the Internet.

A half-truth? Yes, English is and will remain for a while, together with Chinese, the first language in terms of Web contents. However, the today real percentage is half what is repeated in the media, supported by a biased source. And half means more than 25% of a large variety of other languages contents. The Internet is the most multilingual realm ever built on earth, a new Babel Tower, but with mutual understanding, thanks to translation assisted by applications.

Today some 750 languages have digital existence, it is only less than 10% of the existing wealth of languages, but it is much more than the less than 100 localized in 2000, and still more than very few tenth of the first years of the Web. There is still a long road for full multilingualism but today more than 90% of world language speakers can use either their first or second language in the Net, as the challenge is affecting now mainly minority and endangered languages.

This publication was first written one year and a half ago and put in open access as preprint. However, getting it published in a serious peer-reviewed Journal has been a long and difficult journey. The analysis of the roadmap may tell something more than our acknowledged difficulties to convey clearly a scientific message.

The publication has been rejected by the editor, with the argument than the subject does not fit, in the Journal of Computational Linguistics and PLOS-One. It has been submitted and rejected after peer-review by Languages@Internet, the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, and the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages (SIGUL). It was finally accepted after a required full revision by the Forum for Linguistic Studies.

The point is absolutely not to discuss the decisions made by those Journals which were all documented, fair and valid (in the SIGUL case the review was rather positive but the number of spaces was limited and a threshold applied). The point is to analyze and take some lessons from that process.

First lesson: peer-review is, like democracy, cumbersome (need to adapt the text to various different format and rules), sometimes frustrating (some reviewers don’t know enough on the subject) but is the least bad method to make science or society. In that sense both are key inventions of humanity! Both need to be appreciated and protected from existing threatens.

In that very case, the 3 x 4 reviews, even if they did not affect the core of the demonstration, forced a series of 4 revisions on clarity, readability and solid argumentation on stakes and impacts. This process progressively and drastically enhanced the product (which does not mean it could not receive still more critics and enhancements).

Reviewers, not only validate or not scientific production, but they also participate actively, and on a solidarity basis, to its improvements. Acting as reviewer is offering valuable professional time for the sake of good science.

Second lesson: the subject of the space of languages in the Internet has been so far treated by a very limited number of researchers and is largely underrated by the language technology community and beyond, and often misunderstood. That, in spite the fact it has implications in many societal aspects: public policies for languages, business/e-commerce, cultural industries, geopolitics, cyber-geography…

Years of misinformation (overestimation) about the reality of English in the Web have shaped deep misconceptions, even in the head of serious scientists used to evidence based reasoning.

The third lesson comes from the fascinating fact that the 4 reviews, each made by 3 reviewers, shared exactly the same pattern:

One of the 3 reviewers stated basically that “this is a non-subject and anyway the text is confused and lack clarity; it does not look like a scientific contribution”. The part of the statement about clarity was obviously correct and often sustained by examples and did helped a lot. The first part only reflected both the knowledge limitation of those reviewers but also confirmed the second lesson in the sense that some misconceptions prevent totally some reviewers to grab the rationale of the article and, as a consequence, the attention required for the demonstration itself. Here below some symptomatic sentences extracted from those reviews:

– I don’t see the relation to language resources.

– The submission offers no real new understandings about online multilingualism.

– Why it is important that English is not the majority language of webpages?

– English could be seen as lingua franca, a lot of people on the world understand (at least as second language )

– For me the problem is not the majority of English, but missing text-material of the under-resourced languages.

– In the reviewer’s opinion, it should not matter what proportion of languages exist on the Internet This does not provide implications for the field.

Rejected.

A second group of reviewers, although not familiar with the subject, in an open-minded manner, looked at the concepts and tried to get thru with often some difficulties. Some of those reviewers sometimes missed the point but they did provide concrete recommendations to make the document clearer and conveying better the stakes.

Request for deep revision on the form.

A third group of reviewers, probably more knowledgeable of the stakes, praised the subject and go thru the demonstration with no special difficulty, while offering advices for better discussion, more references to other works on the subject (which is quite impossible for that matter) and offering valuable advices for better exposition of the ideas and the development of the demonstration.

Accepted. Request for revision on the form.

After 12 added up set of advices taken seriously about readability, clarity, smoothness in the demonstration and better exposition of the stakes and impacts, this article, also updated with new information along the lasting process, could only get better! Nobody can expect perfection on those grounds, but definitively each version is better than the previous one, in a series of 4 versions, and the whole process generate solid enhancements. It remains that the subject of what proportion of languages exist on the Internet, to retake one reviewer expression, is quite misunderstood and underestimated.

Beyond the fact that extremely few researchers have treated that subject, this is partly the consequence of the fact that misinformation is really harmful when repeated year by year, because it shapes and closes the minds. Misinformation taken as evidence prevents some genuine scientists to face that specific subject without emotional or unconscious biases, due to preconceived and wrong evidences.

The lingua franca of today’s Internet is translation aided by artificial intelligence. Multilingualism is both an extraordinary and unique reality of today Internet and still a pressing objective to be extended to more and more languages in the coming years.

If you don’t have time to read the article but are still interested in the matter and need a fast path to get the point, check:

English@Web

Leave a Reply Cancel reply