In the context of Spain’s formidable investment in the cross-cutting themes of artificial intelligence and language [1] one awaited with great interest the annual release of the Cervantes Institute yearbook to learn about the first impact of these plans on the digital dimension of the Spanish language.
[1] The PERTE “new language economy” announced budgets of more than 2 billion euros for initiatives and has been followed by the strategy for artificial intelligence with a clear focus on language.
An appetizer, both digestive and nourishing, had been consumed earlier, with the excellent book “Los futuros del español. Horizonte de una lengua internacional“ by JosĂ© Antonio Alonso, Juan Carlos JimĂ©nez and JosĂ© Luis GarcĂa Delgado. This work proposed applying the intellectual tool of economics to the Spanish language, leading to the production of original data, based on reliable sources and clear reasoning with a view to the future of this language. It concludes with a series of very inspiring recommendations that could well serve other languages, in addition to Spanish.
The announcement of the 2024 edition of “El español en el mundo“, under the headings of the strong presence of Spanish in world music could have led one to suspect a cultural rather than digital orientation, but the disappointment could not be greater when consulting it, upon discovering the emptiness of the digital dimension of the work. Within a work of more than 600 pages, the subchapter “Digital dimension” has less than 20 lines, more than half of them to explain the source used for the only data shown: “Percentages of language usage on websites“.
Cervantes persists to use, years after years, W3Techs as source , in spite the fact it is seriously biased (see the demo) and the description that follows of their method is totally fanciful, quote : “Data is gathered by looking at websites that are considered relevant. A website is relevant if it has meaningful content or functionality“. What a tautology!
The reality, very easy to check here is that W3Techs produces its data from scanning the million websites the most visited, using this source to locate them. The relevance to which W3Techs refers is a simple filter for empty sites and duplicate sites.
W3Techs is a very reliable commercial source for the 20 plus Web technologies it explores, but not so for language which is “a web technology” that has the particularity of not necessarily being unique to a website, unlike the others. The major bias in these data is that, by not taking into account the potential multilingualism of Web sites, it multiplies the proportion of English by a factor of 2 (see the link for details).
As for AI, the book offers an excellent pedagogical contribution on AI and languages, certainly very interesting and informative about options for Spanish. However, when Spain has invested colossal sums in this subject, one would expect to receive, in this yearbook, data on the progress of these projects and not a course on AI and language.
The extensive and intense communication on that work is focused on the predominance of Spanish-language in the world of music , which is certainly relevant, but the issue of language and AI technologies could be, today, of an order of magnitude more relevant and crucial.
Its absence raises questions about the purpose of this yearbook. The problem is clearly in the frustrated expectations of obtaining a compendium of reliable statistical data on Spanish in all possible aspects, with a solid focus on the digital aspect, which has become the most strategic looking to the future.
It is, without a doubt, a remarkable work on the cultural theme associated with the Spanish language, but it is not the compilation of useful data for researchers that one would expect. In that sense, it does not stand the comparison with the effort made, every 4 years, by the Francophonie, under a similar title “Le français dans le monde”, which gathers an impressive sum of original and useful data in all aspects, with a particular effort in the digital dimension.