MECILDI LANGUAGE DETECTOR

Research Information for Webmasters

Summary

MECILDI is a non-profit research bot operated by OBDILCI. We visit the homepages of websites at a frequency of 1 to 4 times per year to measure the multilingualism of those websites. Our bot is designed to be ultra-lightweight, hitting only 2 files per domain (robots.txt and the homepage) to identify monolingual vs. multilingual web presence. The selection of websites is random and the probability of visiting the same site more than once per year is very low.

Full Methodology and Crawler Behavior

The MECILDI project (Targeted Measurement of Languages in the Internet, an acronym from the French version) is a scientific initiative led by the non-profit research organization OBDILCI (www.obdilci.org). This work is supported by various governmental and funding organizations and is strictly dedicated to academic research.

Our goal is to create a statistical map of linguistic diversity on the World Wide Web. Unlike search engine bots or data miners, our crawler follows a “minimal footprint” protocol:

  1. Frequency: In its standard operational mode, our crawler will visit any given domain  at most 1 to 4 times per year, though most sites will only be visited once or twice.

    • Development Note: During our current active development and testing phase, some domains may be visited more frequently (every few days). This testing period is expected to conclude by early Q2 2026, after which the bot will revert to its standard cycle.

  2. Depth: We do not crawl internal pages. We only request the robots.txt file and the root homepage of a domain.

  3. Efficiency: To minimize server load, our crawler attempts to identify the correct protocol (HTTP vs HTTPS) and subdomain (WWW vs non-WWW) in parallel. As soon as the first connection is successful, all remaining “probe” connections to that specific domain are immediately canceled.

  4. Politeness: Our bot strictly respects robots.txt directives.

  5. Traffic Control: We limit our global crawl to a small number of concurrent requests (typically 10) to ensure we do not put any strain on hosting provider networks.

Why was my site visited? Your domain was randomly selected as part of a large-scale scientific research sample. Our datasets are derived from either Country Code Top-Level Domains (ccTLDs) or the TRANCO list of top-ranked global websites. This random sampling allows us to help researchers, NGOs, and governments better understand the complex characteristics of multilingualism and linguistic diversity in the WWW, which is the core of our mission as a non-profit.

Contact us If you have questions, concerns, or wish to exclude your domain from future research cycles, please contact us at contact@obdilci.org

Projects by OBDILCI

  • Indicators for the Presence of Languages and multilingualism in the Internet
  • The Languages of France in the Internet
  • French in the Internet
  • Portuguese in the Internet
  • Spanish in the Internet
  • Web Multilingualism reports
  • Courses
  • AI and Multilingualism
  • Linguistic gTLDs
  • DILINET
  • Pre-historic Projects…
  • Digital Language Death