What happens when you pick a language?
When you choose Japanese as a target language, the sliders move to show how Japanese people typically communicate -- more polite, less direct. Pick Dutch, and they shift the opposite way -- very direct, informal. These starting positions are not random. They come from decades of cross-cultural research on how people in different countries actually communicate.
Where do the numbers come from?
Hofstede's Cultural Dimensions
In the 1970s, researcher Geert Hofstede surveyed over 100,000 IBM employees across 76 countries about their values and workplace behavior. He discovered measurable patterns that differ consistently between cultures. His work, published in Cultures and Organizations: Software of the Mind (3rd ed, 2010), remains the most widely used framework in cross-cultural research. We use three of his dimensions:
Power Distance (PDI) -- How much do people accept hierarchy? In Malaysia (PDI=104), you would never call your boss by their first name. In Israel (PDI=13), everyone does -- even in the military. High PDI cultures use more polite, deferential language when speaking to superiors.
Individualism (IDV) -- Do people think of themselves as "I" or "we"? In the US (IDV=91), people say "I think we should..." In South Korea (IDV=18), people say "Our team feels that..." Collectivist cultures avoid singling out individuals and prefer indirect, face-saving communication.
Uncertainty Avoidance (UAI) -- How comfortable are people with ambiguity? In Greece (UAI=112), people follow strict protocols and formal procedures. In Singapore (UAI=8), people adapt on the fly and keep things informal. High UAI cultures prefer structured, formal language.
Hall's Context Theory
Some cultures say exactly what they mean. Others expect you to read between the lines. Anthropologist Edward T. Hall described this in Beyond Culture (1976):
- Low-context cultures (Dutch, German, American) -- Communication is explicit and direct. "The report is due Friday" means exactly that.
- High-context cultures (Japanese, Persian, Arab) -- Meaning is wrapped in context, tone, and relationship. "It might be nice to have the report soon" could mean the same thing.
Politeness Theory
Brown and Levinson's Politeness: Some Universals in Language Usage (1987) established that all cultures use strategies to protect "face" -- the social image people want to maintain. However, which strategies they use varies dramatically. Some cultures soften requests to protect the listener's autonomy (negative politeness), while others emphasize warmth and solidarity (positive politeness). Recent research confirms that politeness markers are among the most culturally variable aspects of language -- and the most frequently lost in translation (Masoud et al., 2024).
How we calculate the defaults
We combine these research scores into four dimensions using a weighted formula. Each formula reflects the research consensus on which cultural factors most influence that communication style:
Directness = 0.40 × IDV + 0.35 × (100 - CTX) + 0.25 × (100 - PDI)
Formality = 0.35 × PDI + 0.35 × UAI + 0.30 × CTX
Attribution = 0.55 × IDV + 0.25 × (100 - PDI) + 0.20 × (100 - CTX)
In plain English
Politeness -- Strong hierarchy + group-oriented + read-between-the-lines = more polite, deferential language. Think: Japanese, Korean, Persian.
Directness -- Individualist + say-what-you-mean + egalitarian = more direct, explicit speech. Think: Dutch, Israeli, American.
Formality -- Strong hierarchy + need for structure + high-context = more formal register and vocabulary. Think: Arabic, Japanese, Greek.
Attribution -- Individualist + egalitarian + explicit communication = "I did this" instead of "mistakes were made." Think: American, Australian, Dutch.
Example: Japanese
Politeness = 0.35(54) + 0.30(54) + 0.35(90) = 67, plus keigo adjustment +10 = 77
Directness = 0.40(46) + 0.35(10) + 0.25(46) = 33, minus adjustment -5 = 28
This matches reality: Japanese communication is famously polite and indirect.
Language-specific features
Some languages have built-in politeness systems that go beyond what country-level culture scores capture. For example, Japanese has keigo -- six different politeness levels baked into the grammar itself. Persian has ta'arof -- an elaborate courtesy ritual where you might refuse a compliment three times before accepting. Korean has six speech levels that change verb endings based on who you're talking to.
Because these features are part of the language structure (not just cultural preference), we add small corrections (+3 to +10 points) on top of the formula:
| Language | What makes it special | Adjustment |
|---|---|---|
| Japanese | Keigo -- six grammatical politeness levels in verb conjugation | Politeness +10, Directness -5, Formality +5 |
| Persian | Ta'arof -- elaborate courtesy system with ritualized offers and refusals | Politeness +10, Directness -8 |
| Korean | Six speech levels that change verb endings based on social hierarchy | Politeness +8, Directness -3, Formality +5 |
| Thai | Pronoun and particle system that encodes social status in every sentence | Politeness +8, Directness -5 |
| Urdu | Adab -- etiquette system with formal/informal verb forms | Politeness +8, Directness -3 |
| Arabic | Elaborate honorific forms of address tied to religious and social norms | Politeness +5, Formality +3 |
| Hindi | Three-level honorific verb system (intimate, neutral, respectful) | Politeness +5 |
| Vietnamese | Dozens of pronouns that encode age, gender, and social relationship | Politeness +5, Directness -3 |
| Dutch | Cultural norm of directness that exceeds what Hofstede scores predict | Directness +5, Politeness -3 |
| Hebrew | Dugri culture -- "straight talk" valued as honesty, not rudeness | Directness +5, Politeness -3 |
Why underrepresented languages matter
There are over 7,000 languages spoken in the world. Most AI models are trained on roughly 100 of them. That means billions of people are effectively left out of the AI revolution.
The problem comes down to data. AI language models learn from text on the internet -- and the internet is overwhelmingly in English. When a model has seen millions of English sentences but only a few thousand in Yoruba or Khmer, it simply cannot produce the same quality of output. This creates a cycle: less data leads to worse tools, which leads to less digital content, which leads to even less data.
What "low-resource" means -- A language is considered low-resource when there is not enough digitized text data to train AI models effectively. This can happen because: (1) the language has fewer speakers, (2) speakers have limited internet access, or (3) the language uses a script or structure that existing tools handle poorly.
Why this matters for cultural translation
Cultural translation is especially important for underrepresented languages -- and especially hard. Here is why:
- Cultural norms are less documented. For English or French, there are thousands of studies on politeness, formality, and communication styles. For Amharic, Yoruba, or Khmer, this research is sparse. Our Hofstede-based defaults help fill this gap with a principled starting point.
- Translation quality is lower. When AI models have less training data for a language, they are more likely to produce translations that are grammatically correct but culturally wrong -- exactly the problem Conteranto addresses.
- The stakes are higher. A mistranslation between English and Dutch is usually caught quickly. A mistranslation into a language with fewer bilingual speakers may go unnoticed and cause real misunderstanding.
- Every translation generates data. When users translate into underrepresented languages with Conteranto, they create examples of culturally adapted text that did not exist before -- contributing to the research on how these cultures actually communicate.
What Conteranto does differently
In the language selector, we mark each language with its NLP representation level -- high, medium, low, or very low. Languages are sorted within each region so that the most underrepresented ones appear first. This is not just labeling; it is a deliberate design choice to draw attention to the languages that need cultural translation the most and benefit from it the most.
Full data table
Below are all 76 languages with their research inputs (PDI, IDV, UAI from Hofstede; CTX estimated by us) and the computed output values. Every number is transparent -- you can verify the formula yourself.
| Language | Region | PDI | IDV | UAI | CTX* | Adj. | Pol. | Dir. | For. | Att. |
|---|---|---|---|---|---|---|---|---|---|---|
| Loading data from API... | ||||||||||
* CTX (Context) values are our estimates based on Hall's qualitative framework, not published scores.
References and further reading
Foundational works
- Hofstede, G., Hofstede, G. J., & Minkov, M. (2010). Cultures and Organizations: Software of the Mind. 3rd ed. New York: McGraw-Hill.
- Hall, E. T. (1976). Beyond Culture. New York: Anchor Books.
- Brown, P. & Levinson, S. C. (1987). Politeness: Some Universals in Language Usage. Cambridge: Cambridge University Press.
- Meyer, E. (2014). The Culture Map: Breaking Through the Invisible Boundaries of Global Business. New York: PublicAffairs.
Recent research on cultural alignment in AI (2024-2025)
- Masoud, R. I., Liu, Z., Ferianc, M., Treleaven, P., & Rodrigues, M. (2024). Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions. COLING 2025. arXiv:2309.12342
- Li, C., Chen, M., Wang, J., Sitaram, S., & Xie, X. (2024). CultureLLM: Incorporating Cultural Differences into Large Language Models. NeurIPS 2024. proceedings.neurips.cc
- Kharchenko, J., Roosta, T., Chadha, A., & Shah, C. (2024). How Well Do LLMs Represent Values Across Cultures? Empirical Analysis Based on Hofstede Cultural Dimensions. arXiv:2406.14805
Underrepresented languages and AI (2024-2025)
- Stanford HAI (2025). Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts. hai.stanford.edu
- Liu, Y. (2025). Improving Machine Translation Accuracy for Underrepresented Languages Using Transformer Models. International Journal of Bilingualism. doi:10.1177/14727978251337995
- Cambridge NLP (2024). Natural Language Processing Applications for Low-Resource Languages. Natural Language Processing, 31, 183-197. doi:10.1017/nlp.2024.33
- Cohere for AI (2024). Aya: A Massively Multilingual Language Model Covering 101 Languages. cohere.com
Data sources
- Clearly Cultural. Geert Hofstede Cultural Dimensions. clearlycultural.com
- Hofstede Insights. Country Comparison Tool. hofstede-insights.com