Why AI Writes Better Code in Some Languages Than Others
A Research-Based Analysis of LLM Code Generation Quality
January 2026
The Theory
After months of working with AI coding assistants across different technology stacks, I've developed a theory: the quality of AI-generated code depends not just on how "smart" the model is, but on the intersection of three factors — training data quality, language constraint mechanisms, and framework establishment.
In practical terms: Go produces more reliable AI-generated code than C#. Orleans outperforms custom actor frameworks. Python beats JavaScript for consistency. These aren't random observations — they're predictable outcomes once you understand how LLMs actually learn to write code.
This post digs into the research behind these observations and proposes a model for predicting where AI coding assistance will excel (and where it will struggle). For teams making technology decisions in 2026, understanding this dynamic is becoming increasingly important.
The Quality-Over-Quantity Paradigm
Academic Consensus
Recent research has fundamentally shifted our understanding of what drives LLM code generation quality. A 2025 study from ACM ASE found that practitioners rank Reliability, Relevance, and Accuracy as the most important dataset characteristics, while sheer volume ranked significantly lower.^[1]
Li et al. (2023) in their paper From Quantity to Quality introduced the Instruction-Following Difficulty (IFD) score, demonstrating that careful selection of high-quality training samples can outperform larger datasets of mixed quality.^[2] A 2024 arXiv study on training data optimization found that "nearly all optimization techniques improve LLM-based code generation, underscoring data quality as a primary performance driver."^[3]
Critically, the same study revealed that "combining multiple techniques rarely produces additive gains in functional correctness, revealing a clear upper bound."^[3] Beyond a certain quality threshold, additional data provides diminishing returns.
The Python vs JavaScript Question
Consider why Python training data may be higher quality than JavaScript data. Python benefits from several quality advantages: extensive documentation in scientific computing, consistent coding standards (PEP 8), and its dominance in machine learning means models are often trained by researchers who write Python professionally.
JavaScript's ecosystem fragmentation — multiple frameworks, rapid evolution, varying quality of npm packages — introduces noise into training data. The Continue.dev analysis of multilingual LLM performance noted that while JavaScript has one of the largest presences on GitHub and Stack Overflow, benchmark performance does not linearly correlate with dataset size.^[4]
The Constraint Hypothesis: Why "Opinionated" Languages Excel
Functional Programming and Predictability
Some of the most interesting observations come from functional programming communities. As one practitioner analysis noted: "In functional programming, everything is immutable, side effects are discouraged, and you don't have to worry about distant or abstract values hiding somewhere in your codebase... When an AI is trying to understand what a piece of code does, this predictability is invaluable."^[5]
Chris McCord, creator of Phoenix Framework, argued at a 2025 conference that Elixir's "cohesive tooling and language design" make it well-suited for AI coding agents. Unlike fragmented ecosystems like JavaScript, "we have Mix, your build tool" providing a unified experience.^[6]
Go's Design Philosophy
Go's strong alignment with AI code generation is supported by its deliberate design constraints. Go enforces: a single canonical code format (gofmt), explicit error handling, limited language features (no generics until recently, no inheritance), and strong typing. These constraints reduce the "solution space" that an LLM must navigate.
Research on type-constrained code generation (Chaudhuri et al., 2025) demonstrates that "leveraging type systems to guide code generation" significantly reduces compilation errors and increases functional correctness.^[7] Go's strong type system provides exactly this kind of guidance.
The C# Paradox
C# presents an interesting paradox: it has strong typing and excellent documentation, yet AI often struggles more with it than Go. The likely explanation lies in C#'s extensive flexibility. C# supports multiple paradigms (OOP, functional, procedural), numerous ways to accomplish the same task (LINQ vs loops, async patterns, nullable reference types), and constant language evolution adding new syntax.
Research on design patterns and LLMs found that "LLMs often fail to properly understand existing design patterns and coding styles of a project, leading to generated code that does not meet project requirements."^[8] C#'s rich pattern library actually becomes a liability when the AI must choose among many valid approaches.
Established Frameworks vs Custom Code: The Orleans Evidence
BaxBench: Empirical Framework Comparison
The most direct evidence for why established frameworks outperform custom ones comes from BaxBench (Vero et al., 2025), a benchmark testing LLM backend generation across 14 frameworks and 6 languages. The key finding: "in less popular backend frameworks, models further struggle to generate correct and secure applications."^[9]
Even the best model (OpenAI o1) achieved only 62% correctness on established frameworks like Django and Express. Performance dropped significantly for less common frameworks. A custom framework, by definition, has zero training examples — placing it at maximum disadvantage.
At Infonuncio Consulting, we've seen this firsthand. When we work with Orleans — a Microsoft-backed, well-documented actor framework — AI assistance is genuinely helpful. When we've experimented with custom actor implementations, the AI produces code that looks plausible but fundamentally misunderstands the architecture.
The Library Bias Effect
A January 2025 paper, "LLMs Love Python: A Study of LLMs' Bias for Programming Languages and Libraries," found that LLMs "heavily favour well established libraries over high-quality alternatives." NumPy was used unnecessarily in up to 48% of cases when better alternatives existed.^[10]
This bias extends to frameworks: Orleans, as a Microsoft-backed actor framework with substantial training data, benefits from this preference. Custom frameworks, regardless of technical merit, cannot compete with Orleans' representation in training corpora.
Counterarguments and Limitations
The Benchmark Criticism
Most coding benchmarks (HumanEval, MBPP) are Python-centric, potentially inflating perceived Python performance.^[11] When Tencent's AutoCodeBench tested 20 languages equally, they found that "models showed small differences in popular languages like Python and JavaScript, but huge differences in less common languages."^[12]
The Elixir Paradox
Despite Elixir's theoretical advantages from functional programming, practitioners note that "LLMs sometimes mix syntaxes from different languages or suggest functions that don't exist" when generating Elixir code due to limited training data.^[13] The constraint benefits can be overwhelmed by data scarcity.
Productivity vs Quality
One counterintuitive finding: languages where AI assistance seems less impressive may actually be more productive overall. As one analysis noted: "Maybe the reason 'AI doesn't help much' with Elixir isn't because AI is bad at Elixir — maybe it's because Elixir problems are already well-structured enough that we don't need as much help."^[5]
A Revised Model for Predicting AI Code Quality
Based on the research, I propose a model for predicting LLM code generation quality:
LLM Code Generation Quality = f(Training Data Quality × Language Constraints × Framework Establishment)
| Factor | Description |
|---|---|
| Training Data Quality | Not just volume, but consistency, documentation quality, and adherence to best practices in training examples |
| Language Constraints | Strong typing, enforced conventions, limited idioms reduce the solution space and guide the model toward correct outputs |
| Framework Establishment | Well-documented, widely-used frameworks have more training examples and benefit from the LLM's library bias |
Practical Recommendations
For optimal AI-assisted development:
- Prefer established frameworks over custom solutions when AI assistance is important
- Choose languages with strong conventions (Go > Python > JavaScript for AI consistency)
- Document custom code extensively to help the AI understand your patterns
- Consider semantic search tools to surface relevant code context for the AI
Composite Scoring Results
The following table applies this model to common language/framework combinations found in fintech companies (startups and mid-stage). Each factor is scored 1-10, with the composite score calculated as a weighted geometric mean — emphasizing that all three factors matter. A zero in any category severely impacts overall performance.
Scoring Methodology:
- Data Quality (DQ): Volume and quality of training examples, documentation, Stack Overflow presence
- Language Constraints (LC): Type system strength, enforced conventions, idiom consistency
- Framework Establishment (FE): Adoption rate, documentation depth, years in production use
- Composite: Geometric mean approximation:
(DQ × LC × FE)^(1/3)normalized to 10-point scale
| Language | Framework | Data Quality | Lang. Constraints | Framework Est. | Composite | Notes |
|---|---|---|---|---|---|---|
| Java | Spring Boot | 9 | 7 | 10 | 8.6 | Enterprise standard; massive training corpus |
| Go | Standard Library | 9 | 9 | 10 | 9.3 | Canonical examples; gofmt enforces consistency |
| Python | Django | 9 | 5 | 10 | 7.7 | Excellent docs; Django conventions help offset Python flexibility |
| Go | Gin | 8 | 9 | 8 | 8.3 | Strong constraints; popular API framework |
| Kotlin | Spring Boot | 7 | 8 | 9 | 7.9 | Leverages Java ecosystem; null safety helps |
| Python | FastAPI | 8 | 6 | 7 | 7.0 | Growing fast; type hints improve outcomes |
| Ruby | Rails | 8 | 6 | 9 | 7.5 | "Convention over configuration" aids AI |
| C# | Orleans | 7 | 6 | 7 | 6.6 | Good MS docs; actor model patterns established |
| TypeScript | NestJS | 7 | 7 | 7 | 7.0 | Decorators provide structure; growing adoption |
| C# | Dapr | 6 | 6 | 6 | 6.0 | Newer; sidecar pattern less common in training |
| Scala | Akka | 6 | 7 | 7 | 6.6 | Niche but quality; actor patterns well-documented |
| Elixir | Phoenix | 5 | 8 | 7 | 6.5 | Functional benefits offset by data scarcity |
| Rust | Actix | 6 | 10 | 6 | 7.1 | Compiler catches errors AI would make; smaller corpus |
| TypeScript | Express + Custom | 6 | 5 | 5 | 5.3 | Framework established but custom patterns fragment |
| C# | Custom Framework | 2 | 4 | 1 | 2.0 | Zero public examples; AI has nothing to learn from |
Key Observations:
Go + Standard Library achieves the highest score (9.3). The combination of canonical, high-quality examples with strict language enforcement creates an ideal environment for AI code generation.
The C# gradient is instructive. Orleans (6.6) → Dapr (6.0) → Custom (2.0) demonstrates how framework establishment dominates when language constraints remain constant.
Rust's constraint advantage partially compensates for smaller corpus. Despite less training data than Python or Java, Rust's compiler enforcement (LC=10) helps the AI avoid errors it would otherwise make.
Elixir's paradox is visible. Strong constraints (LC=8) but limited data (DQ=5) results in a middle-tier score, explaining the mixed experiences developers report.
TypeScript + Express + Custom patterns score poorly (5.3) despite TypeScript's popularity — the "custom" element fragments the solution space.
Java Spring Boot rivals Go due to sheer training data volume and enterprise standardization offsetting Java's moderate constraint level.
Conclusion
The research suggests that constraint reduction (fewer valid ways to write code), data quality (not quantity), and ecosystem cohesion (unified tooling and conventions) are the key drivers of AI code generation quality.
The Orleans vs custom framework observation is particularly well-supported by the BaxBench findings. The Go vs C# experience aligns with the constraint hypothesis. And the Elixir observations from the functional programming community, while complicated by data scarcity, reflect functional programming's inherent AI-friendliness.
The practical implication: when planning AI-assisted development, optimize for constraint and convention over raw language power or flexibility.
Appendix: Sources and References
[1] Liu et al., "What Makes a High-Quality Training Dataset for Large Language Models: A Practitioners' Perspective," Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE'24), October 2024. https://dl.acm.org/doi/10.1145/3691620.3695061
[2] Li et al., "From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning," arXiv:2308.12032, 2023. https://arxiv.org/html/2308.12032v4
[3] Anonymous, "On the Effectiveness of Training Data Optimization for LLM-based Code Generation: An Empirical Study," arXiv:2512.24570, December 2024. https://arxiv.org/html/2512.24570v1
[4] Continue.dev, "LLMs are helpful with Python, but what about all of the other programming languages?" Continue Blog, November 2023. https://blog.continue.dev/programming-languages/
[5] Revelry, "Which Language Is Best For AI Code Generation?" Revelry Insights, October 2025. https://revelry.co/insights/artificial-intelligence/which-language-is-best-for-ai-code-generation/
[6] Yolanda, L., "Phoenix Creator Argues Elixir Is AI's Best Language," The New Stack, October 2025. https://thenewstack.io/phoenix-creator-argues-elixir-is-ais-best-language/
[7] Chaudhuri et al., "Type-Constrained Code Generation with Language Models," arXiv:2504.09246, 2025. https://arxiv.org/pdf/2504.09246
[8] Anonymous, "Do Code LLMs Understand Design Patterns?" arXiv:2501.04835, January 2025. https://arxiv.org/html/2501.04835v1
[9] Vero et al., "BaxBench: Can LLMs Generate Correct and Secure Backends?" arXiv:2502.11844, ICML 2025, February 2025. https://baxbench.com/
[10] Twist et al., "LLMs Love Python: A Study of LLMs' Bias for Programming Languages and Libraries," arXiv:2503.17181, January 2025. https://arxiv.org/html/2503.17181v1
[11] Chen et al., "Evaluating Large Language Models Trained on Code," arXiv:2107.03374 (HumanEval), 2021. https://github.com/openai/human-eval
[12] Chou et al. (Tencent Hunyuan), "AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators," arXiv:2508.09101, August 2025. https://arxiv.org/html/2508.09101v1
[13] Eberhardt, J., "Writing Elixir with LLMs: Maximizing Efficiency and Avoiding Pitfalls," Medium, November 2024. https://medium.com/@jonnyeberhardt7/writing-elixir-with-llms-maximizing-efficiency-and-avoiding-pitfalls-141a1b65374b
Tags: AI, LLM, Code Generation, Architecture, Orleans, Go, Python, Microservices