Gauging LLM Struggles With Math, Humanities Challenges
March 17, 2026
D.A.D. today covers 12 stories from 4 sources. What's New, What's Innovative, What's Controversial, What's in the Lab, and What's in Academe.
D.A.D. Joke of the Day: My AI wrote a 2,000-word email for me. My boss replied "Got it, thanks." Perfectly balanced, as all automation should be.
What's New
AI developments from the last 24 hours
Mistral's Math-Proof Model Delivers Strong Results at 1/15th Claude's Cost
Mistral released Leanstral, an open-source model designed for Lean 4, the proof assistant used in formal mathematics and software verification. The 120B-parameter model uses a sparse architecture (only 6B parameters active at once), which Mistral says delivers strong performance at dramatically lower cost. On the FLTEval benchmark, Leanstral scored 26.3 at $36 versus Claude Sonnet's 23.7 at $549—and reached 31.9 at $290 compared to Claude Opus's 39.6 at $1,650. Released under Apache 2.0.
Why it matters: Formal verification—mathematically proving code is correct—has been too expensive and specialized for most teams; a capable open-source option at a fraction of proprietary costs could make verified software more accessible for safety-critical applications.
Discuss on Hacker News · Source: mistral.ai
Nvidia Unveils CPU It Says Doubles Efficiency for AI Agent Workloads
NVIDIA announced the Vera CPU, which it calls the first processor designed specifically for agentic AI—systems where AI agents run many parallel tasks autonomously. The company claims 2x efficiency and 50% faster performance than traditional rack-scale CPUs, with a single rack sustaining over 22,500 concurrent AI environments. Major cloud providers including Oracle, CoreWeave, and Lambda are signed on as partners, alongside hardware makers Dell, HPE, and Lenovo. No independent benchmarks or comparison methodology were provided.
Why it matters: If the performance claims hold up, dedicated AI agent hardware could make running large-scale autonomous AI workflows significantly cheaper—relevant as enterprises explore agents for customer service, research, and operations.
Discuss on Hacker News · Source: nvidianews.nvidia.com
Kagi Adds "LinkedIn Speak" Translation for Corporate Jargon Parody
Kagi, the paid search engine company, added "LinkedIn Speak" as a tongue-in-cheek output option in its translation tool. The feature converts plain text into the inflated corporate vernacular common on the platform—turning phrases like "I need a drink" into musings about self-care and leadership. Users discovered the URL parameter also accepts other novelty outputs like "Pirate speak." The feature doesn't work in reverse, so there's no relief for those drowning in synergy-speak.
Why it matters: It's a joke feature, but it reflects growing fatigue with performative LinkedIn culture—and shows smaller AI companies using humor to differentiate from the big labs.
Discuss on Hacker News · Source: translate.kagi.com
What's Innovative
Clever new use cases for AI
Text-to-Game Tool Generates Playable Godot Projects From Prompts
A developer released Godogen, an open-source tool that generates complete, playable Godot 4 games from text prompts—including game architecture, 2D/3D assets, and working code. The system uses Claude for code generation and a separate Gemini agent for visual quality checks, comparing screenshots against reference images. The project required a year of development to solve challenges like GDScript's limited training data. Early reactions on Hacker News were more impressed than expected, particularly with the spatial reasoning for asset placement.
Why it matters: This signals AI game development tools are moving from code assistance to full project generation—potentially useful for rapid prototyping, though production-quality output remains unproven.
Discuss on Hacker News · Source: github.com
Ex-Google Maps Engineers Build Location Data Service for AI Agents
Voygr, a Y Combinator startup founded by former Google Maps, Apple, and Meta engineers, launched a maps API designed specifically for AI agents. Their pitch: traditional maps APIs offer static snapshots, but roughly 25-30% of business locations change yearly through closures, moves, or rebranding. Voygr claims to combine place data with live web signals—news, events, articles—to detect these real-world changes. They're processing tens of thousands of places daily for enterprise customers and now offering a Business Validation API to developers.
Why it matters: If you're building AI tools that need current local business information—appointment booking, delivery logistics, lead generation—stale data creates real problems; this is early-stage infrastructure worth watching.
Discuss on Hacker News · Source: news.ycombinator.com
What's Controversial
Stories sparking genuine backlash, policy fights, or heated disagreement in the AI community
Journalist Reports Death Threats From Prediction Market Users Over War Coverage
Emanuel Fabian, military correspondent for The Times of Israel, reports receiving death threats from Polymarket users after his coverage of a March 2026 Iranian ballistic missile strike near Beit Shemesh. Fabian says the threats demand he rewrite his reporting to claim the impact came from interceptor debris rather than a missile—apparently because accurate reporting affects prediction market bets. He cites Israeli military confirmation, rescue services reports, and explosion footage supporting the missile assessment.
Why it matters: This signals a troubling intersection of prediction markets and press freedom—when gamblers have money riding on how events are characterized, journalists covering contested facts may face coordinated harassment campaigns aimed at manipulating the historical record.
Discuss on Hacker News · Source: timesofisrael.com
What's in the Lab
New announcements from major AI labs
AI Security Startup Says It Can Outperform Traditional Code Scanners
Codex Security published a technical explanation of why its AI-powered security tool skips traditional Static Application Security Testing (SAST)—the scanning approach used by established tools like Snyk and Checkmarx. The company claims AI-driven "constraint reasoning" can find real vulnerabilities while generating fewer false positives than rule-based SAST, which is notorious for alert fatigue. No benchmarks or comparative data were provided.
Why it matters: If validated, this could signal a shift in how enterprises approach code security—but without evidence, it's a marketing position, not a proven alternative.
What's in Academe
New papers on AI and its effects from researchers
Deep AI Models Get Better at Retaining Information Across Layers
Researchers developed mixture-of-depths attention (MoDA), a technique that lets AI models pull information from both the current processing layer and earlier layers simultaneously. The approach addresses a known problem where signals degrade as they pass through very deep neural networks. In tests on 1.5-billion-parameter models, MoDA improved accuracy by about 2% on downstream tasks while adding only 3.7% to computational costs. The team claims their implementation runs at 97% the speed of FlashAttention-2, a widely-used efficiency standard. Code is available on GitHub.
Why it matters: This is research infrastructure—relevant primarily to teams training custom models, but it signals that efficiency gains in deep networks are still being found without major cost increases.
Most AI Models Score Near Zero on Unsolved Math Problems—GPT-5 Reportedly Solved Two
Researchers released HorizonMath, a benchmark of over 100 predominantly unsolved mathematical problems designed to measure whether AI can contribute to genuine mathematical discovery rather than just solve textbook problems. Most frontier models score near 0%. The notable exception: GPT 5.4 Pro reportedly proposed solutions to two problems that improve on the best-known published results—potential novel mathematical contributions now pending expert review. The benchmark includes an open-source framework for automated verification.
Why it matters: If the GPT 5.4 Pro results survive peer review, this would represent AI moving from solving known problems to contributing original mathematical research—a qualitative shift in capability.
Distillation Method Promises Faster AI Models for Long Documents
Researchers developed a method to convert standard large language models into xLSTM, an architecture that processes text faster as inputs grow longer. The technique distills knowledge from models like Llama and Qwen into smaller, structurally different students. The team claims the converted models recover most of the original's capability and occasionally outperform it on certain tasks. This is early research without production implementations.
Why it matters: If the approach scales, it could eventually let companies run capable AI models faster and cheaper, particularly for long-document processing—though this remains academic work for now.
Pokémon Battles Emerge as AI Decision-Making Benchmark
Researchers launched the PokeAgent Challenge, using Pokémon's battle system as a benchmark for AI decision-making. The framework includes two tracks: strategic battling under incomplete information and long-term planning through RPG speedrunning. A NeurIPS 2025 competition drew over 100 teams, with results showing significant performance gaps between general-purpose LLMs, specialized reinforcement learning agents, and elite human players. The researchers claim Pokémon battling tests capabilities that standard AI benchmarks miss entirely, supported by a dataset of 20 million+ battle trajectories.
Why it matters: This signals growing recognition that current AI benchmarks may not capture real-world decision-making skills—and that game environments, with their complex strategy and uncertainty, could become serious research tools rather than novelties.
Frontier AI Models Struggle to Design Valid Social Science Experiments
Researchers built InterveneBench, a benchmark testing whether AI models can design valid social science experiments—the kind of causal reasoning behind policy evaluation and A/B testing. Drawing from 744 peer-reviewed studies across policy domains, the benchmark reveals that current frontier models struggle with intervention design: figuring out what to change, what to measure, and how to isolate cause from correlation. The researchers' multi-agent framework, STRIDES, reportedly outperforms standard reasoning models, though specific numbers weren't released.
Why it matters: For organizations using AI to design experiments, evaluate programs, or inform policy decisions, this suggests current models may confidently propose flawed research designs—a gap worth knowing before you trust AI-generated study plans.
What's Happening on Capitol Hill
Upcoming AI-related committee hearings
Tuesday, March 17 — DeepSeek and Unitree Robotics: Examining the National Security Risks of PRC Artificial Intelligence, Robotics, and Autonomous Technologies and Building a Secure U.S. Technology Base House · Homeland Security Subcommittee on Cybersecurity and Infrastructure Protection (Hearing) 310, Cannon House Office Building
What's On The Pod
Some new podcast episodes
The Cognitive Revolution — AI Scouting Report: the Good, Bad, & Weird @ the Law & AI Certificate Program, by LexLab, UC Law SF
How I AI — From journalist to iOS developer: How LinkedIn’s editor builds with Claude Code | Daniel Roth
AI in Business — Why Supply Chain Design Becomes the Differentiator as AI Automates Planning - with Don Hicks of Optilogic