Since the introduction of the o1 reasoning model, there have been significant advances in AI reasoning. DeepSeek shared the RL post-training process used to instill reasoning into DeepSeek-R1, and this year many papers have presented refined RL post-training algorithms for AI reasoning.
This week’s AI research review covers papers that expose limits to these methods and extend AI reasoning capabilities – extending reasoning to the visual domain using RL techniques, building structured reasoning from the ground up, instilling causal reasoning, and examining strategic reasoning capabilities:
GLM-4.1V-9B-Thinking
ASTRO: Teaching LLMs to Reason with Search
Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess
Causal Reasoning in LLMs: Reality or Mirage?
Researchers from Zhipu AI & Tsinghua University have introduced GLM-4.1V-Thinking, a vision-language model (VLM) engineered for general-purpose multimodal reasoning. Addressing the challenge of achieving broad-spectrum reasoning capabilities in VLMs, the researchers presents a novel reasoning-centric training framework to train GLM-4.1V-Thinking in the paper “GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.”
The GLM-4.1V-Thinking architecture leverages a ViT Encoder (AIMv2-Huge) for visual processing, an MLP Projector for feature alignment, and the GLM4 LLM as the decoder. It handles native image and video resolutions and incorporates 3D-RoPE for enhanced spatial and temporal awareness.
The training pipeline progresses through three stages: multimodal pre-training with a diverse knowledge-intensive corpus; supervised fine-tuning using meticulously curated long Chain-of-Thought (CoT) data for reasoning style; and a critical RL phase, that utilizes both Reinforcement Learning with Verifiable Rewards (RLVR) and Human Feedback (RLHF), underpinned by a robust, multi-domain reward system crucial for preventing training collapse.
They further boost RL results using a novel technique called Reinforcement Learning with Curriculum Sampling (RLCS). RLCS dynamically adjusts sampling difficulty based on the model’s evolving competence, significantly boosting learning efficiency.
Figure 2. Domain-specific reward design in the RL reward system for GLM-4.1V-Thinking, used to improve its reasoning across a broad range of tasks in several modalities.
Their use of RLCS and domain-specific reward system innovations in RL post-trainingsubstantially boosts the model’s performance, with gains of up to 7.3% on reasoning benchmarks. As a result, GLM-4.1V-9B-Thinking demonstrates state-of-the-art performance among models of comparable size:
In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities.
Figure 3. (A) GLM-4.1V-9B-Thinking matches or outperforms the much larger Qwen2.5-VL-72B and the closed-source GPT-4o across a range of tasks. (B) Reinforcement learning substantially boosts the model’s performance, with gains of up to +7.3%.
Research from Meta AI published in ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context addresses the challenge of systematically teaching LLMs to internalize structured reasoning. They introduce the ASTRO (“Autoregressive Search-Taught Reasoner”) framework, which teaches an LLM to reason like a classical search algorithm, imbuing self-reflection, backtracking, and exploration through the reasoning training process.
The ASTRO framework instills robust reasoning capabilities into LLMs by inserting search processes into training inputs. An ASTRO-trained model generates the entire search trajectory, complete with its twists and turns, as a coherent stream of thought. The key innovation is to make the model internalize the entire search process—including exploration, self-reflection on intermediate steps, and backtracking from errors—within a single, continuous autoregressive generation.
Training a model with ASTRO operates in three key stages. First, ASTRO generates a synthetic dataset of search trajectories by applying Monte Carlo Tree Search (MCTS) to mathematical problem-solving. These search traces are then linearized and converted into natural language Chain-of-Thoughts (CoTs), which crucially injects explicit self-reflection and backtracking phrases into training from the search.
This dataset subsequently informs a supervised fine-tuning (SFT) stage, bootstrapping models with a rich prior for autoregressive search. Finally, reinforcement learning (RL) with verifiable rewards further optimizes the model’s search and reasoning proficiencies.
Applying ASTRO to the Llama 3 family of models yielded significant performance improvements on challenging mathematical reasoning benchmarks. Llama-3.1-70B-ASTRO-RL achieved absolute gains of 16% on MATH-500, 26.9% on AMC 2023, and 20% on AIME 2024, surpassing other advanced baselines. A critical finding is that search-based reasoning traces are essential: Models trained with explicit self-reflection and backtracking significantly outperformed those without.
This paper rigorously investigates the capacity to train LLMs for strategic reasoning, by applying RL to an LLM in the domain of chess. It surprisingly concludes that while RL with dense, expert-derived rewards improves tactical performance, LLMs consistently plateau far below human expert levels:
Our experiments show that our distillation-based dense rewards often outperform sparse binary rewards. However, surprisingly, all models plateau far below expert levels.
The experimental setup was designed to isolate the impact of RL on strategic reasoning. The methodology involved fine-tuning Qwen 2.5 and Llama 3.1 models with Group Relative Policy Optimization (GRPO) on a Lichess puzzle dataset. A novel aspect was employing a pre-trained chess expert network to provide dense, continuous reward signals based on move quality, effectively a knowledge distillation process.
Figure 4. Overview of the chess training process. (a) A data sample from the Lichess puzzle dataset is formatted into a prompt that includes the current board state and the set of legal moves. (b) At each GRPO step, the policy model generates multiple rollouts of predicted actions. A reward model evaluates these rollouts with dense feedback, including sub-optimal actions.
The performance of this approach was compared against training with sparse binary rewards. Key results are that distillation-based dense rewards substantially outperform sparse binary rewards, yet all models plateau at 25-30% puzzle accuracy, well below expert performance (60-80%). Even with additional supervised fine-tuning on expert reasoning traces, performance did not improve, as models struggled with basic chess rules and board state comprehension.
This leads to the paper’s critical insight that this failure stems from a deficit in the pretrained model’s internal world model:
“RL alone may not be able to fully overcome [the] deficit in the pretrained models’ internal understanding of chess.”
RL primarily amplifies existing capabilities in pre-trained LLMs rather than teaching new domain knowledge; RL cannot create strategic understanding that does not already exist in the foundation. Since RL cannot impart complex strategic understanding without contextual knowledge, it suggests that adequate domain-specific exposure during pre-training is essential for developing advanced strategic reasoning in complex new environments.
The research paper Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?critically assesses whether LLMs exhibit genuine human-like causal reasoning or merely leverage memorized knowledge. LLMs often appear to demonstrate causal reasoning, correctly identifying cause-and-effect relationships in text, but a critical open question is whether this is genuine reasoning, or a “mirage” created by retrieving causal patterns memorized from training data.
The authors propose a distinction between “level-1” (shallow, knowledge-retrieval based) and “level-2” (genuine, deduction-based, new knowledge generation) causal reasoning, arguing that current LLMs primarily operate at level-1.
They first empirically validate this hypothesis with a new causal Q&A benchmark called CausalProbe-2024. They find that:
The LLMs exhibit a significant performance drop on CausalProbe-2024 compared to earlier benchmarks, indicating the fact that they primarily engage in level-1 causal reasoning.
Figure 5. A diagram illustrates how autoregression fails to capture the correct causal knowledge. The paper argues that the autoregressive, next-token-prediction mechanism at the heart of transformer-based LLMs is not inherently causal.
To bridge this gap in causal reasoning, the paper proposes G2-Reasoner, a framework inspired by human reasoning that integrates external general knowledge via Retrieval-Augmented Generation (RAG) and goal-oriented thinking via prompts to guide LLMs. These steer the model towards a causal inference process.
In evaluations, G2-Reasoner demonstrated that it “significantly enhances LLMs’ causal reasoning capability,” particularly in fresh and counterfactual contexts, outperforming vanilla, CoT, and RAG baselines. This suggests that while LLMs may not possess innate causal reasoning, their capabilities can be substantially enhanced by augmenting them with external knowledge and more structured reasoning frameworks.
While G2-Reasoner offers a promising initial step towards fostering more genuine, deductive causal reasoning, achieving full human-like level-2 capability remains a significant challenge requiring further exploration into broader knowledge integration and sophisticated reasoning mechanisms.
These research papers show that current AI reasoning models are limited in deeper forms of reasoning, such as strategic and causal reasoning. Currently, AI reasoning models learn to reason by getting trained via RL on reasoning traces, a form of “cognitive distillation” that trains the AI model on specific patterns of thinking. This trains models to follow known chains of thought, but it is not sufficient to build models that can think in more complex, deeper, and powerful ways.
We’ll need more breakthroughs to get to AGI-level reasoning. However, these results give possible directions on what those breakthroughs might include:
Breadth: GLM-4.1V-Thinking improves its reasoning with cross-domain reasoning challenges covering a broad range of tasks in several modalities.
Self-learning: ASTRO distills the algorithmic process of MCTS into a natural language format, effectively teaching the model how to bootstrap its own learning.
Knowledge context support: The causal reasoning paper proposes a new framework, G²-Reasoner, which integrates external knowledge via RAG to augment the model’s core capabilities. This may overcome the barrier identified in the chess paper, which showed that RL training cannot overcome gaps in foundation AI model domain knowledge.
A new study shows that the language used to prompt AI chatbots can steer them toward different cultural mindsets, even when the question stays the same. Researchers at MIT and Tongji University found that large language models like OpenAI’s GPT and China’s ERNIE change their tone and reasoning depending on whether they’re responding in English or Chinese.
The results indicate that these systems translate language while also reflecting cultural patterns. These patterns appear in how the models provide advice, interpret logic, and handle questions related to social behavior.
Same Question, Different Outlook
The team tested both GPT and ERNIE by running identical tasks in English and Chinese. Across dozens of prompts, they found that when GPT answered in Chinese, it leaned more toward community-driven values and context-based reasoning. In English, its responses tilted toward individualism and sharper logic.
Take social orientation, for instance. In Chinese, GPT was more likely to favor group loyalty and shared goals. In English, it shifted toward personal independence and self-expression. These patterns matched well-documented cultural divides between East and West.
When it came to reasoning, the shift continued. The Chinese version of GPT gave answers that accounted for context, uncertainty, and change over time. It also offered more flexible interpretations, often responding with ranges or multiple options instead of just one answer. In contrast, the English version stuck to direct logic and clearly defined outcomes.
No Nudging Needed
What’s striking is that these shifts occurred without any cultural instructions. The researchers didn’t tell the models to act more “Western” or “Eastern.” They simply changed the input language. That alone was enough to flip the models’ behavior, almost like switching glasses and seeing the world in a new shade.
To check how strong this effect was, the researchers repeated each task more than 100 times. They tweaked prompt formats, varied the examples, and even changed gender pronouns. No matter what they adjusted, the cultural patterns held steady.
Real-World Impact
The study didn’t stop at lab tests. In a separate exercise, GPT was asked to choose between two ad slogans, one that stressed personal benefit, another that highlighted family values. When the prompt came in Chinese, GPT picked the group-centered slogan most of the time. In English, it leaned toward the one focused on the individual.
This might sound small, but it shows how language choice can guide the model’s output in ways that ripple into marketing, decision-making, and even education. People using AI tools in one language may get very different advice than someone asking the same question in another.
Can You Steer It?
The researchers also tested a workaround. They added cultural prompts, telling GPT to imagine itself as a person raised in a specific country. That small nudge helped the model shift its tone, even in English, suggesting that cultural context can be dialed up or down depending on how the prompt is framed.
Why It Matters
The findings concern how language affects the way AI models present information. Differences in response patterns suggest that the input language influences how content is structured and interpreted. As AI tools become more integrated into routine tasks and decision-making processes, language-based variations in output may influence user choices over time.
Notes: This post was edited/created using GenAI tools. Image: DIW-Aigen.
Indonesia’s Mount Lewotobi Laki-laki has begun erupting again – at one point shooting an ash cloud 18km (11mi) into the sky – as residents flee their homes once more.
There have been no reports of casualties since Monday morning, when the volcano on the island of Flores began spewing ash and lava again. Authorities have placed it on the highest alert level since an earlier round of eruptions three weeks ago.
At least 24 flights to and from the neighbouring resort island of Bali were cancelled on Monday, though some flights had resumed by Tuesday morning.
The initial column of hot clouds that rose at 11:05 (03:05 GMT) Monday was the volcano’s highest since November, said geology agency chief Muhammad Wafid.
“An eruption of that size certainly carries a higher potential for danger, including its impact on aviation,” Wafid told The Associated Press.
Monday’s eruption, which was accompanied by a thunderous roar, led authorities to enlarge the exclusion zone to a 7km radius from the central vent. They also warned of potential lahar floods – a type of mud or debris flow of volcanic materials – if heavy rain occurs.
The twin-peaked volcano erupted again at 19:30 on Monday, sending ash clouds and lava up to 13km into the air. It erupted a third time at 05:53 on Tuesday at a reduced intensity.
Videos shared overnight show glowing red lava spurting from the volcano’s peaks as residents get into cars and buses to flee.
More than 4,000 people have been evacuated from the area so far, according to the local disaster management agency.
Residents who have stayed put are facing a shortage of water, food and masks, local authorities say.
“As the eruption continues, with several secondary explosions and ash clouds drifting westward and northward, the affected communities who have not been relocated… require focused emergency response efforts,” say Paulus Sony Sang Tukan, who leads the Pululera village, about 8km from Lewotobi Laki-laki.
“Water is still available, but there’s concern about its cleanliness and whether it has been contaminated, since our entire area was blanketed in thick volcanic ash during yesterday’s [eruptions],” he said.
Indonesia sits on the Pacific “Ring of Fire” where tectonic plates collide, causing frequent volcanic activity as well as earthquakes.
Lewotobi Laki-laki has erupted multiple times this year – no casualties have been reported so far.
As tools such as ChatGPT, Copilot and other generative artificial intelligence (AI) systems become part of everyday workflows, more companies are looking for employees who can answer “yes” to this question. In other words, people who can prompt effectively, think with AI, and use it to boost productivity.
In fact, in a growing number of roles, being “AI fluent” is quickly becoming as important as being proficient in office software once was.
But we’ve all had that moment when we’ve asked an AI chatbot a question and received what feels like the most generic, surface level answer. The problem isn’t the AI – you just haven’t given it enough to work with.
Think of it this way. During training, the AI will have “read” virtually everything on the internet. But because it makes predictions, it will give you the most probable, most common response. Without specific guidance, it’s like walking into a restaurant and asking for something good. You’ll likely get the chicken.
Your solution lies in understanding that AI systems excel at adapting to context, but you have to provide it. So how exactly do you do that?
Crafting better prompts
You may have heard the term “prompt engineering”. It might sound like you need to design some kind of technical script to get results.
To get the most out of your AI conversations, it’s important that you convey a few basics about what you want, and how you want it. Our approach follows the acronym CATS – context, angle, task and style.
Context means providing the setting and background information the AI needs. Instead of asking “How do I write a proposal?” try “I’m a nonprofit director writing a grant proposal to a foundation that funds environmental education programs for urban schools”. Upload relevant documents, explain your constraints, and describe your specific situation.
Angle (or attitude) leverages AI’s strength in role-playing and perspective-taking. Rather than getting a neutral response, specify the attitude you want. For example, “Act as a critical peer reviewer and identify weaknesses in my argument” or “Take the perspective of a supportive mentor helping me improve this draft”.
Task is specifically about what you actually want the AI to do. “Help me with my presentation” is vague. But “Give me three ways to make my opening slide more engaging for an audience of small business owners” is actionable.
Style harnesses AI’s ability to adapt to different formats and audiences. Specify whether you want a formal report, a casual email, bullet points for executives, or an explanation suitable for teenagers. Tell the AI what voice you want to use – for example, a formal academic style, technical, engaging or conversational.
In a growing number of roles, being able to use AI is quickly becoming as important as being proficient in office software once was. Shutterstock
Context is everything
Besides crafting a clear, effective prompt, you can also focus on managing the surrounding information – that is to say on “context engineering”. Context engineering refers to everything that surrounds the prompt.
That means thinking about the environment and information the AI has access to: its memory function, instructions leading up to the task, prior conversation history, documents you upload, or examples of what good output looks like.
You should think about prompting as a conversation. If you’re not happy with the first response, push for more, ask for changes, or provide more clarifying information.
Don’t expect the AI to give a ready-made response. Instead, use it to trigger your own thinking. If you feel the AI has produced a lot of good material but you get stuck, copy the best parts into a fresh session and ask it to summarise and continue from there.
Always retain your professional distance and remind yourself that you are the only thinking part in this relationship. And always make sure to check the accuracy of anything an AI produces – errors are increasingly common.
AI systems are remarkably capable, but they need you – and human intelligence – to bridge the gap between their vast generic knowledge and your particular situation. Give them enough context to work with, and they might surprise you with how helpful they can be.