AI Insights
SCAN Keeps Older Americans in the Know! Sign-up for Artificial Intelligence Classes. Learn more: – TAPinto

AI Insights
How AI Simulations Match Up to Real Students—and Why It Matters

AI-simulated students consistently outperform real students—and make different kinds of mistakes—in math and reading comprehension, according to a new study.
That could cause problems for teachers, who increasingly use general prompt-based artificial intelligence platforms to save time on daily instructional tasks. Sixty percent of K-12 teachers report using AI in the classroom, according to a June Gallup study, with more than 1 in 4 regularly using the tools to generate quizzes and more than 1 in 5 using AI for tutoring programs. Even when prompted to cater to students of a particular grade or ability level, the findings suggest underlying large language models may create inaccurate portrayals of how real students think and learn.
“We were interested in finding out whether we can actually trust the models when we try to simulate any specific types of students. What we are showing is that the answer is in many cases, no,” said Ekaterina Kochmar, co-author of the study and an assistant professor of natural-language processing at the Mohamed bin Zayed University of Artificial Intelligence in the United Arab Emirates, the first university dedicated entirely to AI research.
How the study tested AI “students”
Kochmar and her colleagues prompted 11 large language models (LLMs), including those underlying generative AI platforms like ChatGPT, Qwen, and SocraticLM, to answer 249 mathematics and 240 reading grade-level questions on the National Assessment of Educational Progress in reading and math using the persona of typical students in grades 4, 8, and 12. The researchers then compared the models’ answers to NAEP’s database of real student answers to the same questions to measure how closely AI-simulated students’ answers mirrored those of actual student performance.
The LLMs that underlie AI tools do not think but generate the most likely next word in a given context based on massive pools of training data, which might include real test items, state standards, and transcripts of lessons. By and large, Kochmar said, the models are trained to favor correct answers.
“In any context, for any task, [LLMs] are actually much more strongly primed to answer it correctly,” Kochmar said. “That’s why it’s very difficult to force them to answer anything incorrectly. And we’re asking them to not only answer incorrectly but fall in a particular pattern—and then it becomes even harder.”
For example, while a student might miss a math problem because he misunderstood the order of operations, an LLM would have to be specifically prompted to misuse the order of operations.
None of the tested LLMs created simulated students that aligned with real students’ math and reading performance in 4th, 8th, or 12th grades. Without specific grade-level prompts, the proxy students performed significantly higher than real students in both math and reading—scoring, for example, 33 percentile points to 40 percentile points higher than the average real student in reading.
Kochmar also found that simulated students “fail in different ways than humans.” While specifying specific grades in prompts did make simulated students perform more like real students with regard to how many answers they got correct, they did not necessarily follow patterns related to particular human misconceptions, such as order of operations in math.
The researchers found no prompt that fully aligned simulated and real student answers across different grades and models.
What this means for teachers
For educators, the findings highlight both the potential and the pitfalls of relying on AI-simulated students, underscoring the need for careful use and professional judgment.
“When you think about what a model knows, these models have probably read every book about pedagogy, but that doesn’t mean that they know how to make choices about how to teach,” said Robbie Torney, the senior director of AI programs at Common Sense Media, which studies children and technology.
Torney was not connected to the current study, but last month released a study of AI-based teaching assistants that similarly found alignment problems. AI models produce answers based on their training data, not professional expertise, he said. “That might not be bad per se, but it might also not be a good fit for your learners, for your curriculum, and it might not be a good fit for the type of conceptual knowledge that you’re trying to develop.”
This doesn’t mean teachers shouldn’t use general prompt-based AI to develop tools or tests for their classes, the researchers said, but that educators need to prompt AI carefully and use their own professional judgement when deciding if AI outputs match their students’ needs.
“The great advantage of the current technologies is that it is relatively easy to use, so anyone can access [them],” Kochmar said. “It’s just at this point, I would not trust the models out of the box to mimic students’ actual ability to solve tasks at a specific level.”
Torney said educators need more training to understand not just the basics of how to use AI tools but their underlying infrastructure. “To be able to optimize use of these tools, it’s really important for educators to recognize what they don’t have, so that they can provide some of those things to the models and use their professional judgement.”
AI Insights
We’re Entering a New Phase of AI in Schools. How Are States Responding?

Artificial intelligence topped the list of state technology officials’ priorities for the first time, according to an annual survey released by the State Educational Technology Directors’ Association on Wednesday.
More than a quarter of respondents—26%—listed AI as their most pressing issue, compared to 18% in a similar survey conducted by SETDA last year. AI supplanted cybersecurity, which state leaders previously identified as their No. 1 concern.
About 1 in 5 state technology officials—21%—named cybersecurity as their highest priority, and 18% identified professional development and technology support for instruction as their top issues.
Forty percent of respondents reported that their state had issued guidance on AI. That’s a considerable increase from just two years ago, when only 2% of respondents to the same survey reported their state had released AI guidance.
State officials’ heightened attention on AI suggests that even though many more states have released some sort of AI guidance in the past year or two, officials still see a lot left on their to-do lists when it comes to supporting districts in improving students’ AI literacy, offering professional development about AI for educators, and crafting policies around cheating and proper AI use.
“A lot of guidance has come out, but now the rubber’s hitting the road in terms of implementation and integration,” said Julia Fallon, SETDA’s executive director, in an interview.
SETDA, along with Whiteboard Advisors, surveyed state education leaders—including ed-tech directors, chief information officers, and state chiefs—receiving more than 75 responses across 47 states. It conducted interviews with state ed-tech teams in Alabama, Delaware, Nebraska, and Utah and did group interviews with ed-tech leaders from 14 states.
AI professional development is a rising priority
States are taking a myriad of approaches to responding to the AI challenge, the report noted.
Some states—such as North Carolina and Utah—designated an AI point person to help support districts in puzzling through the technology. For instance, Matt Winters, who leads Utah’s work, has helped negotiate statewide pricing for AI-powered ed-tech tools and worked with an outside organization to train 4,500 teachers on AI, according to the report.
Wyoming, meanwhile, has developed an “innovator” network that pays teachers to offer AI professional development to colleagues across the state. Washington hosted two statewide AI summits to help district and school leaders explore the technology.
And North Carolina and Virginia have used state-level competitive grant programs to support activities such as AI-specific professional development or AI-infused teaching and learning initiatives.
“As AI continues to evolve, developing connections with those in tech, in industry, and in commerce, as well as with other educators, will become more important than ever,” wrote Sydnee Dickson, formerly Utah’s state superintendent of public instruction, in an introduction to the report. “The technology is advancing too quickly for any one person or state to have all the answers.”
AI Insights
General-purpose LLMs can be used to track true critical findings

General-purpose large language models (LLMs), such as GPT-4, can be adapted to detect and categorize multiple critical findings within individual radiology reports, using minimal data annotation, researchers have reported.
A team led by Ish Talati, MD, of Stanford University, with colleagues from the Arizona Advanced AI and Innovation (A3I) Hub and Mayo Clinic Arizona, retrospectively evaluated two “out-of-the-box” LLMs — GPT-4 and Mistral-7B — to see how well they might perform at classifying findings indicating medical emergency or requiring immediate action, among others. Their results were published on September 10 in the American Journal of Roentgenology.
Timely critical findings communication can be challenging due to the increasing complexity and volume of radiology reports, the authors noted. “Workflow pressures highlight the need for automated tools to assist in critical findings’ systematic identification and categorization,” they said.
The study demonstrated that few-shot prompting, incorporating a small number of examples for model guidance, can aid general-purpose LLMs in adapting to the medical task of complex categorization of findings into distinct actionable categories.
To that end, Talati and colleagues evaluated GPT-4 and Mistral-7B on more than 400 radiology reports selected from the MIMIC-III database of deidentified health data from patients in the intensive care unit (ICU) at Beth Israel Deaconess Medical Center from 2001 to 2012.
Analysis included 252 radiology reports of varying modalities (56% CT, ~30% radiography, 9% MRI, for example) and anatomic regions (mostly chest, pelvis, and head).
The reports were divided into a prompt engineering tuning set of 50, a holdout test set of 125, and a pool of 77 remaining reports used as examples for few-shot prompting. An external test set consisted of 180 chest x-ray reports extracted from the CheXpert Plus database.
With a board-certified radiologist and software separately, manual reviews of the reports classified them at consensus into one of three categories:
- True critical finding (new, worsening, or increasing in severity since prior imaging)
- Known/expected critical finding (a critical finding that is known and unchanged, improving, or decreasing in severity since prior imaging)
- Equivocal critical finding (an observation that is suspicious for a critical finding but that is not definitively present based on the report)
The models analyzed the submitted report and provided structured output containing multiple fields, listing model-identified critical findings within each of the three categories, according to the group. Evaluation included automated text similarity metrics (BLEU-1, ROUGE-F1, G-Eval) and manual performance metrics (precision, recall) in the three categories.
Precision and recall comparison for LLMs tracking true critical findings |
||
Type of test set and classification |
GPT-4 |
Mistral-7B |
Precision |
||
Holdout test set, true critical findings |
90.1% |
75.6% |
Holdout test set, known/expected critical findings |
80.9% |
34.1% |
Holdout test set, equivocal critical findings |
80.5% |
41.3% |
External test set, True critical findings |
82.6% |
75% |
External test set, known/expected critical findings |
76.9% |
33.3% |
External test set, equivocal critical findings |
70.8% |
34% |
Recall |
||
Holdout test set, true critical findings |
86.9% |
77.4% |
Holdout test set, known/expected critical findings |
85% |
70% |
Holdout test set, equivocal critical findings |
94.3% |
74.3% |
External test set, True critical findings |
98.3% |
93.1% |
External test set, known/expected critical findings |
71.4% |
92.9% |
External test set, equivocal critical findings |
85% |
80% |
“GPT-4, when optimized with just a small number of in-context examples, may offer new capabilities compared to prior approaches in terms of nuanced context-dependent classifications,” Tatali and colleagues wrote. “This capability is crucial in radiology, where identification of findings warranting referring clinician alerts requires differentiation of whether the finding is new or already known.”
Though promising, further refinement is needed before clinical implementation, the group noted. In addition, the group highlighted a role for electronic health record (EHR) integration to inform more nuanced categorization in future implementations.
Furthermore, additional technical development remains required before potential real-world applications, the group said.
See all metrics and the complete paper here.
-
Business2 weeks ago
The Guardian view on Trump and the Fed: independence is no substitute for accountability | Editorial
-
Tools & Platforms4 weeks ago
Building Trust in Military AI Starts with Opening the Black Box – War on the Rocks
-
Ethics & Policy2 months ago
SDAIA Supports Saudi Arabia’s Leadership in Shaping Global AI Ethics, Policy, and Research – وكالة الأنباء السعودية
-
Events & Conferences4 months ago
Journey to 1000 models: Scaling Instagram’s recommendation system
-
Jobs & Careers2 months ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Education2 months ago
Macron says UK and France have duty to tackle illegal migration ‘with humanity, solidarity and firmness’ – UK politics live | Politics
-
Education2 months ago
VEX Robotics launches AI-powered classroom robotics system
-
Podcasts & Talks2 months ago
Happy 4th of July! 🎆 Made with Veo 3 in Gemini
-
Funding & Business2 months ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Podcasts & Talks2 months ago
OpenAI 🤝 @teamganassi