Last month, educational psychologist Ronald Beghetto asked a group of graduate students and teaching professionals to discuss their work in an unusual way. As well as talking to each other, they conversed with a collection of creativity-focused chatbots that Beghetto had designed and that will soon be hosted on a platform run by his institute, Arizona State University (ASU).
The bots are based on the same artificial-intelligence (AI) technology that powers the famous and conversationally fluent ChatGPT. Beghetto prompts the bots to take on various personas to encourage creativity — for example, by deliberately challenging someone’s assumptions. One student discussed various dissertation topics with the chatbots. Lecturers talked about how to design classes.
The feedback was overwhelmingly positive. One participant said that they had previously tried to use ChatGPT to support learning but had not found it useful — unlike Beghetto’s chatbots. Another asked: “When are these things going to be available?” The bots helped participants to generate more possibilities than they would have considered otherwise.
Why teachers should explore ChatGPT’s potential — despite the risks
Many educators fear that the rise of ChatGPT will make it easier for students to cheat on assignments. Yet Beghetto, who is based in Tempe, and others are exploring the potential of large language models (LLMs), such as ChatGPT, as tools to enhance education.
Using LLMs to read and summarize large stretches of text could save students and teachers time and help them to instead focus on discussion and learning. ChatGPT’s ability to lucidly discuss nearly any topic raises the prospect of using LLMs to create a personalized, conversational educational experience. Some educators see them as potential ‘thought partners’ that might cost less than a human tutor and — unlike people — are always available.
“One-on-one tutoring is the single most effective intervention for teaching, but it’s very expensive and not scalable,” says Theodore Gray, co-founder of Wolfram Research, a technology company in Champaign, Illinois. “People have tried software, and it generally doesn’t work very well. There’s now a real possibility that one could make educational software that works.” Gray told Nature that Wolfram Research is currently working on an LLM-based tutor but gave few details.
Such AI partners could be used to lead students through a problem step by step, stimulate critical thinking or — as in the case of Beghetto’s experiment — enhance users’ creativity and broaden the possibilities being considered. Jules White, director of the Initiative on the Future of Learning and Generative AI at Vanderbilt University in Nashville, Tennessee, calls ChatGPT “an exoskeleton for the mind”.
The risks are real
Since California firm OpenAI launched ChatGPT in November 2022, much of the attention regarding its use in education has been negative. LLMs work by learning how words and phrases relate to each other from training data containing billions of examples. In response to user prompts, they then produce sentences, including the answer to an assignment question, and even whole essays.
Unlike previous AI systems, ChatGPT’s answers are often well written and seemingly well researched. This raises concerns that students will simply be able to get ChatGPT to do their homework for them, or at least that they might become reliant on a chatbot to get quick answers, without understanding the rationale.
ChatGPT could also lead students astray. Despite excelling in a host of business, legal and academic exams1, the bot is notoriously brittle, getting things wrong if a question is phrased slightly differently, and it even makes things up, an issue known as hallucination.
Wei Wang, a computer scientist at the University of California, Los Angeles, found that GPT-3.5 — which powers the free version of ChatGPT — and its successor, GPT-4, got a lot wrong when tested on questions in physics, chemistry, computer science and mathematics taken from university-level textbooks and exams2. Wang and her colleagues experimented with different ways to query the two GPT bots. They found that the best method used GPT-4, and that its bot could answer around one-third of the textbook questions correctly that way (see ‘AI’s textbook errors’), although it scored 80% in one exam.
Privacy is another hurdle: students might be put off working regularly with LLMs once they realize that everything they type into them is being stored by OpenAI and might be used to train the models.
Embracing LLMs
But despite the challenges, some researchers, educators and companies see huge potential in ChatGPT and its underlying LLM technology. Like Beghetto and Wolfram Research, they are now experimenting with how best to use LLMs in education. Some use alternatives to ChatGPT, some find ways to minimize inaccuracies and hallucinations, and some improve the LLMs’ subject-specific knowledge.
“Are there positive uses?” asks Collin Lynch, a computer scientist at North Carolina State University in Raleigh who specializes in educational systems. “Absolutely. Are there risks? There are huge risks and concerns. But I think there are ways to mitigate those.”
Society needs to help students to understand LLMs’ strengths and risks, rather than just forbidding them to use the technology, says Sobhi Tawil, director of the future of learning and innovation at UNESCO, the United Nations’ agency for education, in Paris. In September, UNESCO published a report entitled Guidance for Generative AI in Education and Research. One of its key recommendations is that educational institutions validate tools such as ChatGPT before using them to support learning3.
Companies are marketing commercial assistants, such as MagicSchool and Eduaide, that are based on OpenAI’s LLM technology and help schoolteachers to plan lesson activities and assess students’ work. Academics have produced other tools, such as PyrEval4, created by computer scientist Rebecca Passonneau’s team at Pennsylvania State University in State College, to read essays and extract the key ideas.
With help from educational psychologist Sadhana Puntambekar at the University of Wisconsin–Madison, PyrEval has scored physics essays5 written during science classes by around 2,000 middle-school students a year for the past three years. The essays are not given conventional grades, but PyrEval enables teachers to quickly check whether assignments include key themes and to provide feedback during the class itself, something that would otherwise be impossible, says Puntambekar.
PyrEval’s scores also help students to reflect on their work: if the AI doesn’t detect a theme that the student thought they had included, it could indicate that the idea needs to be explained more clearly or that they made small conceptual or grammatical errors, she says. The team is now asking ChatGPT and other LLMs to do the same task and is comparing the results.
Introducing the AI tutor
Other organizations use AI to help students directly. That’s the approach of what is perhaps the most widely used LLM-based education tool other than ChatGPT itself; the AI tutor and teaching assistant Khanmigo. The tool is the result of a partnership between OpenAI and education non-profit organization Khan Academy in Mountain View, California. Using GPT-4, Khanmigo offers students tips as they work through an exercise, saving teachers time.
Khanmigo works differently from ChatGPT. It appears as a pop-up chatbot on a student’s computer screen. Students can discuss the problem that they are working on with it. The tool automatically adds a prompt before it sends the student’s query to GPT-4, instructing the bot not to give away answers and instead to ask lots of questions.
AI bot ChatGPT writes smart essays — should professors worry?
Kristen DiCerbo, the academy’s chief learning officer, calls this process a “productive struggle”. But she acknowledges that Khanmigo is still in a pilot phase and that there is a fine line between a question that aids learning and one that’s so difficult that it makes students give up. “The trick is to figure out where that line is,” she says.
Khanmigo was first introduced in March, and more than 28,000 US teachers and 11–18-year-old students are piloting the AI assistant this school year, according to Khan Academy. Users include private subscribers as well as more than 30 school districts. Individuals pay US$99 a year to cover the computing costs of LLMs, and school districts pay $60 a year per student for access. To protect student privacy, OpenAI has agreed not to use Khanmigo data for training.
But whether Khanmigo can truly revolutionize education is still unclear. LLMs are trained to include only the next most likely word in a sentence, not to check facts. They therefore sometimes get things wrong. To improve its accuracy, the prompt that Khanmigo sends to GPT-4 now includes the right answers for guidance, says DiCerbo. It still makes mistakes, however, and Khan Academy asks users to let the organization know when it does.
Lynch says Khanmigo seems to be doing well. But he cautions: “I haven’t seen a clear validation yet.”
More generally, Lynch stresses that it’s crucial that any chatbot used in education is carefully checked for its tone, as well as accuracy — and that it does not insult or belittle students, or make them feel lost. “Emotion is key to learning. You can legitimately destroy somebody’s interest in learning by helping them in a bad way,” Lynch says.
DiCerbo notes that Khanmigo responds differently to each student in each situation, which she hopes makes the bot more engaging than previous tutoring systems. Khan Academy expects to share its research on Khanmigo’s efficacy in late 2024 or early 2025.
Other tutoring companies are offering LLMs as assistants for students or are experimenting with them. The education technology firm Chegg in Santa Clara, California, launched an assistant based on GPT-4 in April. And TAL Education Group, a Chinese tutoring company based in Beijing, has created an LLM called MathGPT that it claims is more accurate than GPT-4 at answering maths-specific questions. MathGPT also aims to help students by explaining how to solve problems.
Augmenting retrieval
Another approach to creating an AI learning partner integrates the LLM with external, focused corpuses of knowledge — such as a textbook or a set of scientific papers — that have been rigorously verified. The goal of this retrieval-augmented generation (RAG) method is to sidestep the impossibility of verifying the billions of sources of text that give an LLM its conversational power.
AI company Merlyn Mind in New York City is using RAG in its open-source Corpus-qa LLM, which is aimed at education. Like ChatGPT, Merlyn Mind’s LLM is initially trained on a big body of text not related to education specifically — this gives it its conversational ability.
But unlike ChatGPT, when the LLM answers a query, it does not rely just on what it has learnt in its training. Instead, it also refers to a specific corpus of information, which minimizes hallucinations and other errors, says Satya Nitta, chief executive of the company. Merlyn Mind also fine-tunes its LLMs to “confess” if they don’t have a high-quality response and work on producing a better answer, and thereby resist hallucination in many cases, says Nitta.
ChatGPT broke the Turing test — the race is on for new ways to assess AI
RAG is also being used by ASU, which is one of the most progressive universities for LLM adoption, says Claire Zau, vice-president of GSV Ventures, an investor in educational-technology companies in New York City. After an initial narrow release for testing, ASU launched a toolbox in October that enables its faculty members to experiment with LLMs in education through a web interface. This includes access to six LLMs, including GPT-3.5, GPT-4 and Google’s Bard, as well as RAG capabilities.
The tools will allow more researchers, such as Beghetto, to construct chatbots for their students to interact with. After his initial workshop, Beghetto plans to use the bots in a course that he is developing. ASU hosts secure versions of the LLMs in its private cloud to minimize privacy concerns, says Elizabeth Reilley, ASU’s executive director of AI acceleration, who is based in Phoenix.
Reilley says that the bots are already having a positive impact on education at ASU. For example, she says, a bot created for use in ASU’s introductory chemistry course uses RAG to combine GPT-3.5 with PDF and PowerPoint course materials. She gives an example of a test that imagined a baseball-loving student asking the LLM for an explanation of dipole–dipole interactions in molecules based on that sport. The response was an accurate explanation, she says, that wove in “a baseball metaphor to make that a little bit more meaningful”.
Using a general LLM combined with RAG differs from previous machine-learning approaches, which sought to train an AI system to simulate a science expert, says Danielle McNamara, executive director of ASU’s learning engineering institute in Tempe. Those tools lacked generalized capabilities, such as the capacity to incorporate baseball into chemical concepts, that could help students. McNamara and her colleagues now plan to study how effective the chatbots and LLM tools that ASU uses are.
Other institutions are also embracing LLMs, including Vanderbilt University in Nashville, Tennessee, which has given students on certain courses access to a paid version of ChatGPT, including access to specialized plug-in tools. Researchers at East China Normal University in Shanghai have created a dedicated educational LLM called EduChat that combines essay assessment, dialogue-based tutoring and emotional support in one chatbot6. The team has shared the tool as open-source code. Although EduChat is still at an early stage, it is notable for being a dedicated educational LLM rather than an adaptation of an existing, general-purpose model, such as ChatGPT or Bard.
Will it catch on?
An important question around the use of AI in education is who will have access to it, and whether paid services such as Khanmigo will exacerbate existing inequalities in educational resources. DiCerbo says Khan Academy is now looking for philanthropists and grants to help to pay for computing power and to provide access for under-resourced schools, having prioritized such schools in the pilot phase. “We are working to make sure that digital divide doesn’t happen,” she says.
How ChatGPT is transforming the postdoc experience
Another challenge is how to ensure that the information LLMs provide is not biased, and that the models consider knowledge and viewpoints from under-represented groups. Such information is absent from much of the text that LLMs are trained on. Sean Dudley, ASU’s associate vice-president for research technology, based in Tempe, says that RAG allows ASU’s LLM platform to provide users with the sources of its answers. This doesn’t remove the problem of bias, but he hopes that it will at least provide transparency and a chance for students to critically consider where the information has come from. “Part of our mission is asking who’s been left out,” Dudley says.
Whether LLMs’ promise for education will ultimately outweigh the risks is still not clear. Lynch accepts that they are powerful tools, but also seeks to keep their shortcomings in focus. “It isn’t like overnight we’ve learnt to fly,” he says.
He likens the attention that they’re attracting to that previously lavished on massively online open courses and educational uses of the 3D virtual worlds known as the metaverse. Neither have the transformative power that some once predicted, but both have their uses. “In a sense, this is going to be the same. It’s not bad. It’s not perfect. It’s not everything. It’s a new thing,” he says.
Tawil, who has worked in education at UNESCO for more than two decades, says that understanding AI’s limitations is crucial. At the same time, LLMs are now so bound up in human endeavours that he says it is essential to rethink how to teach and assess learning. “It’s redefining what makes us human, what is unique about our intelligence.”