New Jersey uses AI to score standardized written tests

(TNS) — Artificial intelligence will be used to grade most essays written by New Jersey students on a new statewide standardized test scheduled to be introduced this spring, state education officials announced.

The AI system would be used to score student essays and short answers on the English language arts section of the statewide exam, according to the state-approved test proposal. The “artificial intelligence” will be trained using scores generated by human graders on mock exams administered to students in October and November.

New Jersey will debut a new type of state test this spring called the New Jersey Student Learning Assessment Assessment – Adaptive. This test is offered to students in grades 3 through 10 to test their knowledge of English, math, and science.

A new version of the state’s high school exit exam for high school seniors, now called the New Jersey Graduation Assessment – Adapted, will also be administered.

Like the previous version of the test known as NJSLA, the exam is administered by computer. However, the new version will be “adaptive”. This means that students will be asked different questions based on their previous answers on the exam. This is believed to make test scoring more accurate.

State Department of Education spokesman Michael Yeaple said AI systems will be used to grade essays and written questions, but there will still be human graders.

If a student’s written response is identified as “anomalous” or “borderline,” it “will be flagged for human review,” Yaple said.

“The system performs periodic quality assurance checks to ensure that the scores assigned by the automated scoring engine match human scores through rigorous quality controls,” he added.

Yaple said Cambium, the company overseeing the new test, is not using generative AI. Generative AI is a version of artificial intelligence used in ChatGPT-type platforms that can create something new and is known for sometimes hallucinating false or inaccurate information.

Instead, the automated grading system “has rigorous parameters with proven consistency, and human grading will continue to be the foundation of the process, validating accuracy at multiple checkpoints throughout the grading workflow,” state education officials said in a statement.

Computer-based scoring of state tests in New Jersey is nothing new. Last year, about 90 percent of student essays on the NJSLA and state high school exit exams were graded solely by automated scoring systems, Yaple said.

But some education officials are concerned about the widespread use of AI to score the new version of the test, which will eventually be taken by nearly all of New Jersey’s 1.3 million public school students.

Steve Beatty, president of the New Jersey Education Association, the state’s largest teachers’ union, says using versions of AI to grade student writing is dangerous.

He said he didn’t want to see “some students fail a computer-scored test and then find out there was some kind of mistake.”

NJEA opposes high-stakes testing in general, Beatty said. However, if the tests are to continue, “we would like them to be scored by trained educators, i.e. humans.”

If a student fails the AI-graded section of the exam, there should be a plan to re-evaluate the human writing, he said.

“They should go back and check on themselves,” Beatty said.

new test contract

New Jersey students will begin taking the new NJSLA-Adaptive exam during a month-long testing period from April 27 to May 29. Exams are typically conducted over several consecutive days.

According to the state Department of Education’s testing schedule, the testing period for the new NJGPA-Adaptive High School Exit Exam for high school seniors will run from March 16 to April 1.

The new statewide NJSLA and NJGPA tests were developed by Cambium Assessment, a company that won a $58.7 million, two-year contract with the state.

Under Cambium’s proposal, Measurement Incorporated, a Durham, North Carolina, company, would be responsible for providing and training human resources to perform “hand scoring” when AI-generated essay and written response scores are judged.

In its proposal to the state, Cambium said the company envisions “25% of all responses being routed to trained hand scoring.”

New Jersey officials said no AI was used to create test items for the new version of the test, and no artificial intelligence will be used to determine which questions students see on adaptive assessments.

Jeffrey Hauger, director of assessment for the state Department of Education from 2010 to 2018, said New Jersey has a long history of using computers to grade the written portion of state exams. He then worked as a consultant for Pearson, which had previously been contracted to provide the state’s NJSLA tests.

Around 2016, Hauger said, the state began implementing a system that uses one human and one automated grader to evaluate each student’s writing.

If a significant difference is found between the two scores, the essay will be read by a second person, he said.

“It was an efficiency tool, but at the time there was always a human involved throughout the process,” says Hauger.

AI scoring is becoming more sophisticated, he said.

“Technology has advanced, so now it’s not as big a leap as people think,” Hauger said.

During Gov. Phil Murphy’s tenure, the department began relying more on automated grading and moving away from having both machines and humans evaluate each passage.

Signs of a problem

AI scoring has been controversial in other states as well.

Last year, Massachusetts blamed AI-based scoring errors for 1,400 incorrect scores on the state’s Massachusetts Comprehensive Assessment System, known as MCAS.

In Texas, several school districts have questioned whether AI-based scoring on recent statewide tests is fair.

Over the past two years, Dallas Independent School District has challenged thousands of AI-generated essay scores on STAAR standardized tests across the state of Texas.

Cambium and Pearson, the companies involved in New Jersey’s testing, both contributed to Texas’ standardized testing system.

In 2024, the Dallas School District asked the state to rescore 4,600 tests, sending test results to the state for human rescoring.

About 44% of rescored tests returned higher scores after being read by humans, said Jacob Cortez, Dallas’ assistant superintendent for assessment and evaluation.

The district also submitted thousands of AI-scored tests for rescoring last year, and nearly 40% returned higher scores than humans, the district said.

Correct answer rates on AI-scored third-grade tests were the biggest problem, with 85 percent of the tests sent back improving scores when humans read students’ work.

“That’s not right,” Cortez said.

The Dallas school district, which serves about 139,000 students, limited the number of tests it could send back for rescoring because it had to pay $50 for each test whose scores did not improve, local officials said.

Cambium officials did not respond to requests for comment about Dallas’ accuracy issues or the company’s AI scoring methodology.

New Jersey officials declined to comment on questions about the accuracy of AI scoring in other states.

“New Jersey cannot comment on another state’s evaluation and scoring process,” Yaple said.

New Jersey’s new education commissioner, Lily Roe, also did not respond to a request for comment. According to her LinkedIn profile, her previous job was as deputy commissioner for school programs for the state of Texas, where she helped design the state’s standardized testing system.

Scott Marion, principal learning associate at the Center for Assessment, a nonprofit, nonpartisan consulting firm, said the problems with Dallas’ AI scoring raise questions about the system.

“Isn’t there enough training? Aren’t we training enough diverse populations?” Marion asked.

He said while AI scoring makes financial sense, states also need to be careful not to rely too heavily on AI. He is used to writing about 80 percent AI scoring, as the system still requires human backup.

“We’ve been doing this for a long time,” he said, referring to the use of AI to grade student writing.

Education advocates say many students, teachers and parents may be surprised to learn how much of their school writing is already graded by AI.

“A lot of parents don’t understand that this is an issue,” said Julie Borst, executive director of community organizing for the statewide advocacy group Save Our Schools New Jersey.

She worries that students with unique writing styles may receive lower scores on tests because the AI is looking for specific words or phrases, or a standard number of sentences to earn a high score.

Borst’s organization has long opposed high-stakes standardized testing, but said it’s still ultimately up to teachers to know where their students are doing well and where they’re struggling.

“Teachers are going to know where their weaknesses are. They’re going to know where their strengths are,” she says. “At the student level, you can’t tell that from a standardized test.”

Source link