The competition to create more powerful artificial intelligence applications has also created a great demand in China for high-quality training data.
Scott Detrow, host:
The competition to create more powerful artificial intelligence applications has created a great demand for high quality training data and competition over who will use that data, as reported by NPR's Emily Fenn and Aowen CAO.
(Mouse click sound bite)
Emily Feng, byline: In this brand-spanking new office building in northeast China, people line and queues quietly click on computer screens. This is the fuel that enhances much of the generative AI-RAW data, and this data processing center is the brainchild of this man.
Henry Chen: My name is Henry, Henry Chen.
Feng: He is the founder of Sapien Ai. It can be used to train a variety of artificial intelligence applications to hire people from all over the world and collect, tag and organize data. China is a big market.
Chen: Especially after deepseek came out.
FENG: Deepseek, a Chinese chatbot that plays on par with an American-trained chatbot, but is trained at a fraction of the cost – the demand for that data is why Chen's company has about 60 employees labelling street maps in China. Today this data is used to train autonomous driving programs.
Aowen Cao, byline: Looks very abstract.
Feng: That's NPR producer Aowen Cao.
CAO: I'm watching someone work in front of a computer, but on the computer screen it's a square black background.
Feng: Squares and Green Dots – looks like most, Aowen laughs, says the TV show “Severance.” The data may seem abstract, but it is a valuable product, says Roger Creamers. He is a professor at Leiden University in the Netherlands and studies China's digital technology policy.
Rogier Creemers: They believe that data is an economic input, and in a sense, in that sense, they see it as akin for the raw materials.
Feng: ChatBots, like ChatGpt, literally requires trillions of data points, and those who own that data are in competition between businesses and countries such as the US and China. Each is AI and the others want the edge, which means hoarding the data. Since last year, data is a very chokepoint, China's cyberspace regulators have had to approve mass exports of data abroad.
Chen: For the AI model trained here, the data must be processed domestically and cannot be left.
Feng: The race to create and protect data is because the data AI wants is becoming more complicated. Olga Megorkkaya, founder of the Amsterdam Registered Data Processing Company called Toloka, currently specializes in creating data sets in the highly technical science and engineering fields. She uses analogy that compares early AI models with human infants.
Olga Megorskaya: That person appears to be two years old. He or she is taught from a child's book with very bright photos.
Feng: And the more advanced AI models are like college students.
Megorkkaya: When she goes to college, there are many textbooks that she needs to read.
Feng: For AI models, that means gobbling increasingly advanced datasets. The data industry is so important that local Chinese governments, once reliant on dying industries such as steel manufacturing and coal mining, are actively hiring AI data processing companies. This is once again a Creamer at Leiden University.
Creamers: China wants to make a lot of money by developing its future industry.
Feng: The Rust Belt City city in Shekenyang, where Sapien AI chose to find one of its offices, is one of seven Chinese cities that they say they want to become an AI data hub. The city offers low interest rates for lending and flexible and affordable office space. This is again the Chan of Sapien AI. They benefited from this help.
Chen: So they give us a lot of help, so we find a really good environment to set up an office here.
FENG: As data processing employs many young people, China's economy will not fully recover from the global coronavirus pandemic, and youth unemployment is fully concerned about policymakers, so it temporarily suspended its release of its statistics.
(Mouse click sound bite)
Feng: One of the young people working at Sapien Ai is 21-year-old Huang Rui. She is a data quality specialist.
Huang Rui: (a language other than English).
Feng: She says data processing work is suitable for people with a strong obsession-oriented tendency. Because it requires a high level of attention to detail. Data processing certainly isn't the most exciting job, says her boss, Chen.
Chen: Sit at your desk and imagine yourself and draw a box with boundaries around your car for 40 hours a week.
Feng: But innovation actually requires a lot of people to do boring jobs. Emily Fenn, NPR News.
Copyright©2025 NPR. Unauthorized reproduction is prohibited. For more information, please see the www.npr.org website Terms of Use and Permissions page.
The accuracy and availability of NPR transcripts may vary. You can modify the transcript text to fix the errors and match the update to the audio. Audio on npr.org may be edited after the original broadcast or publication. The authoritative record of NPR programming is audio records.
