
As of Wednesday, at least eleven IP addresses have actively tried to exploit the vulnerability, with thousands more addresses possibly doing reconnaissance work.


As of Wednesday, at least eleven IP addresses have actively tried to exploit the vulnerability, with thousands more addresses possibly doing reconnaissance work.
Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found.
Thousands of images—including identifiable faces—were found in a small subset of DataComp CommonPool, a major AI training set for image generation scraped from the web. Because the researchers audited just 0.1% of CommonPool’s data, they estimate that the real number of images containing personally identifiable information, including faces and identity documents, is in the hundreds of millions. The study that details the breach was published on arXiv earlier this month.
The bottom line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, is that “anything you put online can [be] and probably has been scraped.”
The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates—as well as over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people. (In many more cases, the researchers did not have time to validate the documents or were unable to because of issues like image clarity.)
A number of the résumés disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references).

When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the largest existing data set of publicly available image-text pairs, which are often used to train generative text-to-image models. While its curators said that CommonPool was intended for academic research, its license does not prohibit commercial use as well.
CommonPool was created as a follow-up to the LAION-5B data set, which was used to train models including Stable Diffusion and Midjourney. It draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022.
While commercial models often do not disclose what data sets they are trained on, the shared data sources of DataComp CommonPool and LAION-5B mean that the data sets are similar, and that the same personally identifiable information likely appears in LAION-5B, as well as in other downstream models trained on CommonPool data. CommonPool researchers did not respond to emailed questions.
And since DataComp CommonPool has been downloaded more than 2 million times over the past two years, it is likely that “there [are]many downstream models that are all trained on this exact data set,” says Rachel Hong, a PhD student in computer science at the University of Washington and the paper’s lead author. Those would duplicate similar privacy risks.
“You can assume that any large-scale web-scraped data always contains content that shouldn’t be there,” says Abeba Birhane, a cognitive scientist and tech ethicist who leads Trinity College Dublin’s AI Accountability Lab—whether it’s personally identifiable information (PII), child sexual abuse imagery, or hate speech (which Birhane’s own research into LAION-5B has found).
Indeed, the curators of DataComp CommonPool were themselves aware it was likely that PII would appear in the data set and did take some measures to preserve privacy, including automatically detecting and blurring faces. But in their limited data set, Hong’s team found and validated over 800 faces that the algorithm had missed, and they estimated that overall, the algorithm had missed 102 million faces in the entire data set. On the other hand, they did not apply filters that could have recognized known PII character strings, like emails or Social Security numbers.
“Filtering is extremely hard to do well,” says Agnew. “They would have had to make very significant advancements in PII detection and removal that they haven’t made public to be able to effectively filter this.”

There are other privacy issues that the face blurring doesn’t address. While the blurring filter is automatically applied, it is optional and can be removed. Additionally, the captions that often accompany the photos, as well as the photos’ metadata, often contain even more personal information, such as names and exact locations.
Another privacy mitigation measure comes from Hugging Face, a platform that distributes training data sets and hosts CommonPool, which integrates with a tool that theoretically allows people to search for and remove their own information from a data set. But as the researchers note in their paper, this would require people to know that their data is there to start with. When asked for comment, Florent Daudens of Hugging Face said that “maximizing the privacy of data subjects across the AI ecosystem takes a multilayered approach, which includes but is not limited to the widget mentioned,” and that the platform is “working with our community of users to move the needle in a more privacy-grounded direction.”
In any case, just getting your data removed from one data set probably isn’t enough. “Even if someone finds out their data was used in a training data sets and … exercises their right to deletion, technically the law is unclear about what that means,” says Tiffany Li, an associate professor of law at the University of San Francisco School of Law. “If the organization only deletes data from the training data sets—but does not delete or retrain the already trained model—then the harm will nonetheless be done.”
The bottom line, says Agnew, is that “if you web-scrape, you’re going to have private data in there. Even if you filter, you’re still going to have private data in there, just because of the scale of this. And that’s something that we [machine-learning researchers], as a field, really need to grapple with.”
CommonPool was built on web data scraped between 2014 and 2022, meaning that many of the images likely date to before 2020, when ChatGPT was released. So even if it’s theoretically possible that some people consented to having their information publicly available to anyone on the web, they could not have consented to having their data used to train large AI models that did not yet exist.
And with web scrapers often scraping data from each other, an image that was originally uploaded by the owner to one specific location would often find its way into other image repositories. “I might upload something onto the internet, and then … a year or so later, [I] want to take it down, but then that [removal] doesn’t necessarily do anything anymore,” says Agnew.
The researchers also found numerous examples of children’s personal information, including depictions of birth certificates, passports, and health status, but in contexts suggesting that they had been shared for limited purposes.
“It really illuminates the original sin of AI systems built off public data—it’s extractive, misleading, and dangerous to people who have been using the internet with one framework of risk, never assuming it would all be hoovered up by a group trying to create an image generator,” says Ben Winters, the director of AI and privacy at the Consumer Federation of America.
Ultimately, the paper calls for the machine-learning community to rethink the common practice of indiscriminate web scraping and also lays out the possible violations of current privacy laws represented by the existence of PII in massive machine-learning data sets, as well as the limitations of those laws’ ability to protect privacy.
“We have the GDPR in Europe, we have the CCPA in California, but there’s still no federal data protection law in America, which also means that different Americans have different rights protections,” says Marietje Schaake, a Dutch lawmaker turned tech policy expert who currently serves as a fellow at Stanford’s Cyber Policy Center.
Besides, these privacy laws apply to companies that meet certain criteria for size and other characteristics. They do not necessarily apply to researchers like those who were responsible for creating and curating DataComp CommonPool.
And even state laws that do address privacy, like California’s consumer privacy act, have carve-outs for “publicly available” information. Machine-learning researchers have long operated on the principle that if it’s available on the internet, then it is public and no longer private information, but Hong, Agnew, and their colleagues hope that their research challenges this assumption.
“What we found is that ‘publicly available’ includes a lot of stuff that a lot of people might consider private—résumés, photos, credit card numbers, various IDs, news stories from when you were a child, your family blog. These are probably not things people want to just be used anywhere, for anything,” says Hong.
Hopefully, Schaake says, this research “will raise alarm bells and create change.”
This article previously misstated Tiffany Li’s affiliation. This has been fixed.
This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.
How to run an LLM on your laptop
In the early days of large language models, there was a high barrier to entry: it used to be impossible to run anything useful on your own computer without investing in pricey GPUs. But researchers have had so much success in shrinking down and speeding up models that anyone with a laptop, or even a smartphone, can now get in on the action.
For people who are concerned about privacy, want to break free from the control of the big LLM companies, or just enjoy tinkering, local models offer a compelling alternative to ChatGPT and its web-based peers. Here’s how to get started running a useful model from the safety and comfort of your own computer. Read the full story.
—Grace Huckins
This story is part of MIT Technology Review’s How To series, helping you get things done. You can check out the rest of the series here.
A brief history of “three-parent babies”
This week we heard that eight babies have been born in the UK following an experimental form of IVF that involves DNA from three people. The approach was used to prevent women with genetic mutations from passing mitochondrial diseases to their children.
But these eight babies aren’t the first “three-parent” children out there. Over the last decade, several teams have been using variations of this approach to help people have babies. But the procedure is not without controversy. Read the full story.
—Jessica Hamzelou
This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here.
The must-reads
I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.
1 OpenAI has launched ChatGPT Agent
It undertakes tasks on your behalf by building its own “virtual computer.” (The Verge)
+ It may take a while to actually complete them. (Wired $)
+ Are we ready to hand AI agents the keys? (MIT Technology Review)
2 The White House is going after “woke AI”
It’s preparing an executive order preventing companies with “liberal bias” in their models from landing federal contracts. (WSJ $)
+ Why it’s impossible to build an unbiased AI language model. (MIT Technology Review)
3 A new law in Russia criminalizes certain online searches
Looking up LGBT content, for example, could land Russians in big trouble. (WP $)
+ Dozens of Russian regions have been hit with cellphone internet shutdowns. (ABC News)
4 Elon Musk wants to detonate SpaceX rockets over Hawaii’s waters
Even though the proposed area is a sacred Hawaiian religious site. (The Guardian)
+ Rivals are rising to challenge the dominance of SpaceX. (MIT Technology Review)
5 Meta’s privacy violation trial is over
The shareholders suing Mark Zuckerberg and other officials have settled for a (likely very hefty) payout. (Reuters)
6 Inside ICE’s powerful facial recognition app
Mobile Fortify can check a person’s face against a database of 200 million images. (404 Media)
+ The department has unprecedented access to Medicaid data, too. (Wired $)
7 DOGE has left federal workers exhausted and anxious
Six months in, workers are struggling to cope with the fall out. (Insider $)
+ DOGE’s tech takeover threatens the safety and stability of our critical data. (MIT Technology Review)
8 Netflix has used generative AI in a show for the first time
To cut costs, apparently. (BBC)
9 Does AI really spell the end of loneliness?
Virtual companions aren’t always what they’re cracked up to be. (New Yorker $)
+ The AI relationship revolution is already here. (MIT Technology Review)
10 Flip phones are back with a vengeance
At least they’re more interesting to look at than a conventional smartphone. (Vox)
+ Triple-folding phones might be a bridge too far, though. (The Verge)
Quote of the day
“It is far from perfect.”
—Kevin Weil, OpenAI’s chief product officer, acknowledges that its new agent still requires a lot of work, Bloomberg reports.
One more thing
GMOs could reboot chestnut trees
Living as long as a thousand years, the American chestnut tree once dominated parts of the Eastern forest canopy, with many Native American nations relying on them for food. But by 1950, the tree had largely succumbed to a fungal blight probably introduced by Japanese chestnuts.
As recently as last year, it seemed the 35-year effort to revive the American chestnut might grind to a halt. Now, American Castanea, a new biotech startup, has created more than 2,500 transgenic chestnut seedlings— likely the first genetically modified trees to be considered for federal regulatory approval as a tool for ecological restoration. Read the full story.
—Anya Kamenetz
We can still have nice things
A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.)
+ This stained glass embedded into a rusted old Porsche is strangely beautiful.
+ Uhoh: here comes the next annoying group of people to avoid, the Normans.
+ I bet Dolly Parton knows a thing or two about how to pack for a trip.
+ Aww—orcas have been known to share food with humans in the wild.
This week we heard that eight babies have been born in the UK following an experimental form of IVF that involves DNA from three people. The approach was used to prevent women with genetic mutations from passing mitochondrial diseases to their children. You can read all about the results, and the reception to them, here.
But these eight babies aren’t the first “three-parent” children out there. Over the last decade, several teams have been using variations of this approach to help people have babies. This week, let’s consider the other babies born from three-person IVF.
I can’t go any further without talking about the term we use to describe these children. Journalists, myself included, have called them “three-parent babies” because they are created using DNA from three people. Briefly, the approach typically involves using the DNA from the nuclei of the intended parents’ egg and sperm cells. That’s where most of the DNA in a cell is found.
But it also makes use of mitochondrial DNA (mtDNA)—the DNA found in the energy-producing organelles of a cell—from a third person. The idea is to avoid using the mtDNA from the intended mother, perhaps because it is carrying genetic mutations. Other teams have done this in the hope of treating infertility.
mtDNA, which is usually inherited from a person’s mother, makes up a tiny fraction of total inherited DNA. It includes only 37 genes, all of which are thought to play a role in how mitochondria work (as opposed to, say, eye color or height).
That’s why some scientists despise the term “three-parent baby.” Yes, the baby has DNA from three people, but those three can’t all be considered parents, critics argue. For the sake of argument, this time around I’ll use the term “three-person IVF” from here on out.
So, about these babies. The first were reported back in the 1990s. Jacques Cohen, then at Saint Barnabas Medical Center in Livingston, New Jersey, and his colleagues thought they might be able to treat some cases of infertility by injecting the mitochondria-containing cytoplasm of healthy eggs into eggs from the intended mother. Seventeen babies were ultimately born this way, according to the team. (Side note: In their paper, the authors describe potential resulting children as “three-parental individuals.”)
But two fetuses appeared to have genetic abnormalities. And one of the children started to show signs of a developmental disorder. In 2002, the US Food and Drug Administration put a stop to the research.
The babies born during that study are in their 20s now. But scientists still don’t know why they saw those abnormalities. Some think that mixing mtDNA from two people might be problematic.
Newer approaches to three-person IVF aim to include mtDNA from just the donor, completely bypassing the intended mother’s mtDNA. John Zhang at the New Hope Fertility Center in New York City tried this approach for a Jordanian couple in 2016. The woman carried genes for a fatal mitochondrial disease and had already lost two children to it. She wanted to avoid passing it on to another child.
Zhang took the nucleus of the woman’s egg and inserted it into a donor egg that had had its own nucleus removed—but still had its mitochondria-containing cytoplasm. That egg was then fertilized with the woman’s husband’s sperm.
Because it was still illegal in the US, Zhang controversially did the procedure in Mexico, where, as he told me at the time, “there are no rules.” The couple eventually welcomed a healthy baby boy. Less than 1% of the boy’s mitochondria carried his mother’s mutation, so the procedure was deemed a success.
There was a fair bit of outrage from the scientific community, though. Mitochondrial donation had been made legal in the UK the previous year, but no clinic had yet been given a license to do it. Zhang’s experiment seemed to have been conducted with no oversight. Many questioned how ethical it was, although Sian Harding, who reviewed the ethics of the UK procedure, then told me it was “as good as or better than what we’ll do in the UK.”
The scandal had barely died down by the time the next “three-person IVF” babies were announced. In 2017, a team at the Nadiya Clinic in Ukraine announced the birth of a little girl to parents who’d had the treatment for infertility. The news brought more outrage from some quarters, as scientists argued that the experimental procedure should only be used to prevent severe mitochondrial diseases.
It wasn’t until later that year that the UK’s fertility authority granted a team in Newcastle a license to perform mitochondrial donation. That team launched a trial in 2017. It was big news—the first “official” trial to test whether the approach could safely prevent mitochondrial disease.
But it was slow going. And meanwhile, other teams were making progress. The Nadiya Clinic continued to trial the procedure in couples with infertility. Pavlo Mazur, a former embryologist who worked at that clinic, tells me that 10 babies were born there as a result of mitochondrial donation.
Mazur then moved to another clinic in Ukraine, where he says he used a different type of mitochondrial donation to achieve another five healthy births for people with infertility. “In total, it’s 15 kids made by me,” he says.
But he adds that other clinics in Ukraine are also using mitochondrial donation, without sharing their results. “We don’t know the actual number of those kids in Ukraine,” says Mazur. “But there are dozens of them.”
In 2020, Nuno Costa-Borges of Embryotools in Barcelona, Spain, and his colleagues described another trial of mitochondrial donation. This trial, performed in Greece, was also designed to test the procedure for people with infertility. It involved 25 patients. So far, seven children have been born. “I think it’s a bit strange that they aren’t getting more credit,” says Heidi Mertes, a medical ethicist at Ghent University in Belgium.
The newly announced UK births are only the latest “three-person IVF” babies. And while their births are being heralded as a success story for mitochondrial donation, the story isn’t quite so simple. Three of the eight babies were born with a non-insignificant proportion of mutated mitochondria, ranging between 5% and 20%, depending on the baby and the sample.
Dagan Wells of the University of Oxford, who is involved in the Greece trial, says that two of the seven babies in their study also appear to have inherited mtDNA from their intended mothers. Mazur says he has seen several cases of this “reversal” too.
This isn’t a problem for babies whose mothers don’t carry genes for mitochondrial disease. But it might be for those whose mothers do.
I don’t want to pour cold water over the new UK results. It was great to finally see the results of a trial that’s been running for eight years. And the births of healthy babies are something to celebrate. But it’s not a simple success story. Mitochondrial donation doesn’t guarantee a healthy baby. We still have more to learn, not only from these babies, but from the others that have already been born.
This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here.

The “hard data” signals that Ether is not due for a correction anytime soon, according to Felix Xu, a partner at crypto hedge fund ZX Squared Capital.

White House spokesman Kush Desai told Cointelegraph that “No decisions should be deemed official,” unless it comes directly from President Trump himself.