Materials of child sexual abuse were found in the training dataset for Stable Diffusion

Brother · Dec 21, 2023

The number of materials disclosed by Microsoft may be much higher than stated.

A new study from the Stanford Internet Observatory (SIO) reveals the presence of Child Sexual Abuse Material (CSAM) in the extensive public LAION-5B dataset, which was used to train popular generative neural networks, including Stable Diffusion. An analysis of more than 32 million data points revealed that Microsoft's PhotoDNA tool confirmed the presence of 1,008 CSAM images. The researchers stressed that this number of materials can be much larger.

It is important to note that LAION-5B does not contain the images themselves, but is a collection of metadata that includes:

hash of the image;
description;
information about the language;
information about whether an image may be unsafe;
URL of the image.

Some of the links to CSAM photos in LAION-5B led to sites such as Reddit, X, Blogspot, Wordpress, as well as adult sites such as xHamster and XVideos.

To identify suspicious images in the dataset, the SIO team focused on those that were marked as "unsafe". These images were tested with PhotoDNA for CSAM, and then the results were sent to the Canadian Center for Child Protection (C3P) for confirmation. The process of deleting the identified source materials is currently underway – after the image URLs were passed to C3P and the National Center for Missing and Exploited Children (NCMEC) in the United States.

Stable Diffusion version 1.5, trained on LAION-5B data, is known for its ability to create obscene images. Although there is no direct link to the use of AI to create pornographic images of minors, it is precisely such technologies that have facilitated the commission of crimes related to deepfake blackmail and other types of crimes.

Stable Diffusion 1.5 continues to be popular for creating obscene photos, despite widespread community dissatisfaction with the release of Stable Diffusion 2.0 with additional security filters. It is unclear whether Stability AI, which developed Stable Diffusion, was aware of the potential presence of CSAM in its models due to the use of LAION-5B – the company did not answer questions from specialists.

German non-profit organization LAION, which creates data sets for training generative AI, has previously been criticized for including controversial content. Google used the LAION-5B's predecessor, the LAION-400M, to train its Imagen AI . However, the company decided not to release the tool, as the LAION-400M audit revealed a wide range of inappropriate content, including pornographic images, racist insults and social stereotypes. Also in September 2022, private medical photos posted without permission were found in the LAION-5B dataset.

In response to the allegations, LAION announced the start of "regular maintenance procedures" to remove links to suspicious and potentially illegal content. LAION said that the company has a zero-tolerance policy for illegal content, adding that public datasets are temporarily removed from publication and will be returned after filtering updates. The return of the data sets to public use is scheduled for the second half of January.

Materials of child sexual abuse were found in the training dataset for Stable Diffusion

Brother

Professional

Similar threads