Mistral AI caught up in suspicions of massive looting

The Anh

Long presented as the European alternative to the American AI giants, Mistral AI is now seeing its image tarnished.

Indeed, after Perplexity with Britannica and others like OpenAI and Anthropic, a recent investigation accuses the French start-up of having used copyrighted works to train its language models…

Iconic works reproduced by the model

Published by Mistral Large 3-2512, the company's latest model, as well as its chatbot Le Chat. The journalists worked with a researcher specializing in algorithmic auditing to assess the model's ability to reproduce copyrighted content. According to Mediapart, the system was able to reproduce 35% of the English version of Harry Potter and the Philosopher's Stone, as well as 58% of The Little Prince and 25% of The Hobbit, by submitting the paragraphs one by one. We also learn that the opening sequences of "1984" and "Game of Thrones" were also obtained without difficulty. For several researchers, this level of accuracy represents strong evidence that the complete works were used to train the model. The subject is all the more sensitive given that a Munich court set a threshold of 15 consecutive words reproduced identically in November 2025 to constitute copyright infringement. The investigation also mentions songs such as Elton John's "Rocket Man" exceeding this threshold. When questioned, the company invoked a "principle of reality," suggesting that this widely disseminated online content was collected indirectly. Transparency, opt-outs, and the AI Act: a framework under pressure. Beyond literary and musical works, the investigation highlights the respect for opt-out mechanisms. The 2019 European directive authorizes text and data mining of protected content, provided that rights holders can object, notably via the robots.txt file for websites. However, Mediapart claims to have observed requests originating from servers linked to Mistral despite an explicit prohibition. For its part, the startup maintains that these robots are used solely to enhance user responses, and not to create training datasets.

It should be noted that this case comes at a time when the European AI Act imposes new transparency obligations regarding the data used. From now on, Mistral is expected to publish a detailed list of its training sources.

Already challenged at the end of January by a publisher, the startup is thus seeing increasing legal and political pressure. And at a time when the race for model size determines competitiveness, the line between technological innovation and respect for copyright is becoming more of a minefield than ever…

Comments

Suggested for You

•

Who is being harassed?

Help us understand the problem.

Help us understand the problem.

Help us understand the problem.

Help us understand the problem.

Mistral AI caught up in suspicions of massive looting

The Anh

Iconic works reproduced by the model

Comments

Leave a Comment

Suggested for You