Microsoft collaborates with News Corp-B's HarperCollins to train AI models using massive book data

Microsoft has reached an agreement with News Corp-B's HarperCollins to utilize its extensive non-fiction book resources to train artificial intelligence models. This collaboration is limited to selected backlist titles and does not involve the creation of new books, with authors having the option to participate. HarperCollins confirmed the agreement, emphasizing respect for authors' rights and ensuring the protection of the value and income of their works. This move aims to enhance the quality and accuracy of the models, as technology companies generally seek high-quality text sources to optimize AI training

According to informed sources, Microsoft Corporation (MSFT.US) has reached an agreement with HarperCollins Publishers, a subsidiary of News Corporation-B (NWS.US), to utilize the latter's rich resources of non-fiction books to train its artificial intelligence models, aiming to enhance the quality and performance of the models. This collaboration is limited to using selected old books for model training and does not involve the creation of new books, with authors having the right to choose whether to participate.

Specifically, Microsoft hopes to incorporate HarperCollins books into its yet-to-be-announced AI model to expand high-quality text sources and improve the model's accuracy and expertise. Although Microsoft declined to comment, HarperCollins has confirmed the agreement, stating that it will "allow limited use of selected non-fiction old books to train AI models."

At the same time, HarperCollins emphasized that the scope of this agreement is limited and has clear restrictions on the exemplary output that respects authors' rights, allowing authors to choose whether to participate.

"One of our tasks is to create opportunities for authors to think deeply while ensuring that the core value of their works and the revenue and royalties we share are protected," HarperCollins stated. "This agreement is limited in scope and sets clear boundaries for outstanding works that respect authors' rights, successfully achieving this goal."

It is understood that technology companies have been seeking more high-quality text sources to train AI models, and companies like Microsoft are no exception. They obtain licenses to use a range of data from social media sites to news articles to make their programs more accurate and better at answering questions or providing expertise on specific topics.

Notably, News Corporation had previously signed an agreement with OpenAI, allowing it to use content from several of its publications. Microsoft has also collaborated with multiple publishers on AI projects.

Additionally, earlier this year, Google reached a $60 million agreement with Reddit, enabling the search giant to utilize a large number of subreddits to train its AI models.

However, some publishers have expressed dissatisfaction with AI companies citing content without permission and have filed lawsuits. For example, The New York Times has sued OpenAI and Microsoft, accusing them of copyright infringement.

In summary, the agreement between Microsoft and HarperCollins marks another significant advancement for technology companies in seeking high-quality text sources to train AI models. However, how to respect authors' rights while utilizing these resources remains a challenge that publishers and technology companies need to face together