In this study, we investigate the use of a large language model to assist in the evaluation of the reliability of the vast number of existing online news publishers, addressing the impracticality of relying solely on human expert annotators for this task. In the context of the Italian news media market, we first task the model with evaluating expert-designed reliability criteria using a representative sample of news articles. We then compare the model’s answers with those of human experts. The dataset consists of 352 news articles annotated by three human experts and the LLM. Examining 6,081 annotations over six criteria, we observe good agreement between LLM and human annotators in three evaluated criteria, including the critical ability to detect instances where a text negatively targets an entity or individual. For two additional criteria, such as the detection of sensational language and the recognition of bias in news content, LLMs generate fair annotations, albeit with certain trade-offs. Furthermore, we show that the LLM is able to help resolve disagreements among human experts, especially in tasks such as identifying cases of negative targeting.

Evaluation of reliability criteria for news publishers with large language models / Pratelli, Manuel; Bianchi, John; Pinelli, Fabio; Petrocchi, Marinella. - (2025), pp. 179-188. ( WebSci '25 - 17th ACM Web Science Conference 2025 New Brunswick, USA 20-24/05/2025) [10.1145/3717867.3717924].

Evaluation of reliability criteria for news publishers with large language models

Pratelli Manuel;Bianchi John;Pinelli Fabio;Petrocchi Marinella
2025

Abstract

In this study, we investigate the use of a large language model to assist in the evaluation of the reliability of the vast number of existing online news publishers, addressing the impracticality of relying solely on human expert annotators for this task. In the context of the Italian news media market, we first task the model with evaluating expert-designed reliability criteria using a representative sample of news articles. We then compare the model’s answers with those of human experts. The dataset consists of 352 news articles annotated by three human experts and the LLM. Examining 6,081 annotations over six criteria, we observe good agreement between LLM and human annotators in three evaluated criteria, including the critical ability to detect instances where a text negatively targets an entity or individual. For two additional criteria, such as the detection of sensational language and the recognition of bias in news content, LLMs generate fair annotations, albeit with certain trade-offs. Furthermore, we show that the LLM is able to help resolve disagreements among human experts, especially in tasks such as identifying cases of negative targeting.
2025
9798400714832
Inter-annotator agreement
Reliability evaluation
Good editorial practices
Generative question answering
LLMs
File in questo prodotto:
File Dimensione Formato  
3717867.3717924.pdf

accesso aperto

Descrizione: Evaluation of Reliability Criteria for News Publishers with Large Language Models
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 1.19 MB
Formato Adobe PDF
1.19 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11771/40058
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
social impact