Authors:
Josue Daniel Caldas Velasquez, Elvis de Souza
Abstract:Retrieval-Augmented Generation (RAG) has emerged as an effective strategy to ground Large Language Models (LLMs) with reliable, real-time information. This paper investigates the trade-off between cost and performance by evaluating 13 LLMs within a RAG pipeline for the Question Answering (Q&A) task under noisy retrieval conditions. We assess four extractive and nine generative models—spanning both open- and closed-source ones of varying sizes—on a journalistic benchmark specifically designed for RAG. By systematically varying the level of noise injected into the retrieved context, we analyze not only which models perform best, but also their robustness to noisy input. Results show that large open-source generative models (approx. 70B parameters) achieve performance and noise tolerance on par with top-tier closed-source models. However, their computational demands limit their practicality in resource-constrained settings. In contrast, medium-sized open-source models (approx. 7B parameters) emerge as a compelling compromise, balancing efficiency, robustness, and accessibility.
Authors:
Pakhapoom Sarapat, Trapoom Ukarapol, Tatsunori Hashimoto
Abstract:This paper presents a comprehensive study on the multilingual adaptability of large language models (LLMs), with a focus on the interplay between training strategies and prompt design. Using Thai as a case study, we examine: (RQ1) the extent to which pre-trained models (Base) can adapt to another language through additional fine-tuning; (RQ2) how continual pre-training (CPT) compares to multilingual pre-training (MLLM) in terms of performance on downstream tasks; and (RQ3) how language variation within different components of a structured prompt--task instruction, context input, and output instruction--influences task performance in cross-lingual settings. Our findings reveal that CPT proves to be a promising strategy for enhancing model performance in languages other than English like Thai in monolingual settings, particularly for models that initially lack strong linguistic capabilities. Its effectiveness, however, is highly task-dependent and varies based on the base model's initial proficiency. In cross-lingual scenarios, MLLMs exhibit superior robustness compared to Base and CPT models, which are more susceptible to context-output language mismatches. Considering the high cost of training multilingual models from scratch, MLLMs remain a critical component for downstream tasks in multilingual settings due to their strong cross-lingual performance.
Authors:
Ofir Azachi, Kfir Eliyahu, Eyal El Ani, Rom Himelstein, Roi Reichart, Yuval Pinter, Nitay Calderon
Abstract:Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM's next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs. All data is publicly available at https://huggingface.co/datasets/wrom/Language-Vision-Hallucinations.
Authors:
David Kletz, Sandra Mitrovic, Ljiljana Dolamic, Fabio Rinaldi
Abstract:In this paper, we explore the potential of Open-source Small Language Models (OSLMs) for localizing hallucinations related to factual accuracy. We first present Lucifer, a dataset designed to enable proper and consistent evaluation of LMs, composed of an automatically constructed portion and a manually curated subset intended for qualitative analysis. We then assess the performance of five OSLMs using four carefully designed prompts. Results are evaluated either individually or merged through a voting-based merging approach. While our results demonstrate that the merging method yields promising performance even with smaller models, our manually curated dataset highlights the inherent difficulty of the task, underscoring the need for further research.
Authors:
Chrisanna Cornish, Anna Rogers
Abstract:Chain-of-Thought (CoT) ‘reasoning’ promises to enhance the performance and transparency of Large Language Models (LLMs). Models, such as Deepseek R1, are trained via reinforcement learning to automatically generate CoT explanations in their outputs. Their faithfulness, i.e. how well the explanations actually reflect their internal reasoning process, has been called into doubt by recent studies (Chen et al., 2025a; Chua and Evans, 2025). This paper extends previous work by probing Deepseek R1 with 445 logical puzzles under zero- and few-shot settings. We find that whilst the model explicitly acknowledges a strong harmful hint in 94.6% of cases, it reports less than 2% of helpful hints. Further analysis reveals implicit unfaithfulness as the model significantly reduces answer-rechecking behaviour for helpful hints (p<0.01) despite rarely mentioning them in its CoT, demonstrating a discrepancy between its reported and actual decision process. In line with prior reports for GPT, Claude, Gemini and other models, our results for DeepSeek raise concerns about the use of CoT as an explainability technique.
Authors:
Sri Yanamandra, Vivek Sekhadia, James Vaisman, Yuliah Louis, Kingston Huynh
Abstract:Retrieval-Augmented Generation (RAG) enhances language models by grounding outputs in retrieved, relevant documents, effectively addressing key limitations such as hallucinations, updated knowledge, and restricted domain expertise. We propose CREAM-RAG, which stabilizes self-reward signals by applying consistency regularization, reducing hallucinations, and improving factual accuracy. Our framework integrates retrieval, DPO-based self-reward reinforcement learning, and consistency loss into a unified system for more reliable output. Experimental results demonstrated significant performance improvements on the LLaMA 2 7B model. It improved the results on average by 35.04% compared to the base model, suggesting that CREAM-RAG successfully enhanced semantic reasoning and decreased hallucinations.
Authors:
Karin Sim, Lisa Vasileva
Abstract:LLMs are proving to be adept at machine translation although due to their generative nature they may at times overgenerate in various ways. These overgenerations are different from the neurobabble seen in NMT and range from LLM self-explanations, to risky confabulations, to appropriate explanations, where the LLM is able to act as a human translator would- enabling greater comprehension for the target audience. Detecting and determining the exact nature of the overgenerations is a challenging task. We detail different strategies we have explored for our work in a commercial setting, and present our results.
Authors:
Patrícia Schmidtová, Ondrej Dusek, Saad Mahamood
Abstract:We examine evaluation of faithfulness to input data in the context of hotel highlights – brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (r=0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.
Authors:
Patrícia Schmidtová, Eduardo Calò, Simone Balloccu, Dimitra Gkatzia, Rudali Huidrom, Mateusz Lango, Fahime Same, Vilém Zouhar, Saad Mahamood, Ondrej Dusek
Abstract:Hallucinations are one of the most pressing challenges for large language models (LLMs). While numerous methods have been proposed to detect and mitigate them automatically, human evaluation continues to serve as the gold standard. However, these human evaluations of hallucinations show substantial variation in definitions, terminology, and evaluation practices. In this paper, we survey 64 studies involving human evaluation of hallucination published between 2019 and 2024, to investigate how hallucinations are currently defined and assessed. Our analysis reveals a lack of consistency in definitions and exposes several concerning methodological shortcomings. Crucial details, such as evaluation guidelines, user interface design, inter-annotator agreement metrics, and annotator demographics, are frequently under-reported or omitted altogether.
Authors:
Pranava Madhyastha
Abstract:Public leaderboards for large language models often rely on aggregate scores that conceal critical information about model behavior. In this paper, we present a methodology for task-aware evaluation that combines (i) correctness metrics aligned with task semantics compliance checks for instruction-following and numeric equivalence for mathematics with (ii) pairwise error-overlap analysis to identify complementary model pairs. We apply this methodology to 17 outputs of recent state of the art and frontier LLMs across multiple-choice QA, instruction-following, and mathematical reasoning tasks. We observe that task-aware metrics can reorder model rankings relative to generic lexical metrics, and that error-overlap patterns vary substantially across model pairs and scenarios. We finally conclude by discussing implications for model selection, routing strategies, and LLM-as-judge calibration, and release our analysis pipeline to support further investigation.
Authors:
Harsh Rathwa, Pruthwik Mishra, Shrikant Malviya
Abstract:The detection of hallucinations in multilingual scientific text generated by Large Language Models (LLMs) presents significant challenges for reliable AI systems. This paper describes our submission to the SHROOM-CAP 2025 shared task on scientific hallucination detection across 9 languages. Unlike most approaches that focus primarily on model architecture, we adopted a data-centric strategy that addressed the critical issue of training data scarcity and imbalance. We unify and balance five existing datasets to create a comprehensive training corpus of 124,821 samples (50% correct, 50% hallucinated), representing a 172x increase over the original SHROOM training data. Our approach fine-tuned XLM-RoBERTa-Large with 560 million parameters on this enhanced dataset, achieves competitive performance across all languages, including extbf{2nd place in Gujarati} (zero-shot language) with Factuality F1 of 0.5107, and rankings between 4th-6th place across the remaining 8 languages. Our results demonstrate that systematic data curation can significantly outperform architectural innovations alone, particularly for low-resource languages in zero-shot settings.
Authors:
Anjali R, Anand Kumar M
Abstract:One of the key challenges of deploying Large Language Models (LLMs) in multilingual scenarios is maintaining output quality across two conditions: factual correctness and linguistic fluency. LLMs are liable to produce text with factual hallucinations, solid-sounding but false information, and fluency errors that take the form of grammatical mistakes, repetition, or unnatural speech patterns. In this paper, we address a two-framework solution for the end-to-end quality evaluation of LLM-generated text in low-resource languages. (1) For hallucination detection, we introduce a retrieval-augmented classification model that utilizes hybrid document retrieval, along with gradient boosting.(2) For fluency detection, we introduce a deep learning model that combines engineered statistical features with pre-trained semantic embeddings using an attention-based mechanism.
Authors:
Timur Ionov, Evgenii Nikolaev, Artem Vazhentsev, Mikhail Chaichuk, Anton Korznikov, Elena Tutubalina, Alexander Panchenko, Vasily Konovalov, Elisei Rykov
Abstract:Large Language Models (LLMs) often generate hallucinations, a critical issue in domains like scientific communication where factual accuracy and fluency are essential. The SHROOM-CAP shared task addresses this challenge by evaluating Factual Mistakes and Fluency Mistakes across diverse languages, extending earlier SHROOM editions to the scientific domain. We present Smurfcat, our system for SHROOM-CAP, which integrates three complementary approaches: uncertainty estimation (white-box and black-box signals), encoder-based classifiers (Multilingual Modern BERT), and decoder-based judges (instruction-tuned LLMs with classification heads). Results show that decoder-based judges achieve the strongest overall performance, while uncertainty methods and encoders provide complementary strengths. Our findings highlight the value of combining uncertainty signals with encoder and decoder architectures for robust, multilingual detection of hallucinations and related errors in scientific publications.
Authors:
TBU
Abstract:TBU
Accepted papers
|
|
|