From Rabat to Reliable AI : Rethinking Retrieval Augmented Generation Evaluation


At EACL 2026 in Rabat, Morocco, presenting RAGVUE became more than a conference milestone. It was a journey through research conversations, culture, and one central question that matters across research domains: How can we trust LLM-generated answers?


Author: Keerthana Murugaraj

Keerthana Murugaraj is a doctoral researcher in computer science at the Faculty of Science, Technology, and Medicine (FSTM). She works on retrieval-augmented generation and explainable RAG evaluation, focusing on how to make AI systems more transparent, diagnosable, and reliable.


A conference, a city, and a question

Some conferences are remembered for the paper you presented. Others stay with you because of the city, the people, the conversations, and the way the whole experience quietly changes how you see your own work. For me, EACL 2026 in Rabat, Morocco, was one of those conferences.

At the end of March, I traveled to Rabat to present our system demonstration paper, RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation, co-authored with Dr. Salima Lamsiyah and Prof. Martin Theobald. RAGVUE is a tool for evaluating Retrieval-Augmented Generation systems, or RAG systems, which answer questions by retrieving external documents and generating responses from them.

But the trip was not only about presenting a system. It became a reminder of why this research matters.

Rabat had a rhythm of its own. Between conference sessions, demo presentations, and research discussions, there was the warmth of the city, the colors of the streets, the kindness of people, the mint tea, the food, and the feeling of being in a place where cultures and histories meet naturally. Morocco gave the conference a special atmosphere. It made the experience feel alive, not just academic.

The conference itself was very well organized. The sessions, demonstrations, posters, and informal discussions created many spaces for exchange. There were moments of intense technical discussion, but also lighter moments during coffee breaks, meals, and walks through the city, where research ideas continued more humanly.

That is one of the best parts of a conference: you arrive with your own paper, but you leave with new questions.

Why does evaluation need a diagnosis

For me, the central question was this: how do we know when an LLM-generated answer is reliable? On paper, RAG systems offer a promising answer. Instead of relying only on what a language model has learned during training, they retrieve relevant evidence from external sources and use it to generate a more grounded response.

In practice, it is much more complicated.

A system may retrieve the wrong document. It may retrieve the right document but miss the most important passage. It may generate an answer that sounds confident but is only partly supported by the evidence. It may also receive a score that tells us something went wrong, without explaining what failed.

This is the gap that we try to address.

Instead of treating evaluation as one final number, RAGVUE looks at the behavior of the system more carefully. It helps users inspect retrieval quality, answer faithfulness, and grounding, and judge stability. The goal is not only to ask, “Is this answer good?” but also, “Why did the system behave this way?” During the demo sessions, I realized that this question is connected with many researchers’ concerns. People were not only interested in the technical details of the tool. They were interested in the bigger issue behind it: how can we make AI systems more transparent, more explainable, and more useful for real users?

Why this matters for Digital Humanities

This question is especially important for the digital humanities. When we work with historical documents, archives, literary texts, or cultural heritage collections, evidence is rarely simple. Sources can be incomplete, multilingual, ambiguous, or shaped by historical context. In such settings, an LLM-generated answer should not simply sound fluent. It should be traceable. It should show its evidence. It should make uncertainty visible.

Reliable AI is not only AI that gives a correct answer. Reliable AI is AI that can be inspected. It helps users understand when to trust, when to question, and when to look deeper.

What I brought back

Rabat made this lesson clearer for me. Being in a new place, surrounded by researchers from different countries and perspectives, reminded me that research does not grow only through publications. It grows through conversations, feedback, curiosity, and the courage to ask better questions. I returned from EACL 2026 with more than the memory of presenting RAGVUE. I returned with renewed motivation.

The future of AI will not only depend on building more powerful systems. It will also depend on building evaluation systems that we can understand, question, and trust.

That is one of the directions I want my work to contribute to: moving beyond scores and towards AI systems that are transparent, explainable, and responsible.

Published Research Paper Link: https://aclanthology.org/2026.eacl-demo.35/

GitHub Repo: https://github.com/KeerthanaMurugaraj/RAGVue-Diagnostic

Thank you for your valuable time.

Keerthana