RAG Evaluation with Ragas

Overview

This tutorial demonstrates how to use Ragas to evaluate the quality of your Contextual AI’s Retrieval Augmented Generation (RAG) agents. The purpose of this notebook is to show the flexibility of Contextual AI’s platform to support external evaluation approaches. The approach shown here with Ragas can be used similarly with other evaluation tools.

What is Ragas?

Ragas is an open-source evaluation framework specifically designed for RAG systems. It provides several important metrics to assess the quality of both the retrieval and generation components:

Faithfulness - Measures if the generated answer is factually consistent with the retrieved context
Context Relevancy - Evaluates if the retrieved passages are relevant to the question
Context Recall - Checks if all information needed to answer the question is present in the context
Answer Relevancy - Assesses if the generated answer is relevant to the question

A key advantage of Ragas is that it can perform reference-free evaluations, meaning you don’t need ground truth answers to evaluate your Contextual AI RAG pipeline. This makes it particularly useful for evaluating production systems built with Contextual AI where labeled data may not be available.

This tutorial assumes you already have a Contextual AI Agent setup. If you haven’t, please follow the Contextual AI Platform Quickstart

Scope

This tutorial can be completed in under 30 minutes and covers:

Setting up the Ragas evaluation environment
Preparing evaluation datasets
Querying Contextual AI RAG agents
Calculating RAGAS metrics:
- Faithfulness: Measures factual consistency with retrieved context
- Context Recall: Evaluates completeness of retrieved information
- Answer Accuracy: Assesses match with reference answers
Analyzing and exporting evaluation results

Prerequisites

Contextual AI API Key
OpenAI API Key (for RAGAS evaluation)
Python 3.8+
Required dependencies (listed in requirements.txt)

Environment Setup

Before running the notebook, install the required packages below. These libraries provide tracking, evaluation, export capabilities, and LLM access.

Required Packages

langfuse – Tracking and observability
ragas – Core evaluation framework
openpyxl – Enables exporting results to Excel
openai – Provides LLM access (used internally by RAGAS)
langchain-openai – LangChain integration with OpenAI
langchain-contextual – Connects LangChain to Contextual AI

Together, these packages form a complete evaluation pipeline. Installation may take a few minutes depending on your connection speed and whether you already have some dependencies installed.

%pip install langfuse ragas openpyxl openai langchain-openai langchain-contextual --upgrade --quiet

You may need to restart the kernel to use updated packages.

Import Dependencies

With the environment ready, we can import the necessary libraries and initialize our clients. The imports are grouped for clarity and maintainability:

Import Structure

Standard library imports — Core Python functionality
Third-party imports — Data processing and API interaction
RAGAS imports — Evaluation metrics and utilities
Client initialization — Contextual AI client and the evaluator LLM

Contextual AI uses GPT-4o as the evaluator model, as high-quality evaluation depends on the model’s ability to understand nuance, interpret context, and accurately compare textual information.

# Standard library imports
import os
import random
import time
import asyncio
import uuid
from typing import List, Dict, Any
import requests

# Third party imports
import pandas as pd
import tqdm
import openai
from langchain_openai import ChatOpenAI
from contextual import ContextualAI
from ragas.dataset_schema import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import Faithfulness, ContextRecall, AnswerAccuracy

# API Keys
os.environ["OPENAI_API_KEY"] = "API_KEY"
os.environ["CONTEXTUAL_API_KEY"] = "API_KEY"

# Initialize clients
client = ContextualAI()
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

def fetch_file(filepath):
    if not os.path.exists(os.path.dirname(filepath)):  # Ensure the directory exists
        os.makedirs(os.path.dirname(filepath), exist_ok=True)  # Create if not exists

    print(f"Fetching {filepath}")
    response = requests.get(f"https://raw.githubusercontent.com/ContextualAI/examples/main/01-getting-started/{filepath}")

    with open(filepath, 'wb') as f:
        f.write(response.content)

fetch_file('data/eval_short.csv')

Fetching data/eval_short.csv

Running The Notebook

Once you’ve set up your environment and dependencies, continue preparing your evaluation data by following Steps 3-8 in the example notebook, which includes testing your RAG pipeline and applying RAGAS metrics to your evaluation samples.

Using_RAGAS.ipynb

Demos

Tutorials

Overview

What is Ragas?

Scope

Prerequisites

Environment Setup

Required Packages

Import Dependencies

Import Structure

Running The Notebook

Demos

Tutorials

​Overview

​What is Ragas?

​Scope

​Prerequisites

​Environment Setup

​Required Packages

​Import Dependencies

​Import Structure

​Running The Notebook

Overview

What is Ragas?

Scope

Prerequisites

Environment Setup

Required Packages

Import Dependencies

Import Structure

Running The Notebook