Skip to main content

Overview

This tutorial demonstrates how to use Ragas to evaluate the quality of your Contextual AI’s Retrieval Augmented Generation (RAG) agents. The purpose of this notebook is to show the flexibility of Contextual AI’s platform to support external evaluation approaches. The approach shown here with Ragas can be used similarly with other evaluation tools.

What is Ragas?

Ragas is an open-source evaluation framework specifically designed for RAG systems. It provides several important metrics to assess the quality of both the retrieval and generation components:
  • Faithfulness - Measures if the generated answer is factually consistent with the retrieved context
  • Context Relevancy - Evaluates if the retrieved passages are relevant to the question
  • Context Recall - Checks if all information needed to answer the question is present in the context
  • Answer Relevancy - Assesses if the generated answer is relevant to the question
A key advantage of Ragas is that it can perform reference-free evaluations, meaning you don’t need ground truth answers to evaluate your Contextual AI RAG pipeline. This makes it particularly useful for evaluating production systems built with Contextual AI where labeled data may not be available.
This tutorial assumes you already have a Contextual AI Agent setup. If you haven’t, please follow the Contextual AI Platform Quickstart

Scope

This tutorial can be completed in under 30 minutes and covers:
  • Setting up the Ragas evaluation environment
  • Preparing evaluation datasets
  • Querying Contextual AI RAG agents
  • Calculating RAGAS metrics:
    • Faithfulness: Measures factual consistency with retrieved context
    • Context Recall: Evaluates completeness of retrieved information
    • Answer Accuracy: Assesses match with reference answers
  • Analyzing and exporting evaluation results

Prerequisites

  • Contextual AI API Key
  • OpenAI API Key (for RAGAS evaluation)
  • Python 3.8+
  • Required dependencies (listed in requirements.txt)

Environment Setup

Before running the notebook, install the required packages below. These libraries provide tracking, evaluation, export capabilities, and LLM access.

Required Packages

  • langfuse – Tracking and observability
  • ragas – Core evaluation framework
  • openpyxl – Enables exporting results to Excel
  • openai – Provides LLM access (used internally by RAGAS)
  • langchain-openai – LangChain integration with OpenAI
  • langchain-contextual – Connects LangChain to Contextual AI
Together, these packages form a complete evaluation pipeline. Installation may take a few minutes depending on your connection speed and whether you already have some dependencies installed.
%pip install langfuse ragas openpyxl openai langchain-openai langchain-contextual --upgrade --quiet
You may need to restart the kernel to use updated packages.

Import Dependencies

With the environment ready, we can import the necessary libraries and initialize our clients. The imports are grouped for clarity and maintainability:

Import Structure

  • Standard library imports — Core Python functionality
  • Third-party imports — Data processing and API interaction
  • RAGAS imports — Evaluation metrics and utilities
  • Client initialization — Contextual AI client and the evaluator LLM
Contextual AI uses GPT-4o as the evaluator model, as high-quality evaluation depends on the model’s ability to understand nuance, interpret context, and accurately compare textual information.
# Standard library imports
import os
import random
import time
import asyncio
import uuid
from typing import List, Dict, Any
import requests

# Third party imports
import pandas as pd
import tqdm
import openai
from langchain_openai import ChatOpenAI
from contextual import ContextualAI
from ragas.dataset_schema import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import Faithfulness, ContextRecall, AnswerAccuracy

# API Keys
os.environ["OPENAI_API_KEY"] = "API_KEY"
os.environ["CONTEXTUAL_API_KEY"] = "API_KEY"

# Initialize clients
client = ContextualAI()
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
def fetch_file(filepath):
    if not os.path.exists(os.path.dirname(filepath)):  # Ensure the directory exists
        os.makedirs(os.path.dirname(filepath), exist_ok=True)  # Create if not exists

    print(f"Fetching {filepath}")
    response = requests.get(f"https://raw.githubusercontent.com/ContextualAI/examples/main/01-getting-started/{filepath}")

    with open(filepath, 'wb') as f:
        f.write(response.content)

fetch_file('data/eval_short.csv')
Fetching data/eval_short.csv

Running The Notebook

Once you’ve set up your environment and dependencies, continue preparing your evaluation data by following Steps 3-8 in the example notebook, which includes testing your RAG pipeline and applying RAGAS metrics to your evaluation samples.