Skip to main content

Overview

Evaluating LLM outputs is critical, but underexplored for generative AI uses cases. Natural language unit tests provide a systematic approach for evaluating LLM response quality. Contextual AI’s LMUnit is a specialized model that achieves state-of-the-art performance in creating and applying unit tests to evaluate LLM outputs.

Why Natural Language Unit Testing?

Traditional LLM evaluation methods often face several challenges:
  • Human evaluations are inconsistent and costly, while metrics like ROUGE fail to capture nuanced quality measures.
  • General-purpose LLMs may not provide fine-grained feedback
  • Simple yes/no evaluations miss important nuances
Natural language unit tests address these challenges by:
  • Breaking down evaluation into specific, testable criteria
  • Providing granular feedback on different quality aspects
  • Enabling systematic improvement of LLM outputs
  • Supporting domain-specific quality requirements
For example, financial compliance often requires precise regulatory phrasing, which is hard to assess with a generic style evaluation.

1. Set Up Development Environment

Set up your environment to start using LMUnit to evaluate LLM responses. This example uses LMUnit as provided through the Contextual AI python client, so install it first.
%pip install contextual-client tqdm
You’ll need several Python packages for data handling and visualization:
import os
import pandas as pd
from contextual import ContextualAI

# polar plots
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Dict, Optional, Union, Tuple

#clustering analysis
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import seaborn as sns
To use LMUnit, you’ll need an API key from Contextual AI. Follow these instructions to get your API key.
client = ContextualAI(api_key="ADD YOUR KEY HERE")
# Consider using environment variables for production environments.

2. Load Evaluation Dataset

LMUnit evaluates query–response pairs, which means we need:
  • The original query/promptprompt
  • The LLM’s responseresponse
This example uses synthetic financial data. The dataset contains 10 financial questions and responses designed to highlight different aspects of response-quality evaluation.
df = pd.read_csv(
    'data/financial_qa_pairs.csv' if os.path.exists('data/financial_qa_pairs.csv')
    else "https://raw.githubusercontent.com/ContextualAI/examples/refs/heads/main/03-lmunit/data/financial_qa_pairs.csv"
)
df.head()

3. Identify Unit Tests

Unit tests offer deeper insight than simply asking an LLM whether a response is “high quality.” When writing effective unit tests, strive to ensure they are:
  • Specific and focused on a single aspect
  • Clear and unambiguous
  • Measurable and consistent
  • Relevant to the domain
  • Framed positively
You should create unit tests tailored to your own use case. For this example, we use six global unit tests across all responses. These reflect critical dimensions of high-quality communication in financial services.

Context

Question: “Are relevant market conditions or external factors acknowledged?”
Why: Ensures responses consider the broader financial environment.

Clarity

Question: “Is complex financial information presented in an accessible way?”
Why: Tests whether technical concepts are explained effectively.

Precision

Question: “Is terminology used accurately and consistently?”
Why: Validates the correct use of financial terms.

Compliance

Question: “Does the response adhere to relevant financial regulations and disclosure requirements?”
Why: Ensures regulatory alignment.

Actionable

Question: “Does the response provide clear next steps or implications?”
Why: Tests practical usefulness.

Risks

Question: “Are potential risks clearly identified and explained?”
Why: Verifies appropriate risk disclosure.

unit_tests = [
    "Are relevant market conditions or external factors acknowledged?",
    "Is complex financial information presented in an accessible way?",
    "Is terminology used accurately and consistently?",
    "Does the response adhere to relevant financial regulations and disclosure requirements?",
    "Does the response provide clear next steps or implications?",
    "Are potential risks clearly identified and explained?"
]

4. Evaluate Unit Tests Using LMUnit

LMUnit is specifically trained to evaluate natural-language unit tests and provides several advantages:
  • Scores on a continuous 1–5 scale
  • Consistent evaluation across different criteria
  • Stronger performance than general-purpose LLMs like GPT-4
  • Support for custom scoring rubrics
  • Threshold-based binary scoring, e.g., score > 2.5 → 1, else 0
Below is a simple example demonstrating how LMUnit evaluates a response against a single unit test.
response = client.lmunit.create(
    query="What material is used in N95 masks?",
    response=(
        "N95 masks are made primarily of polypropylene. This synthetic material is created "
        "through a melt-blowing process that creates multiple layers of microfibers. "
        "The material was chosen because it can be electrostatically charged to attract "
        "particles. Particles are the constituents of the universe"
    ),
    unit_test="Does the response avoid unnecessary information?"
)

print(response)
Here is more complex example with a custom scoring rubric:
response = client.lmunit.create(
                    query="How effectively can the company's current AI system handle customer service inquiries?",
                    response= """ Response: Our AI system currently handles 70% of customer inquiries without human intervention. 
                    It excels at processing returns and tracking orders, but struggles with complex billing disputes. 
                    Response times average 30 seconds, though this increases to 2 minutes during peak hours. 
                    The system successfully resolves basic inquiries but often fails to understand context-dependent questions 
                    or multiple requests within the same conversation. """,
                  unit_test= """
                    Does the response provide specific, measurable performance metrics?
                  Scoring Scale:
                  Score 1: No specific metrics provided; vague or general statements only
                  Score 2: Limited metrics provided; either strengths or limitations discussed, but not both
                  Score 3: Basic metrics provided with surface-level analysis of strengths and limitations
                  Score 4: Clear metrics provided with detailed analysis of both strengths and limitations
                  Score 5: Comprehensive metrics with in-depth analysis of strengths, limitations, and contextual factors
                    """
                    
                )
print(response)
For this use case, you will need to apply each global unit test to the query/response pairs we identified in the evaluation data. Here is helper function for testing batches:
from typing import List, Dict
from tqdm import tqdm

def run_unit_tests_with_progress(
    df: pd.DataFrame,
    unit_tests: List[str],
    batch_size: int = 10
) -> List[Dict]:
    """
    Run unit tests with progress tracking and error handling.

    Args:
        df: DataFrame with prompt-response pairs
        unit_tests: List of unit test strings
        batch_size: Number of tests to run in parallel

    Returns:
        List of test results
    """
    results = []

    # Process in batches with progress bar
    for idx in tqdm(range(0, len(df)), desc="Processing responses"):
        row = df.iloc[idx]
        row_results = []

        for test in unit_tests:
            try:
                result = client.lmunit.create(
                    query=row['prompt'],
                    response=row['response'],
                    unit_test=test
                )
                row_results.append({
                    'test': test,
                    'score': result.score,
                    'metadata': result.metadata if hasattr(result, 'metadata') else None
                })
            except Exception as e:
                print(f"Error with prompt {idx}, test '{test}': {e}")
                row_results.append({
                    'test': test,
                    'score': None,
                    'error': str(e)
                })

        results.append({
            'prompt': row['prompt'],
            'response': row['response'],
            'test_results': row_results
        })

    return results
results = run_unit_tests_with_progress(df, unit_tests)
Now examine the results—you see every unit test is scored on a continuous scale of 1-5.
for result in results[:2]:  # Slice to get the first two entries
    print(f"\nPrompt: {result['prompt']}")
    print(f"Response: {result['response']}")
    print("Test Results:")
    for test_result in result['test_results']:
        print(f"- {test_result['test']}: {test_result['score']}")
Save out the results
pd.DataFrame([(r['prompt'], r['response'], t['test'], t['score']) for r in results for t in r['test_results']], columns=['prompt', 'response', 'test', 'score']).to_csv(f"unit_test_results.csv", index=False)

5. Visualize Individual Results

Visualizations are helpful for understanding unit test results. Create a visualization of individual response radar plots showing performance across all dimensions:
def map_test_to_category(test_question: str) -> str:
    """Map the full test question to its category."""
    category_mapping = {
        'Are relevant market conditions or external factors': 'CONTEXT',
        'Is complex financial information presented': 'CLARITY',
        'Is terminology used accurately': 'PRECISION',
        'Does the response adhere to relevant financial regulations': 'COMPLIANCE',
        'Does the response provide clear next steps': 'ACTIONABLE',
        'Are potential risks clearly identified': 'RISK'
    }

    for key, value in category_mapping.items():
        if key.lower() in test_question.lower():
            return value
    return None

def create_unit_test_plots(results: List[Dict],
                          test_indices: Optional[Union[int, List[int]]] = None,
                          figsize: tuple = (10, 10)):
    """
    Create polar plot(s) for unit test results. Can plot either a single test,
    specific multiple tests, or all tests in a row.

    Args:
        results: List of dictionaries containing test results
        test_indices: Optional; Either:
            - None (plots all results)
            - int (plots single result)
            - List[int] (plots multiple specific results)
        figsize: Tuple specifying the figure size (width, height)
    """
    # Handle different input cases for test_indices
    if test_indices is None:
        indices_to_plot = list(range(len(results)))
    elif isinstance(test_indices, int):
        if test_indices >= len(results):
            raise IndexError(f"test_index {test_indices} is out of range. Only {len(results)} results available.")
        indices_to_plot = [test_indices]
    else:
        if not test_indices:
            raise ValueError("test_indices list cannot be empty")
        if max(test_indices) >= len(results):
            raise IndexError(f"test_index {max(test_indices)} is out of range. Only {len(results)} results available.")
        indices_to_plot = test_indices

    # Categories in desired order
    categories = ['CONTEXT', 'CLARITY', 'PRECISION',
                 'COMPLIANCE', 'ACTIONABLE', 'RISK']

    # Set up the angles for the polar plot
    angles = np.linspace(0, 2*np.pi, len(categories), endpoint=False)
    angles = np.concatenate((angles, [angles[0]]))  # Close the plot

    # Calculate figure size based on number of plots
    num_plots = len(indices_to_plot)
    fig_width = figsize[0] * num_plots
    fig = plt.figure(figsize=(fig_width, figsize[1]))

    # Create a subplot for each result
    for plot_idx, result_idx in enumerate(indices_to_plot):
        result = results[result_idx]

        # Create subplot
        ax = plt.subplot(1, num_plots, plot_idx + 1, projection='polar')

        # Get scores for this result
        scores = []
        for category in categories:
            score = None
            for test_result in result['test_results']:
                mapped_category = map_test_to_category(test_result['test'])
                if mapped_category == category:
                    score = test_result['score']
                    break
            scores.append(score if score is not None else 0)

        # Close the scores array
        scores = np.concatenate((scores, [scores[0]]))

        # Plot the scores
        ax.plot(angles, scores, 'o-', linewidth=2)
        ax.fill(angles, scores, alpha=0.25)

        # Set the labels
        ax.set_xticks(angles[:-1])
        ax.set_xticklabels(categories)

        # Set the scale
        ax.set_ylim(0, 5)

        # Add grid
        ax.grid(True)

        # Add score values as annotations
        for angle, score, category in zip(angles[:-1], scores[:-1], categories):
            ax.text(angle, score + 0.2, f'{score:.2f}',
                    ha='center', va='bottom')

        # Add title for each subplot
        prompt = result['prompt']
        ax.set_title(f"Test {result_idx}\n{prompt}", pad=20)

    plt.tight_layout()
    return fig
Radar plots are a great way to visualize the different dimensions that the unit tests provide. Try changing the index to view other plots:
# Plot the second test result
fig = create_unit_test_plots(results, test_indices=0)
You want to compare multiple plots? Try this
fig = create_unit_test_plots(results, test_indices=[0, 1, 2,])
fig = create_unit_test_plots(results, test_indices=[3, 4, 5,])
fig = create_unit_test_plots(results, test_indices=[6, 7, 8,])

6. Visualize Group Results

For analyzing larger sets of results, it’s useful to use clustering methods. Let’s walk through using clustering to help analyze a dataset of 40 unit test results.
df = pd.read_csv(
    'data/synthetic_financial_responses.csv' if os.path.exists('data/synthetic_financial_responses.csv')
    else "https://raw.githubusercontent.com/ContextualAI/examples/refs/heads/main/03-lmunit/data/synthetic_financial_responses.csv"
)
df.head()
Start by using Kmeans and clustering this into four groups. For your analysis, you may need to use fewer or more clusters.
def cluster_responses(df: pd.DataFrame, n_clusters: int = 4) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Perform clustering on response evaluation data.

    Args:
        df: DataFrame containing evaluation scores
        n_clusters: Number of clusters to identify

    Returns:
        tuple: (DataFrame with cluster assignments, DataFrame of cluster centers)
    """
    categories = ['CONTEXT', 'CLARITY', 'PRECISION',
                 'COMPLIANCE', 'ACTIONABLE', 'RISK']
    # Prepare and perform clustering
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(df)

    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    df_clustered = df.copy()
    df_clustered['cluster'] = kmeans.fit_predict(X_scaled)

    # Calculate cluster centers
    cluster_centers = pd.DataFrame(
        scaler.inverse_transform(kmeans.cluster_centers_),
        columns=categories
    )
    return df_clustered, cluster_centers
Look at how each of our samples are now clustered:
df_clustered, centers = cluster_responses(df)
df_clustered.head()
You can visualize these cluster, both in terms of how the clusters centers vary as well as how the individuals points.

def visualize_clusters(df: pd.DataFrame, cluster_centers: pd.DataFrame):
    """
    Create visualizations for cluster analysis.

    Args:
        df: DataFrame with cluster assignments
        cluster_centers: DataFrame of cluster centers
    """
    # 1. Heatmap of cluster centers
    plt.figure(figsize=(12, 8))
    sns.heatmap(cluster_centers, annot=True, cmap='RdYlGn', fmt='.2f')
    plt.title('Response Pattern Cluster Centers')
    plt.ylabel('Cluster')
    plt.tight_layout()
    plt.show()

    # 2. Scatter plot of key dimensions
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(df['CONTEXT'], df['ACTIONABLE'],
                         c=df['cluster'], cmap='viridis')
    plt.xlabel('CONTEXT Score')
    plt.ylabel('ACTIONABLE Score')
    plt.title('Cluster Distribution (Context vs Actionable)')
    plt.colorbar(scatter, label='Cluster')
    plt.grid(True)
    plt.tight_layout()
    plt.show()

visualize_clusters(df_clustered, centers)
The following code helps analyze the clusters by the categories used for unit tests:

def explain_clusters(df: pd.DataFrame, cluster_centers: pd.DataFrame):
    """
    Provide detailed explanation of cluster characteristics.

    Args:
        df: DataFrame with cluster assignments
        cluster_centers: DataFrame of cluster centers
    """
    expected_categories = ['CONTEXT', 'CLARITY', 'PRECISION',
                         'COMPLIANCE', 'ACTIONABLE', 'RISK']

    print("\nCluster Analysis:")
    print("-----------------")

    # Print cluster centers
    print("\nCluster Centers:")
    print(cluster_centers.round(2))

    # Print cluster sizes
    print("\nCluster Sizes:")
    print(df['cluster'].value_counts().sort_index())

    # Analyze each cluster
    print("\nCluster Characteristics:")
    for i in range(len(cluster_centers)):
        cluster_df = df[df['cluster'] == i]
        print(f"\nCluster {i}:")

        # Calculate average scores
        avg_scores = cluster_df[expected_categories].mean()
        sorted_scores = avg_scores.sort_values(ascending=False)

        # Get top and bottom categories
        top_cats = list(sorted_scores.head(2).items())
        bottom_cats = list(sorted_scores.tail(2).items())

        # Print characteristics
        print(f"Size: {len(cluster_df)} responses")
        print(f"Strongest areas: {top_cats[0][0]} ({top_cats[0][1]:.2f}), "
              f"{top_cats[1][0]} ({top_cats[1][1]:.2f})")
        print(f"Weakest areas: {bottom_cats[0][0]} ({bottom_cats[0][1]:.2f}), "
              f"{bottom_cats[1][0]} ({bottom_cats[1][1]:.2f})")


explain_clusters(df_clustered, centers)

Interpret the clusters

Now that you have a better understanding of the response clusters, you can identify the patterns and characteristics of each cluster. After looking at the clusters, these patterns should emerge: Cluster 0: Compliance Blind Spot. High CLARITY/PRECISION, Low COMPLIANCE/RISK Clear communication but missing regulatory elements Cluster 1: Clarity Gap. High CONTEXT/RISK, Low CLARITY/PRECISION High context awareness but poor explanation clarity Cluster 2: Theory-Practice Gap. High PRECISION/CLARITY, Low ACTIONABLE. Strong theoretical understanding but impractical Cluster 3: Surface Analysis Medium CLARITY but Low CONTEXT/RISK Basic understanding without depth
Your clusters may be less straightforward to interpret, so use the patterns that do emerge to guide error analysis and refinement.

Best Practices for Using LMUnit

Unit Test Design:

  • Keep tests focused and specific
  • Avoid compound criteria
  • Use clear, unambiguous language
  • Assess a desirable quality, such as “Is the response coherent?” rather than “Is the response incoherent?”

Evaluation Strategy:

  • Start with global tests
  • Add query-level tests as needed
  • Monitor patterns across responses

Score Interpretation:

  • 5: Excellent - Fully satisfies criteria
  • 4: Good - Minor issues
  • 3: Acceptable - Some issues
  • 2: Poor - Significant issues
  • 1: Unacceptable - Fails criteria
You can provide a custom rubric for scoring.

Next Steps

  • Customize unit tests for your use case
  • Integrate with your evaluation pipeline
  • Monitor and adjust based on results

Additional Resources