Name		Name	Last commit message	Last commit date
parent directory ..
imgs		imgs
rag_evaluator		rag_evaluator
results		results
synthetic_data_generator		synthetic_data_generator
01_synthetic_data_generation.ipynb		01_synthetic_data_generation.ipynb
02_filling_RAG_outputs_for_Evaluation.ipynb		02_filling_RAG_outputs_for_Evaluation.ipynb
03_eval_ragas.ipynb		03_eval_ragas.ipynb
04_Human_Like_RAG_Evaluation-AIP.ipynb		04_Human_Like_RAG_Evaluation-AIP.ipynb
Dockerfile		Dockerfile
Dockerfile.eval		Dockerfile.eval
README.md		README.md
qa_generation.json		qa_generation.json
requirements.txt		requirements.txt

README.md

RAG Evaluation Application

---
depth: 2
local: true
backlinks: none
---

About Evaluating RAGs

RAGs have two components--a retriever and a generator. To quantify the performance of a RAG pipeline, you have to evaluate these components seperately as well as while they work together.

This RAG evaluation application measures RAG performance using RAGAS metrics and a likert score. The RAGAS metrics are faithfulness, context relevancy, answer similarity, answer relevancy, and context precision. The likert score is a value from 1 to 5 based on helpfulness, relevancy, accuracy, and level of detail of the generated answer.

Comparing the metrics for different RAG pipelines can provide insights and help you choose better parameters for the pipeline. You can evalute the pipelines on standard raw or synthetically generated question-and-answer dataset.

Prerequisites

Clone the Generative AI examples Git repository using Git LFS:

$ sudo apt -y install git-lfs
$ git clone git@github.com:NVIDIA/GenerativeAIExamples.git
$ cd GenerativeAIExamples/
$ git lfs pull

A host with an NVIDIA A100, H100, or L40S GPU.

Verify NVIDIA GPU driver version 535 or later is installed and that the GPU is in compute mode:

$ nvidia-smi -q -d compute

Example Output

---
emphasize-lines: 4,9
---
==============NVSMI LOG==============

Timestamp                                 : Sun Nov 26 21:17:25 2023
Driver Version                            : 535.129.03
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:CA:00.0
    Compute Mode                          : Default

If the driver is not installed or below version 535, refer to the NVIDIA Driver Installation Quickstart Guide.

Install Docker Engine and Docker Compose. Refer to the instructions for Ubuntu.

Install the NVIDIA Container Toolkit.

Refer to the installation documentation.
When you configure the runtime, set the NVIDIA runtime as the default:
```
$ sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
```
If you did not set the runtime as the default, you can reconfigure the runtime by running the preceding command.

Verify the NVIDIA container toolkit is installed and configured as the default container runtime:

$ cat /etc/docker/daemon.json

Example Output

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Run the nvidia-smi command in a container to verify the configuration:

$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi -L

Example Output

GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-d8ce95c1-12f7-3174-6395-e573163a2ace)

Generating Data with the Synthetic Data Generator

To generate a synthetic Q&A pair dataset from custom documents, perform the following steps:

In the Generative AI Examples repository, edit the deploy/compose/eval-app-compose.env file and specify the input and output paths:
- Update DATASET_DIRECTORY with the path to a directory with the documents to ingest.
  
  Copy PDF files to analyze into the specified directory. You can use the notebooks/dataset.zip file in the repository for sample PDF files.
- Update RESULT_DIRECTORY with the path for the output Q&A pair dataset.
Set your NVIDIA API key in an environment variable:
```
$ export NVIDIA_API_KEY='nvapi-*'
```

From the root of the repository, build and run the synthetic data generator:

$ docker compose \
    --env-file deploy/compose/eval-app-compose.env \
    -f deploy/compose/docker-compose-evaluation-application.yaml \
    build synthetic_data_generator

$ docker compose \
    --env-file deploy/compose/eval-app-compose.env \
    -f deploy/compose/docker-compose-evaluation-application.yaml \
    up synthetic_data_generator

Example Output

[+] Running 1/0
 ✔ Container data-generator  Created
Attaching to data-generator
data-generator  | INFO:data_generator:1/1
data-generator  | INFO:pikepdf._core:pikepdf C++ to Python logger bridge initialized
data-generator  | INFO:matplotlib.font_manager:generated new fontManager
data-generator  | [nltk_data] Downloading package punkt to /root/nltk_data...
data-generator  | [nltk_data]   Unzipping tokenizers/punkt.zip.
data-generator  | [nltk_data] Downloading package averaged_perceptron_tagger to
data-generator  | [nltk_data]     /root/nltk_data...
data-generator  | [nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
data-generator  | INFO:__main__:\DATA GENERATED
data-generator  |
data-generator exited with code 0

Generating Answers and Evaluating a RAG Pipeline

Start an instance of the Chain Server.

You can run an example, such as Using the NVIDIA API Catalog, to start a Chain Server.

From the root of the repository, build and run the RAG evaluator:

$ docker compose \
    --env-file deploy/compose/eval-app-compose.env \
    -f deploy/compose/docker-compose-evaluation-application.yaml \
    build rag_evaluator

$ docker compose \
    --env-file deploy/compose/eval-app-compose.env \
    -f deploy/compose/docker-compose-evaluation-application.yaml \
    run rag_evaluator

Example Output

INFO:llm_answer_generator:1/1
INFO:llm_answer_generator:1/6
INFO:llm_answer_generator:data: {"id":"e7262f2b-0753-4b6c-813d-a38cd4a5954c","choices":[{"index":0,"message":{"role":"assistant","content":""},"finish_reason":""}]}
...
Evaluating:  94%|███████████████████████████████████████████████████████████████████    | 34/36 [00:18<00:00,  2.10it/s]
WARNING:ragas.metrics._context_recall:Invalid JSON response. Expected dictionary with key 'Attributed'
Evaluating: 100%|███████████████████████████████████████████████████████████████████████| 36/36 [00:22<00:00,  1.62it/s]
INFO:evaluator:Results written to /result_dir/result.json and /result_dir/result.parquet
INFO:__main__:
RAG EVALUATED WITH RAGAS METRICS

Results and Conclusion

Find the following as results of running evaluation application on given qna.json dataset. The RESULT_DIRECTORY path has two newly created files.

A JSON file, result.json, with aggregated PERF metrics like the following example:

{
  "answer_similarity": 0.7944183243305074,
  "faithfulness": 0.25,
  "context_precision": 0.249999999975,
  "context_relevancy": 0.4837612078324153,
  "answer_relevancy": 0.6902010104258721,
  "context_recall": 0.5,
  "ragas_score": 0.4203451750317139
}

A parquet file, result.parquet, with PERF metrics for each Q&A pair like the following example:

{
  "question": "What is the contact email for Jordan Dodge who works in the SHIELD and GeForce NOW division at NVIDIA Corporation?",
  "answer": " jdodge@nvidia.com",
  "contexts": [
  "products and technologies or enhancements to our existing product and technologies ; market acceptance of our products or our partners ’ products ; design, manufacturing or software defects ; changes in consumer preferences or demands ; changes in industry standards and interfaces ; unexpected loss of performance of our products or technologies when integrated into systems ; as well as other factors detailed from time to time in the most recent reports nvidia files with the securities and exchange commission, or sec, including, but not limited to, its annual report on form 10 - k and quarterly reports on form 10 - q. copies of reports filed with the sec are posted on the company ’ s website and are available from nvidia without charge. these forward - looking statements are not guarantees of future performance and speak only as of the date hereof, and, except as required by law, nvidia disclaims any obligation to update these forward - looking statements to reflect future events or circumstances. © 2023 nvidia corporation. all rights reserved. nvidia, the nvidia logo, bluefield and connectx are trademarks and / or registered trademarks of nvidia corporation in the u. s. and other countries. all other trademarks and copyrights are the property of their respective owners. features, pricing, availability and specifications are subject to change without notice. alexa korkos director, product pr ampere computing + 1 - 925 - 286 - 5270 akorkos @ amperecomputing. com jordan dodge shield, geforce now nvidia corp. + 1 - 408 - 506 - 6849 jdodge @ nvidia. com"
  ],
  "ground_truth": "jdodge@nvidia.com",
  "answer_similarity": 1,
  "faithfulness": 0,
  "context_precision": 0.9999999999,
  "context_relevancy": 0.35714285714285715,
  "answer_relevancy": 0.7686588526523409,
  "context_recall": 1,
  "ragas_score": 0
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

evaluation

imgs

imgs

rag_evaluator

rag_evaluator

results

results

synthetic_data_generator

synthetic_data_generator

01_synthetic_data_generation.ipynb

01_synthetic_data_generation.ipynb

02_filling_RAG_outputs_for_Evaluation.ipynb

02_filling_RAG_outputs_for_Evaluation.ipynb

03_eval_ragas.ipynb

03_eval_ragas.ipynb

04_Human_Like_RAG_Evaluation-AIP.ipynb

04_Human_Like_RAG_Evaluation-AIP.ipynb

Dockerfile

Dockerfile

Dockerfile.eval

Dockerfile.eval

README.md

README.md

qa_generation.json

qa_generation.json

requirements.txt

requirements.txt

README.md

RAG Evaluation Application

About Evaluating RAGs

Prerequisites

Generating Data with the Synthetic Data Generator

Generating Answers and Evaluating a RAG Pipeline

Results and Conclusion

Files

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

RAG Evaluation Application

About Evaluating RAGs

Prerequisites

Generating Data with the Synthetic Data Generator

Generating Answers and Evaluating a RAG Pipeline

Results and Conclusion