This is a notebook
Using HuggingFace Datasets in evaluations with preprocess_model_input
Note: This is a temporary workaround
This guide demonstrates a workaround for using HuggingFace Datasets with Weave evaluations.
We are actively working on developing more seamless integrations that will simplify this process.
While this approach works, expect improvements and updates in the near future that will make working with external datasets more straightforward.
Setup and imports
First, we initialize Weave and connect to Weights & Biases for tracking experiments.
!pip install datasets wandb weave
# Initialize variables
HUGGINGFACE_DATASET = "wandb/ragbench-test-sample"
WANDB_KEY = ""
WEAVE_TEAM = ""
WEAVE_PROJECT = ""
# Init weave and required libraries
import asyncio
import nest_asyncio
import wandb
from datasets import load_dataset
import weave
from weave import Evaluation
# Login to wandb and initialize weave
wandb.login(key=WANDB_KEY)
client = weave.init(f"{WEAVE_TEAM}/{WEAVE_PROJECT}")
# Apply nest_asyncio to allow nested event loops (needed for some notebook environments)
nest_asyncio.apply()
Load and prepare HuggingFace dataset
- We load a HuggingFace dataset.
- Create an index mapping to reference the dataset rows.
- This index approach allows us to maintain references to the original dataset.
Note:
In the index, we encode thehf_hub_name
along with thehf_id
to ensure each row has a unique identifier.
This unique digest value is used for tracking and referencing specific dataset entries during evaluations.
# Load the HuggingFace dataset
ds = load_dataset(HUGGINGFACE_DATASET)
row_count = ds["train"].num_rows
# Create an index mapping for the dataset
# This creates a list of dictionaries with HF dataset indices
# Example: [{"hf_id": 0}, {"hf_id": 1}, {"hf_id": 2}, ...]
hf_index = [{"hf_id": i, "hf_hub_name": HUGGINGFACE_DATASET} for i in range(row_count)]
Define processing and evaluation functions
Processing pipeline
preprocess_example
: Transforms the index reference into the actual data needed for evaluationhf_eval
: Defines how to score the model outputsfunction_to_evaluate
: The actual function/model being evaluated
@weave.op()
def preprocess_example(example):
"""
Preprocesses each example before evaluation.
Args:
example: Dict containing hf_id
Returns:
Dict containing the prompt from the HF dataset
"""
hf_row = ds["train"][example["hf_id"]]
return {"prompt": hf_row["question"], "answer": hf_row["response"]}
@weave.op()
def hf_eval(hf_id: int, output: dict) -> dict:
"""
Scoring function for evaluating model outputs.
Args:
hf_id: Index in the HF dataset
output: The output from the model to evaluate
Returns:
Dict containing evaluation scores
"""
hf_row = ds["train"][hf_id]
return {"scorer_value": True}
@weave.op()
def function_to_evaluate(prompt: str):
"""
The function that will be evaluated (e.g., your model or pipeline).
Args:
prompt: Input prompt from the dataset
Returns:
Dict containing model output
"""
return {"generated_text": "testing "}
Create and run evaluation
- For each index in hf_index:
preprocess_example
gets the corresponding data from the HF dataset.- The preprocessed data is passed to
function_to_evaluate
. - The output is scored using
hf_eval
. - Results are tracked in Weave.
# Create evaluation object
evaluation = Evaluation(
dataset=hf_index, # Use our index mapping
scorers=[hf_eval], # List of scoring functions
preprocess_model_input=preprocess_example, # Function to prepare inputs
)
# Run evaluation asynchronously
async def main():
await evaluation.evaluate(function_to_evaluate)
asyncio.run(main())