Scoring Overview
In Weave, Scorers are used to evaluate AI outputs and return evaluation metrics. They take the AI's output, analyze it, and return a dictionary of results. Scorers can use your input data as reference if needed and can also output extra information, such as explanations or reasonings from the evaluation.
- Python
- TypeScript
Scorers are passed to a weave.Evaluation
object during evaluation. There are two types of Scorers in weave:
- Function-based Scorers: Simple Python functions decorated with
@weave.op
. - Class-based Scorers: Python classes that inherit from
weave.Scorer
for more complex evaluations.
Scorers must return a dictionary and can return multiple metrics, nested metrics and non-numeric values such as text returned from a LLM-evaluator about its reasoning.
Scorers are special ops passed to a weave.Evaluation
object during evaluation.
Create your own Scorersβ
While this guide shows you how to create custom scorers, Weave comes with a variety of predefined scorers and local SLM scorers that you can use right away, including:
Function-based Scorersβ
- Python
- TypeScript
These are functions decorated with @weave.op
that return a dictionary. They're great for simple evaluations like:
import weave
@weave.op
def evaluate_uppercase(text: str) -> dict:
return {"text_is_uppercase": text.isupper()}
my_eval = weave.Evaluation(
dataset=[{"text": "HELLO WORLD"}],
scorers=[evaluate_uppercase]
)
When the evaluation is run, evaluate_uppercase
checks if the text is all uppercase.
These are functions wrapped with weave.op
that accept an object with modelOutput
and optionally datasetRow
. They're great for simple evaluations like:
import * as weave from 'weave'
const evaluateUppercase = weave.op(
({modelOutput}) => modelOutput.toUpperCase() === modelOutput,
{name: 'textIsUppercase'}
);
const myEval = new weave.Evaluation({
dataset: [{text: 'HELLO WORLD'}],
scorers: [evaluateUppercase],
})
Class-based Scorersβ
- Python
- TypeScript
For more advanced evaluations, especially when you need to keep track of additional scorer metadata, try different prompts for your LLM-evaluators, or make multiple function calls, you can use the Scorer
class.
Requirements:
- Inherit from
weave.Scorer
. - Define a
score
method decorated with@weave.op
. - The
score
method must return a dictionary.
Example:
import weave
from openai import OpenAI
from weave import Scorer
llm_client = OpenAI()
class SummarizationScorer(Scorer):
model_id: str = "gpt-4o"
system_prompt: str = "Evaluate whether the summary is good."
@weave.op
def some_complicated_preprocessing(self, text: str) -> str:
processed_text = "Original text: \n" + text + "\n"
return processed_text
@weave.op
def call_llm(self, summary: str, processed_text: str) -> dict:
res = llm_client.chat.completions.create(
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": (
f"Analyse how good the summary is compared to the original text."
f"Summary: {summary}\n{processed_text}"
)}])
return {"summary_quality": res}
@weave.op
def score(self, output: str, text: str) -> dict:
"""Score the summary quality.
Args:
output: The summary generated by an AI system
text: The original text being summarized
"""
processed_text = self.some_complicated_preprocessing(text)
eval_result = self.call_llm(summary=output, processed_text=processed_text)
return {"summary_quality": eval_result}
evaluation = weave.Evaluation(
dataset=[{"text": "The quick brown fox jumps over the lazy dog."}],
scorers=[summarization_scorer])
This class evaluates how good a summary is by comparing it to the original text.
This feature is not available in TypeScript yet. Stay tuned!