Skip to main content

Local Weave Scorers

Open In Colab

Weave's local scorers are a suite of small language models that run locally on your machine with minimal latency. These models evaluate the safety and quality of your AI system’s inputs, context, and outputs.

Some of these models are fine-tuned by Weights & Biases, while others are state-of-the-art open-source models trained by the community. Weights & Biases (W&B) Reports were used for training and evaluation. You can find the full details in this list of W&B Reports.

The model weights are publicly available in W&B Artifacts and are automatically downloaded when you instantiate the scorer class. The artifact paths can be found here if you'd like to download them yourself: weave.scorers.default_models

The object returned from calling these scorers contains a passed boolean attribute to indicate whether the input text is safe or high quality as well as a metadata atttribute that contains more detail such as the raw score from the model.


While local scorers can be run on CPUs and GPUs, use GPUs for best performance.


Before you can use Weave local scorers, install additional dependencies:

pip install weave[scorers]

Select a scorer​

The following local scorers are available. Select a scorer based on your use case.

WeaveToxicityScorerV1Identify toxic or harmful content in your AI system's inputs and outputs, including hate speech or threats.
WeaveBiasScorerV1Detect biased or stereotypical content in your AI system's inputs and outputs. Ideal for reducing harmful biases in generated text.
WeaveHallucinationScorerV1Identify whether your RAG system generates hallucinations in its output based on the input and context provided.
WeaveContextRelevanceScorerV1Measure whether the AI system's output is relevant to the input and context provided.
WeaveCoherenceScorerV1Evaluate the coherence and logical structure of the AI system's output.
WeaveFluencyScorerV1Measure whether the AI system's output is fluent.
WeaveTrustScorerV1An aggregate scorer that leverages the toxicity, hallucination, context relevance, fluency, coherence scorers.
PresidioScorerDetect Personally Identifiable Information (PII) in your AI system's inputs and outputs using the Presidio library from Microsoft.


This scorer assesses gender and race/origin bias. The scorer assesses bias along two dimensions:

  • Race and Origin: Racism and bias against a country or region of origin, immigration status, ethnicity, etc.
  • Gender and Sexuality: Sexism, misogyny, homophobia, transphobia, sexual harassment, etc.

WeaveBiasScorerV1 uses a fine-tuned deberta-small-long-nli model. For more details on the model, dataset and calibration process, see the WeaveBiasScorerV1 W&B Report

Usage notes​

  • The score method expects a string to be passed to the output parameter.
  • A higher score means that there is a stronger prediction of bias in the text.
  • The threshold parameter is set but can also be overridden on initialization.

Usage example​

import weave
from weave.scorers import WeaveBiasScorerV1

bias_scorer = WeaveBiasScorerV1()
result = bias_scorer.score(output="Martian men are terrible at cleaning")

print(f"The text is biased: {not result.passed}")


The WeaveToxicityScorerV1 scorer assesses the input text for toxicity along five dimensions:

  • Race and Origin: Racism and bias against a country or region of origin, immigration status, ethnicity, etc.
  • Gender and Sexuality: Sexism, misogyny, homophobia, transphobia, sexual harassment, etc.
  • Religious: Bias or stereotype against someone's religion.
  • Ability: Bias according to someone's physical, mental, or intellectual ability or disability.
  • Violence and Abuse: Overly graphic descriptions of violence, threats of violence, or incitement of violence.

The WeaveToxicityScorerV1 uses the open source Celadon model from PleIAs. For more information, see the WeaveToxicityScorerV1 W&B Report.

Usage notes​

  • The score method expects a string to be passed to the output parameter.
  • The model returns scores from 0 to 3 across 5 different categories:
    • If the sum of these scores is above total_threshold (default value 5), then the input is flagged as toxic.
    • If any single category has a score higher than category_threshold (default 2), then the input is flagged as toxic. Default values were fine-tuned to decrease false positives and improve recall.
  • For more aggressive filtering, override the category_threshold parameter or the total_threshold parameter in the scorer constructor.

Usage example​

import weave
from weave.scorers import WeaveToxicityScorerV1

toxicity_scorer = WeaveToxicityScorerV1()
result = toxicity_scorer.score(output="people from the south pole of mars are the worst")

print(f"Input is toxic: {not result.passed}")


This scorer checks if your AI system's output contains any hallucinations based on the input data.

The WeaveHallucinationScorerV1 uses the open source HHEM 2.1 model from Vectara. For more information, see the WeaveHallucinationScorerV1 W&B Report.

Usage notes​

  • The score method expects data to be passed to the query and output parameters. The context should be passed to the output parameter as a string or list of strings.
  • A higher output score means that there is a stronger prediction of hallucination in the output given the query and context.
  • The threshold parameter is set, but can also be overridden upon initialization.

Usage example​

import weave
from weave.scorers import WeaveHallucinationScorerV1

hallucination_scorer = WeaveHallucinationScorerV1()

result = hallucination_scorer.score(
query="What is the capital of Antartica?",
context="People in Antartica love the penguins.",
output="While Antartica is known for its sea life, penguins aren't liked there."

print(f"Output is hallucinated: {not result.passed}")


This scorer is designed to be used when evaluating RAG systems. It scores the relevance of the context to the query.

The WeaveContextRelevanceScorerV1 scorer uses a fine-tuned deberta-small-long-nli model from tasksource. For more details, see the WeaveContextRelevanceScorerV1 W&B Report.

Usage notes​

  • The score method expects data to be passed to the query and output parameters. The context should be passed to the output parameter as a string or list of strings.
  • A higher output score means that there is a stronger prediction of that the context is relevant to the query.
  • The threshold parameter is automatically set, but can also be overridden on initialization.
  • Passing verbose = True to the score method will return scores for each relevant chunk of text in the context.

Usage example​

import weave
from weave.scorers import WeaveContextRelevanceScorerV1

context_relevance_scorer = WeaveContextRelevanceScorerV1()

result = context_relevance_scorer.score(
query="What is the capital of Antarctica?",
output="The Antarctic has the happiest penguins." # the context is passed to the output parameter

print(f"Output is relevant: {result.passed}")


This scorer checks that the input text is coherent.

The WeaveCoherenceScorerV1 scorer uses a fine-tuned deberta-small-long-nli model from tasksource. For more information, see the WeaveCoherenceScorerV1 W&B Report.

Usage notes​

  • The score method expects text to be passed to the query and output parameters.
  • A higher output score means that there is a stronger prediction of coherence in the input text.

Usage example​

import weave
from weave.scorers import WeaveCoherenceScorerV1

coherence_scorer = WeaveCoherenceScorerV1()

result = coherence_scorer.score(
query="What is the capital of Antarctica?",
output="but why not monkey up day"

print(f"Output is coherent: {result.passed}")


This scorer checks the input text is fluent; that is, easy to read and understand, similar to human language. The scorer assesses input along dimensions such as grammar, syntax, and overall readability.

The WeaveFluencyScorerV1 scorer uses a fine-tuned ModernBERT-base model from AnswerDotAI. For more information, see the WeaveFluencyScorerV1 W&B Report.

Usage notes​

  • The score method expects text to be passed to the output parameter.
  • A higher output score indicates higher input text fluency.

Usage example​

import weave
from weave.scorers import WeaveFluencyScorerV1

fluency_scorer = WeaveFluencyScorerV1()

result = fluency_scorer.score(
output="The cat did stretching lazily into warmth of sunlight."

print(f"Output is fluent: {result.passed}")


The WeaveTrustScorerV1 is a composite scorer for RAG systems that evaluates the trustworthiness of model outputs by grouping the outputs of other scorers into two logical categories, Critical and Advisory. Based on the compostite score, WeaveTrustScorerV1 returns a trust level score. The values for the trust level score are:

  • high: No issues detected
  • medium: Only Advisory issues detected
  • low: Critical issues detected or empty input

Any input that does not pass a Critical scorer will automatically cause the WeaveTrustScorerV1 to return low, while input that doesn't pass Advisory scorers will return medium.

  • Critical:

    • WeaveToxicityScorerV1: Detects harmful, offensive, or inappropriate content
    • WeaveHallucinationScorerV1: Identifies fabricated or unsupported information
    • WeaveContextRelevanceScorerV1: Ensures output relevance to provided context
  • Advisory:

    • WeaveFluencyScorerV1: Evaluates language quality and coherence

    • WeaveCoherenceScorerV1: Checks for logical consistency and flow

Usage notes​

  • The use case for this scorer is in evalutating RAG pipelines.
  • WeaveFluencyScorerV1 requires query, context and output keys to score correctly.

Usage example​

import weave
from weave.scorers import WeaveTrustScorerV1

trust_scorer = WeaveTrustScorerV1()

# A helper function to print the results of the trust scorer
def print_trust_scorer_result(result):
print(f"Output is trustworthy: {result.passed}")
print(f"Trust level: {result.metadata['trust_level']}")
if not result.passed:
print("Triggered scorers:")
for scorer_name, scorer_data in result.metadata['raw_outputs'].items():
if not scorer_data.passed:
print(f" - {scorer_name} did not pass")

print(f'WeaveToxicityScorerV1 scores: {result.metadata["scores"]["WeaveToxicityScorerV1"]}')
print(f'WeaveHallucinationScorerV1 scores: {result.metadata["scores"]["WeaveHallucinationScorerV1"]}')
print(f'WeaveContextRelevanceScorerV1 score: {result.metadata["scores"]["WeaveContextRelevanceScorerV1"]}')
print(f'WeaveCoherenceScorerV1 score: {result.metadata["scores"]["WeaveCoherenceScorerV1"]}')
print(f'WeaveFluencyScorerV1: {result.metadata["scores"]["WeaveFluencyScorerV1"]}')

# There are 2 issues with the input data: irrelevant context, hallucinated output
result = trust_scorer.score(
query="What is the capital of Antarctica?",
context="People in Antarctica love the penguins.",
output="The cat stretched lazily in the warm sunlight."



This scorer uses the Presidio library to detect Personally Identifiable Information (PII) in your AI system's inputs and outputs.

Usage notes​

  • To specify specific entity types, such as emails or phone numbers, pass a list of Presidio entities to the selected_entities parameter. Otherwise, Presidio will detect all entity types in its default entities list.
  • Pass custom recognizers to the scorer as a list of type presidio.EntityRecognizer via the custom_recognizers parameter.
  • To pass non-Englis input to the scorer, use the language parameter to specify the language of the text.

Usage example​

import weave
from weave.scorers import PresidioScorer

presidio_scorer = PresidioScorer()

result = presidio_scorer.score(
output = "Mary Jane is a software engineer at XYZ company and her email is"

print(f"Output contains PII: {not result.passed}")