Scores & Evaluation

Scores serve as a metric for evaluating individual executions or traces.

A variety of scores may be utilized, with the most common metrics assessing aspects such as quality, tonality, factual accuracy, completeness, and relevance, among others.

In instances where the score pertains to a specific phase of a trace, for example, a singular LLM call, a message in a chat conversation, or a step in an agent, it is possible to attach the score directly to the observation. This enables targeted evaluation of that particular component.

The Score object in Langfuse:

Attribute	Type	Description
`name`	string	Name of the score, e.g. user_feedback, hallucination_eval
`value`	number	Value of the score
`traceId`	string	Id of the trace the score relates to
`observationId`	string	Optional: Observation (e.g. LLM call) the score relates to
`comment`	string	Optional: Evaluation comment, commonly used for user feedback, eval output or internal notes

Kinds of scores

Scores in Langfuse are adaptable and designed to cater to the unique requirements of specific LLM applications. They typically serve to measure the following aspects:

Quality
- Factual accuracy
- Completeness of the information provided
- Verification against hallucinations
Style
- Sentiment portrayed
- Tonality of the content
- Potential toxicity
Security
- Similarity to prevalent prompt injections
- Instances of model refusals (e.g., as a language model, ...)

This flexible scoring system allows for a comprehensive evaluation of various elements integral to the function and performance of the LLM application.

Ingesting scores

We currently run a private beta of our newest evaluation service on Langfuse Cloud. Learn more here and ping us via the chat widget if you are interested to join the beta.

Most users of Langfuse ingest scores programmatically. These are common sources of scores:

Source	examples
Manual evaluation (UI)	Review traces/generations and add scores manually in the UI
User feedback	Explicit (e.g., thumbs up/down, 1-5 star rating) or implicit (e.g., time spent on a page, click-through rate, accepting/rejecting a model-generated output)
Model-based evaluation	OpenAI Evals, Whylabs Langkit, Langchain Evaluators (cookbook), RAGAS for RAG pipelines (cookbook), custom model outputs
Custom via SDKs/API	Run-time quality checks (e.g. valid structured output format), custom workflow tool for human evaluation

Using scores across Langfuse

Scores can be used in multiple ways across Langfuse:

Displayed on trace to provide a quick overview
Segment all execution traces by scores to e.g. find all traces with a low quality score
Analytics: Detailed score reporting with drill downs into use cases and user segments

Scores & Evaluation

Kinds of scores

Ingesting scores

Using scores across Langfuse

Was this page useful?

Questions? We're here to help

Subscribe to updates