Hacking DSPy into doing Automatic System Prompt Optimization

Maxime Rivest

2025-07-21

Setting up

For this tutorial, you will only need to install dspy and setup a LLM connections. I will be using several LLMs to demonstrate how easy it is to switch between them and show the student/teacher concept. You can however set only one up if you want. If you use a locally hosted model, (you can!) simply skip the setting up of the API key. .

For this tutorial, I have will use Kimi-K2 hosted by Groq Click here to get a groq api key and Llama models from OpenRouter Click here to get a OpenRouter key.

python library requirements

I like to use uv to install my libraries.

!uv pip install dspy>=2.6.27

api key setup

I generally setup my key permanently but you can also do this to set it up just for here and now.

```{python}
import os
os.environ["GROQ_API_KEY"] = "[REDACTED]"
os.environ["OPENROUTER_API_KEY"] = "[REDACTED]"
```

Make GROQ_API_KEY permanent

Replace GROQ_API_KEY with OPENROUTER_API_KEY to set openrouter key permanently on your system.

Linux / macOS

Append to your shell start-up file (pick the one you actually use):

echo "export GROQ_API_KEY='gsk_[REDACTED]'" >> ~/.bashrc
# or ~/.zshrc, ~/.profile, etc.
source ~/.bashrc   # reload once

Windows – CMD

setx GROQ_API_KEY "gsk_[REDACTED]"

Close and reopen the terminal.

Windows – PowerShell

[Environment]::SetEnvironmentVariable("GROQ_API_KEY", "gsk_[REDACTED]", "User")

Refresh with refreshenv or open a new window.

Making an automatic System Prompt tool

In this tutorial, I’ll show you how I’ve modified and customized DSPy to make it handle system prompt optimization. Usually DSPy is doing program optimization. DSPy is very much batteries included, giving you tons of tools for everything. It’s general, and it gives you a framework for how to do things, which is powerful and useful. But that framework is about AI programming, not about system prompt optimization. That is why we will need to do some customization to DSPy. Don’t worry, DSPy was built in a way that lets us do it without too much work.

The nice thing about having to customize DSPy is that by the end you’ll walk away with two things. First, a way to automatically optimize system prompts. Second, you’ll have opened the hood: you’ll understand better how DSPy works and this will help you use DSPy more proficiently when you actually want to do AI programs.

So by the end of this tutorial we will have built this simple yet powerful automatic system prompt optimization utility and understood why we had to do what we did.

optimzed_system_prompt = optimize(
    training_inputs = ["User prompt example 1", "...", "User prompt exampl n"],
    training_outputs = ["Desirable Assistant's example 1", "...", "Desirable Assistant's example 1"],
    llm_judge = "Return a 1 if it's good and a 0 if it's bad."
)

Our optimize function will also be able to optionally take a starting system prompt, a max few-shots, and teacher and student model identifiers. Here is a mock-up of that:

optimzed_system_prompt = optimize(
    training_inputs = ["User prompt example 1", "...", "User prompt exampl n"],
    training_outputs = ["Desirable Assistant's example 1", "...", "Desirable Assistant's example 1"],
    llm_judge = "Return a 1 if it's good and a 0 if it's bad.",
    system_prompt = "You are a model that perform well...",
    max_few_shots = 2,
    teacher_model = "a-smart-model",
    student_model = "a-cheap-model"
)

Now that we have our vision, let’s get going!

The task

All throughout this tutorial our task will be to make an English to Quebec-French translator.

The first DSPy optimizer that we want to use is dspy.MIPROv2. This optimizer can write (or improve) a program’s instructions. Let’s analyze the code below to learn what parts we must prepare to reach that goal of running MIPROv2 on task.

First we pass translation_judge to the optimizer initialisation. This should be a function that must return a score between 0 (bad) to 1 (good). In DSPy these are called metrics. Almost every DSPy optimizer requires a metric. After we have 2 max_..._demos which are set to 0, this is because as a first run we would like to only optimise the text of the system prompt without adding few-shot examples. MIPROv2 can search for few-shot examples that would improve a program’s performance.

optimizer = dspy.MIPROv2(translation_judge, max_bootstrapped_demos = 0, max_labeled_demos = 0)
my_program_optimized = optimizer.compile(my_program, trainset=trainset)

Second line of code, inside the compile method, we must give a DSPy program. This is not a string; it cannot be a system prompt. We will thus need to wrap up our system prompt + user/assistant simple LLM call into a lightweight program. And, finally, we have the trainset. In DSPy, this must be a list of dspy.Example objects. This is the object that all of DSPy’s internals are using, so there is no way around it; we must format our input/output training set as dspy.Example.

In summary, we need: 1. a metric 2. a program 3. a training set

and we must format those appropriately.

Let’s first tackle the training set as it is quite straightforward

Training set

The Example() object can take any arguments. You can think of those as column names in a dataframe or “keys” in JSON. It is usually pretty important to consider these names thoughtfully and normally DSPy will present them to the LLM as part of the prompts. In our case, that is a behavior from DSPy that we will change, so it does not matter what we call them. I decided to go with something very general. The prompt will be the user message and the generation will be the assistant message.

import dspy

examples = [
    dspy.Example(prompt="I'm going to the convenience store.", generation="Je m'en vais au dépanneur."),
    dspy.Example(prompt="It's really cold out today.", generation="Il fait frette en maudit aujourd'hui."),
    dspy.Example(prompt="Can you help me move this weekend?", generation="Tu peux m'aider à déménager ce weekend?"),
    dspy.Example(prompt="We were stuck in traffic for two hours.", generation="On était pognés dans le trafic pendant deux heures."),
    dspy.Example(prompt="She's my girlfriend.", generation="C'est ma blonde."),
    dspy.Example(prompt="That car is so cool!", generation="C'est ben l'fun ce char-là!"),
    dspy.Example(prompt="I'll call you tonight.", generation="Je vais t'appeler ce soir."),
    dspy.Example(prompt="He's always bragging.", generation="Il se vente tout l'temps."),
    dspy.Example(prompt="We grabbed a coffee at Tim's.", generation="On a pris un café au Tim."),
    dspy.Example(prompt="Close the window, it's chilly.", generation="Ferme la fenêtre, y fait frette."),
    dspy.Example(prompt="I have an appointment at 3.", generation="J'ai un rendez-vous à trois heures."),
    dspy.Example(prompt="They're celebrating their birthday.", generation="Ils fêtent leur fête."),
    dspy.Example(prompt="I parked in the back.", generation="J'ai stationné dans l'fond."),
    dspy.Example(prompt="The metro is packed.", generation="Le métro est plein à craquer."),
    dspy.Example(prompt="We watched a movie last night.", generation="On a écouté un film hier soir."),
    dspy.Example(prompt="I need to do my groceries.", generation="J'dois faire mon épicerie."),
    dspy.Example(prompt="Don't forget your boots.", generation="Oublie pas tes bottes."),
    dspy.Example(prompt="It's snowing again.", generation="Il neige encore."),
    dspy.Example(prompt="I'll take the bus.", generation="J'va prendre l'bus."),
    dspy.Example(prompt="We're out of milk.", generation="On est à court de lait."),
]

Before we are done with our training set we must do 1 more little thing:

trainset = [x.with_inputs('prompt') for x in examples]

This, again, is something we have to do because of DSPy’s general powerful nature. Briefly, it is used by DSPy’s code internally to know what fields of the Example object are input fields for the LLM. It helps internal development to separate inputs from outputs. In our case, we just need to know that we have to do it, and so we do.

Let’s move on to the Metric now!

Metric

Our first metric will be somewhat dumb and a little bit bad. That is because it is hard to have code that measures the quality of a translation. Despite that, we will get pretty good results, you will see.

In essence, all this code does is search for some very common French words that are not also common English words. If any of the words are found, the function returns a 1; otherwise it returns a 0.

import re

def is_french(text):
    # Naive French detector: check for common French words/accents
    french_markers = [
        r"\b(le|la|les|un|une|des|du|de|et|à|est|sont|avec|pour|sur|par|mais|ou|où|que|qui|quand|comment|nous|vous|ils|elles|ça|ce|cette|ces)\b",
        r"[éèêàùçîôâœëïü]",
    ]
    return any(re.search(marker, text.lower()) for marker in french_markers)

def translation_judge(example, prediction, trace=None):
    """
    Return 1.0 if the output looks French, else 0.0.
    Doing the cast explicitly guarantees we never hand DSPy a None.
    """
    output = prediction.get("generation", "") or ""
    try:
        return float(is_french(output))
    except Exception:
        # Anything weird is just a miss
        return 0.0

Notice how translation_judge takes 3 arguments: example, prediction, and trace.

example will essentially be an instance of the Example() object as we defined above.
prediction will be the parsed LLM output. Usually DSPy can do a lot here, but we will modify and simplify that part too.
trace can be ignored except when we want models to generate good examples themselves. This is called bootstrapping, and in that case, if trace is not None, we must return a boolean for whether the LLM-generated example is good (1) or not (0). This could be used, for instance, to make our list of translation pairs longer.

Moving on the the program now!

Program

The simplest program you can build in DSPy is one with only one input, one output, and empty instructions using Predict. A core concept of DSPy is around that signature, but since we do not want to do program optimization I’ll not go into it (see this post for a simple introduction to DSPy).

class signature(dspy.Signature):
    """ 
    
    """
    prompt = dspy.InputField()
    generation = dspy.OutputField()

initial_program = dspy.Predict(signature)

The most interesting part for you to note is that initial_program is now callable, and if we call it, we will get an LLM response, provided we set up an LLM like this:

kimi = dspy.LM("groq/moonshotai/kimi-k2-instruct")
dspy.configure(lm = kimi)
initial_program(prompt = "Hello, how are you?")

Prediction(
    generation="I'm doing well, thank you for asking! How can I help you today?"
)

But we have a few problems.

initial_program.inspect_history()





[2025-07-23T09:37:55.419674]

System message:

Your input fields are:
1. `prompt` (str):
Your output fields are:
1. `generation` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## prompt ## ]]
{prompt}

[[ ## generation ## ]]
{generation}

[[ ## completed ## ]]
In adhering to this structure, your objective is:


User message:

[[ ## prompt ## ]]
Hello, how are you?

Respond with the corresponding output fields, starting with the field `[[ ## generation ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


Response:

[[ ## generation ## ]]
I'm doing well, thank you for asking! How can I help you today?

[[ ## completed ## ]]

The above command prints the previous interaction we had with the LLM. In that interaction, the system prompt was:

Your input fields are:
1. `prompt` (str):
Your output fields are:
1. `generation` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## prompt ## ]]
{prompt}

[[ ## generation ## ]]
{generation}

[[ ## completed ## ]]
In adhering to this structure, your objective is:

And the user message was:

[[ ## prompt ## ]]
Hello, how are you?

Respond with the corresponding output fields, starting with the field `[[ ## generation ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

And the assistant was:

[[ ## generation ## ]]
I'm doing well, thank you for asking! How can I help you today?

[[ ## completed ## ]]

A lot of stuff was added, and if we run an optimizer as it is, we will be optimizing the LLM’s performance in that prompt template. This is a little too different from the vanilla we would have expected, which is: sp = "", user = "Hello, how are you?", and the assistant response could have been something like assistant = "I'm doing well, thank you for asking! How can I help you today?". The culprit for the additions is DSPy’s adapter. The adapter is amazing at turning a DSPy signature into an AI program, but right now, it’s in the way.

Let’s replace DSPy’s adapter with our own simplified version.

Making a Simple Custom Adapter

Adapters are DSPy’s interface to the LLMs. They are called with a few pieces of information, and DSPy expects a parsed LLM generation to be returned. The following is the simplest we can make an adapter. We are taking in the LM that DSPy’s internals want us to use, keyword arguments if any, a signature, demos, and inputs.

The signature can contain only 3 things: instructions, inputs, and outputs. In our case, we have “canned” the signature, so we also know that the input is named prompt and the output is named generation, simplifying our requirements for our adapter substantially from what DSPy usually has to worry about.

# Define the SimplestAdapter as before
class SimplestAdapter(dspy.Adapter):
    def __call__(self, lm, lm_kwargs, signature, demos, inputs):
        print(inputs)
        system_content = signature.instructions
        if demos:
            system_content
        messages = [
            {"role": "system", "content": system_content},
            {"role": "user", "content": inputs["prompt"]},
        ]
        outputs = lm(messages=messages, **lm_kwargs)
        return [{"generation": outputs[0]}]

# Do NOT call dspy.configure(adapter=SimplestAdapter())
# Subclass Predict to use the custom adapter only for this instance
class MyPredict(dspy.Predict):
    def forward(self, **kwargs):
        adapter = SimplestAdapter()
        with dspy.settings.context(adapter=adapter):
            return super().forward(**kwargs)

We also have to subclass dspy.Predict so that we are able to make a program that uses our adapter. Usually in DSPy, the adapter is set globally or within a scoped context, but in both cases, the adapter is applied recursively. This has the effect of making some DSPy programs inside the optimizer use our simple adapter, causing them all to break. And breaking everything is generally not good…

my_program = MyPredict(signature)

Automatically Generating a System Prompt

We are now ready to run the optimizer!!!

Letting MIPROv2 write the System Prompt

optimizer = dspy.MIPROv2(translation_judge, max_bootstrapped_demos = 0, max_labeled_demos = 0)
my_program_optimized = optimizer.compile(my_program, trainset=trainset, requires_permission_to_run = False)

Let’s test the program right away:

my_program_optimized(prompt = "Hello, how are you?")

{'prompt': 'Hello, how are you?'}

Prediction(
    generation='Salut, ça va-tu ben?'
)

Good! It’s a translation and not a response to our salutation. Let’s inspect the messages.

my_program_optimized.inspect_history()





[2025-07-23T09:37:56.428724]

System message:

You are a seasoned Québécois street linguist who grew up in Montréal’s Plateau-Mile End. Your job is to translate the user’s colloquial North-American English sentence into equally relaxed, idiomatic Québec French. Preserve every ounce of slang, contraction, and colourful swear word that a native speaker would use at the dépanneur counter on a Friday night. Keep the same tone, brevity, and punch as the original—no extra formality, no explanations, just the straight-up spoken French you’d hear in a Montréal alleyway.


User message:

Hello, how are you?


Response:

Salut, ça va-tu ben?

And this confirms that our adapter works! This is a completely ‘vanilla’ set of messages.

Using LLM in the Metric

Here we redefine our translation_judge, so that instead of using Python code to calculate a score between 0 and 1, we ask an LLM to do that.

In this case, we are using DSPy in its most natural way! So first we define a signature by subclassing dspy.Signature.

The docstring there is the instruction that the adapter will put in a system prompt. The InputFields are those we will pass to the program, and the OutputFields are those that the program will return. In the case of score: int = dspy.OutputField(desc="A single integer from 1 to 5."), DSPy will ensure and parse the integer out of the LLM-generated string for us. If an integer is not provided, DSPy will even retry for us, and try different adapters.

class QuebecTranslationJudge(dspy.Signature):
    """You are an expert Quebec French linguist. For each English sentence and its proposed French translation, evaluate the translation on a scale of 1 to 5 based on the following criteria, with 5 being a perfect, natural-sounding translation.

1.  **Accuracy**: Does the French convey the same meaning as the English?
2.  **Register**: Is the tone appropriately informal/colloquial (not formal textbook French)?
3.  **Regional Vocabulary**: Does it use authentic Quebec French terms (e.g., "dépanneur", "frette", "char")?
4.  **Contractions**: Are natural Quebec French contractions used (e.g., "j'va", "t'sais", "y fait")?
5.  **Proper Nouns & Anglicisms**: Are names (e.g., "Tim's") and common anglicisms (e.g., "weekend") handled appropriately for Quebec French?

Provide brief feedback on any issues and output only the final numerical score.

IMPORTANT IF MEANING IS CHANGED SET TO 0.
"""

    english_sentence = dspy.InputField(desc="The original sentence in English.")
    french_translation = dspy.InputField(desc="The proposed translation in Quebec French.")
    feedback = dspy.OutputField(desc="Brief feedback on the translation's quality.")
    score: int = dspy.OutputField(desc="A single integer from 1 to 5.")

# If you have a capable model configured globally, just do this:
llm_judge = dspy.Predict(QuebecTranslationJudge)

def translation_judge(example, prediction, trace=None):
    """
    An LLM-based metric that judges translation quality.
    It robustly parses the score and normalizes it to a 0.0-1.0 scale.
    """
    english_sentence = example.prompt
    # Ensure the prediction's output is not empty
    french_translation = prediction.get("generation", "")
    if not french_translation:
        return 0.0

    try:
        # Call the LLM judge to get a score
        result = llm_judge(
            english_sentence=english_sentence,
            french_translation=french_translation
        )
        # Parse the score and normalize it to a 0.0-1.0 range
        # (e.g., a score of 5 becomes 1.0, 1 becomes 0.2)
        score = float(result.score)
        return score / 5.0
    except (ValueError, AttributeError, TypeError):
        # If the LLM fails to output a valid score, return 0.0
        return 0.0

Now that we have overwritten translation_judge, let’s run the optimization again.

optimizer = dspy.MIPROv2(translation_judge, max_bootstrapped_demos = 0, max_labeled_demos = 0)
my_program_optimized = optimizer.compile(my_program, trainset=trainset, requires_permission_to_run = False)

Let’s test the program right away:

my_program_optimized(prompt = "Hello, how are you?")

{'prompt': 'Hello, how are you?'}

Prediction(
    generation='Salut, ça va-tu?'
)

Good! It’s again a translation and not a response to our salutation. Let’s inspect the messages.

my_program_optimized.inspect_history()





[2025-07-23T09:37:57.484071]

System message:

Translate the given colloquial North-American English sentence into natural, spoken Québec French. Preserve the tone, brevity, and regional flavor—use contractions, slang like “dépaneur”, and expressive modifiers such as “en maudit” where they fit naturally. Return only the French translation.


User message:

Hello, how are you?


Response:

Salut, ça va-tu?

And this confirms that our adapter works! This is a completely ‘vanilla’ set of messages.

Optimizing with Few-Shot Examples too

Let’s now make it possible for MIPROv2 to also add few-shot examples into the system prompt.

For this, we need to improve our simple adapter to have a way to format the demos. So we first define format_demos. This is a normal Python function that will expect a list of DSPy Examples and turn that into a simple string with a light XML structure.

def format_demos(demos):
    """
    Wrap every demo once – no duplicated header lines.
    """
    parts = ["Here are examples of your expected behavior.",
             "<examples>"]
    for i, demo in enumerate(demos, 1):
        parts += [
            f"<example_{i}>",
            "User:",
            demo["prompt"],
            "Assistant:",
            demo["generation"],
            f"</example_{i}>",
        ]
    parts.append("</examples>")
    return "\n".join(parts)

Let’s try it:

examples = [
    dspy.Example(prompt="She's my girlfriend.", generation="C'est ma blonde."),
    dspy.Example(prompt="It's snowing again.", generation="Il neige encore."),
]

print(format_demos(examples))

Here are examples of your expected behavior.
<examples>
<example_1>
User:
She's my girlfriend.
Assistant:
C'est ma blonde.
</example_1>
<example_2>
User:
It's snowing again.
Assistant:
Il neige encore.
</example_2>
</examples>

And we need to update our SimplestAdapter with this line: system_content += "\n" + format_demos(demos).

# Define the SimplestAdapter as before
class SimplestAdapter(dspy.Adapter):
    def __call__(self, lm, lm_kwargs, signature, demos, inputs):
        print(inputs)
        system_content = signature.instructions
        if demos:
            system_content += "\n" + format_demos(demos)
        messages = [
            {"role": "system", "content": system_content},
            {"role": "user", "content": inputs["prompt"]},
        ]
        outputs = lm(messages=messages, **lm_kwargs)
        return [{"generation": outputs[0]}]

Let’s run the optimization again, but with max_labeled_demos = 3 this time.

optimizer = dspy.MIPROv2(translation_judge, max_bootstrapped_demos = 3, max_labeled_demos = 3)
my_program_optimized = optimizer.compile(my_program, trainset=trainset, requires_permission_to_run = False)

Let’s test the program right away:

my_program_optimized(prompt = "Hello, how are you?")

{'prompt': 'Hello, how are you?'}

Prediction(
    generation='Salut, ça va?'
)

Good! It’s again a translation and not a response to our salutation. Let’s inspect the messages.

my_program_optimized.inspect_history()





[2025-07-23T09:37:59.087720]

System message:

Translate the following English sentence into colloquial Québec French exactly as it would be spoken in daily conversation. Preserve the tone, brevity, and any slang or contractions typical of spoken Québécois. Return only the French translation—no explanations, no quotation marks, no extra formatting.
Here are examples of your expected behavior.
<examples>
<example_1>
User:
I'm going to the convenience store.
Assistant:
J'm'en va au dépanneur.
</example_1>
<example_2>
User:
We were stuck in traffic for two hours.
Assistant:
On était pognés dans le trafic pendant deux heures.
</example_2>
<example_3>
User:
Can you help me move this weekend?
Assistant:
Tu peux m'aider à déménager ce weekend?
</example_3>
</examples>


User message:

Hello, how are you?


Response:

Salut, ça va?

Now our system prompt also contains a few examples.

Changing Optimizer

We can easily change the optimizer. Look at this code that now uses SIMBA instead of MIPROv2.

optimizer = dspy.SIMBA(metric = translation_judge, bsize = 8)
my_program_optimized = optimizer.compile(my_program, trainset=trainset)

Let’s check the results:

my_program_optimized(prompt = "Hello, how are you?")
my_program_optimized.inspect_history()

{'prompt': 'Hello, how are you?'}




[2025-07-23T09:38:14.219315]

System message:

If the prompt contains a simple declarative sentence in English (like "They're celebrating their birthday."), then you should output a direct French translation of that exact sentence, not conversational filler or advice. Focus on translating the meaning word-for-word while maintaining proper French grammar and vocabulary.

If the prompt is an enthusiastic English sentence about something being cool or awesome, then you should translate it into informal, spoken Québec French rather than standard European French. Use colloquial expressions like "ben l'fun", "char" for "car", and "ce …-là" constructions to sound natural to a Québec audience.

If the prompt is any English sentence (even a neutral declarative one like "He's always bragging."), then you should translate it into informal Québec French: contract "il" to "y'", drop the article "le" to "l'", use colloquial expressions such as "tout l'temps" and "en train de", and pick Québec-specific verbs like "fanfaronner" instead of standard French "se vanter".

If the prompt is a neutral declarative sentence that mentions a specific clock time (e.g., "at 3", "at 7:30"), then you should render that time in the informal Québec-French way: use numerals followed by "h" (e.g., "3h", "7h30") instead of spelling out the hour in words.
Here are examples of your expected behavior.
<examples>
<example_1>
User:
Can you help me move this weekend?
Assistant:
Tu peux m’aider à déménager ce week-end ?
</example_1>
</examples>


User message:

Hello, how are you?


Response:

Salut, ça va ?

In the case of SIMBA, we can see that it gradually added instructions to the system prompt.

Teacher-Student optimization

Let’s now optimize for a smaller model while still using Kimi to generate the system prompt.

We must now create another LM connection. Let’s stay with Groq for speed and to keep things simple.

llama8b = dspy.LM("groq/llama-3.1-8b-instant")
my_program.set_lm(lm = llama8b)

Here, I have to add the teacher argument to compile: .compile(..., teacher=kimi, ...).

optimizer = dspy.MIPROv2(translation_judge, max_bootstrapped_demos = 0, max_labeled_demos = 0)
my_program_optimized = optimizer.compile(my_program, trainset=trainset, teacher = kimi, requires_permission_to_run = False)

Let’s confirm that my_program_optimized is set to use Llama.

my_program_optimized.lm.model

'groq/llama-3.1-8b-instant'

Indeed it is!

Let’s try it:

my_program_optimized(prompt = "Hello, how are you?")
my_program_optimized.inspect_history()

{'prompt': 'Hello, how are you?'}




[2025-07-23T09:38:15.394238]

System message:

Translate the following informal North-American English sentence into equally informal, colloquial Québec French. Preserve the brevity, tone, and intent of the original. Use authentic Québécois phrasing, contractions, regional slang (e.g., “dépaneur”, “char”), and swear intensifiers (“en maudit”) where appropriate. Do not add or omit information.


User message:

Hello, how are you?


Response:

Salut, comment ça va?

Cool, so now we have Llama 3.1 8b as our translator :)

Bringing It All Together

Let’s now make the optimize() function we envisioned at the beginning.

Here, I asked o3-pro to bring it all together for us. You’ll recognize its comment style.

def optimize(
    *,
    training_inputs: list[str],
    training_outputs: list[str],
    llm_judge,
    system_prompt: str = "",
    max_few_shots: int = 0,
    teacher_model=None,
    student_model=None,
):
    """
    One‑stop helper that (1) turns parallel input / output lists into a DSPy
    training‑set, (2) builds / optimises a tiny translation programme, and
    (3) returns the auto‑generated system‑prompt (with optional few‑shot
    examples baked‑in).

    Parameters
    ----------
    training_inputs, training_outputs : list[str]
        Parallel lists of user prompts and the desired assistant replies.
    llm_judge : str | Callable
        Either a *string* with judging instructions **or** a fully‑formed
        `metric(example, prediction, trace)->float` callable.
    system_prompt : str, optional
        A starting prompt to improve upon (default empty).
    max_few_shots : int, optional
        Upper‑bound on examples the optimiser may add to the prompt.
    teacher_model, student_model : str | dspy.LM | None
        Identifiers *or* `dspy.LM` objects.  If only one is given, we fall
        back gracefully to the globally configured LM.

    Returns
    -------
    str
        The final system‑prompt text, ready to feed any chat‑completion API.
    """

    # ------------------------------------------------------------------ #
    # 0 .  Basic validation                                              #
    # ------------------------------------------------------------------ #
    if len(training_inputs) != len(training_outputs):
        raise ValueError("`training_inputs` and `training_outputs` must "
                         "have the same length.")

    # ------------------------------------------------------------------ #
    # 1 .  Build the training set                                        #
    # ------------------------------------------------------------------ #
    examples = [
        dspy.Example(prompt=inp, generation=out)
        for inp, out in zip(training_inputs, training_outputs, strict=True)
    ]
    trainset = [ex.with_inputs("prompt") for ex in examples]

    # ------------------------------------------------------------------ #
    # 2 .  Build (or wrap) the metric                                    #
    # ------------------------------------------------------------------ #
    if callable(llm_judge):
        translation_judge = llm_judge
    else:
        # Dynamically build a judge signature around the instruction string.
        judge_instructions = str(llm_judge).strip()

        class _AutoJudge(dspy.Signature):
            """{0}""".format(judge_instructions)
            english_sentence = dspy.InputField()
            french_translation = dspy.InputField()
            score: int = dspy.OutputField(desc="0 = bad, 1 = good")

        judge_predict = dspy.Predict(_AutoJudge)

        def translation_judge(example, prediction, trace=None):
            try:
                result = judge_predict(
                    english_sentence=example.prompt,
                    french_translation=prediction.get("generation", "")
                )
                return float(result.score)
            except Exception:
                return 0.0

    # ------------------------------------------------------------------ #
    # 3 .  Prepare the LM objects                                        #
    # ------------------------------------------------------------------ #
    def _to_lm(obj):
        if obj is None:
            return None
        return obj if isinstance(obj, dspy.LM) else dspy.LM(obj)

    teacher_lm = _to_lm(teacher_model)
    student_lm = _to_lm(student_model)

    # If the reader supplied no student, fall back to whatever DSPy is
    # already configured with; otherwise bind the student to our programme.
    if student_lm is not None:
        active_lm = student_lm
    else:
        active_lm = dspy.settings.get("lm")  # may still be None → DSPy default

    # ------------------------------------------------------------------ #
    # 4 .  Build the programme                                           #
    # ------------------------------------------------------------------ #
    class OptimSignature(dspy.Signature):
        """{0}""".format(system_prompt)
        prompt = dspy.InputField()
        generation = dspy.OutputField()

    programme = MyPredict(OptimSignature)
    if active_lm is not None:
        programme.set_lm(active_lm)

    # ------------------------------------------------------------------ #
    # 5 .  Run MIPRO‑v2                                                  #
    # ------------------------------------------------------------------ #
    optimiser = dspy.MIPROv2(
        translation_judge,
        max_bootstrapped_demos=max_few_shots,
        max_labeled_demos=max_few_shots,
    )

    compile_kwargs = dict(
        trainset=trainset,
        requires_permission_to_run=False
    )
    if teacher_lm is not None:
        compile_kwargs["teacher"] = teacher_lm

    tuned_prog = optimiser.compile(programme, **compile_kwargs)

    # ------------------------------------------------------------------ #
    # 6 .  Extract the finished prompt (+ optional demos)                #
    # ------------------------------------------------------------------ #
    final_prompt = tuned_prog.signature.instructions.strip()

    if getattr(tuned_prog, "demos", None):
        final_prompt += "\n" + format_demos(tuned_prog.demos)

    return final_prompt

Let’s use it:

optimized_system_prompt = optimize(
    training_inputs=[
        "I'm going to the convenience store.",
        "It's really cold out today."
    ],
    training_outputs=[
        "Je m'en vais au dépanneur.",
        "Il fait frette en maudit aujourd'hui."
    ],
    llm_judge="Return 1 if the French looks natural and 0 otherwise."
)

Let’s see what system prompt we got:

print(optimized_system_prompt)

You are an expert Canadian-French translator who specializes in ultra-casual, idiomatic language.  
Given a short English sentence or phrase (the `prompt`), produce its Canadian-French equivalent (`generation`) that is just as informal and succinct. Preserve slang, contractions, and the exact tone of the original—no extra formality, no extra words.

Not bad! Let’s test with all the parameters:

optimized_system_prompt = optimize(
    training_inputs=[
        "I'm going to the convenience store.",
        "It's really cold out today."
    ],
    training_outputs=[
        "Je m'en vais au dépanneur.",
        "Il fait frette en maudit aujourd'hui."
    ],
    llm_judge="Return 1 if the French looks natural and French Canadian and 0 otherwise.",
    system_prompt = "Translate from english to french",
    max_few_shots = 2,
    teacher_model = "groq/moonshotai/kimi-k2-instruct",
    student_model = "groq/llama-3.1-8b-instant"
)

And let’s see what we got:

print(optimized_system_prompt)

You are a bilingual Canadian-French speaker who translates casual English into the colloquial, idiomatic French used in everyday Québec conversations.  Given the prompt (an English sentence or short paragraph), return the generation: its natural-sounding, equally informal Canadian-French equivalent, keeping the same register, brevity, and tone.