DSPy Optimizers – Parameter Structure Analysis (by deep research)

Maxime Rivest

2025-07-17

One of my favorite things about deep research from OpenAI is that it was fine-tuned to produce long reports, so I like to use it to produce long reports almost more than to do deep research, and one of the things I just discovered that is very useful is to not give it the internet but give it a GitHub connector to a repo and then ask a question or ask it to document your repository and it will write a very long report about that.

I just discovered that limiting deep research from OpenAI to only a repository (using the connectors) is very effective at focusing deep research on making a complete report about your code. On this page I am sharing with you the results I got applying this to DSPy’s optimizers. Fun fact: deep research was finetuned to write longer output, so it is quite a ‘different’ model than others you would find out there. I like it for this application.

Summary Table

Below is an overview of each optimizer class in the dspy.teleprompter module, including their constructor (__init__) parameters and primary optimization method (usually compile) with a breakdown of positional vs. keyword-only arguments:

Optimizer Class __init__ Parameters (positional vs. keyword-only) Core Method & Parameters (positional vs. keyword-only)
Teleprompter (base class) __init__(self)no parameters (just self). compile(self, student, *, trainset, teacher=None, valset=None)student is positional; trainset is required keyword-only; teacher and valset are optional keyword-only.
LabeledFewShot __init__(self, k=16) – one parameter k (int) with default 16 (may be given positionally or by name). compile(self, student, *, trainset, sample=True)student is positional; trainset required keyword-only; sample optional keyword-only (default True).
BootstrapFewShot __init__(self, metric=None, metric_threshold=None, teacher_settings={}, max_bootstrapped_demos=4, max_labeled_demos=16, max_rounds=1, max_errors=5) – all parameters have defaults (callable metric and optional metric_threshold for success cutoff; teacher_settings dict for teacher LM config; numeric defaults for demos and rounds; max_errors tolerates errors). These can be passed as keywords (order is not enforced by * in the constructor). compile(self, student, *, teacher=None, trainset, valset=None)student positional; trainset required keyword-only; teacher optional (default None, keyword-only); valset optional keyword-only.
BootstrapFewShotWithRandomSearch __init__(self, metric, teacher_settings=None, max_bootstrapped_demos=4, max_labeled_demos=16, max_rounds=1, num_candidate_programs=16, num_threads=None, max_errors=None, stop_at_score=None, metric_threshold=None) – extends BootstrapFewShot with additional parameters for random search. metric (callable) is required (no default); others are optional (teacher_settings default None; defaults for demos and rounds as in BootstrapFewShot; num_candidate_programs controls number of candidate prompt sets; optional num_threads for parallelism; max_errors default None uses global setting; stop_at_score optional early stopping threshold; metric_threshold optional filter threshold). compile(self, student, *, teacher=None, trainset, valset=None, restrict=None, labeled_sample=True)student positional; trainset required keyword-only; teacher optional keyword-only; valset optional keyword-only; restrict optional keyword-only (to restrict which candidate seeds to run); labeled_sample optional keyword-only (default True, whether to sample labeled demos in candidate generation).
Ensemble __init__(self, *, reduce_fn=None, size=None, deterministic=False) – all arguments are keyword-only (enforced by *). reduce_fn is a function to combine outputs (e.g. majority vote) defaulting to None; size is an optional int to sample subset of programs; deterministic is a bool (must be False for now, as deterministic mode not implemented). compile(self, programs) – takes a list of programs as a single positional argument. No trainset or metric is used here; the method returns an ensembled program that calls all (or a sampled subset of) given programs and reduces their outputs.
FinetuneTeleprompter (base for fine-tuning optimizers) __init__(self, train_kwargs=None) – one optional parameter train_kwargs which can be a dict of training arguments (or a dict mapping specific LM objects to their training args). Defaults to None (internally converted to a default dict). This base class doesn’t implement compile itself (inherits Teleprompter.compile which raises NotImplemented) – it is meant to be subclassed for fine-tuning behavior. No direct compile method in this base class – subclasses implement the optimization logic. (It inherits the abstract compile signature from Teleprompter but does not override it, so it cannot be used standalone.)
BootstrapFinetune __init__(self, metric=None, multitask=True, train_kwargs=None, adapter=None, exclude_demos=False, num_threads=None) – extends FinetuneTeleprompter. All arguments have defaults: metric (evaluation metric, default None), multitask (bool, True to fine-tune on combined data vs. per-predictor), train_kwargs (dict for training hyperparams, default None), adapter (optional Adapter or mapping for fine-tuning, default None), exclude_demos (bool, default False, whether to clear prompt demos after fine-tuning), num_threads (int, default None for using global default threads). These can be given as keywords or positionally (no * in signature). compile(self, student, trainset, teacher=None, valset=None, target=\"t5-large\", bsize=12, accumsteps=1, lr=5e-5, epochs=1, bf16=False, int8=False, peft=False, path_prefix=None)student and trainset are accepted as positional args (unlike others, this method does not strictly enforce keyword-only for trainset in the code). teacher is optional (default None, can be passed by name); valset optional (default None); and a series of fine-tuning hyperparameters are provided as keyword options with defaults (target model name, batch size bsize, gradient accumulation steps accumsteps, learning rate lr, epochs, and flags for bf16, int8, PEFT usage, plus path_prefix for saving checkpoints). (In practice, these would be passed as keywords; the lack of * means trainset and teacher could technically be given positionally, which is an inconsistency in interface.)
COPRO (Co-Prompt Optimization) __init__(self, prompt_model=None, metric=None, breadth=10, depth=3, init_temperature=1.4, track_stats=False) – all parameters have defaults. prompt_model is an LM used to generate prompt variations (defaults to the globally configured LM if None); metric is the evaluation metric (default None, meaning it will optimize without a specific metric filter unless provided); breadth (int) is how many new prompt candidates to generate per iteration (default 10); depth is how many iterations of prompt refinement to perform (default 3); init_temperature (float) for prompt generation randomness (default 1.4); track_stats (bool) whether to record optimization statistics (default False). compile(self, student, *, trainset, eval_kwargs)student program is positional; trainset is required keyword-only; eval_kwargs is also required keyword-only (a dict of extra arguments for evaluation). No teacher parameter in this optimizer – instead it uses prompt_model internally for generating new instructions, and evaluates the student on trainset using the provided metric and eval settings.
MIPROv2 (Multi-Iteration Prompt Optimizer) __init__(self, metric, prompt_model=None, task_model=None, teacher_settings=None, max_bootstrapped_demos=4, max_labeled_demos=4, auto=\"light\", num_candidates=None, num_threads=None, max_errors=None, seed=9, init_temperature=0.5, verbose=False, track_stats=True, log_dir=None, metric_threshold=None) – a large number of parameters. Notably, metric is required (no default) – the primary evaluation metric. prompt_model and task_model are optional LM instances (if None, defaults to global settings for prompt generation and for executing the task, respectively). teacher_settings is an optional dict of LM settings for any teacher model usage (default None -> {}). max_bootstrapped_demos and max_labeled_demos default to 4 each (controls how many few-shot examples of each type to use initially). auto can be "light", "medium", "heavy" or None, controlling an automatic configuration of search effort (default “light”). num_candidates (int, optional) specifies how many candidate prompt variations to generate (if auto is None, this must be set along with num_trials). num_threads optional (for parallel eval, default None). max_errors optional (max allowed errors during eval, default None to use global). seed default 9 (random seed for reproducibility). init_temperature (float) default 0.5 for initial prompt variation. verbose (bool) default False for logging. track_stats default True to record detailed stats. log_dir optional path for logging. metric_threshold optional float to early-discard prompts below this score threshold. compile(self, student, *, trainset, teacher=None, valset=None, num_trials=None, max_bootstrapped_demos=None, max_labeled_demos=None, seed=None, minibatch=True, minibatch_size=35, minibatch_full_eval_steps=5, program_aware_proposer=True, data_aware_proposer=True, view_data_batch_size=10, tip_aware_proposer=True, fewshot_aware_proposer=True, requires_permission_to_run=True, provide_traceback=None)student is positional; all other parameters are keyword-only. trainset (list of examples) is required; teacher optional (defaults None, a teacher program/LM for bootstrapping if needed); valset optional (if provided, used for evaluation phases). This method exposes many tuning knobs: num_trials (total search iterations, required if auto mode is None), the ability to override max_bootstrapped_demos/max_labeled_demos for this run, a seed (if not given, uses the seed from init), and several boolean flags controlling different proposer strategies (minibatch evaluation vs full dataset, with minibatch_size and how often to fully evaluate minibatch_full_eval_steps; whether the prompt proposal is aware of the program structure, data distribution, etc. via program_aware_proposer, data_aware_proposer, tip_aware_proposer, fewshot_aware_proposer – all True by default). view_data_batch_size (int, default 10) controls how much data a proposal sees at once. requires_permission_to_run (bool, default True) will prompt the user before a potentially expensive run. provide_traceback (bool or None) toggles including stack traces in logged errors. All of these are meant to be supplied as keywords when needed (there is a * enforcing keyword-only) to fine-tune the search behavior.

Table Legend: Positional parameters are those that must be supplied in order (or by name), before any *. Keyword-only parameters (shown after *) can only be supplied by name (and have default values if not marked required). Defaults are shown where applicable. Each class’s core method (usually compile) is listed with its signature and the nature of its arguments.

Detailed Method Argument Analysis

Below we provide a class-by-class breakdown of the constructor and primary method parameters, explaining each argument, default values, and usage conventions:

Teleprompter (Base Class)

  • Constructor Teleprompter.__init__: Takes no arguments besides self (no parameters to configure). It’s essentially an abstract base, so no initialization parameters are needed.

  • Method compile(self, student, *, trainset, teacher=None, valset=None): This is meant to be overridden by subclasses. It accepts a student program (the DSPy program to optimize) as a positional argument. The datasets are keyword-only:

    • trainset (required, list of Example): the training examples on which to optimize.
    • teacher (optional, default None): an optional teacher program used to guide optimization (if not provided, many optimizers default to using the student itself or an internal strategy).
    • valset (optional, default None): an optional validation set of examples to evaluate generalization or for early stopping. All parameters after student are marked with * in the signature, making them keyword-only for clarity. The base implementation raises NotImplementedError (since Teleprompter itself doesn’t define a specific optimization strategy).
  • Method get_params(self): (Minor utility) Returns a dictionary of the Teleprompter’s internal attributes (simply self.__dict__). This is a common interface to retrieve the configuration of any Teleprompter.

LabeledFewShot

  • Constructor LabeledFewShot.__init__(self, k=16): This optimizer’s only parameter is k – the number of examples from the trainset to label (i.e. use as demonstrations) per predictor. It defaults to 16. This parameter is positional-or-keyword (not forced to keyword-only), so one could call LabeledFewShot(10) to use 10 examples, or LabeledFewShot(k=10). The value of k sets an upper bound on how many examples will be taken from the training data to insert as prompt demonstrations.

  • Method compile(self, student, *, trainset, sample=True): Optimizes the given student program by attaching labeled examples to it:

    • student – the program to optimize (positional).
    • trainsetrequired keyword-only list of examples to draw demonstrations from.
    • sample – keyword-only bool (default True): if True, it randomly samples min(k, len(trainset)) examples for each predictor in the student; if False, it simply takes the first k examples (in order) from the trainset.

    The compile method returns a new compiled program where each predictor in the student has up to k example demos in its prompt. If the trainset is empty, it returns the student unchanged. This optimizer does not use any “teacher” or iterative improvement – it’s a one-step assignment of labeled data. All arguments after student are keyword-only as indicated by the * in the signature.

BootstrapFewShot

  • Constructor BootstrapFewShot.__init__: This optimizer automatically “bootstraps” new prompt demonstrations by having the program attempt the task and collecting successful outputs as examples. Its constructor accepts several parameters, all with defaults:

    • metric (callable, default None): A function to judge success on an example (takes e.g. (gold_example, prediction, trace) and returns True/False or a score). If None, any output is considered a success for bootstrapping purposes.
    • metric_threshold (float, default None): A score threshold for the metric – if provided, a prediction must meet or exceed this threshold to count as a successful example. (If metric is boolean-returning, this may not be used.) This parameter allows filtering which outputs become demonstrations.
    • teacher_settings (dict, default {}): Settings to configure the behavior of the teacher model (e.g., a different language model or different decoding parameters). These settings (like temperature) will be applied to the teacher when generating outputs.
    • max_bootstrapped_demos (int, default 4): The maximum number of bootstrapped demos (new examples generated from the model itself) to add per predictor.
    • max_labeled_demos (int, default 16): The maximum number of labeled demos (original trainset examples) to use per predictor. This sets an upper bound on using ground-truth examples in addition to bootstrapped ones.
    • max_rounds (int, default 1): How many bootstrapping rounds to perform. Each round can attempt to gather new demos from the model’s outputs.
    • max_errors (int, default 5 in some implementations, or None): The maximum number of errors to tolerate during bootstrapping (e.g., if the student or teacher throws exceptions). If the number of errors exceeds this, the process will halt or raise. In some versions, if set to None, it may fall back on a global setting.

    All these parameters have default values, meaning the constructor can be called with no arguments (it will bootstrap using default settings). They are not declared as keyword-only in the signature (no leading * in the __init__), but in practice they are almost always passed by keyword for clarity.

  • Method compile(self, student, *, teacher=None, trainset, valset=None): This performs the bootstrapping process:

    • student – the program to optimize (positional). The student should initially be “uncompiled” (no demos attached).
    • teacher – optional keyword-only. If provided, this is a separate program or model to act as the “coach” producing outputs; if None, the student itself (or a copy) is used as the teacher by default. The teacher is typically a copy of the student (or a version with different settings) that generates candidate outputs.
    • trainset – required keyword-only list of examples for training. The teleprompter will run each example through the teacher (or student) to see if it can get a correct output.
    • valset – optional keyword-only list of examples for validation (default None). If provided, it may be used after bootstrapping to evaluate or select prompts (in the basic BootstrapFewShot, it’s not heavily used; it often defaults to using any remaining train examples not successfully bootstrapped as a validation list).

    Process: The compile method will:

    1. Make a fresh copy of the student (ensuring the original remains unchanged) and also prepare a teacher copy.
    2. If max_labeled_demos > 0 and the teacher program isn’t already compiled with demos, it first uses a LabeledFewShot teleprompter to supply up to max_labeled_demos ground-truth examples to the teacher (so the teacher starts with some baseline demos).
    3. It then iterates through the trainset, using the teacher to generate predictions. For each example, if the prediction is “successful” according to the metric (or if no metric provided), it will extract the input/output pair from the execution trace and add it as a new demo example (a bootstrapped demo) for the student’s corresponding predictor.
    4. It stops once it has collected max_bootstrapped_demos successful demos or has exhausted the training data (or completed max_rounds passes). Any training examples not “bootstrapped” successfully may remain as a validation set.
    5. Finally, it calls an internal _train() which assembles the final set of demos for each predictor: it takes the bootstrapped demos collected and, if there’s still room (up to max_labeled_demos total), it may fill in some of the original trainset examples as well. The resulting student (with demos attached) is marked as compiled and returned.

    All arguments after student are keyword-only, enforcing calls like teleprompter.compile(student=prog, trainset=data) for clarity. This is consistent with the base Teleprompter signature. The presence of both teacher_settings in the constructor and an optional teacher in compile means you configure how the teacher behaves up front (e.g., use a different model or temperature via settings), and you can also supply a specific teacher program if desired at compile time.

BootstrapFewShotWithRandomSearch

  • Constructor BootstrapFewShotWithRandomSearch.__init__: This class builds on BootstrapFewShot to not only bootstrap demos but also perform a random search over multiple candidate prompt sets. It inherits from Teleprompter (and in newer versions, it extends BootstrapFewShot) and introduces additional parameters:

    • metric (callable, no default in signature): Similar to BootstrapFewShot, this is the evaluation metric. In this class, metric is effectively required – the absence of a default indicates the user should supply one (the random search needs a way to compare programs). (If not provided, it might default to using the truthy evaluation of outputs if the metric function is None, but typically one provides a metric).
    • teacher_settings (dict, default None): Same role as in BootstrapFewShot – configuration for the teacher’s LM behavior. If None, an empty dict is used internally.
    • max_bootstrapped_demos (int, default 4), max_labeled_demos (int, default 16), max_rounds (int, default 1): Same meaning as in BootstrapFewShot (limits on demos and bootstrap iterations).
    • num_candidate_programs (int, default 16): The number of candidate programs (prompt configurations) to evaluate in the random search. This class will generate and test up to this many variations of prompts.
    • num_threads (int, default None): If set, this can be used to parallelize evaluation of candidates (e.g., number of threads for the Evaluate calls). If None, it might default to a global setting or single-threaded evaluation.
    • max_errors (int, default None): Maximum errors tolerated (similar to BootstrapFewShot; if None, use global setting). This applies during each candidate evaluation as well.
    • stop_at_score (float, default None): If provided, the search will stop early if it finds a candidate with a metric score greater or equal to this threshold.
    • metric_threshold (float, default None): A threshold applied during the bootstrapping phase for considering a trace successful (similar to BootstrapFewShot’s metric_threshold).

    All these arguments have defaults except metric, and they are typically passed by keyword. In the code, none are forced keyword-only at init, but practically one would use keywords for clarity due to the number of parameters.

  • Method compile(self, student, *, teacher=None, trainset, valset=None, restrict=None, labeled_sample=True): This performs an extended random search on top of bootstrapping:

    • student – the program to optimize (positional).
    • teacher – optional keyword-only teacher program (default None) as in BootstrapFewShot.
    • trainset – required keyword-only training examples.
    • valset – optional keyword-only validation set (defaults to using trainset if not provided, as seen in code where self.valset = valset or trainset).
    • restrict – optional keyword-only (default None). This can be used to restrict which candidate indices/seeds to run. Internally, this optimizer uses different random seeds (including some special values like -3, -2, -1 for baseline variants) to generate candidate prompt sets; the restrict parameter can specify a subset of these seeds to actually evaluate (useful for debugging or partial searches).
    • labeled_sample – optional keyword-only bool (default True). This is passed into the LabeledFewShot step for the seed that uses labeled examples only. If True, it randomly samples labeled demos; if False, it takes the first examples (just as in LabeledFewShot’s compile).

    Process: The compile method goes through a sequence of candidate evaluations (using different seed values to shuffle the trainset and vary the demos):

    1. It considers a set of candidate prompt configurations:

      • seed = -3: a zero-shot baseline (no demos at all).
      • seed = -2: a baseline with labeled examples only (uses LabeledFewShot to attach up to max_labeled_demos demos).
      • seed = -1: an “unshuffled” few-shot bootstrap (runs BootstrapFewShot with the trainset in given order).
      • seed >= 0: a number of random shuffles. For each seed from 0 up to num_candidate_programs-1, it shuffles a copy of the trainset and picks a random number of bootstrapped demos (between 1 and max_bootstrapped_demos) to gather, then runs BootstrapFewShot with those settings.
    2. For each candidate, it uses Evaluate to compute the overall metric score on either the valset or training set for that compiled program. It keeps track of the scores.

    3. It applies adjustments for any assertion-based failures (specific to DSPy, e.g., if the program has internal assertion checks) – see the section subtracting for _suggest_failures and zeroing out if _assert_failures.

    4. It identifies the best-scoring program and can stop early if stop_at_score was specified and achieved.

    5. Finally, it attaches a list of all candidate programs and their scores to the best program (best_program.candidate_programs) for reference, and returns the best program.

    All parameters after student are keyword-only, aligning with the interface of BootstrapFewShot (trainset must be named, etc.). This optimizer’s interface is more complex, but the use of keyword-only helps avoid confusion when calling compile with many optional settings. One idiosyncrasy: the compile method itself uses the internal BootstrapFewShot class for seeds -1 and >=0, thereby inheriting any parameters set in the constructor like metric_threshold or teacher_settings and reusing them for each candidate search.

Ensemble

  • Constructor Ensemble.__init__(self, *, reduce_fn=None, size=None, deterministic=False): The Ensemble teleprompter does not deal with datasets or metrics at all – instead, it creates an ensemble from multiple programs. All its parameters are keyword-only (notice the leading *, in the signature):

    • reduce_fn (callable, default None): A function that takes a list of outputs (one from each program in the ensemble) and reduces them to a single output. For example, DSPy provides dspy.majority to pick the most common answer, which is a typical choice for classification tasks. If reduce_fn is None, the ensemble’s forward will return the list of all outputs.
    • size (int, default None): If set, the ensemble will randomly select size programs out of the provided list each time it is called, rather than using all programs. If None, it uses all programs each time.
    • deterministic (bool, default False): If True, the ensemble would aim to produce deterministic behavior (e.g., always pick the same subset for a given input). Currently, this is not implemented (the code asserts that deterministic is False).

    These parameters allow controlling how the ensemble combines multiple models’ outputs. All must be passed by keyword, e.g., Ensemble(reduce_fn=dspy.majority, size=5).

  • Method compile(self, programs): Instead of optimizing prompts, this teleprompter combines programs. The programs argument is a list of DSPy programs to ensemble, passed as a single positional argument. There are no trainset or metric arguments. The method returns a new EnsembledProgram (constructed internally) which, when called, will:

    • If size is specified, randomly sample that many programs from the list; otherwise use all programs.
    • Invoke each selected program’s __call__ (or forward) on the given inputs.
    • Collect their outputs, and then either apply the reduce_fn if provided or return the list of outputs as-is.

    The compile here is straightforward: it doesn’t “learn” or modify the programs, just wraps them. Notably, there is no keyword-only enforcement in this signature, because it only takes one argument (programs). The usage is simply ensemble_teleprompter.compile([prog1, prog2, ...]). This class is an outlier in that it doesn’t use any of the training data or metric infrastructure – it’s purely a structural optimizer.

FinetuneTeleprompter (Base Class for Fine-tuning)

  • Constructor FinetuneTeleprompter.__init__(self, train_kwargs=None): This base class is designed for optimizers that fine-tune language model weights. It introduces a single configuration parameter:

    • train_kwargs (dict or dict-of-dicts, default None): Training arguments for fine-tuning. It can be one dictionary applied to all LMs, or a mapping from specific LM objects to their respective parameter dicts. For example, this might include learning rate, number of epochs, etc. If None, it defaults to an empty configuration. Internally, the constructor converts this into a standard form (using convert_to_lm_dict) where each LM maps to its own settings (even if the same settings are used for all).

    This class does not take a metric in its constructor – because often fine-tuning might use the training loss as implicit metric, or the metric can be applied on a validation set externally. It primarily encapsulates how to call the underlying LM’s fine-tune method. FinetuneTeleprompter doesn’t implement a new compile itself – it relies on child classes to implement the strategy. After construction, it holds a train_kwargs mapping that will be used during fine-tune calls.

  • No direct compile method: FinetuneTeleprompter inherits the abstract compile from Teleprompter but does not override it, so it can’t be used on its own. Subclasses (like BootstrapFinetune) will implement the actual compile logic. Essentially, FinetuneTeleprompter serves to store training configurations and provide utility methods (in the DSPy code, e.g., finetune_lms static method in the newer implementation, or convert_to_lm_dict). Think of it as an abstract base similar to Teleprompter, but specifically for fine-tuning optimizers, ensuring they handle train_kwargs uniformly.

BootstrapFinetune

  • Constructor BootstrapFinetune.__init__: This class combines bootstrapping with actual fine-tuning of an LM. It inherits from FinetuneTeleprompter. Its parameters are as follows:

    • metric (callable, default None): An optional metric function to evaluate model outputs (similar to other teleprompters). If provided, it can be used to judge which outputs are “successful” when bootstrapping data or to guide the selection of fine-tuning data. If None, all outputs might be considered or a default (like always True) is used.
    • multitask (bool, default True): Whether to fine-tune on all tasks/predictors jointly (True) or separately (False). If multitask=True, all data from all predictors might be combined to fine-tune a single model (or one model per unique LM); if False, it will fine-tune separate models for each predictor (the code sets data indices accordingly).
    • train_kwargs (dict or dict-of-LM dicts, default None): Passed to the base FinetuneTeleprompter to configure fine-tuning (learning rate, epochs, etc.). If a plain dict is given, the same settings apply to all language models; a more granular mapping can specify different hyperparameters per LM.
    • adapter (Adapter or dict of LMs to Adapter, default None): An optional specification of an adapter to use for fine-tuning (e.g., for parameter-efficient fine-tuning). If provided, this indicates which fine-tuning method or adapter to use for each LM. Internally converted to a dict mapping each LM to an Adapter (using a similar technique to train_kwargs).
    • exclude_demos (bool, default False): If True, after fine-tuning it will clear out any prompt demonstrations in the predictors (perhaps under the assumption that the model has learned from them and they are no longer needed). If False, it leaves any demos in place. In the code, after fine-tuning, they actually set pred.demos = [] if exclude_demos is True.
    • num_threads (int, default None): Number of threads for parallel fine-tuning jobs. If you have multiple predictors to fine-tune (e.g., multitask=False scenario or multiple LMs in a program), this sets how many can run in parallel. It defaults to None, which means use the global default (or 1 if not set).

    All these parameters have defaults, so you can call BootstrapFinetune() with none, and it will use a multitask approach with whatever global LM is configured. The signature does not enforce keyword-only, but given the number of parameters, using keywords is strongly recommended for clarity (e.g., BootstrapFinetune(metric=my_metric, epochs=2) etc., though epochs would actually go inside train_kwargs in this design).

  • Method compile(self, student, trainset, teacher=None, valset=None, target="t5-large", bsize=12, accumsteps=1, lr=5e-5, epochs=1, bf16=False, int8=False, peft=False, path_prefix=None): This is a two-phase optimizer: it first bootstraps prompt examples, then fine-tunes the model on those examples. Its signature is notably different in that it does not strictly require trainset to be passed as a keyword (there is no * before trainset in the current implementation’s signature, meaning student and trainset could be given positionally). However, to avoid confusion, it’s often called with keywords for clarity. The parameters are:

    • student – the program to optimize (positional).

    • trainset – the list of examples to train on (positional or keyword). These will be used both for bootstrapping prompts and as the fine-tuning dataset.

    • teacher – optional (default None). A teacher program or list of programs. If provided, those will be used to bootstrap examples; if None, it will issue a warning that it’s using an uncompiled student as teacher. Often, one might pass a copy of the student or a differently configured model as the teacher for the bootstrap step.

    • valset – optional validation set (default None). Not extensively used inside the compile method for Bootstrapping (the code primarily uses trainset for bootstrapping and doesn’t explicitly use valset in fine-tuning, though it could be used to evaluate during training or after).

    • Fine-tuning hyperparameters: These are all optional with defaults, and they mirror typical HuggingFace/transformers fine-tuning settings:

      • target (str, default "t5-large"): The model name or identifier to fine-tune. This class may instantiate a fresh model of this type for fine-tuning or use it as an identifier to save the fine-tuned weights.
      • bsize (int, default 12): Batch size for fine-tuning.
      • accumsteps (int, default 1): Gradient accumulation steps.
      • lr (float, default 5e-5): Learning rate for fine-tuning.
      • epochs (int, default 1): Number of fine-tuning epochs.
      • bf16 (bool, default False): Whether to use bfloat16 precision.
      • int8 (bool, default False): Whether to use int8 quantization for fine-tuning (likely requires an adapter that supports it).
      • peft (bool, default False): Whether to use a PEFT (Parameter-Efficient Fine Tuning) method (like LoRA). If True, the fine-tuning will use an adapter method rather than full model tuning.
      • path_prefix (str, default None): An optional prefix path for saving fine-tuned model checkpoints. If provided, the fine-tuned model weights are saved under this path with a generated name.

    The compile process is as follows:

    1. Bootstrap Phase: It uses an internal self.teleprompter, which is a BootstrapFewShot instance configured in __init__ (with max_bootstrapped_demos very high and max_labeled_demos=0 by default in some implementations), to compile the student (or teacher) with bootstrapped demonstrations. Essentially, it generates a set of demonstrations by running the teacher (or student) on the trainset and collecting successful outputs (using the given metric if provided). This yields a compiled program with demos.
    2. It then prepares fine-tuning data: for each predictor in the compiled program, it takes all the demos (input-output pairs) and formats them into prompt-completion training examples appropriate for the language model fine-tuning. The code constructs prompt text and target text from each demo using the predictor’s signature/template, accumulating them in a list.
    3. It shuffles the fine-tuning data and writes it to disk as a .jsonl file (or multiple files if multitask vs per-predictor).
    4. Fine-tuning Phase: It invokes a fine-tuning routine (likely finetune_hf for HuggingFace models) on the prepared data for the specified target model, with the given hyperparameters (batch_size, epochs, lr, etc.). This produces fine-tuned model checkpoint(s).
    5. It loads these fine-tuned weights into the student’s predictors – replacing their lm with the fine-tuned model(s). If multitask=True, typically one model is fine-tuned for all (assuming a shared LM); if False, each predictor might get its own fine-tuned model. The code ensures the structure matches and assigns the new LMs.
    6. If exclude_demos=True, it clears the demos for each predictor (since the model is now supposed to handle the task without needing prompt examples).
    7. The method marks the program as compiled and returns the fine-tuned compiled program.

    Key points: The trainset here is used both to bootstrap examples and to generate the fine-tuning dataset, effectively turning successful model outputs into training data (this is a form of self-training). The presence of both metric-based bootstrapping and actual gradient descent is unique to this optimizer. The interface inconsistency is that trainset is not forced to keyword-only (likely an oversight), whereas most others require naming it. Best practice is to call it as teleprompter.compile(student, trainset=..., teacher=..., epochs=..., lr=..., ...) for clarity. All the fine-tuning hyperparameters are keyword-only by position (they come after the required args and * in the function definition), meaning in code you must call them as named arguments (which is natural for these settings).

COPRO (Co-Prompt Optimizer)

  • Constructor COPRO.__init__: COPRO aims to optimize the instructions in a prompt by iterative generation and testing. Its parameters:

    • prompt_model (LM client, default None): The language model used to propose new instructions. If None, the system likely defaults to the same model as the student (or whatever is set in global settings). By providing a separate prompt_model, you could use a larger or more creative model to generate prompt variants while using a different task_model (the student) for execution.
    • metric (callable, default None): The metric to evaluate the student’s performance. If None, COPRO can still run, but it might not have a quantitative way to compare prompts – in practice, a metric should be supplied so it can choose the best prompt.
    • breadth (int, default 10): The number of new prompt candidates to generate at each iteration (each “depth”). Essentially, in each round COPRO will produce this many alternative instructions via the prompt_model.
    • depth (int, default 3): The number of iterations (rounds of prompt generation and evaluation) to perform. A depth of 3 means it will generate new instructions 3 times, each time possibly building on or replacing previous ones.
    • init_temperature (float, default 1.4): The temperature setting for the prompt generation model in the initial generation round (higher temperature means more randomness/creativity). This influences the diversity of prompts generated. In the code, this temperature might be used for prompt_model when sampling instructions.
    • track_stats (bool, default False): Whether to collect statistics about the optimization process. If True, COPRO will record details such as the distribution of scores for prompts at each iteration (min, max, avg, std of top prompts, etc.). These stats would be stored in attributes like results_best, results_latest, etc., on the returned program for analysis.

    All of these parameters are keyword-only by design (note the *, in the __init__ signature in code) – meaning you must call, for example, COPRO(metric=..., breadth=20). This enforces clarity given the number of optional arguments.

  • Method compile(self, student, *, trainset, eval_kwargs): COPRO’s compile differs from previous ones in that it doesn’t attach demos or fine-tune weights, but instead alters the prompt instructions of the student’s predictors. Parameters:

    • student – the program to optimize (positional). This program likely contains one or more predictors with an instruction (prompt template) that we want to improve.
    • trainset – required keyword-only list of examples. These will be used to evaluate the quality of instructions. Essentially, for each candidate prompt, COPRO will run the student on the trainset and measure performance.
    • eval_kwargs – required keyword-only dict of arguments for evaluation. This is passed to DSPy’s Evaluate to evaluate the student on the trainset. For example, eval_kwargs might specify num_threads for parallel evaluation or display_progress flags. It’s mandatory to provide (the code does not have a default), ensuring the user is explicit about how to evaluate (e.g., eval_kwargs={"display_progress": False} or with specific settings).

    Process: In simplified terms, COPRO will:

    1. Make a deepcopy of the student to work on (so as not to modify the original mid-process).

    2. Evaluate the initial student on the trainset to get a baseline score (not explicitly shown in snippet, but likely done implicitly as part of loop or for stats tracking).

    3. For each iteration (up to depth):

      • Use the prompt_model to generate breadth new candidate instructions for each predictor. The generation likely uses one of two Signature classes defined in the code:

        • BasicGenerateInstruction if it’s the first round (which just takes the original instruction and asks for an improved one).
        • GenerateInstructionGivenAttempts if it’s after the first round (which provides some of the previously tried instructions and their scores to the prompt model, so it can propose a better one).
      • For each predictor in the student program, replace its instruction with each of the candidate instructions one at a time and evaluate the program on the trainset using the metric (via Evaluate with eval_kwargs).

      • Track the performance of each candidate. If track_stats is True, record the stats of these candidates (min, max, etc.).

      • Possibly filter out duplicate or very similar instructions (the code has _drop_duplicates to eliminate repeated candidates that yield the same results).

      • Select the top-performing instruction(s) to carry forward. Likely it keeps the best one as the new base instruction (and possibly uses others for context in subsequent rounds).

    4. Repeat for the specified number of depths. By the end, ideally, the student’s predictors have improved instructions that yield better metric performance on the trainset.

    5. Return the optimized program (with its instruction updated to the best found). If track_stats was True, the returned program might have attributes like results_best and results_latest containing the recorded statistics.

    All parameters after student are keyword-only, so one would call teleprompter.compile(student=prog, trainset=data, eval_kwargs=eval_args). The absence of a teacher parameter here is notable – COPRO doesn’t use a separate teacher model to generate outputs for evaluation; instead, it uses a separate prompt_model to generate prompts (instructions), and the original program (or its LM, possibly configured via teacher_settings if any) to evaluate those prompts. Essentially, COPRO is searching in prompt/instruction space, guided by metric evaluations on the trainset.

MIPROv2

  • Constructor MIPROv2.__init__: MIPRO (“Mixed Initiative Prompt Optimization”, perhaps) is one of the most complex teleprompters, combining few-shot bootstrapping, instruction proposal, and hyperparameter search. Its initialization has many parameters, mostly optional, to cover various aspects of the search:

    • metric (callable, required): The evaluation metric to maximize. Unlike many others, MIPROv2 does not default metric to None – you must provide a metric function. This makes sense given the complexity: it needs a quantitative measure to drive the optimization.
    • prompt_model (LM, default None): Similar to COPRO, an optional separate model used to propose instructions or other prompt components. If None, defaults to the globally configured LM (or the student’s LM).
    • task_model (LM, default None): If the student program uses a particular LM, task_model can override or specify it. If None, it uses dspy.settings.lm (the globally configured default LM) as the model to actually run the task. Essentially, task_model is the model that executes the prompts (the “student’s LM”), and prompt_model is the model that generates new prompt candidates; they could be different.
    • teacher_settings (dict, default None): Similar to earlier teleprompters, this can hold settings for any teacher or evaluation model usage. MIPRO does some bootstrapping internally, so this could configure how that’s done. Internally, if None, it stores as an empty {}.
    • max_bootstrapped_demos (int, default 4): The initial number of bootstrapped few-shot examples to gather (per predictor) for use in prompts.
    • max_labeled_demos (int, default 4): The initial number of labeled (ground-truth) examples to include per predictor. (Notice this default is 4, smaller than the 16 used in simpler teleprompters, possibly to limit scope for the automated search).
    • auto (Literal “light”/“medium”/“heavy” or None, default “light”): This is a high-level switch to configure how exhaustive the search should be. If set to “medium” or “heavy”, the teleprompter will automatically set or override other parameters (like number of trials, etc.) to spend more effort. If auto=None, the user must manually specify certain parameters (like num_trials). The allowed values are enforced; any other string would raise an error.
    • num_candidates (int, default None): The number of candidate solutions (e.g., prompt combinations) to consider in the search. If auto is None, this must be provided (along with num_trials) or an error is raised. If auto is set, num_candidates should not be provided (it would be overridden by the auto settings).
    • num_threads (int, default None): Number of threads for parallel operations (like evaluation). If None, falls back to global setting.
    • max_errors (int, default None): Max errors to tolerate; if None, use global setting (similar usage as before).
    • seed (int, default 9): Random seed for reproducibility. Used for shuffling and any stochastic decisions.
    • init_temperature (float, default 0.5): Initial temperature for any prompt generation or sampling (lower than COPRO’s default, implying more conservative generation).
    • verbose (bool, default False): If True, provides more logging info during the process.
    • track_stats (bool, default True): Whether to collect and store statistics of the optimization (like how COPRO does). By default True, so it will track performance of trials, etc.
    • log_dir (str, default None): If provided, the directory path to save logs or intermediate results (like candidate programs, evaluations).
    • metric_threshold (float, default None): Similar to earlier, a threshold for the metric to perhaps prune or consider a trial successful. If set, any candidate with metric below this might be discarded or considered failing.

    The constructor sets a lot of these into internal attributes and does some validation: e.g., ensures if auto is not None, the user hasn’t also set num_candidates or num_trials (to avoid conflict), and if auto is None, then both num_candidates and num_trials must be specified by the user. It also immediately converts teacher_settings to an empty dict if None and assigns default models if prompt_model or task_model are None. All parameters except metric have defaults, but given their number, they are meant to be given by keyword (the signature includes no * here, but practically one would hardly pass 15 args positionally in order). The ordering places metric first (required), then the two models, then other settings.

  • Method compile(self, student, *, trainset, teacher=None, valset=None, num_trials=None, max_bootstrapped_demos=None, max_labeled_demos=None, seed=None, minibatch=True, minibatch_size=35, minibatch_full_eval_steps=5, program_aware_proposer=True, data_aware_proposer=True, view_data_batch_size=10, tip_aware_proposer=True, fewshot_aware_proposer=True, requires_permission_to_run=True, provide_traceback=None): This signature is expansive, but all arguments after student are keyword-only (enforced by the *). Here’s what they mean:

    • student – the program to optimize (positional).

    • trainset – required keyword-only list of examples to train/optimize on.

    • teacher – optional keyword-only (default None). If provided, used during the bootstrap of few-shot examples (similar to BootstrapFewShot’s teacher). If None, the student (or rather its task_model) is used to bootstrap itself.

    • valset – optional keyword-only list of examples for validation (default None). MIPRO uses a validation set to evaluate candidate prompts (distinct from trainset if provided) and for final evaluation of each trial. If not provided, it may split the trainset or use part of it for validation implicitly.

    • num_trials – optional keyword-only (int). The number of search trials to run. If auto is None, this must be set (and should correspond roughly to num_candidates and the effort desired). If auto is “light”/“medium”/“heavy,num_trials` will be determined internally (and providing it will raise an error).

    • max_bootstrapped_demos, max_labeled_demos – optional ints to override the defaults for this compile run. If provided, they will update the internal max_bootstrapped_demos/max_labeled_demos before running. Otherwise, it uses the values from the constructor (which might have been set via auto mode).

    • seed – optional int to override the random seed for this run (if not provided, uses the seed from init). This allows one to repeat the search with different seeds or ensure reproducibility.

    • minibatch (bool, default True): Whether to use minibatch evaluation when scoring prompts. If True, and the validation set is large, MIPRO will evaluate in batches rather than all at once (to speed up or simulate iterative evaluation). If False, it evaluates on the full valset every time.

    • minibatch_size (int, default 35): The number of examples to use in each minibatch evaluation if minibatch is True. It will evaluate candidate programs on chunks of this many examples and possibly use an average or intermediate pruning strategy.

    • minibatch_full_eval_steps (int, default 5): If using minibatch mode, this could indicate how frequently (in terms of trial count or iterations) a full evaluation on the entire valset is done, or how many minibatch steps constitute a “full” eval for logging. (This parameter’s use is a bit advanced; it might define after how many partial batches to do a full evaluation or something similar.)

    • The next several are boolean flags controlling proposers – these determine what aspects of the prompt the algorithm is allowed to propose changes for:

      • program_aware_proposer (default True): If True, the optimizer will propose modifications aware of the program’s structure (likely meaning it can consider changes to instructions in context of entire program).
      • data_aware_proposer (default True): If True, proposals might take into account the data distribution or particularities of examples (perhaps by examining some examples during instruction proposals).
      • view_data_batch_size (int, default 10): Possibly the number of examples the proposers can look at at once when generating suggestions (if data-aware).
      • tip_aware_proposer (default True): “Tip” could refer to a part of prompt (like a prefix or a suffix). If True, the proposer can adjust the “tip” (maybe the output field prefix or few-shot separators).
      • fewshot_aware_proposer (default True): If True, the proposer can adjust few-shot examples or how they’re used (since MIPRO also handles bootstrapped demos).
    • requires_permission_to_run (bool, default True): If True, the compile will prompt the user for confirmation before running a potentially expensive search (especially in heavy mode). If set to False, it will run to completion without interactive confirmation.

    • provide_traceback (bool or None, default None): If True, any errors encountered might include tracebacks in the logs; if False, suppress tracebacks; if None, use a default setting (perhaps false). This is mainly for debugging if something goes wrong during evaluation, which can be helpful when verbose logging.

    Process: MIPROv2’s compile is very comprehensive. Summarizing:

    1. Few-shot Bootstrapping: It likely begins by ensuring the student has some initial demos. There is a call demo_candidates = self._bootstrap_fewshot_examples(program, trainset, seed, teacher) which presumably uses max_bootstrapped_demos and max_labeled_demos to produce a set of demonstration candidates (similar to BootstrapFewShot but perhaps generating multiple sets).

    2. Instruction Proposal: Then it calls _propose_instructions(...) which uses the prompt_model to propose new instructions, possibly taking into account the current program, the data, and the demo candidates. The parameters like view_data_batch_size, program_aware_proposer, etc., influence this step – e.g., it might generate instructions while seeing a batch of view_data_batch_size examples or not.

    3. If zero-shot optimization is indicated (no demos allowed, zeroshot_opt), it may discard demos to focus purely on instructions.

    4. Prompt Parameter Optimization: It then calls _optimize_prompt_parameters(...) – this likely orchestrates the main search over trials (num_trials). In each trial, it might:

      • Choose a set of demos (from demo_candidates, possibly none if zero-shot) and an instruction (from instruction_candidates proposed) to form a candidate program (a specific configuration of prompts).
      • Evaluate that program on the valset using the metric (the code uses an Evaluate instance for the valset with the given metric and threads).
      • Use something like Optuna (since the code imports optuna if available) to intelligently choose the next combination of parameters to try (the “Bayesian” or guided search aspect).
      • Possibly prune low-performing trials early (since the code has integration for pruning via intermediate minibatch evaluation).
      • Repeat until num_trials are done or the search converges.
    5. It likely uses the auto setting to determine num_trials and possibly adjust minibatch usage. For example, “heavy” auto might set a large number of trials and larger validation set size.

    6. If requires_permission_to_run=True, before starting the full search, it will print an estimate of how many LM calls or how long it might take and prompt the user to continue. If the user declines, it aborts and returns the original student unchanged.

    7. Throughout, it tracks the best program found. At the end, it returns the optimized program (with improved instructions and possibly with selected demos attached). It also attaches logs like trial_logs containing the score of each trial and the parameters used, as well as possibly storing in student._compiled = True.

    The key feature of MIPROv2 is that it integrates multiple dimensions: it can optimize the instruction text (like COPRO), the selection of few-shot examples (like BootstrapFewShot), and even other prompt parameters (e.g., it might experiment with presence or absence of demos – that’s why it has both fewshot_aware_proposer and code logic for zero-shot vs few-shot). It effectively generalizes and combines ideas from the simpler teleprompters. Because of this, its interface is the most complex. All those boolean flags allow turning on/off certain aspects of the search:

    • e.g., one could run it with program_aware_proposer=False to ignore program structure differences when proposing instructions, or minibatch=False to always evaluate on full validation set (safer but slower).

    As with other teleprompters, trainset and other main parameters are keyword-only to prevent mix-ups. The compile method is clearly intended to be called with named arguments for anything beyond the basics (e.g., teleprompter.compile(student=prog, trainset=data, valset=dev, num_trials=50, fewshot_aware_proposer=False, requires_permission_to_run=False)). The consistency in using keyword-only here is welcome given how many tuning knobs exist.

Patterns and Idiosyncrasies

Examining all these optimizers, we can observe several patterns in how parameters are structured, as well as some inconsistencies or outliers:

  • Common Structure – “compile” with trainset: Almost every optimizer uses a compile(student, *, ... trainset ..., ...) method to perform the optimization on a given program and dataset. Requiring trainset as a keyword-only argument is a common design (seen in Teleprompter base, LabeledFewShot, BootstrapFewShot, RandomSearch, COPRO, MIPRO). This pattern enforces clarity that a training set must be provided and avoids accidental swapping of positional arguments. An inconsistency here is BootstrapFinetune, whose compile signature does not enforce keyword-only for trainset (it takes student, trainset positionally). This makes BootstrapFinetune stand out as allowing compile(prog, data) without naming trainset, whereas others would require compile(prog, trainset=data). It’s likely an oversight in that implementation because the conceptual pattern is that trainset should be keyword-only for all.

  • Positional vs Keyword-only in Constructors: The base classes (Teleprompter, FinetuneTeleprompter) and some simple ones have very few parameters and thus no need for keyword-only in __init__. E.g., Teleprompter and FinetuneTeleprompter have none or one parameter and don’t use *. But Ensemble explicitly uses * to force its three parameters (reduce_fn, size, deterministic) to be keyword-only in the constructor. This is a design choice to improve readability: calling Ensemble(size=3, reduce_fn=majority) is self-documenting, versus relying on positional order. Other optimizers like BootstrapFewShot, BootstrapFewShotWithRandomSearch, BootstrapFinetune, COPRO, MIPROv2 did not enforce * in their __init__, despite having many parameters. This means in theory one could call BootstrapFewShot(None, {}, 4, 16, 1) positionally, but that would be very unclear. In practice, users likely call BootstrapFewShot(metric=my_metric, max_rounds=2, ...). The lack of uniform use of keyword-only in constructors is an inconsistency. A pattern is that newer or more user-facing classes (Ensemble, perhaps MIPRO if it was considered user-facing) lean towards keyword-only for clarity, whereas older classes did not enforce it.

  • Parameter Naming Conventions:

    • Most classes use trainset and (optionally) valset consistently to refer to data. This is uniform across optimizers.

    • The use of teacher vs teacher_settings is a bit confusing across classes:

      • BootstrapFewShot and RandomSearch have a teacher_settings in the constructor (for LM config) and a teacher argument in compile (for an actual program instance).

      • BootstrapFinetune similarly takes an adapter (similar concept to teacher settings, but specific to fine-tuning) in constructor and a teacher in compile.

      • MIPROv2 uses teacher_settings in constructor (to adjust the teacher LM) and teacher in compile.

      • LabeledFewShot and Ensemble do not involve a teacher at all.

      • COPRO does not have a teacher parameter either; instead it has prompt_model and uses the student’s own execution for evaluation. Inconsistency arises in naming: e.g., BootstrapFewShotWithRandomSearch reuses teacher_settings from its parent and has teacher in compile, whereas FinetuneTeleprompter/BootstrapFinetune introduced a separate concept of adapter and train_kwargs for fine-tuning. These serve a similar role (configuring how the “teaching” or training is done) but under different names. Also, in MIPROv2, there is both teacher_settings and a teacher argument, plus separate prompt_model and task_model. This can be conceptually hard to follow:

        • teacher generally means an alternate DSPy program or LM used to generate outputs for bootstrapping.
        • teacher_settings means a dictionary of parameters to apply to whichever model is acting as teacher (like setting its temperature or max tokens).
        • prompt_model is an LM used for generating new prompt text (distinct from the task).
        • adapter in finetuning is an object encapsulating how to fine-tune (distinct from anything in non-finetune classes). Ideally, the interface could be cleaner if, for example, every Teleprompter had a teacher argument in compile (for a program or LM) and possibly a unified way to specify how that teacher should behave (maybe always via teacher_settings). Currently it’s partly unified (teacher + teacher_settings) in bootstrap classes, but fine-tune adds adapter, and COPRO/MIPRO add prompt_model separately. This is an area of inconsistency in naming and usage.
    • Metric and Threshold: Every optimizer that evaluates outputs uses a metric parameter name for the evaluation function. This is consistent. Some optimizers (BootstrapFewShot, RandomSearch, MIPRO) also use metric_threshold as an optional cutoff for success. The concept of metric_threshold is not present in others like Finetune or COPRO (COPRO could theoretically use it but doesn’t expose it; Finetune focuses on loss). The inconsistent part is documentation vs implementation: e.g., the official docs for BootstrapFewShot did not list metric_threshold or max_errors, yet the code and random search clearly use them. This indicates either a new feature that wasn’t documented or a parameter considered more internal. As a pattern, many classes allow a None metric to mean “no filtering, just optimize blindly” and some threshold to refine what “success” means.

    • Demo-related parameters: We see repeated parameters controlling number of examples:

      • k in LabeledFewShot.
      • max_bootstrapped_demos and max_labeled_demos in BootstrapFewShot, RandomSearch, MIPRO. These generally default to some small numbers (4 and 16, or 4 and 4 in MIPRO). The choice of 4/16 vs 4/4 is inconsistent. Possibly, earlier versions assumed up to 16 labeled demos is fine (for simpler tasks or lots of data), whereas MIPRO’s authors might have found using 16 made the search space too large or wasn’t needed, and so they reduced both defaults to 4. It’s an inconsistency in default tuning: two classes aimed at similar goals have different default for max labeled demos (16 vs 4). Similarly, LabeledFewShot and BootstrapFewShot share the 16 default for labeled demos (and LabeledFewShot’s sole param k=16 aligns with BootstrapFewShot’s 16), whereas MIPRO diverges.
    • Parallelism parameters: num_threads appears in BootstrapFewShotWithRandomSearch, BootstrapFinetune, MIPRO, but not in plain BootstrapFewShot or LabeledFewShot. The base Evaluate class in DSPy likely uses a global thread count if not specified. The newer/complex optimizers expose num_threads to give the user control over parallel evaluations. This is a pattern of evolving design: earlier optimizers didn’t surface this (assuming either single-thread or using global config), later ones made it explicit. So there’s inconsistency across classes – e.g., one can’t directly set threads in BootstrapFewShot without going through dspy.settings, but one can in RandomSearch via the teleprompter’s param.

    • Boolean flags for features: Some advanced optimizers (MIPRO) have many boolean flags to toggle sub-behaviors (program_aware_proposer, etc.), whereas simpler ones bake in one strategy. This reflects differing complexity: simpler optimizers don’t have these flags at all. It’s expected, but it means the interface isn’t uniform – MIPRO stands out with a very large signature and lots of optional toggles, compared to something like BootstrapFewShot which has a concise interface. From a consistency standpoint, MIPRO’s interface might be overwhelming relative to others.

  • Use of * in method signatures: As noted, almost all compile methods use * to separate student (positional) from the rest (keyword-only). This is a clear pattern for compile. The only exceptions:

    • BootstrapFinetune’s compile, which did not put a * before teacher and trainset in the older implementation. (Documentation suggests there might be a version that does, but the code we saw treats teacher as positional after student, which is unusual).
    • Ensemble.compile doesn’t use * simply because it has a single argument. This pattern – having the dataset and other settings be keyword-only – is generally followed and is good for clarity. The inconsistency in BootstrapFinetune is likely something to correct for uniformity.
  • Public Method Names (step vs compile): All these optimizers use a method named compile as the entry point to perform optimization, rather than something like step() or optimize(). The user question mentioned “methods such as step or optimize,” but in DSPy’s design it appears compile is the standard name (compiling a program with a teleprompter means optimizing it). None of the classes have a public method literally named step or optimize – they all stick to compile(). Internally, some have helper methods (_bootstrap_one_example, _train, etc.) but those are private. So there is consistency in using compile as the interface method, inherited from Teleprompter. The only slight oddity is Ensemble using compile in a non-learning sense, but still logically “compiling an ensemble program.”

  • Outlier Classes:

    • Ensemble is quite different in purpose (no metric, no trainset). It still fits the Teleprompter interface (taking programs and returning a program), but its parameter set (reduce_fn, deterministic, etc.) doesn’t overlap with others. It’s an idiosyncratic case included in the same module for convenience.

    • FinetuneTeleprompter as a base class is a bit of an abstraction layer not exposed to end-users typically. It doesn’t quite act on its own. This is an internal consistency: Teleprompter vs FinetuneTeleprompter both serving as abstract bases for two families (prompt-based vs fine-tune-based optimizers). They share the interface but introduce different init params (none vs train_kwargs). A slight inconsistency is that Teleprompter base has no init params, FinetuneTeleprompter does – but that’s due to the nature of fine-tuning needing configuration up front.

    • COPRO and MIPRO introduce parameter names not seen elsewhere (e.g., breadth, depth, auto, all the proposer flags). They were likely developed later to tackle prompt optimization more holistically. They still follow patterns like requiring trainset and using metric, but add their own twist. COPRO, for instance, doesn’t accept teacher or use max_rounds – instead it has depth for iterations of prompt proposals, essentially analogous but specific to its domain. MIPRO aggregates parameters from many others, making it quite an outlier in complexity.

  • Defaults and Range of Values: Many numeric defaults seem somewhat ad-hoc but within a small range:

    • 4 and 16 appear frequently (suggesting maybe at most 4 bootstrapped examples or 16 labeled examples as a reasonable default).
    • Max rounds default to 1 in bootstrap (a single iteration is often enough to get some improvement).
    • RandomSearch defaults to 16 candidate programs (which aligns with maybe trying seeds -3, -2, -1 and 0..12 – indeed in code they loop range(-3, num_candidate_sets) which for 16 gives seeds -3..15 inclusive, that’s 19, but likely they intended a fixed count; perhaps the special negatives are not counted in that num).
    • Finetuning hyperparams default to typical values like 1 epoch, batch 12, lr 5e-5 – those mirror common practice in ML.
    • The auto="light" default in MIPRO suggests they wanted the safer, quicker configuration by default.

    The inconsistencies here are minor – just that some defaults might not align (e.g., if one expected MIPRO to default to the same 16 labeled demos as simpler teleprompters, they’d be surprised it’s 4). Another example: LabeledFewShot vs BootstrapFewShot default k=16 vs max_labeled_demos=16 (consistent), but Bootstrapped demos default 4 vs Labeled default 16 in simple version, whereas MIPRO uses 4 for both – possibly to balance that it will do iterative improvements.

  • Error handling and user interaction parameters: Some newer classes have parameters related to robustness:

    • max_errors is present in BootstrapFewShot and RandomSearch (to avoid infinite loops or crashes if too many errors occur). Others like Finetune don’t expose max_errors (though Evaluate inside might use a global max error).
    • MIPRO uses requires_permission_to_run to ensure the user is aware of resource cost; no other class does something like that (likely because MIPRO can be very expensive). This is a unique design consideration for an outlier.
    • provide_traceback is similarly only in MIPRO, aimed at debugging – indicating MIPRO expects potentially long runs where silent failures would be frustrating.
    • Ensemble asserts if deterministic=True because it’s not implemented, which is a bit user-unfriendly (they could have just not offered the parameter or documented that it’s a future feature). This is an idiosyncrasy in Ensemble’s interface (exposing a param that only throws an error if set True).

In summary, patterns include the consistent use of a compile method with student + keyword-only datasets/metrics, the presence of metric functions in most, and repeated use of parameters controlling how many examples to use or generate. Idiosyncrasies and inconsistencies include differences in keyword-only enforcement, slight naming mismatches (teacher_settings vs adapter vs separate model params), differences in default values for similar concepts, and the sheer divergence in complexity between simpler teleprompters (LabeledFewShot, BootstrapFewShot) and the complex ones (MIPRO, COPRO).

Each optimizer class was likely developed to extend functionality, which led to some divergence in interface. For example, COPRO and MIPRO added new kinds of parameters (depth, breadth, auto, etc.) that don’t appear in earlier classes, making the overall module less uniform.

Recommendations for Unifying the Interface

To improve consistency and usability across these teleprompter optimizers, we suggest the following changes:

  1. Enforce Keyword-Only for Key Parameters: Ensure that in all optimizers, important parameters like trainset, teacher, and other configuration options are keyword-only. This means adding *, where missing (e.g., in BootstrapFinetune.compile to require naming trainset and teacher, and in any constructor where positional use could be confusing). A uniform rule could be: any optimizer method that takes a dataset or multiple optional settings should use keyword-only args beyond the program argument. This will prevent mistakes and make code more self-documenting.

  2. Standardize Teacher Configuration: Unify the approach to teacher models across classes:

    • Always use a teacher argument in compile for providing an alternate program or LM for generating outputs (as is done in BootstrapFewShot, etc.), and consistently use a teacher_settings (or similarly named) parameter in the constructor to configure that teacher’s behavior. For fine-tuning, instead of introducing a separate adapter parameter, consider treating it analogously (e.g., a teacher_settings could include an adapter or fine-tune specific config). If that’s too abstract, at least rename adapter to something like finetune_adapter and document it as the analog of teacher settings but for fine-tune.
    • If prompt_model and task_model (as in MIPRO) are essentially playing roles of teacher vs student, clarify that or even rename them to teacher_model and student_model for consistency. Alternatively, provide a unified interface where Teleprompter base could accept something like teacher=... in init or compile that could be a model or program. Having multiple parameters (prompt_model, task_model, teacher) is confusing; consolidating where possible would help (e.g., maybe define that teacher can be either a full DSPy Program or a raw LM; if the latter, treat it as the model to generate prompts).
    • Essentially, reduce the terminology: decide on either “teacher” or specific terms, and use them consistently. If the role is to generate new prompts, maybe call it generator_model everywhere instead of prompt_model in one place and implicitly using teacher in another. Consistency in naming would reduce user confusion.
  3. Unify Metric Handling: Make sure the role of metric and metric_threshold is consistently implemented and documented:

    • If metric_threshold is supported in some optimizers (BootstrapFewShot, RandomSearch, MIPRO), consider supporting it in others that might benefit (or explicitly excluding it). At least document it uniformly. It might be useful in COPRO too (maybe to decide if a prompt is “good enough”). If it’s an advanced feature, ensure all classes that use metrics either accept metric_threshold or none of them do. As it stands, a user might not realize BootstrapFewShot accepts a metric_threshold because it wasn’t in official docs, which is a documentation inconsistency.
    • Similarly, if max_errors is a common safeguard, consider exposing it in all relevant optimizers (for example, COPRO and MIPRO do handle errors but not via a parameter; they rely on global settings or internal logic). It might be good to allow the user to set max_errors in MIPRO too for consistency, or state clearly that it uses the global dspy.settings.max_errors. Unifying this across classes (all teleprompters either take a max_errors or none do and it’s purely global) would avoid confusion.
  4. Align Default Values and Ranges: Review the default values for parameters that serve similar purposes and align them unless there’s a strong reason not to:

    • For example, the default max_labeled_demos in MIPROv2 is 4 whereas in BootstrapFewShot it’s 16. If 16 was found to be too high in practice, perhaps all classes should default to 4 for consistency (or vice versa if 16 is preferred for thoroughness). Choose one philosophy (fewer demos vs more) and apply it uniformly so users have a consistent expectation.
    • Likewise, ensure that if an optimization class is essentially a generalization of another, its defaults should not dramatically conflict. MIPROv2 is like a superset of BootstrapFewShot + COPRO; one would expect that if you use MIPROv2 in a “minimal” way, it might by default behave somewhat like a BootstrapFewShot (just with added capabilities). That could mean defaulting max_labeled_demos=16 as in BootstrapFewShot for a fair comparison, or at least documenting why it’s different.
    • Another default to align: LabeledFewShot’s k=16 vs BootstrapFewShot’s max_labeled_demos=16 (those match), but if any divergence occurs in future, keep them in sync.
    • If possible, use the same default num_threads behavior – e.g., default None meaning use dspy.settings.num_threads. Document that consistently so users know None implies some global or single-thread. Right now, it’s implied but not always explicitly stated in each class docs.
  5. Refine and Simplify Interfaces of Complex Classes: For very complex optimizers like MIPROv2 (and to a lesser extent COPRO), consider grouping some of the less commonly changed hyperparameters into a config object or using **kwargs to pass through to internal methods. As it stands, the compile signature of MIPROv2 is extremely long, which can be intimidating. Some ideas:

    • Group the proposer-related booleans into one structure or prefix them clearly. For example, instead of five separate flags, one could have a single proposers=dict(program_aware=True, data_aware=True, tip_aware=True, fewshot_aware=True) or similar. This way the signature is shorter and it’s clear they belong together. Or provide a simpler toggle that sets a combination of them (e.g., a mode for proposers).
    • The minibatch, minibatch_size, minibatch_full_eval_steps could perhaps be combined or managed by the auto mode. If auto is heavy, maybe always use full eval (minibatch=False). Document or enforce such relationships to reduce what the user must consider. If not grouping, at least document in one place how they interact (some of which the code does via errors).
    • Another approach: provide preset configurations for MIPRO (like how auto does) but maybe even expose them at a higher level rather than lots of individual args. For instance, an auto="heavy" sets many underlying defaults. Perhaps include in docs or interface something like MIPROv2.heavy() as an alternate constructor classmethod to preconfigure, etc. This doesn’t change parameters per se, but helps users not have to tweak each one. This is more of a usability suggestion beyond just parameter format.

    While these suggestions don’t unify across all classes (since simpler ones don’t need it), they do make the outlier interfaces easier to handle, which indirectly unifies the experience. A user switching from BootstrapFewShot to MIPROv2 wouldn’t want to worry about 10 new parameters if not needed; having reasonable defaults and grouping helps.

  6. Consistent Documentation and Naming: Ensure that the documentation (docstrings or user guides) for each optimizer class follows a consistent template:

    • List out positional and keyword-only arguments explicitly, and use the same terminology for similar things (e.g., always call them “bootstrapped demos” vs sometimes “augmented demos” etc., to avoid confusion).

    • If a parameter is effectively doing the same thing across classes, use the same name. For example, if we decide teacher_settings is the term, then perhaps adapter in BootstrapFinetune could be encompassed by teacher_settings as well (it could have keys for adapter vs others) or be renamed to something like finetune_settings. Right now the names teacher_settings, train_kwargs, and adapter all refer to configuration of the “optimization process or model” beyond just metric and data. A unified naming (maybe a generic config dict or breaking them into clearer categories) would help. For instance:

      • teacher_settings could be expanded to handle fine-tuning specifics (not ideal semantic fit), or
      • use train_kwargs for all cases of LM training/hyperparameters (so BootstrapFewShot might not need it, but FinetuneTeleprompter does, and maybe MIPRO could reuse train_kwargs for consistency instead of burying fine-tune params in compile).
    • The goal is that a user reading the docs doesn’t have to guess that “adapter” in one class serves a role analogous to “teacher_settings” in another. If they truly are different in nature, clarify that in docs or choose distinct naming that reflects purpose (e.g., lm_adapter vs teacher_lm_settings might clarify one is for fine-tuning method, one for prompting method).

  7. Unify Process Flow Where Possible: While not directly about parameters, making sure each Teleprompter clearly states its two main phases (if any) in a similar way could help unify understanding. For instance, all compile methods could follow a pattern in documentation: “Preprocess (e.g., prepare student/teacher), Optimize (via bootstrapping or search), Post-process (attach demos or fine-tune weights)”. If the interface and documentation emphasize these stages similarly, users can map parameters to each stage (e.g., max_rounds -> relates to optimization loop, exclude_demos -> relates to post-process). Right now, each class’s documentation is isolated; a unified narrative would make the parameter sets feel more coherent.

By implementing these recommendations, the teleprompter optimizers would have a more consistent interface. For example, a user could expect that every optimizer’s compile is called with student=... , trainset=... , teacher=... , valset=... (where relevant) without worrying about positional quirks, and that if they see a parameter like max_x_demos or num_threads, it means the same general concept across the board. It would reduce the learning curve when moving from one optimizer to another and lower the chance of misuse due to inconsistent conventions.