DSPy Optimizers – Parameter Structure Analysis (by deep research)
2025-07-17
One of my favorite things about deep research from OpenAI is that it was fine-tuned to produce long reports, so I like to use it to produce long reports almost more than to do deep research, and one of the things I just discovered that is very useful is to not give it the internet but give it a GitHub connector to a repo and then ask a question or ask it to document your repository and it will write a very long report about that.
I just discovered that limiting deep research from OpenAI to only a repository (using the connectors) is very effective at focusing deep research on making a complete report about your code. On this page I am sharing with you the results I got applying this to DSPy’s optimizers. Fun fact: deep research was finetuned to write longer output, so it is quite a ‘different’ model than others you would find out there. I like it for this application.
Summary Table
Below is an overview of each optimizer class in the dspy.teleprompter
module, including their constructor (__init__
) parameters and primary optimization method (usually compile
) with a breakdown of positional vs. keyword-only arguments:
Optimizer Class | __init__ Parameters (positional vs. keyword-only) |
Core Method & Parameters (positional vs. keyword-only) |
---|---|---|
Teleprompter (base class) | __init__(self) – no parameters (just self ). |
compile(self, student, *, trainset, teacher=None, valset=None) – student is positional; trainset is required keyword-only; teacher and valset are optional keyword-only. |
LabeledFewShot | __init__(self, k=16) – one parameter k (int) with default 16 (may be given positionally or by name). |
compile(self, student, *, trainset, sample=True) – student is positional; trainset required keyword-only; sample optional keyword-only (default True). |
BootstrapFewShot | __init__(self, metric=None, metric_threshold=None, teacher_settings={}, max_bootstrapped_demos=4, max_labeled_demos=16, max_rounds=1, max_errors=5) – all parameters have defaults (callable metric and optional metric_threshold for success cutoff; teacher_settings dict for teacher LM config; numeric defaults for demos and rounds; max_errors tolerates errors). These can be passed as keywords (order is not enforced by * in the constructor). |
compile(self, student, *, teacher=None, trainset, valset=None) – student positional; trainset required keyword-only; teacher optional (default None, keyword-only); valset optional keyword-only. |
BootstrapFewShotWithRandomSearch | __init__(self, metric, teacher_settings=None, max_bootstrapped_demos=4, max_labeled_demos=16, max_rounds=1, num_candidate_programs=16, num_threads=None, max_errors=None, stop_at_score=None, metric_threshold=None) – extends BootstrapFewShot with additional parameters for random search. metric (callable) is required (no default); others are optional (teacher_settings default None; defaults for demos and rounds as in BootstrapFewShot; num_candidate_programs controls number of candidate prompt sets; optional num_threads for parallelism; max_errors default None uses global setting; stop_at_score optional early stopping threshold; metric_threshold optional filter threshold). |
compile(self, student, *, teacher=None, trainset, valset=None, restrict=None, labeled_sample=True) – student positional; trainset required keyword-only; teacher optional keyword-only; valset optional keyword-only; restrict optional keyword-only (to restrict which candidate seeds to run); labeled_sample optional keyword-only (default True, whether to sample labeled demos in candidate generation). |
Ensemble | __init__(self, *, reduce_fn=None, size=None, deterministic=False) – all arguments are keyword-only (enforced by * ). reduce_fn is a function to combine outputs (e.g. majority vote) defaulting to None; size is an optional int to sample subset of programs; deterministic is a bool (must be False for now, as deterministic mode not implemented). |
compile(self, programs) – takes a list of programs as a single positional argument. No trainset or metric is used here; the method returns an ensembled program that calls all (or a sampled subset of) given programs and reduces their outputs. |
FinetuneTeleprompter (base for fine-tuning optimizers) | __init__(self, train_kwargs=None) – one optional parameter train_kwargs which can be a dict of training arguments (or a dict mapping specific LM objects to their training args). Defaults to None (internally converted to a default dict). This base class doesn’t implement compile itself (inherits Teleprompter.compile which raises NotImplemented) – it is meant to be subclassed for fine-tuning behavior. |
No direct compile method in this base class – subclasses implement the optimization logic. (It inherits the abstract compile signature from Teleprompter but does not override it, so it cannot be used standalone.) |
BootstrapFinetune | __init__(self, metric=None, multitask=True, train_kwargs=None, adapter=None, exclude_demos=False, num_threads=None) – extends FinetuneTeleprompter. All arguments have defaults: metric (evaluation metric, default None), multitask (bool, True to fine-tune on combined data vs. per-predictor), train_kwargs (dict for training hyperparams, default None), adapter (optional Adapter or mapping for fine-tuning, default None), exclude_demos (bool, default False, whether to clear prompt demos after fine-tuning), num_threads (int, default None for using global default threads). These can be given as keywords or positionally (no * in signature). |
compile(self, student, trainset, teacher=None, valset=None, target=\"t5-large\", bsize=12, accumsteps=1, lr=5e-5, epochs=1, bf16=False, int8=False, peft=False, path_prefix=None) – student and trainset are accepted as positional args (unlike others, this method does not strictly enforce keyword-only for trainset in the code). teacher is optional (default None, can be passed by name); valset optional (default None); and a series of fine-tuning hyperparameters are provided as keyword options with defaults (target model name, batch size bsize , gradient accumulation steps accumsteps , learning rate lr , epochs , and flags for bf16, int8, PEFT usage, plus path_prefix for saving checkpoints). (In practice, these would be passed as keywords; the lack of * means trainset and teacher could technically be given positionally, which is an inconsistency in interface.) |
COPRO (Co-Prompt Optimization) | __init__(self, prompt_model=None, metric=None, breadth=10, depth=3, init_temperature=1.4, track_stats=False) – all parameters have defaults. prompt_model is an LM used to generate prompt variations (defaults to the globally configured LM if None); metric is the evaluation metric (default None, meaning it will optimize without a specific metric filter unless provided); breadth (int) is how many new prompt candidates to generate per iteration (default 10); depth is how many iterations of prompt refinement to perform (default 3); init_temperature (float) for prompt generation randomness (default 1.4); track_stats (bool) whether to record optimization statistics (default False). |
compile(self, student, *, trainset, eval_kwargs) – student program is positional; trainset is required keyword-only; eval_kwargs is also required keyword-only (a dict of extra arguments for evaluation). No teacher parameter in this optimizer – instead it uses prompt_model internally for generating new instructions, and evaluates the student on trainset using the provided metric and eval settings. |
MIPROv2 (Multi-Iteration Prompt Optimizer) | __init__(self, metric, prompt_model=None, task_model=None, teacher_settings=None, max_bootstrapped_demos=4, max_labeled_demos=4, auto=\"light\", num_candidates=None, num_threads=None, max_errors=None, seed=9, init_temperature=0.5, verbose=False, track_stats=True, log_dir=None, metric_threshold=None) – a large number of parameters. Notably, metric is required (no default) – the primary evaluation metric. prompt_model and task_model are optional LM instances (if None, defaults to global settings for prompt generation and for executing the task, respectively). teacher_settings is an optional dict of LM settings for any teacher model usage (default None -> {} ). max_bootstrapped_demos and max_labeled_demos default to 4 each (controls how many few-shot examples of each type to use initially). auto can be "light" , "medium" , "heavy" or None, controlling an automatic configuration of search effort (default “light”). num_candidates (int, optional) specifies how many candidate prompt variations to generate (if auto is None, this must be set along with num_trials ). num_threads optional (for parallel eval, default None). max_errors optional (max allowed errors during eval, default None to use global). seed default 9 (random seed for reproducibility). init_temperature (float) default 0.5 for initial prompt variation. verbose (bool) default False for logging. track_stats default True to record detailed stats. log_dir optional path for logging. metric_threshold optional float to early-discard prompts below this score threshold. |
compile(self, student, *, trainset, teacher=None, valset=None, num_trials=None, max_bootstrapped_demos=None, max_labeled_demos=None, seed=None, minibatch=True, minibatch_size=35, minibatch_full_eval_steps=5, program_aware_proposer=True, data_aware_proposer=True, view_data_batch_size=10, tip_aware_proposer=True, fewshot_aware_proposer=True, requires_permission_to_run=True, provide_traceback=None) – student is positional; all other parameters are keyword-only. trainset (list of examples) is required; teacher optional (defaults None, a teacher program/LM for bootstrapping if needed); valset optional (if provided, used for evaluation phases). This method exposes many tuning knobs: num_trials (total search iterations, required if auto mode is None), the ability to override max_bootstrapped_demos /max_labeled_demos for this run, a seed (if not given, uses the seed from init), and several boolean flags controlling different proposer strategies (minibatch evaluation vs full dataset, with minibatch_size and how often to fully evaluate minibatch_full_eval_steps ; whether the prompt proposal is aware of the program structure, data distribution, etc. via program_aware_proposer , data_aware_proposer , tip_aware_proposer , fewshot_aware_proposer – all True by default). view_data_batch_size (int, default 10) controls how much data a proposal sees at once. requires_permission_to_run (bool, default True) will prompt the user before a potentially expensive run. provide_traceback (bool or None) toggles including stack traces in logged errors. All of these are meant to be supplied as keywords when needed (there is a * enforcing keyword-only) to fine-tune the search behavior. |
Table Legend: Positional parameters are those that must be supplied in order (or by name), before any *
. Keyword-only parameters (shown after *
) can only be supplied by name (and have default values if not marked required). Defaults are shown where applicable. Each class’s core method (usually compile
) is listed with its signature and the nature of its arguments.
Detailed Method Argument Analysis
Below we provide a class-by-class breakdown of the constructor and primary method parameters, explaining each argument, default values, and usage conventions:
Teleprompter (Base Class)
Constructor
Teleprompter.__init__
: Takes no arguments besidesself
(no parameters to configure). It’s essentially an abstract base, so no initialization parameters are needed.Method
compile(self, student, *, trainset, teacher=None, valset=None)
: This is meant to be overridden by subclasses. It accepts astudent
program (the DSPy program to optimize) as a positional argument. The datasets are keyword-only:trainset
(required, list ofExample
): the training examples on which to optimize.teacher
(optional, defaultNone
): an optional teacher program used to guide optimization (if not provided, many optimizers default to using the student itself or an internal strategy).valset
(optional, defaultNone
): an optional validation set of examples to evaluate generalization or for early stopping. All parameters afterstudent
are marked with*
in the signature, making them keyword-only for clarity. The base implementation raisesNotImplementedError
(since Teleprompter itself doesn’t define a specific optimization strategy).
Method
get_params(self)
: (Minor utility) Returns a dictionary of the Teleprompter’s internal attributes (simplyself.__dict__
). This is a common interface to retrieve the configuration of any Teleprompter.
LabeledFewShot
Constructor
LabeledFewShot.__init__(self, k=16)
: This optimizer’s only parameter isk
– the number of examples from the trainset to label (i.e. use as demonstrations) per predictor. It defaults to 16. This parameter is positional-or-keyword (not forced to keyword-only), so one could callLabeledFewShot(10)
to use 10 examples, orLabeledFewShot(k=10)
. The value ofk
sets an upper bound on how many examples will be taken from the training data to insert as prompt demonstrations.Method
compile(self, student, *, trainset, sample=True)
: Optimizes the givenstudent
program by attaching labeled examples to it:student
– the program to optimize (positional).trainset
– required keyword-only list of examples to draw demonstrations from.sample
– keyword-only bool (defaultTrue
): if True, it randomly samplesmin(k, len(trainset))
examples for each predictor in the student; if False, it simply takes the firstk
examples (in order) from the trainset.
The
compile
method returns a new compiled program where each predictor in the student has up tok
example demos in its prompt. If thetrainset
is empty, it returns the student unchanged. This optimizer does not use any “teacher” or iterative improvement – it’s a one-step assignment of labeled data. All arguments afterstudent
are keyword-only as indicated by the*
in the signature.
BootstrapFewShot
Constructor
BootstrapFewShot.__init__
: This optimizer automatically “bootstraps” new prompt demonstrations by having the program attempt the task and collecting successful outputs as examples. Its constructor accepts several parameters, all with defaults:metric
(callable, defaultNone
): A function to judge success on an example (takes e.g.(gold_example, prediction, trace)
and returns True/False or a score). IfNone
, any output is considered a success for bootstrapping purposes.metric_threshold
(float, defaultNone
): A score threshold for the metric – if provided, a prediction must meet or exceed this threshold to count as a successful example. (Ifmetric
is boolean-returning, this may not be used.) This parameter allows filtering which outputs become demonstrations.teacher_settings
(dict, default{}
): Settings to configure the behavior of the teacher model (e.g., a different language model or different decoding parameters). These settings (like temperature) will be applied to the teacher when generating outputs.max_bootstrapped_demos
(int, default 4): The maximum number of bootstrapped demos (new examples generated from the model itself) to add per predictor.max_labeled_demos
(int, default 16): The maximum number of labeled demos (original trainset examples) to use per predictor. This sets an upper bound on using ground-truth examples in addition to bootstrapped ones.max_rounds
(int, default 1): How many bootstrapping rounds to perform. Each round can attempt to gather new demos from the model’s outputs.max_errors
(int, default 5 in some implementations, orNone
): The maximum number of errors to tolerate during bootstrapping (e.g., if the student or teacher throws exceptions). If the number of errors exceeds this, the process will halt or raise. In some versions, if set to None, it may fall back on a global setting.
All these parameters have default values, meaning the constructor can be called with no arguments (it will bootstrap using default settings). They are not declared as keyword-only in the signature (no leading
*
in the__init__
), but in practice they are almost always passed by keyword for clarity.Method
compile(self, student, *, teacher=None, trainset, valset=None)
: This performs the bootstrapping process:student
– the program to optimize (positional). The student should initially be “uncompiled” (no demos attached).teacher
– optional keyword-only. If provided, this is a separate program or model to act as the “coach” producing outputs; if None, the student itself (or a copy) is used as the teacher by default. The teacher is typically a copy of the student (or a version with different settings) that generates candidate outputs.trainset
– required keyword-only list of examples for training. The teleprompter will run each example through the teacher (or student) to see if it can get a correct output.valset
– optional keyword-only list of examples for validation (default None). If provided, it may be used after bootstrapping to evaluate or select prompts (in the basicBootstrapFewShot
, it’s not heavily used; it often defaults to using any remaining train examples not successfully bootstrapped as a validation list).
Process: The compile method will:
- Make a fresh copy of the
student
(ensuring the original remains unchanged) and also prepare ateacher
copy. - If
max_labeled_demos > 0
and the teacher program isn’t already compiled with demos, it first uses aLabeledFewShot
teleprompter to supply up tomax_labeled_demos
ground-truth examples to the teacher (so the teacher starts with some baseline demos). - It then iterates through the
trainset
, using the teacher to generate predictions. For each example, if the prediction is “successful” according to themetric
(or if no metric provided), it will extract the input/output pair from the execution trace and add it as a new demo example (a bootstrapped demo) for the student’s corresponding predictor. - It stops once it has collected
max_bootstrapped_demos
successful demos or has exhausted the training data (or completedmax_rounds
passes). Any training examples not “bootstrapped” successfully may remain as a validation set. - Finally, it calls an internal
_train()
which assembles the final set of demos for each predictor: it takes the bootstrapped demos collected and, if there’s still room (up tomax_labeled_demos
total), it may fill in some of the original trainset examples as well. The resulting student (with demos attached) is marked as compiled and returned.
All arguments after
student
are keyword-only, enforcing calls liketeleprompter.compile(student=prog, trainset=data)
for clarity. This is consistent with the base Teleprompter signature. The presence of bothteacher_settings
in the constructor and an optionalteacher
in compile means you configure how the teacher behaves up front (e.g., use a different model or temperature via settings), and you can also supply a specific teacher program if desired at compile time.
BootstrapFewShotWithRandomSearch
Constructor
BootstrapFewShotWithRandomSearch.__init__
: This class builds onBootstrapFewShot
to not only bootstrap demos but also perform a random search over multiple candidate prompt sets. It inherits fromTeleprompter
(and in newer versions, it extendsBootstrapFewShot
) and introduces additional parameters:metric
(callable, no default in signature): Similar to BootstrapFewShot, this is the evaluation metric. In this class,metric
is effectively required – the absence of a default indicates the user should supply one (the random search needs a way to compare programs). (If not provided, it might default to using the truthy evaluation of outputs if the metric function is None, but typically one provides a metric).teacher_settings
(dict, default None): Same role as in BootstrapFewShot – configuration for the teacher’s LM behavior. If None, an empty dict is used internally.max_bootstrapped_demos
(int, default 4),max_labeled_demos
(int, default 16),max_rounds
(int, default 1): Same meaning as in BootstrapFewShot (limits on demos and bootstrap iterations).num_candidate_programs
(int, default 16): The number of candidate programs (prompt configurations) to evaluate in the random search. This class will generate and test up to this many variations of prompts.num_threads
(int, default None): If set, this can be used to parallelize evaluation of candidates (e.g., number of threads for the Evaluate calls). If None, it might default to a global setting or single-threaded evaluation.max_errors
(int, default None): Maximum errors tolerated (similar to BootstrapFewShot; if None, use global setting). This applies during each candidate evaluation as well.stop_at_score
(float, default None): If provided, the search will stop early if it finds a candidate with a metric score greater or equal to this threshold.metric_threshold
(float, default None): A threshold applied during the bootstrapping phase for considering a trace successful (similar to BootstrapFewShot’s metric_threshold).
All these arguments have defaults except
metric
, and they are typically passed by keyword. In the code, none are forced keyword-only at init, but practically one would use keywords for clarity due to the number of parameters.Method
compile(self, student, *, teacher=None, trainset, valset=None, restrict=None, labeled_sample=True)
: This performs an extended random search on top of bootstrapping:student
– the program to optimize (positional).teacher
– optional keyword-only teacher program (default None) as in BootstrapFewShot.trainset
– required keyword-only training examples.valset
– optional keyword-only validation set (defaults to usingtrainset
if not provided, as seen in code whereself.valset = valset or trainset
).restrict
– optional keyword-only (default None). This can be used to restrict which candidate indices/seeds to run. Internally, this optimizer uses different random seeds (including some special values like -3, -2, -1 for baseline variants) to generate candidate prompt sets; therestrict
parameter can specify a subset of these seeds to actually evaluate (useful for debugging or partial searches).labeled_sample
– optional keyword-only bool (default True). This is passed into the LabeledFewShot step for the seed that uses labeled examples only. IfTrue
, it randomly samples labeled demos; ifFalse
, it takes the first examples (just as in LabeledFewShot’s compile).
Process: The compile method goes through a sequence of candidate evaluations (using different
seed
values to shuffle the trainset and vary the demos):It considers a set of candidate prompt configurations:
seed = -3
: a zero-shot baseline (no demos at all).seed = -2
: a baseline with labeled examples only (usesLabeledFewShot
to attach up tomax_labeled_demos
demos).seed = -1
: an “unshuffled” few-shot bootstrap (runs BootstrapFewShot with the trainset in given order).seed >= 0
: a number of random shuffles. For each seed from 0 up tonum_candidate_programs-1
, it shuffles a copy of the trainset and picks a random number of bootstrapped demos (between 1 andmax_bootstrapped_demos
) to gather, then runs BootstrapFewShot with those settings.
For each candidate, it uses
Evaluate
to compute the overall metric score on either thevalset
or training set for that compiled program. It keeps track of the scores.It applies adjustments for any assertion-based failures (specific to DSPy, e.g., if the program has internal assertion checks) – see the section subtracting for
_suggest_failures
and zeroing out if_assert_failures
.It identifies the best-scoring program and can stop early if
stop_at_score
was specified and achieved.Finally, it attaches a list of all candidate programs and their scores to the best program (
best_program.candidate_programs
) for reference, and returns the best program.
All parameters after
student
are keyword-only, aligning with the interface of BootstrapFewShot (trainset must be named, etc.). This optimizer’s interface is more complex, but the use of keyword-only helps avoid confusion when callingcompile
with many optional settings. One idiosyncrasy: thecompile
method itself uses the internalBootstrapFewShot
class for seeds -1 and >=0, thereby inheriting any parameters set in the constructor likemetric_threshold
orteacher_settings
and reusing them for each candidate search.
Ensemble
Constructor
Ensemble.__init__(self, *, reduce_fn=None, size=None, deterministic=False)
: The Ensemble teleprompter does not deal with datasets or metrics at all – instead, it creates an ensemble from multiple programs. All its parameters are keyword-only (notice the leading*,
in the signature):reduce_fn
(callable, default None): A function that takes a list of outputs (one from each program in the ensemble) and reduces them to a single output. For example, DSPy providesdspy.majority
to pick the most common answer, which is a typical choice for classification tasks. Ifreduce_fn
is None, the ensemble’sforward
will return the list of all outputs.size
(int, default None): If set, the ensemble will randomly selectsize
programs out of the provided list each time it is called, rather than using all programs. If None, it uses all programs each time.deterministic
(bool, default False): If True, the ensemble would aim to produce deterministic behavior (e.g., always pick the same subset for a given input). Currently, this is not implemented (the code asserts thatdeterministic is False
).
These parameters allow controlling how the ensemble combines multiple models’ outputs. All must be passed by keyword, e.g.,
Ensemble(reduce_fn=dspy.majority, size=5)
.Method
compile(self, programs)
: Instead of optimizing prompts, this teleprompter combines programs. Theprograms
argument is a list of DSPy programs to ensemble, passed as a single positional argument. There are no trainset or metric arguments. The method returns a newEnsembledProgram
(constructed internally) which, when called, will:- If
size
is specified, randomly sample that many programs from the list; otherwise use all programs. - Invoke each selected program’s
__call__
(orforward
) on the given inputs. - Collect their outputs, and then either apply the
reduce_fn
if provided or return the list of outputs as-is.
The
compile
here is straightforward: it doesn’t “learn” or modify the programs, just wraps them. Notably, there is no keyword-only enforcement in this signature, because it only takes one argument (programs
). The usage is simplyensemble_teleprompter.compile([prog1, prog2, ...])
. This class is an outlier in that it doesn’t use any of the training data or metric infrastructure – it’s purely a structural optimizer.- If
FinetuneTeleprompter (Base Class for Fine-tuning)
Constructor
FinetuneTeleprompter.__init__(self, train_kwargs=None)
: This base class is designed for optimizers that fine-tune language model weights. It introduces a single configuration parameter:train_kwargs
(dict or dict-of-dicts, default None): Training arguments for fine-tuning. It can be one dictionary applied to all LMs, or a mapping from specificLM
objects to their respective parameter dicts. For example, this might include learning rate, number of epochs, etc. If None, it defaults to an empty configuration. Internally, the constructor converts this into a standard form (usingconvert_to_lm_dict
) where each LM maps to its own settings (even if the same settings are used for all).
This class does not take a metric in its constructor – because often fine-tuning might use the training loss as implicit metric, or the metric can be applied on a validation set externally. It primarily encapsulates how to call the underlying LM’s fine-tune method.
FinetuneTeleprompter
doesn’t implement a newcompile
itself – it relies on child classes to implement the strategy. After construction, it holds atrain_kwargs
mapping that will be used during fine-tune calls.No direct
compile
method:FinetuneTeleprompter
inherits the abstractcompile
from Teleprompter but does not override it, so it can’t be used on its own. Subclasses (likeBootstrapFinetune
) will implement the actual compile logic. Essentially,FinetuneTeleprompter
serves to store training configurations and provide utility methods (in the DSPy code, e.g.,finetune_lms
static method in the newer implementation, orconvert_to_lm_dict
). Think of it as an abstract base similar to Teleprompter, but specifically for fine-tuning optimizers, ensuring they handletrain_kwargs
uniformly.
BootstrapFinetune
Constructor
BootstrapFinetune.__init__
: This class combines bootstrapping with actual fine-tuning of an LM. It inherits fromFinetuneTeleprompter
. Its parameters are as follows:metric
(callable, default None): An optional metric function to evaluate model outputs (similar to other teleprompters). If provided, it can be used to judge which outputs are “successful” when bootstrapping data or to guide the selection of fine-tuning data. If None, all outputs might be considered or a default (like always True) is used.multitask
(bool, default True): Whether to fine-tune on all tasks/predictors jointly (True
) or separately (False
). Ifmultitask=True
, all data from all predictors might be combined to fine-tune a single model (or one model per unique LM); if False, it will fine-tune separate models for each predictor (the code sets data indices accordingly).train_kwargs
(dict or dict-of-LM dicts, default None): Passed to the base FinetuneTeleprompter to configure fine-tuning (learning rate, epochs, etc.). If a plain dict is given, the same settings apply to all language models; a more granular mapping can specify different hyperparameters per LM.adapter
(Adapter or dict of LMs to Adapter, default None): An optional specification of an adapter to use for fine-tuning (e.g., for parameter-efficient fine-tuning). If provided, this indicates which fine-tuning method or adapter to use for each LM. Internally converted to a dict mapping each LM to an Adapter (using a similar technique totrain_kwargs
).exclude_demos
(bool, default False): If True, after fine-tuning it will clear out any prompt demonstrations in the predictors (perhaps under the assumption that the model has learned from them and they are no longer needed). If False, it leaves any demos in place. In the code, after fine-tuning, they actually setpred.demos = []
ifexclude_demos
is True.num_threads
(int, default None): Number of threads for parallel fine-tuning jobs. If you have multiple predictors to fine-tune (e.g., multitask=False scenario or multiple LMs in a program), this sets how many can run in parallel. It defaults to None, which means use the global default (or 1 if not set).
All these parameters have defaults, so you can call
BootstrapFinetune()
with none, and it will use a multitask approach with whatever global LM is configured. The signature does not enforce keyword-only, but given the number of parameters, using keywords is strongly recommended for clarity (e.g.,BootstrapFinetune(metric=my_metric, epochs=2)
etc., thoughepochs
would actually go insidetrain_kwargs
in this design).Method
compile(self, student, trainset, teacher=None, valset=None, target="t5-large", bsize=12, accumsteps=1, lr=5e-5, epochs=1, bf16=False, int8=False, peft=False, path_prefix=None)
: This is a two-phase optimizer: it first bootstraps prompt examples, then fine-tunes the model on those examples. Its signature is notably different in that it does not strictly requiretrainset
to be passed as a keyword (there is no*
beforetrainset
in the current implementation’s signature, meaningstudent
andtrainset
could be given positionally). However, to avoid confusion, it’s often called with keywords for clarity. The parameters are:student
– the program to optimize (positional).trainset
– the list of examples to train on (positional or keyword). These will be used both for bootstrapping prompts and as the fine-tuning dataset.teacher
– optional (default None). A teacher program or list of programs. If provided, those will be used to bootstrap examples; if None, it will issue a warning that it’s using an uncompiled student as teacher. Often, one might pass a copy of the student or a differently configured model as the teacher for the bootstrap step.valset
– optional validation set (default None). Not extensively used inside the compile method for Bootstrapping (the code primarily usestrainset
for bootstrapping and doesn’t explicitly usevalset
in fine-tuning, though it could be used to evaluate during training or after).Fine-tuning hyperparameters: These are all optional with defaults, and they mirror typical HuggingFace/transformers fine-tuning settings:
target
(str, default"t5-large"
): The model name or identifier to fine-tune. This class may instantiate a fresh model of this type for fine-tuning or use it as an identifier to save the fine-tuned weights.bsize
(int, default 12): Batch size for fine-tuning.accumsteps
(int, default 1): Gradient accumulation steps.lr
(float, default 5e-5): Learning rate for fine-tuning.epochs
(int, default 1): Number of fine-tuning epochs.bf16
(bool, default False): Whether to use bfloat16 precision.int8
(bool, default False): Whether to use int8 quantization for fine-tuning (likely requires an adapter that supports it).peft
(bool, default False): Whether to use a PEFT (Parameter-Efficient Fine Tuning) method (like LoRA). If True, the fine-tuning will use an adapter method rather than full model tuning.path_prefix
(str, default None): An optional prefix path for saving fine-tuned model checkpoints. If provided, the fine-tuned model weights are saved under this path with a generated name.
The compile process is as follows:
- Bootstrap Phase: It uses an internal
self.teleprompter
, which is aBootstrapFewShot
instance configured in__init__
(withmax_bootstrapped_demos
very high andmax_labeled_demos=0
by default in some implementations), to compile the student (or teacher) with bootstrapped demonstrations. Essentially, it generates a set of demonstrations by running the teacher (or student) on the trainset and collecting successful outputs (using the givenmetric
if provided). This yields a compiled program with demos. - It then prepares fine-tuning data: for each predictor in the compiled program, it takes all the demos (input-output pairs) and formats them into prompt-completion training examples appropriate for the language model fine-tuning. The code constructs prompt text and target text from each demo using the predictor’s signature/template, accumulating them in a list.
- It shuffles the fine-tuning data and writes it to disk as a
.jsonl
file (or multiple files if multitask vs per-predictor). - Fine-tuning Phase: It invokes a fine-tuning routine (likely
finetune_hf
for HuggingFace models) on the prepared data for the specifiedtarget
model, with the given hyperparameters (batch_size
,epochs
,lr
, etc.). This produces fine-tuned model checkpoint(s). - It loads these fine-tuned weights into the student’s predictors – replacing their
lm
with the fine-tuned model(s). Ifmultitask=True
, typically one model is fine-tuned for all (assuming a shared LM); if False, each predictor might get its own fine-tuned model. The code ensures the structure matches and assigns the new LMs. - If
exclude_demos=True
, it clears thedemos
for each predictor (since the model is now supposed to handle the task without needing prompt examples). - The method marks the program as compiled and returns the fine-tuned compiled program.
Key points: The
trainset
here is used both to bootstrap examples and to generate the fine-tuning dataset, effectively turning successful model outputs into training data (this is a form of self-training). The presence of both metric-based bootstrapping and actual gradient descent is unique to this optimizer. The interface inconsistency is thattrainset
is not forced to keyword-only (likely an oversight), whereas most others require naming it. Best practice is to call it asteleprompter.compile(student, trainset=..., teacher=..., epochs=..., lr=..., ...)
for clarity. All the fine-tuning hyperparameters are keyword-only by position (they come after the required args and*
in the function definition), meaning in code you must call them as named arguments (which is natural for these settings).
COPRO (Co-Prompt Optimizer)
Constructor
COPRO.__init__
: COPRO aims to optimize the instructions in a prompt by iterative generation and testing. Its parameters:prompt_model
(LM client, default None): The language model used to propose new instructions. If None, the system likely defaults to the same model as the student (or whatever is set in global settings). By providing a separateprompt_model
, you could use a larger or more creative model to generate prompt variants while using a differenttask_model
(the student) for execution.metric
(callable, default None): The metric to evaluate the student’s performance. If None, COPRO can still run, but it might not have a quantitative way to compare prompts – in practice, a metric should be supplied so it can choose the best prompt.breadth
(int, default 10): The number of new prompt candidates to generate at each iteration (each “depth”). Essentially, in each round COPRO will produce this many alternative instructions via theprompt_model
.depth
(int, default 3): The number of iterations (rounds of prompt generation and evaluation) to perform. A depth of 3 means it will generate new instructions 3 times, each time possibly building on or replacing previous ones.init_temperature
(float, default 1.4): The temperature setting for the prompt generation model in the initial generation round (higher temperature means more randomness/creativity). This influences the diversity of prompts generated. In the code, this temperature might be used forprompt_model
when sampling instructions.track_stats
(bool, default False): Whether to collect statistics about the optimization process. If True, COPRO will record details such as the distribution of scores for prompts at each iteration (min, max, avg, std of top prompts, etc.). These stats would be stored in attributes likeresults_best
,results_latest
, etc., on the returned program for analysis.
All of these parameters are keyword-only by design (note the
*,
in the__init__
signature in code) – meaning you must call, for example,COPRO(metric=..., breadth=20)
. This enforces clarity given the number of optional arguments.Method
compile(self, student, *, trainset, eval_kwargs)
: COPRO’s compile differs from previous ones in that it doesn’t attach demos or fine-tune weights, but instead alters the prompt instructions of the student’s predictors. Parameters:student
– the program to optimize (positional). This program likely contains one or more predictors with an instruction (prompt template) that we want to improve.trainset
– required keyword-only list of examples. These will be used to evaluate the quality of instructions. Essentially, for each candidate prompt, COPRO will run the student on the trainset and measure performance.eval_kwargs
– required keyword-only dict of arguments for evaluation. This is passed to DSPy’sEvaluate
to evaluate the student on the trainset. For example,eval_kwargs
might specifynum_threads
for parallel evaluation ordisplay_progress
flags. It’s mandatory to provide (the code does not have a default), ensuring the user is explicit about how to evaluate (e.g.,eval_kwargs={"display_progress": False}
or with specific settings).
Process: In simplified terms, COPRO will:
Make a deepcopy of the
student
to work on (so as not to modify the original mid-process).Evaluate the initial student on the trainset to get a baseline score (not explicitly shown in snippet, but likely done implicitly as part of loop or for stats tracking).
For each iteration (up to
depth
):Use the
prompt_model
to generatebreadth
new candidate instructions for each predictor. The generation likely uses one of two Signature classes defined in the code:BasicGenerateInstruction
if it’s the first round (which just takes the original instruction and asks for an improved one).GenerateInstructionGivenAttempts
if it’s after the first round (which provides some of the previously tried instructions and their scores to the prompt model, so it can propose a better one).
For each predictor in the student program, replace its instruction with each of the candidate instructions one at a time and evaluate the program on the trainset using the metric (via
Evaluate
witheval_kwargs
).Track the performance of each candidate. If
track_stats
is True, record the stats of these candidates (min, max, etc.).Possibly filter out duplicate or very similar instructions (the code has
_drop_duplicates
to eliminate repeated candidates that yield the same results).Select the top-performing instruction(s) to carry forward. Likely it keeps the best one as the new base instruction (and possibly uses others for context in subsequent rounds).
Repeat for the specified number of depths. By the end, ideally, the student’s predictors have improved instructions that yield better metric performance on the trainset.
Return the optimized program (with its instruction updated to the best found). If
track_stats
was True, the returned program might have attributes likeresults_best
andresults_latest
containing the recorded statistics.
All parameters after
student
are keyword-only, so one would callteleprompter.compile(student=prog, trainset=data, eval_kwargs=eval_args)
. The absence of ateacher
parameter here is notable – COPRO doesn’t use a separate teacher model to generate outputs for evaluation; instead, it uses a separateprompt_model
to generate prompts (instructions), and the original program (or its LM, possibly configured viateacher_settings
if any) to evaluate those prompts. Essentially, COPRO is searching in prompt/instruction space, guided by metric evaluations on the trainset.
MIPROv2
Constructor
MIPROv2.__init__
: MIPRO (“Mixed Initiative Prompt Optimization”, perhaps) is one of the most complex teleprompters, combining few-shot bootstrapping, instruction proposal, and hyperparameter search. Its initialization has many parameters, mostly optional, to cover various aspects of the search:metric
(callable, required): The evaluation metric to maximize. Unlike many others, MIPROv2 does not defaultmetric
to None – you must provide a metric function. This makes sense given the complexity: it needs a quantitative measure to drive the optimization.prompt_model
(LM, default None): Similar to COPRO, an optional separate model used to propose instructions or other prompt components. If None, defaults to the globally configured LM (or the student’s LM).task_model
(LM, default None): If the student program uses a particular LM,task_model
can override or specify it. If None, it usesdspy.settings.lm
(the globally configured default LM) as the model to actually run the task. Essentially,task_model
is the model that executes the prompts (the “student’s LM”), andprompt_model
is the model that generates new prompt candidates; they could be different.teacher_settings
(dict, default None): Similar to earlier teleprompters, this can hold settings for any teacher or evaluation model usage. MIPRO does some bootstrapping internally, so this could configure how that’s done. Internally, if None, it stores as an empty{}
.max_bootstrapped_demos
(int, default 4): The initial number of bootstrapped few-shot examples to gather (per predictor) for use in prompts.max_labeled_demos
(int, default 4): The initial number of labeled (ground-truth) examples to include per predictor. (Notice this default is 4, smaller than the 16 used in simpler teleprompters, possibly to limit scope for the automated search).auto
(Literal “light”/“medium”/“heavy” or None, default “light”): This is a high-level switch to configure how exhaustive the search should be. If set to “medium” or “heavy”, the teleprompter will automatically set or override other parameters (like number of trials, etc.) to spend more effort. Ifauto=None
, the user must manually specify certain parameters (likenum_trials
). The allowed values are enforced; any other string would raise an error.num_candidates
(int, default None): The number of candidate solutions (e.g., prompt combinations) to consider in the search. Ifauto
is None, this must be provided (along withnum_trials
) or an error is raised. Ifauto
is set,num_candidates
should not be provided (it would be overridden by the auto settings).num_threads
(int, default None): Number of threads for parallel operations (like evaluation). If None, falls back to global setting.max_errors
(int, default None): Max errors to tolerate; if None, use global setting (similar usage as before).seed
(int, default 9): Random seed for reproducibility. Used for shuffling and any stochastic decisions.init_temperature
(float, default 0.5): Initial temperature for any prompt generation or sampling (lower than COPRO’s default, implying more conservative generation).verbose
(bool, default False): If True, provides more logging info during the process.track_stats
(bool, default True): Whether to collect and store statistics of the optimization (like how COPRO does). By default True, so it will track performance of trials, etc.log_dir
(str, default None): If provided, the directory path to save logs or intermediate results (like candidate programs, evaluations).metric_threshold
(float, default None): Similar to earlier, a threshold for the metric to perhaps prune or consider a trial successful. If set, any candidate with metric below this might be discarded or considered failing.
The constructor sets a lot of these into internal attributes and does some validation: e.g., ensures if
auto
is not None, the user hasn’t also setnum_candidates
ornum_trials
(to avoid conflict), and ifauto
is None, then bothnum_candidates
andnum_trials
must be specified by the user. It also immediately convertsteacher_settings
to an empty dict if None and assigns default models ifprompt_model
ortask_model
are None. All parameters exceptmetric
have defaults, but given their number, they are meant to be given by keyword (the signature includes no*
here, but practically one would hardly pass 15 args positionally in order). The ordering placesmetric
first (required), then the two models, then other settings.Method
compile(self, student, *, trainset, teacher=None, valset=None, num_trials=None, max_bootstrapped_demos=None, max_labeled_demos=None, seed=None, minibatch=True, minibatch_size=35, minibatch_full_eval_steps=5, program_aware_proposer=True, data_aware_proposer=True, view_data_batch_size=10, tip_aware_proposer=True, fewshot_aware_proposer=True, requires_permission_to_run=True, provide_traceback=None)
: This signature is expansive, but all arguments afterstudent
are keyword-only (enforced by the*
). Here’s what they mean:student
– the program to optimize (positional).trainset
– required keyword-only list of examples to train/optimize on.teacher
– optional keyword-only (default None). If provided, used during the bootstrap of few-shot examples (similar to BootstrapFewShot’s teacher). If None, the student (or rather itstask_model
) is used to bootstrap itself.valset
– optional keyword-only list of examples for validation (default None). MIPRO uses a validation set to evaluate candidate prompts (distinct from trainset if provided) and for final evaluation of each trial. If not provided, it may split the trainset or use part of it for validation implicitly.num_trials
– optional keyword-only (int). The number of search trials to run. Ifauto
is None, this must be set (and should correspond roughly tonum_candidates
and the effort desired). Ifauto
is “light”/“medium”/“heavy,
num_trials` will be determined internally (and providing it will raise an error).max_bootstrapped_demos
,max_labeled_demos
– optional ints to override the defaults for this compile run. If provided, they will update the internalmax_bootstrapped_demos
/max_labeled_demos
before running. Otherwise, it uses the values from the constructor (which might have been set via auto mode).seed
– optional int to override the random seed for this run (if not provided, uses the seed from init). This allows one to repeat the search with different seeds or ensure reproducibility.minibatch
(bool, default True): Whether to use minibatch evaluation when scoring prompts. If True, and the validation set is large, MIPRO will evaluate in batches rather than all at once (to speed up or simulate iterative evaluation). If False, it evaluates on the fullvalset
every time.minibatch_size
(int, default 35): The number of examples to use in each minibatch evaluation ifminibatch
is True. It will evaluate candidate programs on chunks of this many examples and possibly use an average or intermediate pruning strategy.minibatch_full_eval_steps
(int, default 5): If using minibatch mode, this could indicate how frequently (in terms of trial count or iterations) a full evaluation on the entirevalset
is done, or how many minibatch steps constitute a “full” eval for logging. (This parameter’s use is a bit advanced; it might define after how many partial batches to do a full evaluation or something similar.)The next several are boolean flags controlling proposers – these determine what aspects of the prompt the algorithm is allowed to propose changes for:
program_aware_proposer
(default True): If True, the optimizer will propose modifications aware of the program’s structure (likely meaning it can consider changes to instructions in context of entire program).data_aware_proposer
(default True): If True, proposals might take into account the data distribution or particularities of examples (perhaps by examining some examples during instruction proposals).view_data_batch_size
(int, default 10): Possibly the number of examples the proposers can look at at once when generating suggestions (if data-aware).tip_aware_proposer
(default True): “Tip” could refer to a part of prompt (like a prefix or a suffix). If True, the proposer can adjust the “tip” (maybe the output field prefix or few-shot separators).fewshot_aware_proposer
(default True): If True, the proposer can adjust few-shot examples or how they’re used (since MIPRO also handles bootstrapped demos).
requires_permission_to_run
(bool, default True): If True, the compile will prompt the user for confirmation before running a potentially expensive search (especially in heavy mode). If set to False, it will run to completion without interactive confirmation.provide_traceback
(bool or None, default None): If True, any errors encountered might include tracebacks in the logs; if False, suppress tracebacks; if None, use a default setting (perhaps false). This is mainly for debugging if something goes wrong during evaluation, which can be helpful whenverbose
logging.
Process: MIPROv2’s compile is very comprehensive. Summarizing:
Few-shot Bootstrapping: It likely begins by ensuring the student has some initial demos. There is a call
demo_candidates = self._bootstrap_fewshot_examples(program, trainset, seed, teacher)
which presumably usesmax_bootstrapped_demos
andmax_labeled_demos
to produce a set of demonstration candidates (similar to BootstrapFewShot but perhaps generating multiple sets).Instruction Proposal: Then it calls
_propose_instructions(...)
which uses theprompt_model
to propose new instructions, possibly taking into account the current program, the data, and the demo candidates. The parameters likeview_data_batch_size
,program_aware_proposer
, etc., influence this step – e.g., it might generate instructions while seeing a batch ofview_data_batch_size
examples or not.If zero-shot optimization is indicated (no demos allowed,
zeroshot_opt
), it may discard demos to focus purely on instructions.Prompt Parameter Optimization: It then calls
_optimize_prompt_parameters(...)
– this likely orchestrates the main search over trials (num_trials
). In each trial, it might:- Choose a set of demos (from
demo_candidates
, possibly none if zero-shot) and an instruction (frominstruction_candidates
proposed) to form a candidate program (a specific configuration of prompts). - Evaluate that program on the
valset
using the metric (the code uses anEvaluate
instance for thevalset
with the given metric and threads). - Use something like Optuna (since the code imports
optuna
if available) to intelligently choose the next combination of parameters to try (the “Bayesian” or guided search aspect). - Possibly prune low-performing trials early (since the code has integration for pruning via intermediate minibatch evaluation).
- Repeat until
num_trials
are done or the search converges.
- Choose a set of demos (from
It likely uses the
auto
setting to determinenum_trials
and possibly adjustminibatch
usage. For example, “heavy” auto might set a large number of trials and larger validation set size.If
requires_permission_to_run=True
, before starting the full search, it will print an estimate of how many LM calls or how long it might take and prompt the user to continue. If the user declines, it aborts and returns the original student unchanged.Throughout, it tracks the best program found. At the end, it returns the optimized program (with improved instructions and possibly with selected demos attached). It also attaches logs like
trial_logs
containing the score of each trial and the parameters used, as well as possibly storing instudent._compiled = True
.
The key feature of MIPROv2 is that it integrates multiple dimensions: it can optimize the instruction text (like COPRO), the selection of few-shot examples (like BootstrapFewShot), and even other prompt parameters (e.g., it might experiment with presence or absence of demos – that’s why it has both
fewshot_aware_proposer
and code logic for zero-shot vs few-shot). It effectively generalizes and combines ideas from the simpler teleprompters. Because of this, its interface is the most complex. All those boolean flags allow turning on/off certain aspects of the search:- e.g., one could run it with
program_aware_proposer=False
to ignore program structure differences when proposing instructions, orminibatch=False
to always evaluate on full validation set (safer but slower).
As with other teleprompters,
trainset
and other main parameters are keyword-only to prevent mix-ups. Thecompile
method is clearly intended to be called with named arguments for anything beyond the basics (e.g.,teleprompter.compile(student=prog, trainset=data, valset=dev, num_trials=50, fewshot_aware_proposer=False, requires_permission_to_run=False)
). The consistency in using keyword-only here is welcome given how many tuning knobs exist.
Patterns and Idiosyncrasies
Examining all these optimizers, we can observe several patterns in how parameters are structured, as well as some inconsistencies or outliers:
Common Structure – “compile” with trainset: Almost every optimizer uses a
compile(student, *, ... trainset ..., ...)
method to perform the optimization on a given program and dataset. Requiringtrainset
as a keyword-only argument is a common design (seen in Teleprompter base, LabeledFewShot, BootstrapFewShot, RandomSearch, COPRO, MIPRO). This pattern enforces clarity that a training set must be provided and avoids accidental swapping of positional arguments. An inconsistency here is BootstrapFinetune, whosecompile
signature does not enforce keyword-only fortrainset
(it takesstudent, trainset
positionally). This makes BootstrapFinetune stand out as allowingcompile(prog, data)
without namingtrainset
, whereas others would requirecompile(prog, trainset=data)
. It’s likely an oversight in that implementation because the conceptual pattern is thattrainset
should be keyword-only for all.Positional vs Keyword-only in Constructors: The base classes (Teleprompter, FinetuneTeleprompter) and some simple ones have very few parameters and thus no need for keyword-only in
__init__
. E.g., Teleprompter and FinetuneTeleprompter have none or one parameter and don’t use*
. But Ensemble explicitly uses*
to force its three parameters (reduce_fn, size, deterministic
) to be keyword-only in the constructor. This is a design choice to improve readability: callingEnsemble(size=3, reduce_fn=majority)
is self-documenting, versus relying on positional order. Other optimizers like BootstrapFewShot, BootstrapFewShotWithRandomSearch, BootstrapFinetune, COPRO, MIPROv2 did not enforce*
in their__init__
, despite having many parameters. This means in theory one could callBootstrapFewShot(None, {}, 4, 16, 1)
positionally, but that would be very unclear. In practice, users likely callBootstrapFewShot(metric=my_metric, max_rounds=2, ...)
. The lack of uniform use of keyword-only in constructors is an inconsistency. A pattern is that newer or more user-facing classes (Ensemble, perhaps MIPRO if it was considered user-facing) lean towards keyword-only for clarity, whereas older classes did not enforce it.Parameter Naming Conventions:
Most classes use
trainset
and (optionally)valset
consistently to refer to data. This is uniform across optimizers.The use of
teacher
vsteacher_settings
is a bit confusing across classes:BootstrapFewShot and RandomSearch have a
teacher_settings
in the constructor (for LM config) and ateacher
argument in compile (for an actual program instance).BootstrapFinetune similarly takes an
adapter
(similar concept to teacher settings, but specific to fine-tuning) in constructor and ateacher
in compile.MIPROv2 uses
teacher_settings
in constructor (to adjust the teacher LM) andteacher
in compile.LabeledFewShot and Ensemble do not involve a teacher at all.
COPRO does not have a
teacher
parameter either; instead it hasprompt_model
and uses the student’s own execution for evaluation. Inconsistency arises in naming: e.g., BootstrapFewShotWithRandomSearch reusesteacher_settings
from its parent and hasteacher
in compile, whereas FinetuneTeleprompter/BootstrapFinetune introduced a separate concept ofadapter
andtrain_kwargs
for fine-tuning. These serve a similar role (configuring how the “teaching” or training is done) but under different names. Also, in MIPROv2, there is bothteacher_settings
and ateacher
argument, plus separateprompt_model
andtask_model
. This can be conceptually hard to follow:teacher
generally means an alternate DSPy program or LM used to generate outputs for bootstrapping.teacher_settings
means a dictionary of parameters to apply to whichever model is acting as teacher (like setting its temperature or max tokens).prompt_model
is an LM used for generating new prompt text (distinct from the task).adapter
in finetuning is an object encapsulating how to fine-tune (distinct from anything in non-finetune classes). Ideally, the interface could be cleaner if, for example, every Teleprompter had ateacher
argument in compile (for a program or LM) and possibly a unified way to specify how that teacher should behave (maybe always viateacher_settings
). Currently it’s partly unified (teacher + teacher_settings) in bootstrap classes, but fine-tune adds adapter, and COPRO/MIPRO add prompt_model separately. This is an area of inconsistency in naming and usage.
Metric and Threshold: Every optimizer that evaluates outputs uses a
metric
parameter name for the evaluation function. This is consistent. Some optimizers (BootstrapFewShot, RandomSearch, MIPRO) also usemetric_threshold
as an optional cutoff for success. The concept ofmetric_threshold
is not present in others like Finetune or COPRO (COPRO could theoretically use it but doesn’t expose it; Finetune focuses on loss). The inconsistent part is documentation vs implementation: e.g., the official docs for BootstrapFewShot did not listmetric_threshold
ormax_errors
, yet the code and random search clearly use them. This indicates either a new feature that wasn’t documented or a parameter considered more internal. As a pattern, many classes allow aNone
metric to mean “no filtering, just optimize blindly” and some threshold to refine what “success” means.Demo-related parameters: We see repeated parameters controlling number of examples:
k
in LabeledFewShot.max_bootstrapped_demos
andmax_labeled_demos
in BootstrapFewShot, RandomSearch, MIPRO. These generally default to some small numbers (4 and 16, or 4 and 4 in MIPRO). The choice of 4/16 vs 4/4 is inconsistent. Possibly, earlier versions assumed up to 16 labeled demos is fine (for simpler tasks or lots of data), whereas MIPRO’s authors might have found using 16 made the search space too large or wasn’t needed, and so they reduced both defaults to 4. It’s an inconsistency in default tuning: two classes aimed at similar goals have different default for max labeled demos (16 vs 4). Similarly, LabeledFewShot and BootstrapFewShot share the 16 default for labeled demos (and LabeledFewShot’s sole param k=16 aligns with BootstrapFewShot’s 16), whereas MIPRO diverges.
Parallelism parameters:
num_threads
appears in BootstrapFewShotWithRandomSearch, BootstrapFinetune, MIPRO, but not in plain BootstrapFewShot or LabeledFewShot. The base Evaluate class in DSPy likely uses a global thread count if not specified. The newer/complex optimizers exposenum_threads
to give the user control over parallel evaluations. This is a pattern of evolving design: earlier optimizers didn’t surface this (assuming either single-thread or using global config), later ones made it explicit. So there’s inconsistency across classes – e.g., one can’t directly set threads in BootstrapFewShot without going through dspy.settings, but one can in RandomSearch via the teleprompter’s param.Boolean flags for features: Some advanced optimizers (MIPRO) have many boolean flags to toggle sub-behaviors (program_aware_proposer, etc.), whereas simpler ones bake in one strategy. This reflects differing complexity: simpler optimizers don’t have these flags at all. It’s expected, but it means the interface isn’t uniform – MIPRO stands out with a very large signature and lots of optional toggles, compared to something like BootstrapFewShot which has a concise interface. From a consistency standpoint, MIPRO’s interface might be overwhelming relative to others.
Use of
*
in method signatures: As noted, almost all compile methods use*
to separatestudent
(positional) from the rest (keyword-only). This is a clear pattern for compile. The only exceptions:- BootstrapFinetune’s compile, which did not put a
*
beforeteacher
andtrainset
in the older implementation. (Documentation suggests there might be a version that does, but the code we saw treatsteacher
as positional after student, which is unusual). - Ensemble.compile doesn’t use
*
simply because it has a single argument. This pattern – having the dataset and other settings be keyword-only – is generally followed and is good for clarity. The inconsistency in BootstrapFinetune is likely something to correct for uniformity.
- BootstrapFinetune’s compile, which did not put a
Public Method Names (step vs compile): All these optimizers use a method named
compile
as the entry point to perform optimization, rather than something likestep()
oroptimize()
. The user question mentioned “methods such as step or optimize,” but in DSPy’s design it appearscompile
is the standard name (compiling a program with a teleprompter means optimizing it). None of the classes have a public method literally namedstep
oroptimize
– they all stick tocompile()
. Internally, some have helper methods (_bootstrap_one_example
,_train
, etc.) but those are private. So there is consistency in usingcompile
as the interface method, inherited from Teleprompter. The only slight oddity is Ensemble using compile in a non-learning sense, but still logically “compiling an ensemble program.”Outlier Classes:
Ensemble is quite different in purpose (no metric, no trainset). It still fits the Teleprompter interface (taking programs and returning a program), but its parameter set (reduce_fn, deterministic, etc.) doesn’t overlap with others. It’s an idiosyncratic case included in the same module for convenience.
FinetuneTeleprompter as a base class is a bit of an abstraction layer not exposed to end-users typically. It doesn’t quite act on its own. This is an internal consistency: Teleprompter vs FinetuneTeleprompter both serving as abstract bases for two families (prompt-based vs fine-tune-based optimizers). They share the interface but introduce different init params (none vs train_kwargs). A slight inconsistency is that Teleprompter base has no init params, FinetuneTeleprompter does – but that’s due to the nature of fine-tuning needing configuration up front.
COPRO and MIPRO introduce parameter names not seen elsewhere (e.g.,
breadth
,depth
,auto
, all the proposer flags). They were likely developed later to tackle prompt optimization more holistically. They still follow patterns like requiring trainset and using metric, but add their own twist. COPRO, for instance, doesn’t acceptteacher
or usemax_rounds
– instead it hasdepth
for iterations of prompt proposals, essentially analogous but specific to its domain. MIPRO aggregates parameters from many others, making it quite an outlier in complexity.
Defaults and Range of Values: Many numeric defaults seem somewhat ad-hoc but within a small range:
- 4 and 16 appear frequently (suggesting maybe at most 4 bootstrapped examples or 16 labeled examples as a reasonable default).
- Max rounds default to 1 in bootstrap (a single iteration is often enough to get some improvement).
- RandomSearch defaults to 16 candidate programs (which aligns with maybe trying seeds -3, -2, -1 and 0..12 – indeed in code they loop
range(-3, num_candidate_sets)
which for 16 gives seeds -3..15 inclusive, that’s 19, but likely they intended a fixed count; perhaps the special negatives are not counted in that num). - Finetuning hyperparams default to typical values like 1 epoch, batch 12, lr 5e-5 – those mirror common practice in ML.
- The
auto="light"
default in MIPRO suggests they wanted the safer, quicker configuration by default.
The inconsistencies here are minor – just that some defaults might not align (e.g., if one expected MIPRO to default to the same 16 labeled demos as simpler teleprompters, they’d be surprised it’s 4). Another example: LabeledFewShot vs BootstrapFewShot default k=16 vs max_labeled_demos=16 (consistent), but Bootstrapped demos default 4 vs Labeled default 16 in simple version, whereas MIPRO uses 4 for both – possibly to balance that it will do iterative improvements.
Error handling and user interaction parameters: Some newer classes have parameters related to robustness:
max_errors
is present in BootstrapFewShot and RandomSearch (to avoid infinite loops or crashes if too many errors occur). Others like Finetune don’t exposemax_errors
(though Evaluate inside might use a global max error).- MIPRO uses
requires_permission_to_run
to ensure the user is aware of resource cost; no other class does something like that (likely because MIPRO can be very expensive). This is a unique design consideration for an outlier. provide_traceback
is similarly only in MIPRO, aimed at debugging – indicating MIPRO expects potentially long runs where silent failures would be frustrating.- Ensemble asserts if
deterministic=True
because it’s not implemented, which is a bit user-unfriendly (they could have just not offered the parameter or documented that it’s a future feature). This is an idiosyncrasy in Ensemble’s interface (exposing a param that only throws an error if set True).
In summary, patterns include the consistent use of a compile
method with student
+ keyword-only datasets/metrics, the presence of metric functions in most, and repeated use of parameters controlling how many examples to use or generate. Idiosyncrasies and inconsistencies include differences in keyword-only enforcement, slight naming mismatches (teacher_settings
vs adapter
vs separate model params), differences in default values for similar concepts, and the sheer divergence in complexity between simpler teleprompters (LabeledFewShot, BootstrapFewShot) and the complex ones (MIPRO, COPRO).
Each optimizer class was likely developed to extend functionality, which led to some divergence in interface. For example, COPRO and MIPRO added new kinds of parameters (depth, breadth, auto, etc.) that don’t appear in earlier classes, making the overall module less uniform.
Recommendations for Unifying the Interface
To improve consistency and usability across these teleprompter optimizers, we suggest the following changes:
Enforce Keyword-Only for Key Parameters: Ensure that in all optimizers, important parameters like
trainset
,teacher
, and other configuration options are keyword-only. This means adding*,
where missing (e.g., inBootstrapFinetune.compile
to require namingtrainset
andteacher
, and in any constructor where positional use could be confusing). A uniform rule could be: any optimizer method that takes a dataset or multiple optional settings should use keyword-only args beyond the program argument. This will prevent mistakes and make code more self-documenting.Standardize Teacher Configuration: Unify the approach to teacher models across classes:
- Always use a
teacher
argument incompile
for providing an alternate program or LM for generating outputs (as is done in BootstrapFewShot, etc.), and consistently use ateacher_settings
(or similarly named) parameter in the constructor to configure that teacher’s behavior. For fine-tuning, instead of introducing a separateadapter
parameter, consider treating it analogously (e.g., ateacher_settings
could include an adapter or fine-tune specific config). If that’s too abstract, at least renameadapter
to something likefinetune_adapter
and document it as the analog of teacher settings but for fine-tune. - If
prompt_model
andtask_model
(as in MIPRO) are essentially playing roles of teacher vs student, clarify that or even rename them toteacher_model
andstudent_model
for consistency. Alternatively, provide a unified interface where Teleprompter base could accept something liketeacher=...
in init or compile that could be a model or program. Having multiple parameters (prompt_model
,task_model
,teacher
) is confusing; consolidating where possible would help (e.g., maybe define thatteacher
can be either a full DSPy Program or a raw LM; if the latter, treat it as the model to generate prompts). - Essentially, reduce the terminology: decide on either “teacher” or specific terms, and use them consistently. If the role is to generate new prompts, maybe call it
generator_model
everywhere instead ofprompt_model
in one place and implicitly using teacher in another. Consistency in naming would reduce user confusion.
- Always use a
Unify Metric Handling: Make sure the role of
metric
andmetric_threshold
is consistently implemented and documented:- If
metric_threshold
is supported in some optimizers (BootstrapFewShot, RandomSearch, MIPRO), consider supporting it in others that might benefit (or explicitly excluding it). At least document it uniformly. It might be useful in COPRO too (maybe to decide if a prompt is “good enough”). If it’s an advanced feature, ensure all classes that use metrics either acceptmetric_threshold
or none of them do. As it stands, a user might not realize BootstrapFewShot accepts ametric_threshold
because it wasn’t in official docs, which is a documentation inconsistency. - Similarly, if
max_errors
is a common safeguard, consider exposing it in all relevant optimizers (for example, COPRO and MIPRO do handle errors but not via a parameter; they rely on global settings or internal logic). It might be good to allow the user to setmax_errors
in MIPRO too for consistency, or state clearly that it uses the globaldspy.settings.max_errors
. Unifying this across classes (all teleprompters either take amax_errors
or none do and it’s purely global) would avoid confusion.
- If
Align Default Values and Ranges: Review the default values for parameters that serve similar purposes and align them unless there’s a strong reason not to:
- For example, the default
max_labeled_demos
in MIPROv2 is 4 whereas in BootstrapFewShot it’s 16. If 16 was found to be too high in practice, perhaps all classes should default to 4 for consistency (or vice versa if 16 is preferred for thoroughness). Choose one philosophy (fewer demos vs more) and apply it uniformly so users have a consistent expectation. - Likewise, ensure that if an optimization class is essentially a generalization of another, its defaults should not dramatically conflict. MIPROv2 is like a superset of BootstrapFewShot + COPRO; one would expect that if you use MIPROv2 in a “minimal” way, it might by default behave somewhat like a BootstrapFewShot (just with added capabilities). That could mean defaulting
max_labeled_demos=16
as in BootstrapFewShot for a fair comparison, or at least documenting why it’s different. - Another default to align: LabeledFewShot’s
k=16
vs BootstrapFewShot’smax_labeled_demos=16
(those match), but if any divergence occurs in future, keep them in sync. - If possible, use the same default
num_threads
behavior – e.g., default None meaning usedspy.settings.num_threads
. Document that consistently so users know None implies some global or single-thread. Right now, it’s implied but not always explicitly stated in each class docs.
- For example, the default
Refine and Simplify Interfaces of Complex Classes: For very complex optimizers like MIPROv2 (and to a lesser extent COPRO), consider grouping some of the less commonly changed hyperparameters into a config object or using **kwargs to pass through to internal methods. As it stands, the
compile
signature of MIPROv2 is extremely long, which can be intimidating. Some ideas:- Group the proposer-related booleans into one structure or prefix them clearly. For example, instead of five separate flags, one could have a single
proposers=dict(program_aware=True, data_aware=True, tip_aware=True, fewshot_aware=True)
or similar. This way the signature is shorter and it’s clear they belong together. Or provide a simpler toggle that sets a combination of them (e.g., a mode for proposers). - The
minibatch
,minibatch_size
,minibatch_full_eval_steps
could perhaps be combined or managed by theauto
mode. Ifauto
is heavy, maybe always use full eval (minibatch=False). Document or enforce such relationships to reduce what the user must consider. If not grouping, at least document in one place how they interact (some of which the code does via errors). - Another approach: provide preset configurations for MIPRO (like how
auto
does) but maybe even expose them at a higher level rather than lots of individual args. For instance, anauto="heavy"
sets many underlying defaults. Perhaps include in docs or interface something likeMIPROv2.heavy()
as an alternate constructor classmethod to preconfigure, etc. This doesn’t change parameters per se, but helps users not have to tweak each one. This is more of a usability suggestion beyond just parameter format.
While these suggestions don’t unify across all classes (since simpler ones don’t need it), they do make the outlier interfaces easier to handle, which indirectly unifies the experience. A user switching from BootstrapFewShot to MIPROv2 wouldn’t want to worry about 10 new parameters if not needed; having reasonable defaults and grouping helps.
- Group the proposer-related booleans into one structure or prefix them clearly. For example, instead of five separate flags, one could have a single
Consistent Documentation and Naming: Ensure that the documentation (docstrings or user guides) for each optimizer class follows a consistent template:
List out positional and keyword-only arguments explicitly, and use the same terminology for similar things (e.g., always call them “bootstrapped demos” vs sometimes “augmented demos” etc., to avoid confusion).
If a parameter is effectively doing the same thing across classes, use the same name. For example, if we decide
teacher_settings
is the term, then perhapsadapter
in BootstrapFinetune could be encompassed byteacher_settings
as well (it could have keys for adapter vs others) or be renamed to something likefinetune_settings
. Right now the namesteacher_settings
,train_kwargs
, andadapter
all refer to configuration of the “optimization process or model” beyond just metric and data. A unified naming (maybe a genericconfig
dict or breaking them into clearer categories) would help. For instance:teacher_settings
could be expanded to handle fine-tuning specifics (not ideal semantic fit), or- use
train_kwargs
for all cases of LM training/hyperparameters (so BootstrapFewShot might not need it, but FinetuneTeleprompter does, and maybe MIPRO could reusetrain_kwargs
for consistency instead of burying fine-tune params in compile).
The goal is that a user reading the docs doesn’t have to guess that “adapter” in one class serves a role analogous to “teacher_settings” in another. If they truly are different in nature, clarify that in docs or choose distinct naming that reflects purpose (e.g.,
lm_adapter
vsteacher_lm_settings
might clarify one is for fine-tuning method, one for prompting method).
Unify Process Flow Where Possible: While not directly about parameters, making sure each Teleprompter clearly states its two main phases (if any) in a similar way could help unify understanding. For instance, all compile methods could follow a pattern in documentation: “Preprocess (e.g., prepare student/teacher), Optimize (via bootstrapping or search), Post-process (attach demos or fine-tune weights)”. If the interface and documentation emphasize these stages similarly, users can map parameters to each stage (e.g., max_rounds -> relates to optimization loop, exclude_demos -> relates to post-process). Right now, each class’s documentation is isolated; a unified narrative would make the parameter sets feel more coherent.
By implementing these recommendations, the teleprompter optimizers would have a more consistent interface. For example, a user could expect that every optimizer’s compile
is called with student=... , trainset=... , teacher=... , valset=...
(where relevant) without worrying about positional quirks, and that if they see a parameter like max_x_demos
or num_threads
, it means the same general concept across the board. It would reduce the learning curve when moving from one optimizer to another and lower the chance of misuse due to inconsistent conventions.