The goal of the primal solver is to find a practical, computable approximation for the theoretical continuation value, $C_t$. The derivation follows a three-step process: formalizing the theory, reformulating it with signatures, and then solving the reformulated problem using a function approximator.
The starting point is the formal definition of the continuation value as a conditional expectation. It is an unknown function of all information available up to time $t$, represented by the filtration $\mathcal{F}_t$:
\[C_t = \mathbb{E}\left[e^{-r\Delta t} V_{t+\Delta t} | \mathcal{F}_t\right]\]The key to making this problem computable is to replace the abstract filtration $\mathcal{F}_t$ with a concrete numerical representation. This is achieved through a direct mathematical equivalence.
With the problem reformulated, the final step is to find an approximation for the unknown function $f$. This is done using Monte Carlo regression.
Given a set of $N$ simulated paths, we can create a dataset for a regression task. For each path $i$:
The solver then employs a function approximator to learn the relationship between these inputs and targets. For example, a Kernel RFF solver uses a two-stage process:
The trained model provides our best estimate for the unknown function $f$, which we denote as $\hat{f}$. Therefore, our final, computable approximation for the continuation value is the output of this learned function:
\[\hat{C}_t = \hat{f}(\text{sig}(\text{Path})_t) \approx C_t\]The solver implements the theory by executing a backward induction algorithm. The process begins at maturity ($T$), where the option value for each path $i$ is initialized as $V_T^{(i)} = \text{Payoff}(S_T^{(i)})$. The algorithm then iterates backward for each time step $t = T-1, T-2, \dots, 0$.
At each step $t$, the following mathematical operations are performed:
The algorithm first identifies the set of path indices, $\mathcal{I}_t$, for which the option is in-the-money:
\[\mathcal{I}_t = \{i \mid \text{Payoff}(S_t^{(i)}) > 0\}\]The regression is performed on the training set ${ (x_i, y_i) }_{i \in \mathcal{I}_t}$, where the inputs and targets are defined as:
The solver finds the optimal function, $\hat{f}_t$, that approximates the continuation value by solving the least-squares problem:
\[\hat{f}_t = \arg\min_{f} \sum_{i \in \mathcal{I}_t} \left( y_i - f(x_i) \right)^2\]This learned function is then used to estimate the continuation value, $\hat{C}_t^{(i)}$, for all paths ($i = 1, \dots, N$):
\[\hat{C}_t^{(i)} = \hat{f}_t(\text{sig}(\text{Path}_i)_t)\]The option’s value for each path at time $t$, $V_t^{(i)}$, is updated according to the optimal exercise rule derived from the Bellman equation:
\[V_t^{(i)} = \text{Payoff}(S_t^{(i)}) \quad \text{if } \text{Payoff}(S_t^{(i)}) > \hat{C}_t^{(i)} \text{ and } i \in \mathcal{I}_t\] \[V_t^{(i)} = e^{-r\Delta t}V_{t+1}^{(i)} \quad \text{otherwise}\]The resulting values, ${V_t^{(i)}}_{i=1}^N$, are then used as the known future values for the next step of the iteration at time $t-1$.
After the loop completes, the final option price is the sample mean of the values at time $t=0$:
\[V_{\text{lower}} = \frac{1}{N} \sum_{i=1}^{N} V_0^{(i)}\]The algorithm’s recursion is not just an arbitrary procedure; it is a direct computational method for solving the optimal stopping problem, first proposed by Longstaff and Schwartz (2001). The equivalence between its output and the theoretical $V_{\text{lower}}$ can be formally shown using a proof by backward induction.
The goal is to prove that the value $V_t$ calculated by the algorithm at each step is equal to the true theoretical value of the option at that time.
At the last step of the recursion, the algorithm calculates:
\[V_{T-1} = \max\left(\text{Payoff}(S_{T-1}), \mathbb{E}\left[e^{-r\Delta t}\text{Payoff}(S_T) \mid \mathcal{F}_{T-1}\right]\right)\]This is precisely the theoretical value of the option at $T-1$, as the only two choices are to exercise now or to wait one period and receive the discounted expected payoff at maturity. The equivalence holds.
Let’s assume the equivalence holds at time $t+1$. This means the value computed by the algorithm, $V_{t+1}$, is the true theoretical value of the option from that point forward:
\[V_{t+1} = \sup_{\tau \in \mathcal{T}_{t+1,T}} \mathbb{E}\left[e^{-r(\tau-(t+1))} \text{Payoff}(S_\tau) \mid \mathcal{F}_{t+1}\right]\]At time $t$, the algorithm computes the value by applying the Bellman equation:
\[V_t = \max\left(\text{Payoff}(S_t), \mathbb{E}\left[e^{-r\Delta t} V_{t+1} \mid \mathcal{F}_t\right]\right)\]By substituting our inductive hypothesis for $V_{t+1}$ and using the Tower Property of expectation, the second term becomes:
\[\mathbb{E}\left[e^{-r\Delta t} V_{t+1} \mid \mathcal{F}_t\right] = \sup_{\tau \in \mathcal{T}_{t+1,T}} \mathbb{E}\left[e^{-r(\tau-t)} \text{Payoff}(S_\tau) \mid \mathcal{F}_t\right]\]This is the value of the “hold” strategy. Therefore, the algorithm’s calculation for $V_t$ is:
\[V_t = \max\left(\text{Payoff}(S_t), \quad \sup_{\tau \in \mathcal{T}_{t+1,T}} \mathbb{E}\left[e^{-r(\tau-t)} \text{Payoff}(S_\tau) \mid \mathcal{F}_t\right]\right)\]This is exactly the definition of the true option value at time $t$, which is the maximum of the “exercise now” and “hold” strategies. The equivalence holds at time $t$.
By the principle of backward induction, the value $V_0$ computed by the algorithm is mathematically identical to the theoretical value from the optimal stopping problem. This proves the equivalence:
\[V_{\text{lower}} = \frac{1}{N} \sum_{i=1}^{N} V_0^{(i)} \approx \mathbb{E}\left[e^{-r\hat{\tau}} \text{Payoff}(S_{\hat{\tau}})\right]\]The goal of the dual solver is to find a tight upper bound on the American option’s price. It achieves this by constructing and optimizing a family of martingales, leveraging the principle of martingale duality for optimal stopping.
The dual approach is based on the theoretical result that the true option price $V_0$ is bounded from above by the expected supremum of the payoff minus any suitable martingale process $M_t$. The tightest bound is found by searching over the entire space of valid martingales, $\mathcal{M}$:
\[V_{\text{upper}} = \inf_{M \in \mathcal{M}} \mathbb{E}\left[\sup_{t \in [0,T]} \left(e^{-rt}\text{Payoff}(S_t) - M_t\right)\right]\]The main theoretical challenge is solving this minimization problem over the infinite-dimensional space of martingales. To make this tractable, martingales are constructed as stochastic integrals of the form $M_t = \int_0^t \alpha_s dW_s$, where $\alpha_s$ is a process adapted to the filtration $\mathcal{F}_s$ known as the martingale integrand.
The theoretical challenge is to find the optimal integrand process, $\alpha_s$, which is an abstract process adapted to the filtration $\mathcal{F}_s$. To make this problem computationally tractable, we reformulate it by specifying a concrete, parameterized form for $\alpha_s$.
The requirement that $\alpha_s$ be adapted to the filtration means its value at time $s$ can only depend on the history of the path up to that moment. As established in the primal problem, the Doob-Dynkin Lemma provides the rigorous justification for expressing this relationship as a function of the path signature. It guarantees the existence of some unknown function, $g$, such that: \(\alpha_s = g\left(\text{sig}(\text{Path})_s\right)\)
The function $g$ is unknown and potentially highly complex. The DeepSignatureDualSolver employs a deep neural network, $NN_{\theta}$, as a powerful, non-linear function approximator to model $g$. The integrand is therefore approximated as the output of the neural network, which takes the path signature as input:
\(\alpha_s \approx NN_{\theta}\left(\text{sig}(\text{Path})_s\right)\)
By substituting this neural network approximation into the stochastic integral, we obtain the final, parameterized form of the martingale used by the solver. Its behavior is now controlled entirely by the learnable weights, $\theta$, of the neural network:
\[M_t(\theta) = \int_0^t NN_{\theta}\left(\text{sig}(\text{Path})_s\right) \, dW_s\]In the discrete-time setting of the algorithm, this integral is approximated as a sum of increments over the time steps of the simulation:
\[M_t(\theta) = \sum_{j=0}^{t-1} NN_{\theta}(\text{sig}(\text{Path})_j) \cdot \Delta W_{j+1}\]where $\Delta W_{j+1}$ is the Brownian increment over the interval $[t_j, t_{j+1}]$.
This reformulation successfully transforms the abstract problem of finding an optimal process $\alpha_s$ into a concrete, finite-dimensional optimization problem: finding the optimal network weights $\theta$.
With the martingale parameterized by the neural network, the problem of finding the tightest upper bound becomes a finite-dimensional optimization problem over the network’s weights, $\theta$:
\[\min_{\theta} \mathbb{E}\left[\max_{t=0,\dots,N} \left(\text{Payoff}_t - M_t(\theta)\right)\right]\]To solve this in practice, the theoretical objective function (the expectation) is connected to a practical loss function through Monte Carlo approximation. The true expectation over all possible paths is approximated by the sample mean over the $N$ simulated paths. This sample mean is the loss function that the solver seeks to minimize:
\[\underbrace{\mathbb{E}\left[\max_{t} (\text{Payoff}_t - M_t(\theta))\right]}_{\text{Theoretical Objective}} \quad \xrightarrow \quad \underbrace{\frac{1}{N} \sum_{i=1}^{N} \max_{t} (\text{Payoff}_t^{(i)} - M_t^{(i)}(\theta))}_{\text{Practical Loss Function}}\]The algorithm then finds the minimum of this loss function using the following steps:
The final, minimized value of the loss function is the computed upper bound price, $V_{\text{upper}}$.
The solver employs a Residual Network (ResNet), enhanced with Squeeze-and-Excitation (SE) blocks, to serve as the function approximator. This architecture is deliberately chosen to handle the complexity of the signature features and to ensure a stable and efficient training process.
A residual block computes its output, $x_{l+1}$, by adding its input, $x_l$, to a non-linear transformation, $F(x_l)$, of the input:
\[x_{l+1} = F(x_l) + x_l\]This mathematical form is highly effective for signatures because the skip connection ($+ x_l$) acts as an “information highway.” It allows the foundational information from the lower-order signature terms (e.g., overall displacement) to be perfectly preserved as it propagates to deeper layers. The network block, $F(x_l)$, is then free to focus only on learning the refinement or perturbation provided by the more complex, higher-order signature terms (e.g., area and curvature). This structure mirrors the natural hierarchy of the signature itself.
An SE block acts as an attention mechanism that allows the model to learn the dynamic, context-dependent importance of each feature in the signature vector. It performs this through a three-step mathematical process:
Excite: This descriptor is then passed through a small gating mechanism, which is a two-layer neural network with a final sigmoid activation, $\sigma$. This gate outputs a vector of importance scores, $s$, where each score is between 0 and 1. The mathematical form is:
\[s = \sigma(W_2 \delta(W_1 X))\]where $W_1$ and $W_2$ are the learnable weights of the two layers and $\delta$ is a ReLU activation.
This process of recalibration is what makes the SE block so powerful for signatures. It allows the network to learn the complex, non-linear interdependencies between the different signature terms. For instance, it can learn that if a Level 2 “area” term is high (indicating a volatile path), then a Level 4 “oscillatory” term is highly predictive. In response, the network will learn to output a high score ($s_c \approx 1$) for that Level 4 term, effectively paying more attention to it, while suppressing less relevant terms.
Even if a shallow, single-block MLP had the same number of parameters as a deep ResNet, the ResNet’s architecture gives it a significant advantage.
This architectural difference also leads to a more efficient learning process. The skip connections in the ResNet create a smoother loss landscape, which allows the gradient-based optimizer to converge faster and more reliably. The SE blocks further enhance this efficiency by acting as a smart filter, directing the model’s focus and gradient updates toward the most informative signature terms. This combination results in a model that not only generalizes better but also learns more quickly than a wide MLP, which often struggles with a more complex and difficult-to-navigate loss surface.