|
1 | 1 | ---
|
2 |
| -title: N-Gram Markovian Modeling of Optimal Policy Under Interpretable Abstractions |
| 2 | +title: N-Gram Model of Optimal Policy on Interpretable Abstractions |
3 | 3 | date: 2025-02-21
|
4 |
| -type: redirect |
| 4 | +math: true |
5 | 5 | ---
|
| 6 | + |
| 7 | +## Abstract |
| 8 | + |
| 9 | +In praxis learning[^praxis-learning], choosing an interpretable functional form as a policy approximation model is essential. Equally important is to ensure that it is well-defined defined over interpretable domains. Such domains are often the result of class-valued abstractions of the observable state space, as visual classification is a task that humans excel at. Motivated by this fact, we provide an interpretable functional form that is valid over multiclass spaces in the form of an $n$-gram model approximation of dynamics under optimal policy. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## Background |
| 14 | + |
| 15 | +$n$-gram models were developed as a rudimentary statistical model of language. Assuming an $n^{th}$-order [Markov property](https://en.wikipedia.org/wiki/Markov_property) on the probability of a word $w_{t + 1}$ at discrete time $t + 1$ given a history $\langle w_i \rangle_{i \in [1, \\, t]}$, |
| 16 | + |
| 17 | +$$ |
| 18 | +\begin{equation} |
| 19 | + P(w_1, \, \ldots, w_{t + 1}) = |
| 20 | + P(w_1, \dots, w_{t - n - 1}) \prod_{i = 0}^{n - 1} P(w_{t+1} \mid w_{t-n}, \dots, w_t), |
| 21 | +\end{equation} |
| 22 | +$$ |
| 23 | + |
| 24 | +straightforward [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) shows that this probability is the proportion of times that the sequence $\langle w_{t-n}, \\, \ldots, w_t \rangle$ appears before $w_{t + 1}$ in observations. This is can be seen as frequentist inference, making the probability measure intuitive. |
| 25 | + |
| 26 | +When applied to a set of symbols (words) $S$, such a model implies a Markov chain over the product $S^n = S \times \cdots \times S$. It follows that the chain's [stochastic matrix](https://en.wikipedia.org/wiki/Stochastic_matrix) $\Pi$ is an element of $\mathbb{R}^{k \times k^n}$ with $k = |S|$, so the number of learnable parameters grows exponentially with the order of the model for a fixed $S$. |
| 27 | + |
| 28 | +As a result of upholding the Markov property, $n$-gram models are stationary[^stationary]. This flaw makes them incompatible with natural language to any useful extent, and is directly addressed by modern language models through mechanisms like [attention](https://en.wikipedia.org/wiki/Attention_(machine_learning)). |
| 29 | + |
| 30 | +### Rules of Thumb |
| 31 | + |
| 32 | +Many heuristics taught in strategic decision-making can be described to be conditionals on the result of classification exercises. For example, there is a rule of thumb in Chess which calls for protecting one's own king if it is open. |
| 33 | + |
| 34 | +When implementing this heuristic, a player performs classification via a mapping $\phi : S \to \\{\text{Yes}, \\, \text{No}\\}$ from the set of board states to an answer to the heuristic's condition, where experience insists that if a player's $\phi$ is sufficiently close to ground truth, they obtain a performance improvement in expectation. |
| 35 | + |
| 36 | +Naturally, the complexity involved in evaluating a classification $\phi_h(s)$ for some state $s \in S$ should be minimal so that its heuristic $h$ can be implemented without computer assistance. In many cases, their simplicity to humans (i.e., how intuitive they are) directly translates to the simplicity of implementing them in other models of computation. Put simply, it is generally easy to program such functions. |
| 37 | + |
| 38 | +However, humans can obtain an _unexplainable_ intuitive understanding of a game. In such cases, the classification exercises they carry out for their expert heuristics are mappings onto a set of abstract characteristics (e.g., area 'crowdedness' in Chess). This can be seen loosely as [feature learning](https://en.wikipedia.org/wiki/Feature_learning). |
| 39 | + |
| 40 | +But even in these cases, it is relatively simple to train a model which replicates a human's capacity to perform classification for their own expert-level heuristics by having them label training datasets by hand. Hence, one can generally assume access to efficient classifiers for human-interpretable features. |
| 41 | + |
| 42 | +### Abstract Strategy |
| 43 | + |
| 44 | +Given a morphism (an abstraction) $\alpha : S \to \tilde{S}$ over a state set $S$, the lack of an injectivity constraint could produce a situation where for an arbitrary policy $\pi : S \to S$ satisfying $\pi(s) = a$ and $\pi(s^\prime) = b$ with distinct $a, \\, b, \\, s, \\, s^\prime \in S$, |
| 45 | + |
| 46 | +$$ |
| 47 | +\begin{equation} |
| 48 | + \alpha(s) = \alpha(s^\prime) \;\; \text{and} \;\; \alpha(a) \neq \alpha(b). |
| 49 | +\end{equation} |
| 50 | +$$ |
| 51 | + |
| 52 | +Therefore, attempting to obtain a counterpart $\tilde{\pi} : \tilde{S} \to \tilde{S}$ (an 'abstract strategy') which preserves the information in $\pi$ is oftentimes not possible, as $\tilde{\pi}(\alpha(s)) = \tilde{\pi}(\alpha(s^\prime))$ would have to 'remember' the distinct $\pi(s) = a$ and $\pi(s^\prime) = b$. |
| 53 | + |
| 54 | +## Model |
| 55 | + |
| 56 | +Let $\langle \phi^{(\alpha)} : S \to S^{(\alpha)} \rangle_{\alpha \in \Alpha}$ be a collection of abstractions enumerated in $\Alpha$, and $\pi : S \to S$ a policy over $S$. Observing $(2)$, we propose modeling class-conditional transition probability distributions, |
| 57 | + |
| 58 | +$$ |
| 59 | +\begin{equation} |
| 60 | + P^{(\alpha)}_{t+1}(k) = P[\phi^{(\alpha)}(\pi^{t + 1}(s)) = k \; | \; \phi^{(\alpha)}(\pi^t(s)) = k_t, \, \ldots, \, \phi^{(\alpha)}(\pi^0(s)) = k_0], |
| 61 | +\end{equation} |
| 62 | +$$ |
| 63 | + |
| 64 | +of the elements $k_i \in S^{(\alpha)}$ via an $n$-gram model. This effectively establishes sequences in $\phi^{(\alpha)}(S)$ via repeated aplication of $\pi$ within $S$ (following the dynamics of $\pi$), so that in the above equation, we allow $\pi^t(s) = \pi_t(\pi_{t-1}(\ldots\pi_1(s)))$. |
| 65 | +This yields a collection of stochastic matrices $\langle \Pi^{(a)} \rangle_{\alpha \in \Alpha}$ with |
| 66 | + |
| 67 | +$$ |
| 68 | +\Pi^{(\alpha)}_{i, j} = P[\, i \text{ is observed at time } t \; | \; j \text{ is observed immediately before}\,], |
| 69 | +$$ |
| 70 | + |
| 71 | +where $i \in S^{(\alpha)}$ and $j \in (S^{(\alpha)})^n$. The amount of learnable parameters (i.e., the size) of such a model $M = \langle \Pi^{(a)} \rangle_{\alpha \in \Alpha}$ is therefore |
| 72 | + |
| 73 | +$$ |
| 74 | +\begin{equation} |
| 75 | + |M| = \sum_{\alpha \in \Alpha} |S^{(\alpha)}|^n \, (|S^{(\alpha)}| - 1). |
| 76 | +\end{equation} |
| 77 | +$$ |
| 78 | + |
| 79 | +## Training |
| 80 | + |
| 81 | +The parameter space for a model $M$ of order $n$ is precisely |
| 82 | + |
| 83 | +$$ |
| 84 | +\begin{equation} |
| 85 | + \Theta = |
| 86 | + \large{\times_{\alpha \in \Alpha}} |
| 87 | + \large{\times_{k \in S^{(\alpha)}}} |
| 88 | + \bold{S}^{|S^{(\alpha)}|^n}, |
| 89 | +\end{equation} |
| 90 | +$$ |
| 91 | + |
| 92 | +(where $\bold{S}^d$ denotes the $d$-dimensional unit sphere). Finding optimal parameters $\theta^* \in \Theta$ follows standard procedure as in any $n$-gram model. Hence, we simply provide the generic closed-form solution written in terms of the objects at hand, |
| 93 | + |
| 94 | +$$ |
| 95 | +\begin{equation} |
| 96 | + \Pi^{(\alpha)}_{i, j} = \frac{1}{N} |
| 97 | + \sum_{s \in S} |
| 98 | + I^{(\alpha)}_{i,j}(\pi^n(s), \langle \pi^i(s) \rangle_{i \in [0, \, n)}), |
| 99 | +\end{equation} |
| 100 | +$$ |
| 101 | + |
| 102 | +where |
| 103 | + |
| 104 | +$$ |
| 105 | +\begin{equation*} |
| 106 | + I^{(\alpha)}_{i,j}(a, \langle b_i \rangle_{i \in [0, \, n)}) = |
| 107 | + \begin{cases} |
| 108 | + 1 & \text{if } \; \phi^{(\alpha)}(a) = i \; \text{ and } \; \phi^{(\alpha)}(b) = j, \\ |
| 109 | + 0 & \text{otherwise}, |
| 110 | + \end{cases} |
| 111 | +\end{equation*} |
| 112 | +$$ |
| 113 | + |
| 114 | +and $N$ is the number of length-$(n + 1)$ contiguous subsequences in the dynamics of $\pi$, which can be easily sketched while computing the sum in $(5)$. |
| 115 | + |
| 116 | +### Sources |
| 117 | +The nature of the policy operator $\pi$ is such that there exists some $s \in S$ wihtout an $s^\prime$ with $\pi(s^\prime) = s$. Here, $s$ is called a source within the dynamics of $\pi$. This constitutes a problem, as the start $s_0$ of the game for which $S$ is a state space is necessarily a source (which may not be unique); therefore, an attempt to find an $n$-length sequence of moves leading up to a state less than $n$ applications of $\pi$ away from a source in its dynamics may fail. |
| 118 | + |
| 119 | +This is important because it is a step necessary to compute the $\Pi^{(\alpha)}_{i, j}$$^{\text{th}}$ parameter of the model, where $i$ is the parameter that is too close to a source to have a valid $n$-gram history. A solution which does not significantly alter transition distributions of $\Pi^{(\alpha)}$ is to sample missing elements of $n$-gram histories from a uniform distribution while computing its entries. If this measure is taken, $N$ can be set to $|S|$ in $(5)$, avoiding the need for sketching proportions. |
| 120 | + |
| 121 | +### Sinks |
| 122 | + |
| 123 | +In many traditional definitions of a policy $\pi$, there may exist elements $s^\prime_i$ of $S$ over which $\pi$ is not defined, as they are terminal in the game under representation. These are sinks in the dynamics of $\pi$, and should never be considered as part of a history while computing model parameters. |
| 124 | + |
| 125 | +## Inference |
| 126 | + |
| 127 | +When at a state $s \in S$, a human player can consider the set of next possible states $t(s)$ (where the transition function $t : S \to \mathcal{P}(S)$ is set-valued). Optimally, combinatorial optimization would be done across all elements $s^\prime \in t(s)$ under the objective of maximizing the probability that their action is observed across all abstract state space transitions $S^{(\alpha)}$; this is maximum likelihood estimation. |
| 128 | + |
| 129 | +While this is possible to an extent due to the simplicity of the abstractions in consideration (which map onto small sets of classes, reducing maximization objectives during MLE), the true value of the model is in the subjective analysis of each $\Pi^{(\alpha)}$. Additionally, quantitative techniques (such as finding the static distribution and convergence rate of these matrices) may illustrate interpretable patterns in the dynamics of $\pi$, depending on $\langle \phi^{(\alpha)} \rangle$. |
| 130 | + |
| 131 | +## Remarks |
| 132 | + |
| 133 | +Establishing an approximation of optimal policy in the form of a Markov process provides an interpretable functional representation that is able to work with intuitive abstractions. Thus, it is a valid representation of a praxis, and the above methods effectively 'translate' from policies of arbitrary form. |
| 134 | + |
| 135 | +### Explorations |
| 136 | + |
| 137 | +The following are left as potential avenues of analysis relating to the model family. |
| 138 | + |
| 139 | +* Smoothing techniques, and an analysis of their benefit in the context of optimal policy. |
| 140 | +* Non-interpretability of $n$-gram model successors; in particular transformer attention. |
| 141 | +* Skip-gram models as an extension of this family. |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## Credits |
| 146 | + |
| 147 | +Thank you to my good friend Humberto Gutierrez for spending late nights discussing the concept of policy abstraction with me, and helping me organize many ideas about policies over continuous abstractions. |
| 148 | + |
| 149 | + |
| 150 | +[^praxis-learning]: Automated synthesis of artifacts that improve unassisted human performance from machine representations of perfect or near-perfect game-theoretic strategy. |
| 151 | + |
| 152 | +[^stationary]: A stationary model's probability assignments are invariant with respect to shifts in the time index. |
0 commit comments