diff --git a/.nojekyll b/.nojekyll index ed0ed0e..6d5fffd 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -fe4b8cc8 \ No newline at end of file +10462241 \ No newline at end of file diff --git a/lectures/L03.html b/lectures/L03.html index c0bc57a..c32d1ad 100644 --- a/lectures/L03.html +++ b/lectures/L03.html @@ -407,45 +407,45 @@
The above development of KANs is compelling but it is much more illustrative to approach from Non-parametric Regression point of view, and then see that KAN’s are a type of Deep Non-parametric Regressors.
+The above development of KANs is compelling but it is much more illustrative to approach it from a Non-parametric Regression point of view, and then see that KAN’s are a type of Deep Non-parametric Regressors. Then, we realize that the standard terminology like neurons, activations, pre-activations, post-activation etc., can be completely dropped.
Consider the following regression problem \[y^{[i]} \equiv f(x^{[i]}) + e^{[i]} \equiv \phi(x^{[i]}) + e^{[i]}, i \in \left\{1,\dots,N\right\}\] with \(D = \{x^{[i]}, y^{[i]}\}_{i=1}^{N}\) representing all the data available to fit (train) the model \(f(x)\). It is customary to write the model as \(f(x)\) instead of \(\phi(x)\). It is done for compatibility with KAN. In the shallow case \(f(x) = \phi(x)\) otherwise \(\phi(x)\) is used to denote the building blocks that \(f(x)\) will be made of.
-For a moment, w.l.o.g, assume \(x\) to be univariate. In a typical regression setup, one constructs features such as \(x^2, x^3,\dots\), like in polynomial regression and treats this as a Linear Regression problem. We can view this standard procedures as expanding the function \(f(x)\) on a set of Polynomials. Seen in a more general sense, we can choose an appropriate Basis Functions to construct the feature space. For example [see Chapter 9 of All of Nonparametric Statistics] \[f(x) \equiv \phi(x) = \sum_{i=1}^{\infty} \beta_i B_i(x)\] where \(B_1(x) \equiv 1, B_i(x) \equiv \sqrt{2}\cos((i-1)\pi x) \text{ for } i \ge 2\). See the figure below.
+Consider the following regression problem \[y^{[i]} \equiv f(x^{[i]}) + e^{[i]} \equiv \phi(x^{[i]}) + e^{[i]}, i \in \left\{1,\dots,N\right\}\] with \(D = \{x^{[i]}, y^{[i]}\}_{i=1}^{N}\) representing all the data available to fit (train) the model \(f(x)\). It is customary to write the model as \(f(x)\) instead of \(\phi(x)\). It is done for compatibility with KAN. In the shallow case \(f(x) = \phi(x)\) otherwise \(\phi(x)\) is used to denote the building blocks that \(f(x)\) will be a composition many such building blocks.
+For a moment, w.l.o.g, assume \(x\) to be univariate. In a typical regression setup, one constructs features such as \(x^2, x^3,\dots\), like in polynomial regression and treats this as a Linear Regression problem. We can view this standard procedure as expanding the function \(f(x)\) on a set of Polynomials. Seen in a more general sense, we can choose an appropriate Basis Functions to construct the feature space. For example [see Chapter 9 of All of Nonparametric Statistics] \[f(x) \equiv \phi(x) = \sum_{i=1}^{\infty} \beta_i B_i(x)\] where \(B_1(x) \equiv 1, B_i(x) \equiv \sqrt{2}\cos((i-1)\pi x) \text{ for } i \ge 2\). See the figure below.
In practice, we will truncate the expansion upto some finite number of terms. For illustration, say, we choose \(p\) terms. Then, in matrix notation, the regression problem is: \[ \begin{array}{left} {\bf y} = {\bf X}{\bf \beta} + {\bf \epsilon} \end{array} \] where \[ \begin{array}{left} -{\bf X}_{n \times p} &=& +{\bf X}_{N \times p} &=& \begin{pmatrix} 1 & \sqrt{2}\cos(\pi x^{[1]}) & \dots & \sqrt{2}\cos((p-1)\pi x^{[1]}) \\ 1 & \sqrt{2}\cos(\pi x^{[2]}) & \dots & \sqrt{2}\cos((p-1)\pi x^{[2]}) \\ \vdots & & & \vdots \\ 1 & \sqrt{2}\cos(\pi x^{[N]}) & \dots & \sqrt{2}\cos((p-1)\pi x^{[N]}) \end{pmatrix} \\ {\bf \beta}_{p \times 1} &=& [\beta_1, \beta_2, \dots, \beta_p ]^T \\ -{\bf y}_{n \times 1} &=& [y^{[1]}, y^{[2]}, \dots, y^{[N]} ]^T \\ +{\bf y}_{N \times 1} &=& [y^{[1]}, y^{[2]}, \dots, y^{[N]} ]^T \\ \end{array} -\] Different choices of the Basis Functions lead to different design matrices \({\bf X}\). Some popular choices are Splines, Chebyshev Polynomials, Legendre Polynomials, Wavelets among others. Specifically, the function approximation with wavelets takes the following form [see Wavelet Regression in Python for a very nice demonstration wavelets for denoising]: +\]
Different choices of the Basis Functions lead to different design matrices \({\bf X}\). Some popular choices are Splines, Chebyshev Polynomials, Legendre Polynomials, Wavelets, among others.-\[\phi(x) = \alpha\zeta(x) + \sum_{j=0}^{J-1}\sum_{k=0}^{2^j-1}\beta_{jk}\psi_{jk}(x)\] where \[\alpha=\int_0^1\phi(x)\zeta(x)dx\text{, }\beta_{jk}=\int_0^1\phi(x)\psi_{jk}(x)dx.\] Here \(\alpha\) and \(\beta_{jk}\) are called the scaling coefficients and the detail coefficients, respectively. The basic idea is that the detail coefficients capture the coarser details of the function while the scaling, or smoothing, coefficients capture the overall functional form. +Specifically, the function approximation with wavelets takes the following form [see Wavelet Regression in Python for a very nice demonstration of wavelets for denoising]: \[\phi(x) = \alpha\zeta(x) + \sum_{j=0}^{J-1}\sum_{k=0}^{2^j-1}\beta_{jk}\psi_{jk}(x)\] where \[\alpha=\int_0^1\phi(x)\zeta(x)dx\text{, }\beta_{jk}=\int_0^1\phi(x)\psi_{jk}(x)dx.\] Here \(\alpha\) and \(\beta_{jk}\) are called the scaling coefficients and the detail coefficients, respectively. The basic idea is that the detail coefficients capture the coarser details of the function while the scaling, or smoothing, coefficients capture the overall functional form.
-So, by choosing a basis function, we map the observed inputs into the feature space, and project the response onto the space spanned by the basis functions. This is basically a Linear Regression in a different function space (induced by the basis functions).
+So, by choosing a basis function, we map the observed inputs into the feature space defined by the basis functions, and project the response onto the space spanned by the basis functions. This is basically a Linear Regression in a different function space (induced by the basis functions).
How we do extend the above formulation to multivariate case. Say, we have \(n_{0}\) dimensional inputs. One obvious but non-trivial way is to choose a multivariate basis function. The other way to do that is to construct feature space based on univariate functions which we know to approximate already. For example, take the cartesian product space as \(f(x) \equiv \phi(x) = \Pi_{p=1}^{n_{0}} \phi_p(x_p)\) where \(\phi_p\) can be expanded like before (univariate case). Or we can construct the function \(f(x)\) additively as \[
+ How we do extend the above formulation to multivariate case. Say, we have \(n_{0}\) dimensional inputs. One obvious but non-trivial way is to choose a multivariate basis function. The other way to do that is to construct feature space based on univariate functions which we know how to approximate already. For example, take the cartesian product space as \(f(x) \equiv \phi(x) = \Pi_{p=1}^{n_{0}} \phi_p(x_p)\) where \(\phi_p\) can be expanded like before (univariate case). Or we can construct the function \(f(x)\) additively as \[
\begin{array}{left}
f(x) \equiv \phi(x) = \sum_{p=1}^{n_{0}} \phi_p(x_p)
\end{array}
-\] See the figure below.
That is, for every dimension \(p\) of input, there is a corresponding \(\phi_p\) and add them to get the multivariate function. In fact, the above specification appears in the framework developed by Hastie and Tibhirani in 1986, known as Generalized Addtive Models (see paper, wiki) or GAMs as they are popularly referred to.
In the context of KANs, we can see this exactly as node at layer \(l+1\), which takes inputs from layer \(l\), whose edges represent the transformation of the inputs via \(\phi_p\). Except for notational differences, each neuron in KAN is a GAM.
-So far we are dealing with single output and multiple inputs. But what if there are \(n_{1}\) outputs. This leads us to Vector Generalized Additive Models (VGAMs), proposed by Yee and Wild in 1996 (see paper, wiki). We can represent VGLAM as:
+So far we are dealing with single output and multiple inputs. But what if there are \(n_{1}\) outputs. This leads us to Vector Generalized Additive Models (VGAMs), proposed by Yee and Wild in 1996 (see paper, wiki). We can represent VGAM as:
\[
\begin{array}{left}
y_q \equiv f_{q,.}(x) \equiv \phi_{q,.}(x) = \sum_{p=1}^{n_{0}} \phi_{q,p}(x_p)
@@ -467,12 +467,12 @@ In effect, one KAN layer is actually a VGLM. So, given \(D=\{x^{[i]}, y^{[i]}\}_{i=1}^{n}\) with \(x^{[i]} \in R^{n_{0}}, y^{[i]} \in R^{n_{1}}\) outputs, KAN Layer and VGAM learn \(\phi_{q,p}\) which are specified in terms of the Basis Functions. Note that, the choice of the basis functions should correspond to the domain and range of the functions being modeled. GAMS and VGAM models are trained (or leant) using back-fitting techniques. There is no reason why we can not apply backprop and fit these models using gradient descent. In effect, one KAN layer is actually a VGLM. So, given \(D=\{x^{[i]}, y^{[i]}\}_{i=1}^{N}\) with \(x^{[i]} \in R^{n_{0}}, y^{[i]} \in R^{n_{1}}\) outputs, KAN Layer and VGAM learn \(\phi_{q,p}\) which are specified in terms of the Basis Functions. Note that, the choice of the basis functions should correspond to the domain and range of the functions being modeled. GAMS and VGAM models are trained (or leant) using back-fitting techniques. There is no reason why we can not apply backprop and fit these models using gradient descent.Generalized Ad
What remains to be done is to stack the KAN layers or VGAMs. That gets us to the model we have seen above: \[
+ What remains to be done is to stack the KAN layers or VGAMs. That gets us to the model we have seen before: \[
f(\bf{x})=\sum_{i_{L-1}=1}^{n_{L-1}}\phi_{L-1,i_{L},i_{L-1}}\left(\sum_{i_{L-2}=1}^{n_{L-2}}\cdots\left(\sum_{i_2=1}^{n_2}\phi_{2,i_3,i_2}\left(\sum_{i_1=1}^{n_1}\phi_{1,i_2,i_1}\left(\sum_{i_0=1}^{n_0}\phi_{0,i_1,i_0}(x_{i_0})\right)\right)\right)\cdots\right)
\] For clarity sake, \(x_{i_0}\) is referring to the \(i_0\)th input dimension or feature, \(\phi_{l,q,p}(.)\) refers to the function that maps \(p\)th input of layer \(l\) to the (before summation) \(q\)th output. From the above figure, it is clear that the edges of the KAN network are non-parametric functions \(\phi(x)\) which are specified via basis functions such as Splines. Even MLPs have weights on the edges. So, the mainstream interpretation that, in KANs, the activations are on the edges and they learnable is not very convincing. Both MLPs and KANs, the edges are learnable. For example, if we choose \(\phi(x) = \beta x\), and replace the sum with sum preceded by elementwise, fixed activations like \(SiLU\), we get the standard MLP network. KAN proposed by Liu et al in 2024 is a Deep Non-parametric Regression framework. Specifically, we note the following: Not only that, the modularity of the KAN layer allows one to mix and match KAN layer with other modules such as a TRransfomer block or an RNN block, for example. We can replace MLP with KAN almost like a drop-in. We have already seen implementation of KANs in GPT models GTP-KAN, GPT2-KAN. They also started appearing in CNNs. For more resources see awesome-KANS. Explore this KAN Features notebook or go through the examples from PyKAN official repo. Not only that, the modularity of the KAN layer allows one to mix and match KAN layer with other modules such as a Transformer block or an RNN block, for example. We can replace MLP with KAN almost like a drop-in. We have already seen implementation of KANs in GPT models GTP-KAN, GPT2-KAN. They also started appearing in CNNs. For more resources see awesome-KANS. Explore the features of KANs such as interpretability, solving PDEs,check out this KAN Features notebook or go through the examples from PyKAN official repo.Deep Non-pa
-
@@ -498,21 +498,23 @@ Deep Non-pa
Few limitation of KANs at this time are:
That said, KANs are extremely interesting in the sense that, they are:
+That said, KANs are extremely interesting in the sense that: