diff --git a/.nojekyll b/.nojekyll
index ed0ed0e..6d5fffd 100644
--- a/.nojekyll
+++ b/.nojekyll
@@ -1 +1 @@
-fe4b8cc8
\ No newline at end of file
+10462241
\ No newline at end of file
diff --git a/lectures/L03.html b/lectures/L03.html
index c0bc57a..c32d1ad 100644
--- a/lectures/L03.html
+++ b/lectures/L03.html
@@ -407,45 +407,45 @@ <h3 class="anchored" data-anchor-id="kolmogorov-arnold-network-kan"><span style=
 {\rm spline}(x) = \sum_i c_iB_i(x)
 \end{align}
 \]</span> where <span class="math inline">\(c_i\)</span>s are trainable/learnable parameters, <span class="math inline">\(B_i\)</span> are the Spline basis functions (in the original KAN paper). The authors included <span class="math inline">\(b(x)\)</span>, a <span class="math inline">\(SiLU\)</span>, as a residual connection.</p>
-<p>The above development of KANs is compelling but it is much more illustrative to approach from Non-parametric Regression point of view, and then see that KAN’s are a type of Deep Non-parametric Regressors.</p>
+<p>The above development of KANs is compelling but it is much more illustrative to approach it from a Non-parametric Regression point of view, and then see that KAN’s are a type of Deep Non-parametric Regressors. Then, we realize that the standard terminology like neurons, activations, pre-activations, post-activation etc., can be completely dropped.</p>
 </section>
 <section id="shallow-non-parametric-regression" class="level3">
 <h3 class="anchored" data-anchor-id="shallow-non-parametric-regression">Shallow Non-parametric Regression</h3>
-<p>Consider the following regression problem <span class="math display">\[y^{[i]} \equiv f(x^{[i]}) + e^{[i]} \equiv \phi(x^{[i]}) + e^{[i]}, i \in \left\{1,\dots,N\right\}\]</span> with <span class="math inline">\(D = \{x^{[i]}, y^{[i]}\}_{i=1}^{N}\)</span> representing all the data available to fit (train) the model <span class="math inline">\(f(x)\)</span>. It is customary to write the model as <span class="math inline">\(f(x)\)</span> instead of <span class="math inline">\(\phi(x)\)</span>. It is done for compatibility with KAN. In the shallow case <span class="math inline">\(f(x) = \phi(x)\)</span> otherwise <span class="math inline">\(\phi(x)\)</span> is used to denote the building blocks that <span class="math inline">\(f(x)\)</span> will be made of.</p>
-<p>For a moment, w.l.o.g, assume <span class="math inline">\(x\)</span> to be univariate. In a typical regression setup, one constructs features such as <span class="math inline">\(x^2, x^3,\dots\)</span>, like in polynomial regression and treats this as a Linear Regression problem. We can view this standard procedures as expanding the function <span class="math inline">\(f(x)\)</span> on a set of Polynomials. Seen in a more general sense, we can choose an appropriate <em>Basis Functions</em> to construct the feature space. For example [see Chapter 9 of <a href="https://www.stat.cmu.edu/~larry/all-of-nonpar/index.html">All of Nonparametric Statistics</a>] <span class="math display">\[f(x) \equiv \phi(x) =  \sum_{i=1}^{\infty} \beta_i B_i(x)\]</span> where <span class="math inline">\(B_1(x) \equiv 1, B_i(x) \equiv \sqrt{2}\cos((i-1)\pi x) \text{ for } i \ge 2\)</span>. See the figure below.</p>
+<p>Consider the following regression problem <span class="math display">\[y^{[i]} \equiv f(x^{[i]}) + e^{[i]} \equiv \phi(x^{[i]}) + e^{[i]}, i \in \left\{1,\dots,N\right\}\]</span> with <span class="math inline">\(D = \{x^{[i]}, y^{[i]}\}_{i=1}^{N}\)</span> representing all the data available to fit (train) the model <span class="math inline">\(f(x)\)</span>. It is customary to write the model as <span class="math inline">\(f(x)\)</span> instead of <span class="math inline">\(\phi(x)\)</span>. It is done for compatibility with KAN. In the shallow case <span class="math inline">\(f(x) = \phi(x)\)</span> otherwise <span class="math inline">\(\phi(x)\)</span> is used to denote the building blocks that <span class="math inline">\(f(x)\)</span> will be a composition many such building blocks.</p>
+<p>For a moment, w.l.o.g, assume <span class="math inline">\(x\)</span> to be univariate. In a typical regression setup, one constructs features such as <span class="math inline">\(x^2, x^3,\dots\)</span>, like in polynomial regression and treats this as a Linear Regression problem. We can view this standard procedure as expanding the function <span class="math inline">\(f(x)\)</span> on a set of Polynomials. Seen in a more general sense, we can choose an appropriate <em>Basis Functions</em> to construct the feature space. For example [see Chapter 9 of <a href="https://www.stat.cmu.edu/~larry/all-of-nonpar/index.html">All of Nonparametric Statistics</a>] <span class="math display">\[f(x) \equiv \phi(x) =  \sum_{i=1}^{\infty} \beta_i B_i(x)\]</span> where <span class="math inline">\(B_1(x) \equiv 1, B_i(x) \equiv \sqrt{2}\cos((i-1)\pi x) \text{ for } i \ge 2\)</span>. See the figure below.</p>
 <div class="quarto-figure quarto-figure-center">
 <figure class="figure">
 <p><img src="./../figs/KANs-Shallow-Regression.drawio.png" class="img-fluid figure-img"></p>
-<figcaption>figure</figcaption>
+<figcaption>Shallow Nonparametric Regression</figcaption>
 </figure>
 </div>
-In practice, we will truncate the expansion upto some finite number of terms. For illustration, say, we choose <span class="math inline">\(p\)</span> terms. Then, in matrix notation, the regression problem is: <span class="math display">\[
+<p>In practice, we will truncate the expansion upto some finite number of terms. For illustration, say, we choose <span class="math inline">\(p\)</span> terms. Then, in matrix notation, the regression problem is: <span class="math display">\[
 \begin{array}{left}
 {\bf y} = {\bf X}{\bf \beta} + {\bf \epsilon}
 \end{array}
 \]</span> where <span class="math display">\[
 \begin{array}{left}
-{\bf X}_{n \times p} &amp;=&amp;
+{\bf X}_{N \times p} &amp;=&amp;
 \begin{pmatrix} 1 &amp; \sqrt{2}\cos(\pi x^{[1]}) &amp; \dots &amp; \sqrt{2}\cos((p-1)\pi x^{[1]}) \\
 1 &amp; \sqrt{2}\cos(\pi x^{[2]}) &amp; \dots &amp; \sqrt{2}\cos((p-1)\pi x^{[2]}) \\
 \vdots &amp; &amp; &amp; \vdots \\
 1 &amp; \sqrt{2}\cos(\pi x^{[N]}) &amp; \dots &amp; \sqrt{2}\cos((p-1)\pi x^{[N]})
 \end{pmatrix} \\
 {\bf \beta}_{p \times 1} &amp;=&amp; [\beta_1, \beta_2, \dots, \beta_p ]^T \\
-{\bf y}_{n \times 1} &amp;=&amp; [y^{[1]}, y^{[2]}, \dots, y^{[N]} ]^T \\
+{\bf y}_{N \times 1} &amp;=&amp; [y^{[1]}, y^{[2]}, \dots, y^{[N]} ]^T \\
 \end{array}
-\]</span> Different choices of the Basis Functions lead to different design matrices <span class="math inline">\({\bf X}\)</span>. Some popular choices are Splines, Chebyshev Polynomials, Legendre Polynomials, Wavelets among others. Specifically, the function approximation with wavelets takes the following form [see <a href="https://github.com/jseabold/web-site/blob/master/content/notebooks/wavelet-regression-in-python.ipynb">Wavelet Regression in Python</a> for a very nice demonstration wavelets for denoising]:
+\]</span> Different choices of the Basis Functions lead to different design matrices <span class="math inline">\({\bf X}\)</span>. Some popular choices are Splines, Chebyshev Polynomials, Legendre Polynomials, Wavelets, among others.</p>
 <p style="color:blue">
-<span class="math display">\[\phi(x) = \alpha\zeta(x) + \sum_{j=0}^{J-1}\sum_{k=0}^{2^j-1}\beta_{jk}\psi_{jk}(x)\]</span> where <span class="math display">\[\alpha=\int_0^1\phi(x)\zeta(x)dx\text{, }\beta_{jk}=\int_0^1\phi(x)\psi_{jk}(x)dx.\]</span> Here <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta_{jk}\)</span> are called the scaling coefficients and the detail coefficients, respectively. The basic idea is that the detail coefficients capture the coarser details of the function while the scaling, or smoothing, coefficients capture the overall functional form.
+Specifically, the function approximation with wavelets takes the following form [see <a href="https://github.com/jseabold/web-site/blob/master/content/notebooks/wavelet-regression-in-python.ipynb">Wavelet Regression in Python</a> for a very nice demonstration of wavelets for denoising]: <span class="math display">\[\phi(x) = \alpha\zeta(x) + \sum_{j=0}^{J-1}\sum_{k=0}^{2^j-1}\beta_{jk}\psi_{jk}(x)\]</span> where <span class="math display">\[\alpha=\int_0^1\phi(x)\zeta(x)dx\text{, }\beta_{jk}=\int_0^1\phi(x)\psi_{jk}(x)dx.\]</span> Here <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta_{jk}\)</span> are called the scaling coefficients and the detail coefficients, respectively. The basic idea is that the detail coefficients capture the coarser details of the function while the scaling, or smoothing, coefficients capture the overall functional form.
 </p>
-<p>So, by choosing a basis function, we map the observed inputs into the feature space, and project the response onto the space spanned by the basis functions. This is basically a Linear Regression in a different function space (induced by the basis functions).</p>
+<p>So, by choosing a basis function, we map the observed inputs into the feature space defined by the basis functions, and project the response onto the space spanned by the basis functions. This is basically a Linear Regression in a different function space (induced by the basis functions).</p>
 <section id="generalized-additive-models" class="level5">
 <h5 class="anchored" data-anchor-id="generalized-additive-models">Generalized Additive Models</h5>
-<p>How we do extend the above formulation to multivariate case. Say, we have <span class="math inline">\(n_{0}\)</span> dimensional inputs. One obvious but non-trivial way is to choose a multivariate basis function. The other way to do that is to construct feature space based on univariate functions which we know to approximate already. For example, take the cartesian product space as <span class="math inline">\(f(x) \equiv \phi(x) = \Pi_{p=1}^{n_{0}} \phi_p(x_p)\)</span> where <span class="math inline">\(\phi_p\)</span> can be expanded like before (univariate case). Or we can construct the function <span class="math inline">\(f(x)\)</span> additively as <span class="math display">\[
+<p>How we do extend the above formulation to multivariate case. Say, we have <span class="math inline">\(n_{0}\)</span> dimensional inputs. One obvious but non-trivial way is to choose a multivariate basis function. The other way to do that is to construct feature space based on univariate functions which we know how to approximate already. For example, take the cartesian product space as <span class="math inline">\(f(x) \equiv \phi(x) = \Pi_{p=1}^{n_{0}} \phi_p(x_p)\)</span> where <span class="math inline">\(\phi_p\)</span> can be expanded like before (univariate case). Or we can construct the function <span class="math inline">\(f(x)\)</span> additively as <span class="math display">\[
 \begin{array}{left}
 f(x) \equiv \phi(x) = \sum_{p=1}^{n_{0}} \phi_p(x_p)
 \end{array}
-\]</span> See the figure below.</p>
+\]</span> See the figure below for an illustration.</p>
 <div class="quarto-figure quarto-figure-center">
 <figure class="figure">
 <p><img src="./../figs/KANs-GAMs.drawio.png" class="img-fluid figure-img"></p>
@@ -454,7 +454,7 @@ <h5 class="anchored" data-anchor-id="generalized-additive-models">Generalized Ad
 </div>
 <p>That is, for every dimension <span class="math inline">\(p\)</span> of input, there is a corresponding <span class="math inline">\(\phi_p\)</span> and add them to get the multivariate function. In fact, the above specification appears in the framework developed by Hastie and Tibhirani in 1986, known as Generalized Addtive Models (see <a href="https://projecteuclid.org/journals/statistical-science/volume-1/issue-3/Generalized-Additive-Models/10.1214/ss/1177013604.full">paper</a>, <a href="https://en.wikipedia.org/wiki/Generalized_additive_model">wiki</a>) or GAMs as they are popularly referred to.</p>
 <p>In the context of KANs, we can see this exactly as node at layer <span class="math inline">\(l+1\)</span>, which takes inputs from layer <span class="math inline">\(l\)</span>, whose edges represent the transformation of the inputs via <span class="math inline">\(\phi_p\)</span>. Except for notational differences, each neuron in KAN is a GAM.</p>
-<p>So far we are dealing with single output and multiple inputs. But what if there are <span class="math inline">\(n_{1}\)</span> outputs. This leads us to Vector Generalized Additive Models (VGAMs), proposed by Yee and Wild in 1996 (see <a href="https://www.jstor.org/stable/2345888">paper</a>, <a href="https://en.wikipedia.org/wiki/Vector_generalized_linear_model">wiki</a>). We can represent VGLAM as:</p>
+<p>So far we are dealing with single output and multiple inputs. But what if there are <span class="math inline">\(n_{1}\)</span> outputs. This leads us to Vector Generalized Additive Models (VGAMs), proposed by Yee and Wild in 1996 (see <a href="https://www.jstor.org/stable/2345888">paper</a>, <a href="https://en.wikipedia.org/wiki/Vector_generalized_linear_model">wiki</a>). We can represent VGAM as:</p>
 <p><span class="math display">\[
 \begin{array}{left}
 y_q \equiv f_{q,.}(x) \equiv \phi_{q,.}(x)  = \sum_{p=1}^{n_{0}} \phi_{q,p}(x_p)
@@ -467,12 +467,12 @@ <h5 class="anchored" data-anchor-id="generalized-additive-models">Generalized Ad
 <figcaption>KAN Layer = VGAM</figcaption>
 </figure>
 </div>
-<p>In effect, one KAN layer is actually a VGLM. So, given <span class="math inline">\(D=\{x^{[i]}, y^{[i]}\}_{i=1}^{n}\)</span> with <span class="math inline">\(x^{[i]} \in R^{n_{0}}, y^{[i]} \in R^{n_{1}}\)</span> outputs, KAN Layer and VGAM learn <span class="math inline">\(\phi_{q,p}\)</span> which are specified in terms of the Basis Functions. Note that, the choice of the basis functions should correspond to the domain and range of the functions being modeled. GAMS and VGAM models are trained (or leant) using back-fitting techniques. There is no reason why we can not apply backprop and fit these models using gradient descent.</p>
+<p>In effect, one KAN layer is actually a VGLM. So, given <span class="math inline">\(D=\{x^{[i]}, y^{[i]}\}_{i=1}^{N}\)</span> with <span class="math inline">\(x^{[i]} \in R^{n_{0}}, y^{[i]} \in R^{n_{1}}\)</span> outputs, KAN Layer and VGAM learn <span class="math inline">\(\phi_{q,p}\)</span> which are specified in terms of the Basis Functions. Note that, the choice of the basis functions should correspond to the domain and range of the functions being modeled. GAMS and VGAM models are trained (or leant) using back-fitting techniques. There is no reason why we can not apply backprop and fit these models using gradient descent.</p>
 </section>
 </section>
 <section id="deep-non-parametric-regression" class="level3">
 <h3 class="anchored" data-anchor-id="deep-non-parametric-regression">Deep Non-parametric Regression</h3>
-<p>What remains to be done is to stack the KAN layers or VGAMs. That gets us to the model we have seen above: <span class="math display">\[
+<p>What remains to be done is to stack the KAN layers or VGAMs. That gets us to the model we have seen before: <span class="math display">\[
 f(\bf{x})=\sum_{i_{L-1}=1}^{n_{L-1}}\phi_{L-1,i_{L},i_{L-1}}\left(\sum_{i_{L-2}=1}^{n_{L-2}}\cdots\left(\sum_{i_2=1}^{n_2}\phi_{2,i_3,i_2}\left(\sum_{i_1=1}^{n_1}\phi_{1,i_2,i_1}\left(\sum_{i_0=1}^{n_0}\phi_{0,i_1,i_0}(x_{i_0})\right)\right)\right)\cdots\right)
 \]</span> For clarity sake, <span class="math inline">\(x_{i_0}\)</span> is referring to the <span class="math inline">\(i_0\)</span>th input dimension or feature, <span class="math inline">\(\phi_{l,q,p}(.)\)</span> refers to the function that maps <span class="math inline">\(p\)</span>th input of layer <span class="math inline">\(l\)</span> to the (before summation) <span class="math inline">\(q\)</span>th output.</p>
 <div class="quarto-figure quarto-figure-center">
@@ -483,7 +483,7 @@ <h3 class="anchored" data-anchor-id="deep-non-parametric-regression">Deep Non-pa
 </div>
 <p>From the above figure, it is clear that the edges of the KAN network are non-parametric functions <span class="math inline">\(\phi(x)\)</span> which are specified via basis functions such as Splines. Even MLPs have weights on the edges. So, the mainstream interpretation that, in KANs, the activations are on the edges and they learnable is not very convincing. Both MLPs and KANs, the edges are learnable. For example, if we choose <span class="math inline">\(\phi(x) = \beta x\)</span>, and replace the sum with sum preceded by elementwise, fixed activations like <span class="math inline">\(SiLU\)</span>, we get the standard MLP network. <a href="https://arxiv.org/abs/2404.19756">KAN</a> proposed by Liu et al in 2024 is a Deep Non-parametric Regression framework. Specifically, we note the following:</p>
 <ul>
-<li>KAN nueron is a Generalized Aaddtive Model (GAM)</li>
+<li>KAN nueron is a Generalized Additive Model (GAM)</li>
 <li>KAN layer is a Vector GAM (VGAM)</li>
 <li>KAN Network is a Deep VGAM or simply put a Deep Non-parametric Regression Model</li>
 </ul>
@@ -498,21 +498,23 @@ <h3 class="anchored" data-anchor-id="deep-non-parametric-regression">Deep Non-pa
 <li>TorchKAN: KANs with Monmials, Legendre Polynomials <a href="https://github.com/1ssb/torchkan">code</a></li>
 </ul></li>
 </ul>
-<p>Not only that, the modularity of the KAN layer allows one to mix and match KAN layer with other modules such as a TRransfomer block or an RNN block, for example. We can replace MLP with KAN almost like a drop-in. We have already seen implementation of KANs in GPT models <a href="https://github.com/AdityaNG/kan-gpt">GTP-KAN</a>, <a href="https://github.com/CG80499/KAN-GPT-2">GPT2-KAN</a>. They also started appearing in CNNs. For more resources see <a href="https://github.com/mintisan/awesome-kan">awesome-KANS</a>.</p>
-<p>Explore this <a href="../notebooks/02-01-KAN-Intro.html">KAN Features</a> notebook or go through the examples from <a href="https://github.com/KindXiaoming/pykan">PyKAN</a> official repo.</p>
+<p>Not only that, the modularity of the KAN layer allows one to mix and match KAN layer with other modules such as a Transformer block or an RNN block, for example. We can replace MLP with KAN almost like a drop-in. We have already seen implementation of KANs in GPT models <a href="https://github.com/AdityaNG/kan-gpt">GTP-KAN</a>, <a href="https://github.com/CG80499/KAN-GPT-2">GPT2-KAN</a>. They also started appearing in CNNs. For more resources see <a href="https://github.com/mintisan/awesome-kan">awesome-KANS</a>.</p>
+<p>Explore the features of KANs such as interpretability, solving PDEs,check out this <a href="../notebooks/02-01-KAN-Intro.html">KAN Features</a> notebook or go through the examples from <a href="https://github.com/KindXiaoming/pykan">PyKAN</a> official repo.</p>
 </section>
 <section id="limitations" class="level3">
 <h3 class="anchored" data-anchor-id="limitations">Limitations</h3>
 <p>Few limitation of KANs at this time are:</p>
 <ul>
-<li>they are relying on one-dimensional (univariate) functions as the building blocks. This need not be efficient always. For example, consider 2d functions that have certain spatial or temporal properties. To apply KANs, we have to convert them to 1d first and then apply. It would be much better if we can find multivariate basis functions that can naturally deal with arbitrary dimensions. For example, to process images, 2d wavelets could be a better choice.</li>
-<li>In many cases, MLPs still seem to be doing better. See <a href="https://arxiv.org/abs/2407.16674">KAN or MLP: A Fairer Comparison</a> for details. KANs seem to be good at symbolic regression problems which have much more grounding in physical and other sciences. -Spline-based KAN, as noted by others are computationally slow. Knot selection (how many and where to) are important hyperparameters which can affect the performance significancy. Grid refinement implemented by the KAN authors is a welcoming step.</li>
+<li>they are relying on one dimensional (univariate) functions as the building blocks. This need not be efficient always. For example, consider 2d functions that have certain spatial or temporal properties. To apply KANs, we have to convert them to 1d first and then model them in KAN layers. It would be much better if we can find multivariate basis functions that can naturally deal with arbitrary dimensions. For example, to process images, 2d wavelets could be a better choice.</li>
+<li>In many cases, MLPs still seem to be doing better. See <a href="https://arxiv.org/abs/2407.16674">KAN or MLP: A Fairer Comparison</a> for details. KANs seem to be good at symbolic regression problems which have much more grounding in physical and other sciences.</li>
+<li>Spline-based KANs, as noted by others are computationally slow. Knot selection (how many and where to place them) are important hyperparameters which can affect the performance significancy. Grid refinement implemented by the KAN authors is a welcoming step in this direction but it is a hard problem.</li>
 </ul>
-<p>That said, KANs are extremely interesting in the sense that, they are:</p>
+<p>That said, KANs are extremely interesting in the sense that:</p>
 <ul>
-<li>KANs is a Deep Non-parametric regression framework</li>
+<li>KAN is a Deep Non-parametric regression framework</li>
 <li>Very generic</li>
 <li>After pruning and doing some symbolic search, one can recover interoperable equations, which may be difficult to do in a typical Deep Neural Network.</li>
+<li>Combining the power of Wavelets, Filter Banks, Multi Resolution Analysis (MRA) in the KAN framework would be interesting to pursue.</li>
 </ul>
 
 
diff --git a/search.json b/search.json
index 6f1ba82..f99a8b5 100644
--- a/search.json
+++ b/search.json
@@ -218,7 +218,7 @@
     "href": "lectures/L03.html#notes",
     "title": "KANs",
     "section": "Notes",
-    "text": "Notes\nThe following (written in blue) is taken mostly verbatim from hellokan.ipynb, the authors of KAN.\n\nKolmogorov-Arnold representation theorem\n\nKolmogorov-Arnold representation theorem states that if \\(f\\) is a multivariate continuous function on a bounded domain, then it can be written as a finite composition of continuous functions of a single variable and the binary operation of addition. More specifically, for a smooth \\(f : [0,1]^{n} \\to \\mathbb{R}\\), \\[f(x) = f(x_1,...,x_{n})=\\sum_{q=1}^{2{n}+1}\\Phi_q(\\sum_{p=1}^{n} \\phi_{q,p}(x_p))\\] where \\(\\phi_{q,p}:[0,1]\\to\\mathbb{R}\\) and \\(\\Phi_q:\\mathbb{R}\\to\\mathbb{R}\\). In a sense, they showed that the only true multivariate function is addition, since every other function can be written using univariate functions and sum. However, this 2-Layer width-\\((2{n}+1)\\) Kolmogorov-Arnold representation may not be smooth due to its limited expressive power. We augment its expressive power by generalizing it to arbitrary depths and widths.\n\n\n\nKolmogorov-Arnold Network (KAN)\n\nThe Kolmogorov-Arnold representation can be written in matrix form \\[f(x)={\\bf \\Phi}_{\\rm out}\\circ{\\bf \\Phi}_{\\rm in}\\circ {\\bf x}\\] where \\[{\\bf \\Phi}_{\\rm in}= \\begin{pmatrix} \\phi_{1,1}(\\cdot) & \\cdots & \\phi_{1,n}(\\cdot) \\\\ \\vdots & & \\vdots \\\\ \\phi_{2n+1,1}(\\cdot) & \\cdots & \\phi_{2n+1,n}(\\cdot) \\end{pmatrix},\\quad {\\bf \\Phi}_{\\rm out}=\\begin{pmatrix} \\Phi_1(\\cdot) & \\cdots & \\Phi_{2n+1}(\\cdot)\\end{pmatrix}\\] We notice that both \\({\\bf \\Phi}_{\\rm in}\\) and \\({\\bf \\Phi}_{\\rm out}\\) are special cases of the following function matrix \\({\\bf \\Phi}\\) (with \\(n_{\\rm in}\\) inputs, and \\(n_{\\rm out}\\) outputs), we call a Kolmogorov-Arnold layer: \\[{\\bf \\Phi}= \\begin{pmatrix} \\phi_{1,1}(\\cdot) & \\cdots & \\phi_{1,n_{\\rm in}}(\\cdot) \\\\ \\vdots & & \\vdots \\\\ \\phi_{n_{\\rm out},1}(\\cdot) & \\cdots & \\phi_{n_{\\rm out},n_{\\rm in}}(\\cdot) \\end{pmatrix}\\] \\({\\bf \\Phi}_{\\rm in}\\) corresponds to \\(n_{\\rm in}=n, n_{\\rm out}=2n+1\\), and \\({\\bf \\Phi}_{\\rm out}\\) corresponds to \\(n_{\\rm in}=2n+1, n_{\\rm out}=1\\). After defining the layer, we can construct a Kolmogorov-Arnold network simply by stacking layers! Let’s say we have \\(L\\) layers, with the \\(l^{\\rm th}\\) layer \\({\\bf \\Phi}_l\\) have shape \\((n_{l+1}, n_{l})\\). Then the whole network is \\[{\\rm KAN}({\\bf x})={\\bf \\Phi}_{L-1}\\circ\\cdots \\circ{\\bf \\Phi}_1\\circ{\\bf \\Phi}_0\\circ {\\bf x}\\] In constrast, a Multi-Layer Perceptron is interleaved by linear layers \\({\\bf W}_l\\) and nonlinearities \\(\\sigma\\): \\[{\\rm MLP}({\\bf x})={\\bf W}_{L-1}\\circ\\sigma\\circ\\cdots\\circ {\\bf W}_1\\circ\\sigma\\circ {\\bf W}_0\\circ {\\bf x}\\] Even though cumbersome to write, but simpler to see is the following form of the KAN network. Assuming output dimension \\(n_{L}=1\\), and define \\(f(\\bf{x})\\equiv {\\rm KAN}(\\bf{x})\\): \\[\nf(\\bf{x})=\\sum_{i_{L-1}=1}^{n_{L-1}}\\phi_{L-1,i_{L},i_{L-1}}\\left(\\sum_{i_{L-2}=1}^{n_{L-2}}\\cdots\\left(\\sum_{i_2=1}^{n_2}\\phi_{2,i_3,i_2}\\left(\\sum_{i_1=1}^{n_1}\\phi_{1,i_2,i_1}\\left(\\sum_{i_0=1}^{n_0}\\phi_{0,i_1,i_0}(x_{i_0})\\right)\\right)\\right)\\cdots\\right)\n\\]\n\nThe basic ingredient is the so called learnable activation \\(\\phi_{l,j,i}\\) which maps the post-activation of \\(i\\)th neuron in layer \\(l\\) to the pre-activation of \\(j\\)th neuron. Effectively, it is the edge connecting two neurons on adjacent layers. But how can it be made learnable? Represent this activation function as: \\[\n\\begin{align}\n\\phi(x)=w_{b} b(x)+w_{s}{\\rm spline}(x) \\\\\nb(x)={\\rm silu}(x)=x/(1+e^{-x}) \\\\\n{\\rm spline}(x) = \\sum_i c_iB_i(x)\n\\end{align}\n\\] where \\(c_i\\)s are trainable/learnable parameters, \\(B_i\\) are the Spline basis functions (in the original KAN paper). The authors included \\(b(x)\\), a \\(SiLU\\), as a residual connection.\nThe above development of KANs is compelling but it is much more illustrative to approach from Non-parametric Regression point of view, and then see that KAN’s are a type of Deep Non-parametric Regressors.\n\n\nShallow Non-parametric Regression\nConsider the following regression problem \\[y^{[i]} \\equiv f(x^{[i]}) + e^{[i]} \\equiv \\phi(x^{[i]}) + e^{[i]}, i \\in \\left\\{1,\\dots,N\\right\\}\\] with \\(D = \\{x^{[i]}, y^{[i]}\\}_{i=1}^{N}\\) representing all the data available to fit (train) the model \\(f(x)\\). It is customary to write the model as \\(f(x)\\) instead of \\(\\phi(x)\\). It is done for compatibility with KAN. In the shallow case \\(f(x) = \\phi(x)\\) otherwise \\(\\phi(x)\\) is used to denote the building blocks that \\(f(x)\\) will be made of.\nFor a moment, w.l.o.g, assume \\(x\\) to be univariate. In a typical regression setup, one constructs features such as \\(x^2, x^3,\\dots\\), like in polynomial regression and treats this as a Linear Regression problem. We can view this standard procedures as expanding the function \\(f(x)\\) on a set of Polynomials. Seen in a more general sense, we can choose an appropriate Basis Functions to construct the feature space. For example [see Chapter 9 of All of Nonparametric Statistics] \\[f(x) \\equiv \\phi(x) =  \\sum_{i=1}^{\\infty} \\beta_i B_i(x)\\] where \\(B_1(x) \\equiv 1, B_i(x) \\equiv \\sqrt{2}\\cos((i-1)\\pi x) \\text{ for } i \\ge 2\\). See the figure below.\n\n\n\nfigure\n\n\nIn practice, we will truncate the expansion upto some finite number of terms. For illustration, say, we choose \\(p\\) terms. Then, in matrix notation, the regression problem is: \\[\n\\begin{array}{left}\n{\\bf y} = {\\bf X}{\\bf \\beta} + {\\bf \\epsilon}\n\\end{array}\n\\] where \\[\n\\begin{array}{left}\n{\\bf X}_{n \\times p} &=&\n\\begin{pmatrix} 1 & \\sqrt{2}\\cos(\\pi x^{[1]}) & \\dots & \\sqrt{2}\\cos((p-1)\\pi x^{[1]}) \\\\\n1 & \\sqrt{2}\\cos(\\pi x^{[2]}) & \\dots & \\sqrt{2}\\cos((p-1)\\pi x^{[2]}) \\\\\n\\vdots & & & \\vdots \\\\\n1 & \\sqrt{2}\\cos(\\pi x^{[N]}) & \\dots & \\sqrt{2}\\cos((p-1)\\pi x^{[N]})\n\\end{pmatrix} \\\\\n{\\bf \\beta}_{p \\times 1} &=& [\\beta_1, \\beta_2, \\dots, \\beta_p ]^T \\\\\n{\\bf y}_{n \\times 1} &=& [y^{[1]}, y^{[2]}, \\dots, y^{[N]} ]^T \\\\\n\\end{array}\n\\] Different choices of the Basis Functions lead to different design matrices \\({\\bf X}\\). Some popular choices are Splines, Chebyshev Polynomials, Legendre Polynomials, Wavelets among others. Specifically, the function approximation with wavelets takes the following form [see Wavelet Regression in Python for a very nice demonstration wavelets for denoising]:\n\n\\[\\phi(x) = \\alpha\\zeta(x) + \\sum_{j=0}^{J-1}\\sum_{k=0}^{2^j-1}\\beta_{jk}\\psi_{jk}(x)\\] where \\[\\alpha=\\int_0^1\\phi(x)\\zeta(x)dx\\text{, }\\beta_{jk}=\\int_0^1\\phi(x)\\psi_{jk}(x)dx.\\] Here \\(\\alpha\\) and \\(\\beta_{jk}\\) are called the scaling coefficients and the detail coefficients, respectively. The basic idea is that the detail coefficients capture the coarser details of the function while the scaling, or smoothing, coefficients capture the overall functional form.\n\nSo, by choosing a basis function, we map the observed inputs into the feature space, and project the response onto the space spanned by the basis functions. This is basically a Linear Regression in a different function space (induced by the basis functions).\n\nGeneralized Additive Models\nHow we do extend the above formulation to multivariate case. Say, we have \\(n_{0}\\) dimensional inputs. One obvious but non-trivial way is to choose a multivariate basis function. The other way to do that is to construct feature space based on univariate functions which we know to approximate already. For example, take the cartesian product space as \\(f(x) \\equiv \\phi(x) = \\Pi_{p=1}^{n_{0}} \\phi_p(x_p)\\) where \\(\\phi_p\\) can be expanded like before (univariate case). Or we can construct the function \\(f(x)\\) additively as \\[\n\\begin{array}{left}\nf(x) \\equiv \\phi(x) = \\sum_{p=1}^{n_{0}} \\phi_p(x_p)\n\\end{array}\n\\] See the figure below.\n\n\n\nKAN Neuron = GAM\n\n\nThat is, for every dimension \\(p\\) of input, there is a corresponding \\(\\phi_p\\) and add them to get the multivariate function. In fact, the above specification appears in the framework developed by Hastie and Tibhirani in 1986, known as Generalized Addtive Models (see paper, wiki) or GAMs as they are popularly referred to.\nIn the context of KANs, we can see this exactly as node at layer \\(l+1\\), which takes inputs from layer \\(l\\), whose edges represent the transformation of the inputs via \\(\\phi_p\\). Except for notational differences, each neuron in KAN is a GAM.\nSo far we are dealing with single output and multiple inputs. But what if there are \\(n_{1}\\) outputs. This leads us to Vector Generalized Additive Models (VGAMs), proposed by Yee and Wild in 1996 (see paper, wiki). We can represent VGLAM as:\n\\[\n\\begin{array}{left}\ny_q \\equiv f_{q,.}(x) \\equiv \\phi_{q,.}(x)  = \\sum_{p=1}^{n_{0}} \\phi_{q,p}(x_p)\n\\end{array}\n\\]\nSee the figure below.\n\n\n\nKAN Layer = VGAM\n\n\nIn effect, one KAN layer is actually a VGLM. So, given \\(D=\\{x^{[i]}, y^{[i]}\\}_{i=1}^{n}\\) with \\(x^{[i]} \\in R^{n_{0}}, y^{[i]} \\in R^{n_{1}}\\) outputs, KAN Layer and VGAM learn \\(\\phi_{q,p}\\) which are specified in terms of the Basis Functions. Note that, the choice of the basis functions should correspond to the domain and range of the functions being modeled. GAMS and VGAM models are trained (or leant) using back-fitting techniques. There is no reason why we can not apply backprop and fit these models using gradient descent.\n\n\n\nDeep Non-parametric Regression\nWhat remains to be done is to stack the KAN layers or VGAMs. That gets us to the model we have seen above: \\[\nf(\\bf{x})=\\sum_{i_{L-1}=1}^{n_{L-1}}\\phi_{L-1,i_{L},i_{L-1}}\\left(\\sum_{i_{L-2}=1}^{n_{L-2}}\\cdots\\left(\\sum_{i_2=1}^{n_2}\\phi_{2,i_3,i_2}\\left(\\sum_{i_1=1}^{n_1}\\phi_{1,i_2,i_1}\\left(\\sum_{i_0=1}^{n_0}\\phi_{0,i_1,i_0}(x_{i_0})\\right)\\right)\\right)\\cdots\\right)\n\\] For clarity sake, \\(x_{i_0}\\) is referring to the \\(i_0\\)th input dimension or feature, \\(\\phi_{l,q,p}(.)\\) refers to the function that maps \\(p\\)th input of layer \\(l\\) to the (before summation) \\(q\\)th output.\n\n\n\nKAN = Deep Non-parametric Regression Model\n\n\nFrom the above figure, it is clear that the edges of the KAN network are non-parametric functions \\(\\phi(x)\\) which are specified via basis functions such as Splines. Even MLPs have weights on the edges. So, the mainstream interpretation that, in KANs, the activations are on the edges and they learnable is not very convincing. Both MLPs and KANs, the edges are learnable. For example, if we choose \\(\\phi(x) = \\beta x\\), and replace the sum with sum preceded by elementwise, fixed activations like \\(SiLU\\), we get the standard MLP network. KAN proposed by Liu et al in 2024 is a Deep Non-parametric Regression framework. Specifically, we note the following:\n\nKAN nueron is a Generalized Aaddtive Model (GAM)\nKAN layer is a Vector GAM (VGAM)\nKAN Network is a Deep VGAM or simply put a Deep Non-parametric Regression Model\n\nThe generality of this technique comes from\n\nflexibility of the specification (multiple inputs, multiple outputs, varying depth and width, choice of basis functions) and\nthe deep learning framework itself (so we can fit almost all methods using backprop without doing any custom implementations). See for example,\n\nFast-KAN: KANs with Radial Basis Functions code\nChebyKAN: KANs with Chebyshev Polynomials paper\nWavKAN: KANs with Wavelets paper\nTorchKAN: KANs with Monmials, Legendre Polynomials code\n\n\nNot only that, the modularity of the KAN layer allows one to mix and match KAN layer with other modules such as a TRransfomer block or an RNN block, for example. We can replace MLP with KAN almost like a drop-in. We have already seen implementation of KANs in GPT models GTP-KAN, GPT2-KAN. They also started appearing in CNNs. For more resources see awesome-KANS.\nExplore this KAN Features notebook or go through the examples from PyKAN official repo.\n\n\nLimitations\nFew limitation of KANs at this time are:\n\nthey are relying on one-dimensional (univariate) functions as the building blocks. This need not be efficient always. For example, consider 2d functions that have certain spatial or temporal properties. To apply KANs, we have to convert them to 1d first and then apply. It would be much better if we can find multivariate basis functions that can naturally deal with arbitrary dimensions. For example, to process images, 2d wavelets could be a better choice.\nIn many cases, MLPs still seem to be doing better. See KAN or MLP: A Fairer Comparison for details. KANs seem to be good at symbolic regression problems which have much more grounding in physical and other sciences. -Spline-based KAN, as noted by others are computationally slow. Knot selection (how many and where to) are important hyperparameters which can affect the performance significancy. Grid refinement implemented by the KAN authors is a welcoming step.\n\nThat said, KANs are extremely interesting in the sense that, they are:\n\nKANs is a Deep Non-parametric regression framework\nVery generic\nAfter pruning and doing some symbolic search, one can recover interoperable equations, which may be difficult to do in a typical Deep Neural Network.",
+    "text": "Notes\nThe following (written in blue) is taken mostly verbatim from hellokan.ipynb, the authors of KAN.\n\nKolmogorov-Arnold representation theorem\n\nKolmogorov-Arnold representation theorem states that if \\(f\\) is a multivariate continuous function on a bounded domain, then it can be written as a finite composition of continuous functions of a single variable and the binary operation of addition. More specifically, for a smooth \\(f : [0,1]^{n} \\to \\mathbb{R}\\), \\[f(x) = f(x_1,...,x_{n})=\\sum_{q=1}^{2{n}+1}\\Phi_q(\\sum_{p=1}^{n} \\phi_{q,p}(x_p))\\] where \\(\\phi_{q,p}:[0,1]\\to\\mathbb{R}\\) and \\(\\Phi_q:\\mathbb{R}\\to\\mathbb{R}\\). In a sense, they showed that the only true multivariate function is addition, since every other function can be written using univariate functions and sum. However, this 2-Layer width-\\((2{n}+1)\\) Kolmogorov-Arnold representation may not be smooth due to its limited expressive power. We augment its expressive power by generalizing it to arbitrary depths and widths.\n\n\n\nKolmogorov-Arnold Network (KAN)\n\nThe Kolmogorov-Arnold representation can be written in matrix form \\[f(x)={\\bf \\Phi}_{\\rm out}\\circ{\\bf \\Phi}_{\\rm in}\\circ {\\bf x}\\] where \\[{\\bf \\Phi}_{\\rm in}= \\begin{pmatrix} \\phi_{1,1}(\\cdot) & \\cdots & \\phi_{1,n}(\\cdot) \\\\ \\vdots & & \\vdots \\\\ \\phi_{2n+1,1}(\\cdot) & \\cdots & \\phi_{2n+1,n}(\\cdot) \\end{pmatrix},\\quad {\\bf \\Phi}_{\\rm out}=\\begin{pmatrix} \\Phi_1(\\cdot) & \\cdots & \\Phi_{2n+1}(\\cdot)\\end{pmatrix}\\] We notice that both \\({\\bf \\Phi}_{\\rm in}\\) and \\({\\bf \\Phi}_{\\rm out}\\) are special cases of the following function matrix \\({\\bf \\Phi}\\) (with \\(n_{\\rm in}\\) inputs, and \\(n_{\\rm out}\\) outputs), we call a Kolmogorov-Arnold layer: \\[{\\bf \\Phi}= \\begin{pmatrix} \\phi_{1,1}(\\cdot) & \\cdots & \\phi_{1,n_{\\rm in}}(\\cdot) \\\\ \\vdots & & \\vdots \\\\ \\phi_{n_{\\rm out},1}(\\cdot) & \\cdots & \\phi_{n_{\\rm out},n_{\\rm in}}(\\cdot) \\end{pmatrix}\\] \\({\\bf \\Phi}_{\\rm in}\\) corresponds to \\(n_{\\rm in}=n, n_{\\rm out}=2n+1\\), and \\({\\bf \\Phi}_{\\rm out}\\) corresponds to \\(n_{\\rm in}=2n+1, n_{\\rm out}=1\\). After defining the layer, we can construct a Kolmogorov-Arnold network simply by stacking layers! Let’s say we have \\(L\\) layers, with the \\(l^{\\rm th}\\) layer \\({\\bf \\Phi}_l\\) have shape \\((n_{l+1}, n_{l})\\). Then the whole network is \\[{\\rm KAN}({\\bf x})={\\bf \\Phi}_{L-1}\\circ\\cdots \\circ{\\bf \\Phi}_1\\circ{\\bf \\Phi}_0\\circ {\\bf x}\\] In constrast, a Multi-Layer Perceptron is interleaved by linear layers \\({\\bf W}_l\\) and nonlinearities \\(\\sigma\\): \\[{\\rm MLP}({\\bf x})={\\bf W}_{L-1}\\circ\\sigma\\circ\\cdots\\circ {\\bf W}_1\\circ\\sigma\\circ {\\bf W}_0\\circ {\\bf x}\\] Even though cumbersome to write, but simpler to see is the following form of the KAN network. Assuming output dimension \\(n_{L}=1\\), and define \\(f(\\bf{x})\\equiv {\\rm KAN}(\\bf{x})\\): \\[\nf(\\bf{x})=\\sum_{i_{L-1}=1}^{n_{L-1}}\\phi_{L-1,i_{L},i_{L-1}}\\left(\\sum_{i_{L-2}=1}^{n_{L-2}}\\cdots\\left(\\sum_{i_2=1}^{n_2}\\phi_{2,i_3,i_2}\\left(\\sum_{i_1=1}^{n_1}\\phi_{1,i_2,i_1}\\left(\\sum_{i_0=1}^{n_0}\\phi_{0,i_1,i_0}(x_{i_0})\\right)\\right)\\right)\\cdots\\right)\n\\]\n\nThe basic ingredient is the so called learnable activation \\(\\phi_{l,j,i}\\) which maps the post-activation of \\(i\\)th neuron in layer \\(l\\) to the pre-activation of \\(j\\)th neuron. Effectively, it is the edge connecting two neurons on adjacent layers. But how can it be made learnable? Represent this activation function as: \\[\n\\begin{align}\n\\phi(x)=w_{b} b(x)+w_{s}{\\rm spline}(x) \\\\\nb(x)={\\rm silu}(x)=x/(1+e^{-x}) \\\\\n{\\rm spline}(x) = \\sum_i c_iB_i(x)\n\\end{align}\n\\] where \\(c_i\\)s are trainable/learnable parameters, \\(B_i\\) are the Spline basis functions (in the original KAN paper). The authors included \\(b(x)\\), a \\(SiLU\\), as a residual connection.\nThe above development of KANs is compelling but it is much more illustrative to approach it from a Non-parametric Regression point of view, and then see that KAN’s are a type of Deep Non-parametric Regressors. Then, we realize that the standard terminology like neurons, activations, pre-activations, post-activation etc., can be completely dropped.\n\n\nShallow Non-parametric Regression\nConsider the following regression problem \\[y^{[i]} \\equiv f(x^{[i]}) + e^{[i]} \\equiv \\phi(x^{[i]}) + e^{[i]}, i \\in \\left\\{1,\\dots,N\\right\\}\\] with \\(D = \\{x^{[i]}, y^{[i]}\\}_{i=1}^{N}\\) representing all the data available to fit (train) the model \\(f(x)\\). It is customary to write the model as \\(f(x)\\) instead of \\(\\phi(x)\\). It is done for compatibility with KAN. In the shallow case \\(f(x) = \\phi(x)\\) otherwise \\(\\phi(x)\\) is used to denote the building blocks that \\(f(x)\\) will be a composition many such building blocks.\nFor a moment, w.l.o.g, assume \\(x\\) to be univariate. In a typical regression setup, one constructs features such as \\(x^2, x^3,\\dots\\), like in polynomial regression and treats this as a Linear Regression problem. We can view this standard procedure as expanding the function \\(f(x)\\) on a set of Polynomials. Seen in a more general sense, we can choose an appropriate Basis Functions to construct the feature space. For example [see Chapter 9 of All of Nonparametric Statistics] \\[f(x) \\equiv \\phi(x) =  \\sum_{i=1}^{\\infty} \\beta_i B_i(x)\\] where \\(B_1(x) \\equiv 1, B_i(x) \\equiv \\sqrt{2}\\cos((i-1)\\pi x) \\text{ for } i \\ge 2\\). See the figure below.\n\n\n\nShallow Nonparametric Regression\n\n\nIn practice, we will truncate the expansion upto some finite number of terms. For illustration, say, we choose \\(p\\) terms. Then, in matrix notation, the regression problem is: \\[\n\\begin{array}{left}\n{\\bf y} = {\\bf X}{\\bf \\beta} + {\\bf \\epsilon}\n\\end{array}\n\\] where \\[\n\\begin{array}{left}\n{\\bf X}_{N \\times p} &=&\n\\begin{pmatrix} 1 & \\sqrt{2}\\cos(\\pi x^{[1]}) & \\dots & \\sqrt{2}\\cos((p-1)\\pi x^{[1]}) \\\\\n1 & \\sqrt{2}\\cos(\\pi x^{[2]}) & \\dots & \\sqrt{2}\\cos((p-1)\\pi x^{[2]}) \\\\\n\\vdots & & & \\vdots \\\\\n1 & \\sqrt{2}\\cos(\\pi x^{[N]}) & \\dots & \\sqrt{2}\\cos((p-1)\\pi x^{[N]})\n\\end{pmatrix} \\\\\n{\\bf \\beta}_{p \\times 1} &=& [\\beta_1, \\beta_2, \\dots, \\beta_p ]^T \\\\\n{\\bf y}_{N \\times 1} &=& [y^{[1]}, y^{[2]}, \\dots, y^{[N]} ]^T \\\\\n\\end{array}\n\\] Different choices of the Basis Functions lead to different design matrices \\({\\bf X}\\). Some popular choices are Splines, Chebyshev Polynomials, Legendre Polynomials, Wavelets, among others.\n\nSpecifically, the function approximation with wavelets takes the following form [see Wavelet Regression in Python for a very nice demonstration of wavelets for denoising]: \\[\\phi(x) = \\alpha\\zeta(x) + \\sum_{j=0}^{J-1}\\sum_{k=0}^{2^j-1}\\beta_{jk}\\psi_{jk}(x)\\] where \\[\\alpha=\\int_0^1\\phi(x)\\zeta(x)dx\\text{, }\\beta_{jk}=\\int_0^1\\phi(x)\\psi_{jk}(x)dx.\\] Here \\(\\alpha\\) and \\(\\beta_{jk}\\) are called the scaling coefficients and the detail coefficients, respectively. The basic idea is that the detail coefficients capture the coarser details of the function while the scaling, or smoothing, coefficients capture the overall functional form.\n\nSo, by choosing a basis function, we map the observed inputs into the feature space defined by the basis functions, and project the response onto the space spanned by the basis functions. This is basically a Linear Regression in a different function space (induced by the basis functions).\n\nGeneralized Additive Models\nHow we do extend the above formulation to multivariate case. Say, we have \\(n_{0}\\) dimensional inputs. One obvious but non-trivial way is to choose a multivariate basis function. The other way to do that is to construct feature space based on univariate functions which we know how to approximate already. For example, take the cartesian product space as \\(f(x) \\equiv \\phi(x) = \\Pi_{p=1}^{n_{0}} \\phi_p(x_p)\\) where \\(\\phi_p\\) can be expanded like before (univariate case). Or we can construct the function \\(f(x)\\) additively as \\[\n\\begin{array}{left}\nf(x) \\equiv \\phi(x) = \\sum_{p=1}^{n_{0}} \\phi_p(x_p)\n\\end{array}\n\\] See the figure below for an illustration.\n\n\n\nKAN Neuron = GAM\n\n\nThat is, for every dimension \\(p\\) of input, there is a corresponding \\(\\phi_p\\) and add them to get the multivariate function. In fact, the above specification appears in the framework developed by Hastie and Tibhirani in 1986, known as Generalized Addtive Models (see paper, wiki) or GAMs as they are popularly referred to.\nIn the context of KANs, we can see this exactly as node at layer \\(l+1\\), which takes inputs from layer \\(l\\), whose edges represent the transformation of the inputs via \\(\\phi_p\\). Except for notational differences, each neuron in KAN is a GAM.\nSo far we are dealing with single output and multiple inputs. But what if there are \\(n_{1}\\) outputs. This leads us to Vector Generalized Additive Models (VGAMs), proposed by Yee and Wild in 1996 (see paper, wiki). We can represent VGAM as:\n\\[\n\\begin{array}{left}\ny_q \\equiv f_{q,.}(x) \\equiv \\phi_{q,.}(x)  = \\sum_{p=1}^{n_{0}} \\phi_{q,p}(x_p)\n\\end{array}\n\\]\nSee the figure below.\n\n\n\nKAN Layer = VGAM\n\n\nIn effect, one KAN layer is actually a VGLM. So, given \\(D=\\{x^{[i]}, y^{[i]}\\}_{i=1}^{N}\\) with \\(x^{[i]} \\in R^{n_{0}}, y^{[i]} \\in R^{n_{1}}\\) outputs, KAN Layer and VGAM learn \\(\\phi_{q,p}\\) which are specified in terms of the Basis Functions. Note that, the choice of the basis functions should correspond to the domain and range of the functions being modeled. GAMS and VGAM models are trained (or leant) using back-fitting techniques. There is no reason why we can not apply backprop and fit these models using gradient descent.\n\n\n\nDeep Non-parametric Regression\nWhat remains to be done is to stack the KAN layers or VGAMs. That gets us to the model we have seen before: \\[\nf(\\bf{x})=\\sum_{i_{L-1}=1}^{n_{L-1}}\\phi_{L-1,i_{L},i_{L-1}}\\left(\\sum_{i_{L-2}=1}^{n_{L-2}}\\cdots\\left(\\sum_{i_2=1}^{n_2}\\phi_{2,i_3,i_2}\\left(\\sum_{i_1=1}^{n_1}\\phi_{1,i_2,i_1}\\left(\\sum_{i_0=1}^{n_0}\\phi_{0,i_1,i_0}(x_{i_0})\\right)\\right)\\right)\\cdots\\right)\n\\] For clarity sake, \\(x_{i_0}\\) is referring to the \\(i_0\\)th input dimension or feature, \\(\\phi_{l,q,p}(.)\\) refers to the function that maps \\(p\\)th input of layer \\(l\\) to the (before summation) \\(q\\)th output.\n\n\n\nKAN = Deep Non-parametric Regression Model\n\n\nFrom the above figure, it is clear that the edges of the KAN network are non-parametric functions \\(\\phi(x)\\) which are specified via basis functions such as Splines. Even MLPs have weights on the edges. So, the mainstream interpretation that, in KANs, the activations are on the edges and they learnable is not very convincing. Both MLPs and KANs, the edges are learnable. For example, if we choose \\(\\phi(x) = \\beta x\\), and replace the sum with sum preceded by elementwise, fixed activations like \\(SiLU\\), we get the standard MLP network. KAN proposed by Liu et al in 2024 is a Deep Non-parametric Regression framework. Specifically, we note the following:\n\nKAN nueron is a Generalized Additive Model (GAM)\nKAN layer is a Vector GAM (VGAM)\nKAN Network is a Deep VGAM or simply put a Deep Non-parametric Regression Model\n\nThe generality of this technique comes from\n\nflexibility of the specification (multiple inputs, multiple outputs, varying depth and width, choice of basis functions) and\nthe deep learning framework itself (so we can fit almost all methods using backprop without doing any custom implementations). See for example,\n\nFast-KAN: KANs with Radial Basis Functions code\nChebyKAN: KANs with Chebyshev Polynomials paper\nWavKAN: KANs with Wavelets paper\nTorchKAN: KANs with Monmials, Legendre Polynomials code\n\n\nNot only that, the modularity of the KAN layer allows one to mix and match KAN layer with other modules such as a Transformer block or an RNN block, for example. We can replace MLP with KAN almost like a drop-in. We have already seen implementation of KANs in GPT models GTP-KAN, GPT2-KAN. They also started appearing in CNNs. For more resources see awesome-KANS.\nExplore the features of KANs such as interpretability, solving PDEs,check out this KAN Features notebook or go through the examples from PyKAN official repo.\n\n\nLimitations\nFew limitation of KANs at this time are:\n\nthey are relying on one dimensional (univariate) functions as the building blocks. This need not be efficient always. For example, consider 2d functions that have certain spatial or temporal properties. To apply KANs, we have to convert them to 1d first and then model them in KAN layers. It would be much better if we can find multivariate basis functions that can naturally deal with arbitrary dimensions. For example, to process images, 2d wavelets could be a better choice.\nIn many cases, MLPs still seem to be doing better. See KAN or MLP: A Fairer Comparison for details. KANs seem to be good at symbolic regression problems which have much more grounding in physical and other sciences.\nSpline-based KANs, as noted by others are computationally slow. Knot selection (how many and where to place them) are important hyperparameters which can affect the performance significancy. Grid refinement implemented by the KAN authors is a welcoming step in this direction but it is a hard problem.\n\nThat said, KANs are extremely interesting in the sense that:\n\nKAN is a Deep Non-parametric regression framework\nVery generic\nAfter pruning and doing some symbolic search, one can recover interoperable equations, which may be difficult to do in a typical Deep Neural Network.\nCombining the power of Wavelets, Filter Banks, Multi Resolution Analysis (MRA) in the KAN framework would be interesting to pursue.",
     "crumbs": [
       "Lectures",
       "KANs"
diff --git a/sitemap.xml b/sitemap.xml
index b24b2f4..dba80fe 100644
--- a/sitemap.xml
+++ b/sitemap.xml
@@ -54,6 +54,6 @@
   </url>
   <url>
     <loc>https://mlsquare.github.io/intro2dl/lectures/L03.html</loc>
-    <lastmod>2024-09-22T16:08:33.896Z</lastmod>
+    <lastmod>2024-09-22T16:29:21.759Z</lastmod>
   </url>
 </urlset>