diff --git a/base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md b/base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md
index e9e0c46..f7aa63d 100644
--- a/base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md
+++ b/base/science-tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md
@@ -11,6 +11,7 @@
- [How to assign queries, keys and values](#how-to-assign-queries-keys-and-values)
- [Cross attention](#cross-attention)
- [Alternative interpretation of attention](#alternative-interpretation-of-attention)
+ - [Sliding Window Attention (SWA)](#sliding-window-attention-swa)
- [Embeddings](#embeddings)
- [Feed-forward network](#feed-forward-network)
- [Encoder block](#encoder-block)
@@ -368,6 +369,14 @@ $X$ is trainable, since it's just the embedding matrix.
In summary, attention seems to work because it mimics nearest neighbors EXCEPT it uses a NON SYMMETRIC similarity measure, and cleverly "passes" similarity information downstream using a final mixing projection.
+#### Sliding Window Attention (SWA)
+
+The number of ops in vanilla attention is quadratic in the sequence length. And the memory increases linearly with the sequence length.
+
+in SWA, each token can attend to at most W tokens from the previous layers. But note that tokens outside the window can still attend to the current token, since attention layers are stacked. This is similar to the receptive field in CNNs. After k layers, information can move forward by up to k*W tokens.
+
+At the last layer, using a window size of W = 4096, we have a theoretical attention span of approximately 131K tokens.
+
### Embeddings
To convert from tokens to embeddings and vice-versa, the transformer uses a weight matrix $W_{emb} \in \mathbb{R}^{vocab \times d}$. This matrix is learned during training. It is used at three different places in the transformer:
diff --git a/base/science-tech-maths/machine-learning/hardware/gpu-providers.md b/base/science-tech-maths/machine-learning/hardware/gpu-providers.md
index 373b5aa..ba1da82 100644
--- a/base/science-tech-maths/machine-learning/hardware/gpu-providers.md
+++ b/base/science-tech-maths/machine-learning/hardware/gpu-providers.md
@@ -8,3 +8,4 @@
-
-
-
+-
diff --git a/base/science-tech-maths/machine-learning/machine-learning.md b/base/science-tech-maths/machine-learning/machine-learning.md
index 562253a..f2a4e57 100644
--- a/base/science-tech-maths/machine-learning/machine-learning.md
+++ b/base/science-tech-maths/machine-learning/machine-learning.md
@@ -55,6 +55,8 @@ A machine learning algorithm is an algorithm that is able to learn patterns from
- [Embeddings](#embeddings)
- [Output of sigmoid is not a probability](#output-of-sigmoid-is-not-a-probability)
- [No free lunch theorem](#no-free-lunch-theorem)
+ - [NLP](#nlp)
+ - [Grams](#grams)
- [Problems of AI](#problems-of-ai)
- [Jobs in ML](#jobs-in-ml)
- [Advice for ML engineers](#advice-for-ml-engineers)
@@ -633,6 +635,21 @@ For every case where an algorithm works, I could construct a situation where it
Implies that assumptions are where the power of your mdoel comes from. This is why it's important to understand the assumptions and inductive priors of your model.
+## NLP
+
+### Grams
+
+- unigram: one word
+- bigram: two words
+- trigram: three words
+- n-gram: n words
+
+A bigram is a two-word sequence of words like "How are you doing": "How are", "are you", "you doing".
+
+The items can be phonemes, syllables, letters, words, or base pairs according to the application.
+
+How many $N$-grams in a sentence? If $X$ is the number of words in a given sentence, the number of $N$-grams for sentence would be: $X - (N - 1)$.
+
## Problems of AI
- Human-in-the-loop requirements (facebook still employs 15000 people to assist their content moderation algorithm)
diff --git a/base/science-tech-maths/programming/databases/databases.md b/base/science-tech-maths/programming/databases/databases.md
index 55fbe6a..5aead97 100644
--- a/base/science-tech-maths/programming/databases/databases.md
+++ b/base/science-tech-maths/programming/databases/databases.md
@@ -41,3 +41,8 @@ psql postgresql://postgres:some_password@postgres_db:5432/my_db_name
### UUID
Use UUIDv7 as primary key:
+
+### Links
+
+-
+-
diff --git a/base/science-tech-maths/programming/languages/python/python.md b/base/science-tech-maths/programming/languages/python/python.md
index e5a4aae..3e60200 100644
--- a/base/science-tech-maths/programming/languages/python/python.md
+++ b/base/science-tech-maths/programming/languages/python/python.md
@@ -937,6 +937,8 @@ print(example.__name__) # example
print(example.__doc__) # Docstring
```
+How to properly annotate sync/async function decorators:
+
## Context managers (with)
Use context managers (`with...`) instead of `try` + `finally`!