Update notes

Jonas1312 · May 5, 2024 · e4c2c5d · e4c2c5d
1 parent ef12bfe
commit e4c2c5d
Show file tree

Hide file tree

Showing 5 changed files with 34 additions and 0 deletions.
diff --git a/...tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md b/...tech-maths/machine-learning/algorithms/neural-nets/transformers/transformers.md
@@ -11,6 +11,7 @@
       - [How to assign queries, keys and values](#how-to-assign-queries-keys-and-values)
       - [Cross attention](#cross-attention)
       - [Alternative interpretation of attention](#alternative-interpretation-of-attention)
+      - [Sliding Window Attention (SWA)](#sliding-window-attention-swa)
     - [Embeddings](#embeddings)
     - [Feed-forward network](#feed-forward-network)
     - [Encoder block](#encoder-block)
@@ -368,6 +369,14 @@ $X$ is trainable, since it's just the embedding matrix.
 
 In summary, attention seems to work because it mimics nearest neighbors EXCEPT it uses a NON SYMMETRIC similarity measure, and cleverly "passes" similarity information downstream using a final mixing projection.
 
+#### Sliding Window Attention (SWA)
+
+The number of ops in vanilla attention is quadratic in the sequence length. And the memory increases linearly with the sequence length.
+
+in SWA, each token can attend to at most W tokens from the previous layers. But note that tokens outside the window can still attend to the current token, since attention layers are stacked. This is similar to the receptive field in CNNs. After k layers, information can move forward by up to k*W tokens.
+
+At the last layer, using a window size of W = 4096, we have a theoretical attention span of approximately 131K tokens.
+
 ### Embeddings
 
 To convert from tokens to embeddings and vice-versa, the transformer uses a weight matrix $W_{emb} \in \mathbb{R}^{vocab \times d}$. This matrix is learned during training. It is used at three different places in the transformer:

diff --git a/base/science-tech-maths/machine-learning/hardware/gpu-providers.md b/base/science-tech-maths/machine-learning/hardware/gpu-providers.md
@@ -8,3 +8,4 @@
 - <https://azure.microsoft.com/en-gb/global-infrastructure/services/>
 - <https://azure.microsoft.com/en-gb/pricing/details/virtual-machines/linux/>
 - <https://datalab.sspcloud.fr/catalog/ide>
+- <https://www.gpudeploy.com/>
diff --git a/base/science-tech-maths/machine-learning/machine-learning.md b/base/science-tech-maths/machine-learning/machine-learning.md
@@ -55,6 +55,8 @@ A machine learning algorithm is an algorithm that is able to learn patterns from
     - [Embeddings](#embeddings)
   - [Output of sigmoid is not a probability](#output-of-sigmoid-is-not-a-probability)
   - [No free lunch theorem](#no-free-lunch-theorem)
+  - [NLP](#nlp)
+    - [Grams](#grams)
   - [Problems of AI](#problems-of-ai)
   - [Jobs in ML](#jobs-in-ml)
   - [Advice for ML engineers](#advice-for-ml-engineers)
@@ -633,6 +635,21 @@ For every case where an algorithm works, I could construct a situation where it
 
 Implies that assumptions are where the power of your mdoel comes from. This is why it's important to understand the assumptions and inductive priors of your model.
 
+## NLP
+
+### Grams
+
+- unigram: one word
+- bigram: two words
+- trigram: three words
+- n-gram: n words
+
+A bigram is a two-word sequence of words like "How are you doing": "How are", "are you", "you doing".
+
+The items can be phonemes, syllables, letters, words, or base pairs according to the application.
+
+How many $N$-grams in a sentence? If $X$ is the number of words in a given sentence, the number of $N$-grams for sentence would be: $X - (N - 1)$.
+
 ## Problems of AI
 
 - Human-in-the-loop requirements (facebook still employs 15000 people to assist their content moderation algorithm)

diff --git a/base/science-tech-maths/programming/databases/databases.md b/base/science-tech-maths/programming/databases/databases.md
@@ -41,3 +41,8 @@ psql postgresql://postgres:some_password@postgres_db:5432/my_db_name
 ### UUID
 
 Use UUIDv7 as primary key: <https://www.cybertec-postgresql.com/en/unexpected-downsides-of-uuid-keys-in-postgresql/>
+
+### Links
+
+- <https://postgres.ai/blog/20230722-10-postgres-tips-for-beginners>
+- <https://postgres.ai/blog/20220525-common-db-schema-change-mistakes>
diff --git a/base/science-tech-maths/programming/languages/python/python.md b/base/science-tech-maths/programming/languages/python/python.md
@@ -937,6 +937,8 @@ print(example.__name__)  # example
 print(example.__doc__)  # Docstring
 ```
 
+How to properly annotate sync/async function decorators: <https://github.com/microsoft/pyright/issues/2142>
+
 ## Context managers (with)
 
 Use context managers (`with...`) instead of `try` + `finally`!