Skip to content

Commit

Permalink
Update notes
Browse files Browse the repository at this point in the history
  • Loading branch information
Jonas1312 committed May 5, 2024
1 parent ef12bfe commit e4c2c5d
Show file tree
Hide file tree
Showing 5 changed files with 34 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
- [How to assign queries, keys and values](#how-to-assign-queries-keys-and-values)
- [Cross attention](#cross-attention)
- [Alternative interpretation of attention](#alternative-interpretation-of-attention)
- [Sliding Window Attention (SWA)](#sliding-window-attention-swa)
- [Embeddings](#embeddings)
- [Feed-forward network](#feed-forward-network)
- [Encoder block](#encoder-block)
Expand Down Expand Up @@ -368,6 +369,14 @@ $X$ is trainable, since it's just the embedding matrix.

In summary, attention seems to work because it mimics nearest neighbors EXCEPT it uses a NON SYMMETRIC similarity measure, and cleverly "passes" similarity information downstream using a final mixing projection.

#### Sliding Window Attention (SWA)

The number of ops in vanilla attention is quadratic in the sequence length. And the memory increases linearly with the sequence length.

in SWA, each token can attend to at most W tokens from the previous layers. But note that tokens outside the window can still attend to the current token, since attention layers are stacked. This is similar to the receptive field in CNNs. After k layers, information can move forward by up to k*W tokens.

At the last layer, using a window size of W = 4096, we have a theoretical attention span of approximately 131K tokens.

### Embeddings

To convert from tokens to embeddings and vice-versa, the transformer uses a weight matrix $W_{emb} \in \mathbb{R}^{vocab \times d}$. This matrix is learned during training. It is used at three different places in the transformer:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@
- <https://azure.microsoft.com/en-gb/global-infrastructure/services/>
- <https://azure.microsoft.com/en-gb/pricing/details/virtual-machines/linux/>
- <https://datalab.sspcloud.fr/catalog/ide>
- <https://www.gpudeploy.com/>
17 changes: 17 additions & 0 deletions base/science-tech-maths/machine-learning/machine-learning.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ A machine learning algorithm is an algorithm that is able to learn patterns from
- [Embeddings](#embeddings)
- [Output of sigmoid is not a probability](#output-of-sigmoid-is-not-a-probability)
- [No free lunch theorem](#no-free-lunch-theorem)
- [NLP](#nlp)
- [Grams](#grams)
- [Problems of AI](#problems-of-ai)
- [Jobs in ML](#jobs-in-ml)
- [Advice for ML engineers](#advice-for-ml-engineers)
Expand Down Expand Up @@ -633,6 +635,21 @@ For every case where an algorithm works, I could construct a situation where it

Implies that assumptions are where the power of your mdoel comes from. This is why it's important to understand the assumptions and inductive priors of your model.

## NLP

### Grams

- unigram: one word
- bigram: two words
- trigram: three words
- n-gram: n words

A bigram is a two-word sequence of words like "How are you doing": "How are", "are you", "you doing".

The items can be phonemes, syllables, letters, words, or base pairs according to the application.

How many $N$-grams in a sentence? If $X$ is the number of words in a given sentence, the number of $N$-grams for sentence would be: $X - (N - 1)$.

## Problems of AI

- Human-in-the-loop requirements (facebook still employs 15000 people to assist their content moderation algorithm)
Expand Down
5 changes: 5 additions & 0 deletions base/science-tech-maths/programming/databases/databases.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,8 @@ psql postgresql://postgres:some_password@postgres_db:5432/my_db_name
### UUID

Use UUIDv7 as primary key: <https://www.cybertec-postgresql.com/en/unexpected-downsides-of-uuid-keys-in-postgresql/>

### Links

- <https://postgres.ai/blog/20230722-10-postgres-tips-for-beginners>
- <https://postgres.ai/blog/20220525-common-db-schema-change-mistakes>
Original file line number Diff line number Diff line change
Expand Up @@ -937,6 +937,8 @@ print(example.__name__) # example
print(example.__doc__) # Docstring
```

How to properly annotate sync/async function decorators: <https://github.com/microsoft/pyright/issues/2142>

## Context managers (with)

Use context managers (`with...`) instead of `try` + `finally`!
Expand Down

0 comments on commit e4c2c5d

Please sign in to comment.