|
535 | 535 | "\n",
|
536 | 536 | "`[$][0-9]+([.][0-9][0-9])?`\n",
|
537 | 537 | "\n",
|
538 |
| - "Where `+` means 1 or more occurrences and `?` means at most 1 occurrence. Usually a template consists of a prefix, a target and a postfix regex. In this template, the prefix regex can be \"price:\", the target regex can be the above regex and the postfix regex can be empty.\n", |
| 538 | + "Where `+` means 1 or more occurrences and `?` means atmost 1 occurrence. Usually a template consists of a prefix, a target and a postfix regex. In this template, the prefix regex can be \"price:\", the target regex can be the above regex and the postfix regex can be empty.\n", |
539 | 539 | "\n",
|
540 | 540 | "A template can match with multiple strings. If this is the case, we need a way to resolve the multiple matches. Instead of having just one template, we can use multiple templates (ordered by priority) and pick the match from the highest-priority template. We can also use other ways to pick. For the dollar example, we can pick the match closer to the numerical half of the highest match. For the text \"Price $90, special offer $70, shipping $5\" we would pick \"$70\" since it is closer to the half of the highest match (\"$90\")."
|
541 | 541 | ]
|
|
706 | 706 | "metadata": {},
|
707 | 707 | "source": [
|
708 | 708 | "### Permutation Decoder\n",
|
709 |
| - "Now let us try to decode messages encrypted by a general monoalphabetic substitution cipher. The letters in the alphabet can be replaced by any permutation of letters. For example if the alpahbet consisted of `{A B C}` then it can be replaced by `{A C B}`, `{B A C}`, `{B C A}`, `{C A B}`, `{C B A}` or even `{A B C}` itself. Suppose we choose the permutation `{C B A}`, then the plain text `\"CAB BA AAC\"` would become `\"ACB BC CCA\"`. We can see that Caesar cipher is also a form of permutation cipher where the permutation is a cyclic permutation. Unlike the Caesar cipher, it is infeasible to try all possible permutations. The number of possible permutations in Latin alphabet is `26!` which is of the order $10^{26}$. We use graph search algorithms to search for a 'good' permutation." |
| 709 | + "Now let us try to decode messages encrypted by a general mono-alphabetic substitution cipher. The letters in the alphabet can be replaced by any permutation of letters. For example, if the alphabet consisted of `{A B C}` then it can be replaced by `{A C B}`, `{B A C}`, `{B C A}`, `{C A B}`, `{C B A}` or even `{A B C}` itself. Suppose we choose the permutation `{C B A}`, then the plain text `\"CAB BA AAC\"` would become `\"ACB BC CCA\"`. We can see that Caesar cipher is also a form of permutation cipher where the permutation is a cyclic permutation. Unlike the Caesar cipher, it is infeasible to try all possible permutations. The number of possible permutations in Latin alphabet is `26!` which is of the order $10^{26}$. We use graph search algorithms to search for a 'good' permutation." |
710 | 710 | ]
|
711 | 711 | },
|
712 | 712 | {
|
|
722 | 722 | "cell_type": "markdown",
|
723 | 723 | "metadata": {},
|
724 | 724 | "source": [
|
725 |
| - "Each state/node in the graph is represented as a letter-to-letter map. If there no mapping for a letter it means the letter is unchanged in the permutation. These maps are stored as dictionaries. Each dictionary is a 'potential' permutation. We use the word 'potential' because every dictionary doesn't necessarily represent a valid permutation since a permutation cannot have repeating elements. For example the dictionary `{'A': 'B', 'C': 'X'}` is invalid because `'A'` is replaced by `'B'`, but so is `'B'` because the dictionary doesn't have a mapping for `'B'`. Two dictionaries can also represent the same permutation e.g. `{'A': 'C', 'C': 'A'}` and `{'A': 'C', 'B': 'B', 'C': 'A'}` represent the same permutation where `'A'` and `'C'` are interchanged and all other letters remain unaltered. To ensure we get a valid permutation a goal state must map all letters in the alphabet. We also prevent repetions in the permutation by allowing only those actions which go to new state/node in which the newly added letter to the dictionary maps to previously unmapped letter. These two rules togeter ensure that the dictionary of a goal state will represent a valid permutation.\n", |
| 725 | + "Each state/node in the graph is represented as a letter-to-letter map. If there is no mapping for a letter, it means the letter is unchanged in the permutation. These maps are stored as dictionaries. Each dictionary is a 'potential' permutation. We use the word 'potential' because every dictionary doesn't necessarily represent a valid permutation since a permutation cannot have repeating elements. For example the dictionary `{'A': 'B', 'C': 'X'}` is invalid because `'A'` is replaced by `'B'`, but so is `'B'` because the dictionary doesn't have a mapping for `'B'`. Two dictionaries can also represent the same permutation e.g. `{'A': 'C', 'C': 'A'}` and `{'A': 'C', 'B': 'B', 'C': 'A'}` represent the same permutation where `'A'` and `'C'` are interchanged and all other letters remain unaltered. To ensure that we get a valid permutation, a goal state must map all letters in the alphabet. We also prevent repetitions in the permutation by allowing only those actions which go to a new state/node in which the newly added letter to the dictionary maps to previously unmapped letter. These two rules together ensure that the dictionary of a goal state will represent a valid permutation.\n", |
726 | 726 | "The score of a state is determined using word scores, unigram scores, and bigram scores. Experiment with different weightages for word, unigram and bigram scores and see how they affect the decoding."
|
727 | 727 | ]
|
728 | 728 | },
|
|
752 | 752 | "cell_type": "markdown",
|
753 | 753 | "metadata": {},
|
754 | 754 | "source": [
|
755 |
| - "As evident from the above example, permutation decoding using best first search is sensitive to initial text. This is because not only the final dictionary, with substitutions for all letters, must have good score but so must the intermediate dictionaries. You could think of it as performing a local search by finding substitutons for each letter one by one. We could get very different results by changing even a single letter because that letter could be a deciding factor for selecting substitution in early stages which snowballs and affects the later stages. To make the search better we can use different definition of score in different stages and optimize on which letter to substitute first." |
| 755 | + "As evident from the above example, permutation decoding using best first search is sensitive to initial text. This is because not only the final dictionary, with substitutions for all letters, must have good score but so must the intermediate dictionaries. You could think of it as performing a local search by finding substitutions for each letter one by one. We could get very different results by changing even a single letter because that letter could be a deciding factor for selecting substitution in early stages which snowballs and affects the later stages. To make the search better we can use different definitions of score in different stages and optimize on which letter to substitute first." |
756 | 756 | ]
|
757 | 757 | }
|
758 | 758 | ],
|
|
0 commit comments