Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decoder (and disassembler) cache #7

Open
sulir opened this issue Apr 19, 2012 · 5 comments
Open

Add decoder (and disassembler) cache #7

sulir opened this issue Apr 19, 2012 · 5 comments
Labels
feature optimization Optimization or refactoring
Milestone

Comments

@sulir
Copy link
Collaborator

sulir commented Apr 19, 2012

The cache should work like this: Every time an instruction is decoded, it is placed into the cache of some capacity. Next time, if the supplied bytes are exactly the same, the decoded instruction is returned directly from the cache, without performing the decoding process.

There is one problem - the size of the instruction is unknown before decoding. So the cache should be a combination of a tree and hash-maps.

Something similar could be also done for the disassembler.

If the insertion and selection were fast enough, this would significantly improve the emulator speed.

@ghost ghost assigned sulir Apr 19, 2012
@vbmacher
Copy link
Collaborator

This reminds me a prefix-tree (Trie) used in text searching. I imagine decoder should be working like this:

  • The trie node has the following:

    • function explore(input)
    • property annotation = decoded instruction
  • on an input, the decoding algorithm would match longest prefix in the trie. If it's matched and the node contains an annotation, we have the instruction. If it does not have any annotation, we're expecting more input (cannot decide).

  • if the input is longer than the longest prefix in the trie (we're at the end), we use function explore called with the unmatched input. This function will expand the trie even more from the node where we call it. If it's not possible, then we throw runtime exception (unknown instruction).

But I think we don't need to tie ourselfs with cache approach (I mean dynamically expand the tree). Instead we can build the whole tree in advance from the spec file and use it for searching during decoding. So decoding would mean just to search for the longest prefix of the input and emit instructions.

@vbmacher
Copy link
Collaborator

Actually, maybe even better is to use DFA (determinite finite automaton) used in parsing, or instead of generating Java code, generate parser grammar for some parser generator (JavaCC or ANTLR) which can do the job..

@vbmacher
Copy link
Collaborator

vbmacher commented Jul 7, 2021

Hm, all the processes I described above seem too complicated - in implementation and complexity. I think the caching advantage should mean much faster decoding which means we need to know the input size in advance. I'm returning back to the idea originally suggested by @sulir .

Can we assume one instruction has size limit? I mean - can we always guarantee that the byte sequence of an instruction is final? Let's assume yes.

Then, we can have several caches based on expected instruction size: Map[Integer, Map[Array[Byte], DecodedInstruction]].
By the assumption above, we know all possible instruction sizes. Number of instruction sizes is also final (simply because the specification file doesn't allow infinite number of rules). With this assumptions, the cache can be prepared in advance. If we want to save memory, we can also limit the cache to cache only a certain instruction sizes.

The algorithm can work as follows:

  1. Read input up to the longest instruction size and see if it's in cache
  2. If not, cut the suffix until the next-longest instruction size and see if it's in cache
  3. .. repeat step 2 until we find the instruction
  4. If we didn't find the instruction, go to the decoding process.

When considering algorithmic complexity, the cache lookup is O(k) where k is the number of instruction sizes (when assuming that lookup of Array[Byte] in a hashmap is O(1)).

Decoding itself has complexity O(n * h * m) where n is number of rules, h is the number of masks in a rule and n is number of patterns in a mask.

So the final complexity in the worst-case using the cache solution can be up to O(k) + O(n * h * m).

@vbmacher
Copy link
Collaborator

vbmacher commented Jul 7, 2021

I think we're getting into a problem which the JVM HotSpot needs to deal with. The JVM Hotspot is interpreting instructions until a hit rate of one instruction (or a block) hits a threshold. When this threshold is hit, the block is translated into native implementation and it is cached. From now on, every time this address is hit, the cached value is used.

If we get inspired from this knowledge, then we can improve our decoding algorithm based on the memory location of an instruction. And we don't need to know instruction sizes at all. The only thing we need to take care of is to monitor if any memory location changed which hits instructions in the cache.

The cache can be in form: Map[Integer, DecodedInstruction]. The algorithm can work as follows:

  1. We know the memory location from which we want to read the input. So we check if the memory location is present in the cache.
  2. If yes, return the instruction.
  3. If not, decode the instruction and insert to the cache.

Requirement is that we need to implement a memory listener and invalidate cache locations which changed. In the implementation of the map we can use a LRU eviction algorithm (e.g. https://github.com/ben-manes/caffeine).

@vbmacher
Copy link
Collaborator

vbmacher commented Jul 7, 2021

The algorithmic complexity of the above is not worse than the current decoding algorithm in the worst-case. However, in the best case it is O(1).

@vbmacher vbmacher added optimization Optimization or refactoring and removed low-priority labels Aug 6, 2021
@vbmacher vbmacher added this to the 1.6 milestone Apr 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature optimization Optimization or refactoring
Projects
None yet
Development

No branches or pull requests

2 participants