Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use more efficient regex engine wherever possible #1182

Open
alex-s168 opened this issue Jan 27, 2025 · 17 comments
Open

use more efficient regex engine wherever possible #1182

alex-s168 opened this issue Jan 27, 2025 · 17 comments
Assignees

Comments

@alex-s168
Copy link
Contributor

Reasons:

  1. in the future (when we have a C frontend and my compiler backend is integrated), it would be great to not have to depend on any external library (like the C stdlib) for various reasons.
  2. the builtin regex implementation is slow-ish
  3. binary size

Proposal:
Use tpre to compile the regex to a byte array at compile time, and at runtime you only need 10 lines of code to execute the regex match.
When tpre can not compile a regex pattern (which only happens for very rare regex patterns), it should falls back to the stdlib's regex.h

tpre is at least twice as fast as un-jitted PCRE2

@andrew-johnson-4
Copy link
Owner

Is tpre ANSI C? That would be awesome.

@andrew-johnson-4 andrew-johnson-4 self-assigned this Jan 27, 2025
@andrew-johnson-4
Copy link
Owner

I have a question about tpre? Would it be hard to use this against non null-terminated strings. It is more efficient sometimes to deal with substrings rather than deep-copy the contents etc. So you end up with string slivers that can't be null terminated because they exist inside of another string that is still being used. Would it be possible to adapt a regex engine to work with this?

@alex-s168
Copy link
Contributor Author

I have a question about tpre? Would it be hard to use this against non null-terminated strings. It is more efficient sometimes to deal with substrings rather than deep-copy the contents etc. So you end up with string slivers that can't be null terminated because they exist inside of another string that is still being used. Would it be possible to adapt a regex engine to work with this?

are you tlaking about the strings that you want to match? in that case it's easy to change it

@andrew-johnson-4
Copy link
Owner

Yeah, the strings to match might be encoded as char*, length instead of just char*

@alex-s168
Copy link
Contributor Author

I just added that functionality and made it ANSI C
there is still a tiny bug but I will fix it today

@alex-s168
Copy link
Contributor Author

also you are not reachable on discord

When can / should I start integrating my compiler backend?

@andrew-johnson-4
Copy link
Owner

Oh, sorry about this disconnect. I think I need to enable a notification option or something, I just didn't notice that there was a new message today. I'll look into options so I don't miss messages.

For the compiler backend:

I'll add a priority ticket to quote this TODO

@alex-s168
Copy link
Contributor Author

how does foreach work?

@andrew-johnson-4
Copy link
Owner

I am going to rewrite the C backend in LSTS after finishing cleaning up Core. This new backend should only use stable API interfaces, so that will help keep me honest about what interfaces need to be stabilized. Currently there is information going back and forth between Core and the C Backend. That needs to be kept separate or reentrant.

The goal with this rewrite is

  • make language features total and sound (if you can type it, and if it passes type check, then it will compile)
  • catch all errors during typecheck
  • general tidy, code cleanliness, make sure each function has a clear documented purpose etc.

@andrew-johnson-4
Copy link
Owner

andrew-johnson-4 commented Jan 28, 2025

macro ( ('for-each (item 'in iter) loop) ) (
   (let (uuid iter-term) iter)
   (while (non-zero (uuid iter-term)) (match (uuid iter-term) (
      ()
      ( (LCons( item (uuid iter-rst) )) (
         loop
         (set (uuid iter-term) (uuid iter-rst))
      ))
   )))
);

I need to rewrite for-each so that it doesn't depend on List and instead uses abstract iterators.

Opinions or ideas on how to write for-each better are welcome. I haven't decided yet.

@alex-s168
Copy link
Contributor Author

you could do something like:

type VectorIter<t> = VectorIter {
  vec: Vector<t>,
  idx: U64,
};

let .next(v: VectorIter<t>): Cons<VectorIter<t>, Maybe<t> = (

);

// with const iter, vec is not allowed to be modified while iterating
let .const-iter(v: Vector<t>): VectorIter<t> = (
  // I still don't know how lsts struct initializers work
  VectorIter { v, 0 }
);

but that is a bit stupid because of mutability
I have a better idea bt I need to think about it

@andrew-johnson-4
Copy link
Owner

My main opinion on iterators is that they shouldn't pin to an implementation. No explicit Vector. No explicit List etc. The core point of specialization is that it allows you to write algorithms using only the core parts explicitly and the rest can be inferred.

for x in y { z }

x is a left hand side. y is iterable. z is some expression.

The core design here that is still up in the air is "what does it mean for an object to be iterable."

Does it need to specifically be a vector or list. Why can't the loop accept both etc.

My first thought was head / tail functions, but that means two separate methods.

@andrew-johnson-4
Copy link
Owner

andrew-johnson-4 commented Jan 28, 2025

.next : iter -> Maybe<x>

any objections?

@andrew-johnson-4
Copy link
Owner

This interface is kinda cool because the .next function can be stateless. Random number generator? Sure you can iterate over that.

@alex-s168
Copy link
Contributor Author

I fixed all known bugs in the regex engine and added some tests

@alex-s168
Copy link
Contributor Author

wait do we need named capture groups?
I can add support for them if we need them

@andrew-johnson-4
Copy link
Owner

No, not right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants