Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix [RegExp|String].prototype methods and Regex matcher #6615

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

MadProbe
Copy link
Contributor

@MadProbe MadProbe commented Feb 20, 2021

This PR will fix spec deviations in:

  • RegExp.prototype[@@search]()
  • RegExp.prototype[@@split]()
  • String.prototype.split()
  • Some other methods yet to be added
  • Matcher unicode support
  • Clean out of the ES5 method versions

Also:

  • Unflag the ES6 RegExp methods and properties
  • Tests

@rhuanjl
Copy link
Collaborator

rhuanjl commented Feb 20, 2021

I'm not keen on killing all the fast paths - I suggested you'd need to check them but hopefully we can update conditions so they're only used when they won't cause a problem rather than just deleting them.

@@ -2412,7 +2429,7 @@ namespace Js
{
if (string->GetLength() > (0 > index || (uint64_t)index + (uint64_t)1 > UINT32_MAX ? UINT32_MAX : (uint32_t)(index + 1)) &&
NumberUtilities::IsSurrogateLowerPart(string->GetString()[index]) &&
NumberUtilities::IsSurrogateUpperPart(string->GetString()[index + 1]))
NumberUtilities::IsSurrogateUpperPart(string->GetString()[index + 1]) && isUnicode)
Copy link
Collaborator

@rhuanjl rhuanjl Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ideally should do the isUnicode check first - it's much quicker, than the IsSurrogateLowerPart and IsSuurrogateUpperPart checks; if we do it first can skip the slower bits when it's false.

Comment on lines +45 to +46
Field(codepoint_t*) m_codePointString; // Code points of the string, may be nullptr if not initialized yet by GetCodePoints call
Field(charcount_t) m_codePointsLength; // Length of the codepoints, may be k_InvalidCharCount if not initialized yet by GetCodePoints call
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will add two pointers to every single JS string - which isn't ideal - could we instead subclass JavascriptString with a new type like JavascriptStringWithCodePoints or something?

(Note I think this change is causing the compile failures as there is a static assert based on the size of a different subclass of JavascriptString)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I will need to create a new instance of JavascriptStringWithCodePoints on each SimpleMatch call if pattern shows us that it's unicode one.
Honestly I cannot imagine how i will cache the codepoint strings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right this is not an ideal solution - ugh - I didn't think this through did I. I think cache'ing is a no go.

Either a) we massively increase the size of all our strings OR b) we come up with a way to convert the type whenever we need to cache - which would not be simple as there could be many existing pointers to the string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants