Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct Regular Expressions Behavior Related to Annex B #58320

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

graphemecluster
Copy link
Contributor

@graphemecluster graphemecluster commented Apr 25, 2024

Follow-up of #58295
This fixes issues like the '}' expected cases given at #58275 (comment).

@typescript-bot typescript-bot added the For Uncommitted Bug PR for untriaged, rejected, closed or missing bug label Apr 25, 2024
@typescript-bot
Copy link
Collaborator

This PR doesn't have any linked issues. Please open an issue that references this PR. From there we can discuss and prioritise.

@@ -3390,7 +3400,7 @@ export function createScanner(languageVersion: ScriptTarget, skipTrivia: boolean
error(Diagnostics.Unicode_property_value_expressions_are_only_available_when_the_Unicode_u_flag_or_the_Unicode_Sets_v_flag_is_set, start, pos - start);
}
}
else if (unicodeMode) {
else if (!annexB) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Annex B, braces after p actually should not be parsed at all, but it does provide helpful errors like #58275 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really outdated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outdated label is just because the comment is on an old revision of the PR and GitHub can't figure out where the comment goes after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I could’ve removed the review and re-comment but finally chose not to.

@graphemecluster graphemecluster changed the title Correct RegExp Behavior Related to Annex B Correct Regular Expressions Behavior Related to Annex B Apr 26, 2024
Comment on lines 80 to 88
/\q\u\i\c\k\_\f\o\x\-\j\u\m\p\s/,

!!! error TS1125: Hexadecimal digit expected.
~~
!!! error TS1510: '\k' must be followed by a capturing group name enclosed in angle brackets.

!!! error TS1125: Hexadecimal digit expected.

!!! error TS1125: Hexadecimal digit expected.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all valid in Annex B. Even /\u{1/ is. In Annex B essentially all weird things are valid, you know. Currently (after #58295) /[\1]/ is valid but /[\8]/ isn’t. I don’t think it’s ideal.

@jakebailey
Copy link
Member

This will probably need to be rebased.

@rbuckton
Copy link
Member

@jakebailey is correct. #58339 also made some changes to this code, so a rebase or merge from main is necessary to resolve conflicts.

@graphemecluster
Copy link
Contributor Author

I know, I just haven’t got the time to do so. It would be faster if you could do that for me (I will be back soon).

error(Diagnostics.Numbers_out_of_order_in_quantifier, digitsStart, pos - digitsStart);
}
}
else if (!min) {
if (unicodeMode) {
if (!annexB) {
Copy link
Member

@rbuckton rbuckton May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it may be redundant, I think it might be better to still indicate unicodeMode here so that someone editing this code in the future doesn't mistakenly think this only applies to non-Annex B code. It may be better to use unicodeMode || !annexB and remove the if (unicodeMode) { annexB = false; } at the top of scanRegularExpressionWorker.

The same would go for other uses of annexB as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you are really fine with a dozen of occurrences of unicodeMode || !annexB?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but

const anyUnicodeModeOrNonAnnexB = unicode Mode || !annexB;

would work.

@@ -2801,7 +2811,10 @@ export function createScanner(languageVersion: ScriptTarget, skipTrivia: boolean
scanGroupName(/*isReference*/ true);
scanExpectedChar(CharacterCodes.greaterThan);
}
else if (unicodeMode) {
else {
// This is actually allowed in Annex B if there are no named capturing groups in the regex,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep track of whether we encountered a (?< during reScanSlashToken and add an entry to the RegularExpressionFlags enum? The spec passes NamedCaptureGroups as a production parameter just as it does for UnicodeMode and UnicodeSetsMode, but only ever passes it as ~NamedCaptureGroups in Annex B.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, didn’t think of this clever but dirty solution.
I can implement this but… the point is still about how much Annex B things we are going to respect. If we allow \k then should we allow \u and \x (and also \8 and \9 inside character classes) too? (#58320 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to error on \u, \x, \8, \9 because in those cases you are more likely to have actually meant something different. Writing \k when there are no named capture groups is far less ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If then I would personally lean towards linting also \1~\7 inside character classes to prevent this kind of mistakes. Or perhaps the opposite, only outside character classes. I don’t know. I understand that you don’t want the syntax checking to be too breaky, but someone must find this useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every decimal escape in a character class I've ever seen has been a bug, so it makes sense to error for that case.

PR Backlog automation moved this from Not started to Waiting on author May 16, 2024
@rbuckton
Copy link
Member

@graphemecluster do you have the time to make changes or respond to comments, or would you like me to take over the final changes to get this merged?

Copy link
Member

@rbuckton rbuckton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor comments to address and this will be ready to merge.

@@ -1556,9 +1556,7 @@ export function createScanner(languageVersion: ScriptTarget, skipTrivia: boolean
tokenFlags |= TokenFlags.ContainsInvalidEscape;
if (isRegularExpression || shouldEmitInvalidEscapeError) {
const code = parseInt(text.substring(start + 1, pos), 8);
if (isRegularExpression !== "annex-b") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this was the only place we check for "annex-b", we can make isRegularExpression a boolean again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same too, but there is a isRegularExpression === true at the bottom

// Annex B treats any unicode mode as the strict syntax.
annexB = false;
}
// Annex B treats any unicode mode as the strict syntax.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment explained why we set annexB to false, but doesn't apply to this whole declaration. I would change it to be more explanatory, i.e.:

Suggested change
// Annex B treats any unicode mode as the strict syntax.
// Regular expressions are checked more strictly when either in 'u' or 'v' mode, or
// when not using the looser interpretation of the syntax from ECMA-262 Annex B.

@@ -2887,7 +2903,7 @@ export function createScanner(languageVersion: ScriptTarget, skipTrivia: boolean
return "\\";
}
pos--;
return scanEscapeSequence(/*shouldEmitInvalidEscapeError*/ unicodeMode, /*isRegularExpression*/ annexB ? "annex-b" : true);
return scanEscapeSequence(/*shouldEmitInvalidEscapeError*/ unicodeMode, /*isRegularExpression*/ anyUnicodeModeOrNonAnnexB || "annex-b");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since "annexB" is no longer used in scanEscapeSequence, this can just be true:

Suggested change
return scanEscapeSequence(/*shouldEmitInvalidEscapeError*/ unicodeMode, /*isRegularExpression*/ anyUnicodeModeOrNonAnnexB || "annex-b");
return scanEscapeSequence(/*shouldEmitInvalidEscapeError*/ unicodeMode, /*isRegularExpression*/ true);

Comment on lines +2473 to +2476
&& charCodeUnchecked(p + 1) === CharacterCodes.question
&& charCodeUnchecked(p + 2) === CharacterCodes.lessThan
&& charCodeUnchecked(p + 3) !== CharacterCodes.equals
&& charCodeUnchecked(p + 3) !== CharacterCodes.exclamation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this in my last review. These need to be range checked:

Suggested change
&& charCodeUnchecked(p + 1) === CharacterCodes.question
&& charCodeUnchecked(p + 2) === CharacterCodes.lessThan
&& charCodeUnchecked(p + 3) !== CharacterCodes.equals
&& charCodeUnchecked(p + 3) !== CharacterCodes.exclamation
&& charCodeChecked(p + 1) === CharacterCodes.question
&& charCodeChecked(p + 2) === CharacterCodes.lessThan
&& charCodeChecked(p + 3) !== CharacterCodes.equals
&& charCodeChecked(p + 3) !== CharacterCodes.exclamation

Copy link
Member

@rbuckton rbuckton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more I missed.

isPreviousTermQuantifiable = true;
break;
}
}
if (max && Number.parseInt(min) > Number.parseInt(max)) {
else if (max && Number.parseInt(min) > Number.parseInt(max) && (anyUnicodeModeOrNonAnnexB || text.charCodeAt(pos) === CharacterCodes.closeBrace)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else if (max && Number.parseInt(min) > Number.parseInt(max) && (anyUnicodeModeOrNonAnnexB || text.charCodeAt(pos) === CharacterCodes.closeBrace)) {
else if (max && Number.parseInt(min) > Number.parseInt(max) && (anyUnicodeModeOrNonAnnexB || charCodeChecked(pos) === CharacterCodes.closeBrace)) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
For Uncommitted Bug PR for untriaged, rejected, closed or missing bug
Projects
PR Backlog
  
Waiting on author
Development

Successfully merging this pull request may close these issues.

None yet

4 participants