Correct Regular Expressions Behavior Related to Annex B #58320

graphemecluster · 2024-04-25T21:43:45Z

Follow-up of #58295
This fixes issues like the '}' expected cases given at #58275 (comment).

typescript-bot · 2024-04-25T21:43:50Z

This PR doesn't have any linked issues. Please open an issue that references this PR. From there we can discuss and prioritise.

graphemecluster · 2024-04-25T21:56:35Z

src/compiler/scanner.ts

@@ -3390,7 +3400,7 @@ export function createScanner(languageVersion: ScriptTarget, skipTrivia: boolean
                                error(Diagnostics.Unicode_property_value_expressions_are_only_available_when_the_Unicode_u_flag_or_the_Unicode_Sets_v_flag_is_set, start, pos - start);
                            }
                        }
-                        else if (unicodeMode) {
+                        else if (!annexB) {


In Annex B, braces after p actually should not be parsed at all, but it does provide helpful errors like #58275 (comment)

Not really outdated

The outdated label is just because the comment is on an old revision of the PR and GitHub can't figure out where the comment goes after.

Agreed. I could’ve removed the review and re-comment but finally chose not to.

graphemecluster · 2024-04-27T08:45:48Z

tests/baselines/reference/regularExpressionAnnexB.errors.txt

+      /\q\u\i\c\k\_\f\o\x\-\j\u\m\p\s/,
+
+!!! error TS1125: Hexadecimal digit expected.
+               ~~
+!!! error TS1510: '\k' must be followed by a capturing group name enclosed in angle brackets.
+
+!!! error TS1125: Hexadecimal digit expected.
+
+!!! error TS1125: Hexadecimal digit expected.


These are all valid in Annex B. Even /\u{1/ is. In Annex B essentially all weird things are valid, you know. Currently (after #58295) /[\1]/ is valid but /[\8]/ isn’t. I don’t think it’s ideal.

jakebailey · 2024-05-15T21:08:57Z

This will probably need to be rebased.

rbuckton · 2024-05-15T21:33:59Z

@jakebailey is correct. #58339 also made some changes to this code, so a rebase or merge from main is necessary to resolve conflicts.

graphemecluster · 2024-05-16T15:23:07Z

I know, I just haven’t got the time to do so. It would be faster if you could do that for me (I will be back soon).

rbuckton · 2024-05-16T21:03:04Z

src/compiler/scanner.ts

                                error(Diagnostics.Numbers_out_of_order_in_quantifier, digitsStart, pos - digitsStart);
                            }
                        }
                        else if (!min) {
-                            if (unicodeMode) {
+                            if (!annexB) {


Though it may be redundant, I think it might be better to still indicate unicodeMode here so that someone editing this code in the future doesn't mistakenly think this only applies to non-Annex B code. It may be better to use unicodeMode || !annexB and remove the if (unicodeMode) { annexB = false; } at the top of scanRegularExpressionWorker.

The same would go for other uses of annexB as well.

Are you sure you are really fine with a dozen of occurrences of unicodeMode || !annexB?

No, but

const anyUnicodeModeOrNonAnnexB = unicode Mode || !annexB;

would work.

rbuckton · 2024-05-16T21:10:15Z

src/compiler/scanner.ts

@@ -2801,7 +2811,10 @@ export function createScanner(languageVersion: ScriptTarget, skipTrivia: boolean
                        scanGroupName(/*isReference*/ true);
                        scanExpectedChar(CharacterCodes.greaterThan);
                    }
-                    else if (unicodeMode) {
+                    else {
+                        // This is actually allowed in Annex B if there are no named capturing groups in the regex,


Could we keep track of whether we encountered a (?< during reScanSlashToken and add an entry to the RegularExpressionFlags enum? The spec passes NamedCaptureGroups as a production parameter just as it does for UnicodeMode and UnicodeSetsMode, but only ever passes it as ~NamedCaptureGroups in Annex B.

Oh, didn’t think of this clever but dirty solution.
I can implement this but… the point is still about how much Annex B things we are going to respect. If we allow \k then should we allow \u and \x (and also \8 and \9 inside character classes) too? (#58320 (comment))

I think it's better to error on \u, \x, \8, \9 because in those cases you are more likely to have actually meant something different. Writing \k when there are no named capture groups is far less ambiguous.

If then I would personally lean towards linting also \1~\7 inside character classes to prevent this kind of mistakes. Or perhaps the opposite, only outside character classes. I don’t know. I understand that you don’t want the syntax checking to be too breaky, but someone must find this useful.

Every decimal escape in a character class I've ever seen has been a bug, so it makes sense to error for that case.

rbuckton · 2024-05-20T18:58:28Z

@graphemecluster do you have the time to make changes or respond to comments, or would you like me to take over the final changes to get this merged?

rbuckton

A few more minor comments to address and this will be ready to merge.

rbuckton · 2024-05-22T13:21:57Z

src/compiler/scanner.ts

@@ -1556,9 +1556,7 @@ export function createScanner(languageVersion: ScriptTarget, skipTrivia: boolean
                tokenFlags |= TokenFlags.ContainsInvalidEscape;
                if (isRegularExpression || shouldEmitInvalidEscapeError) {
                    const code = parseInt(text.substring(start + 1, pos), 8);
-                    if (isRegularExpression !== "annex-b") {


Since this was the only place we check for "annex-b", we can make isRegularExpression a boolean again.

I thought the same too, but there is a isRegularExpression === true at the bottom

rbuckton · 2024-05-22T13:26:19Z

src/compiler/scanner.ts

-            // Annex B treats any unicode mode as the strict syntax.
-            annexB = false;
-        }
+        // Annex B treats any unicode mode as the strict syntax.


The comment explained why we set annexB to false, but doesn't apply to this whole declaration. I would change it to be more explanatory, i.e.:

Suggested change

// Annex B treats any unicode mode as the strict syntax.

// Regular expressions are checked more strictly when either in 'u' or 'v' mode, or

// when not using the looser interpretation of the syntax from ECMA-262 Annex B.

rbuckton · 2024-05-22T14:03:15Z

src/compiler/scanner.ts

@@ -2887,7 +2903,7 @@ export function createScanner(languageVersion: ScriptTarget, skipTrivia: boolean
                        return "\\";
                    }
                    pos--;
-                    return scanEscapeSequence(/*shouldEmitInvalidEscapeError*/ unicodeMode, /*isRegularExpression*/ annexB ? "annex-b" : true);
+                    return scanEscapeSequence(/*shouldEmitInvalidEscapeError*/ unicodeMode, /*isRegularExpression*/ anyUnicodeModeOrNonAnnexB || "annex-b");


Since "annexB" is no longer used in scanEscapeSequence, this can just be true:

Suggested change

return scanEscapeSequence(/*shouldEmitInvalidEscapeError*/ unicodeMode, /*isRegularExpression*/ anyUnicodeModeOrNonAnnexB || "annex-b");

return scanEscapeSequence(/*shouldEmitInvalidEscapeError*/ unicodeMode, /*isRegularExpression*/ true);

rbuckton · 2024-05-22T14:13:25Z

src/compiler/scanner.ts

+                    && charCodeUnchecked(p + 1) === CharacterCodes.question
+                    && charCodeUnchecked(p + 2) === CharacterCodes.lessThan
+                    && charCodeUnchecked(p + 3) !== CharacterCodes.equals
+                    && charCodeUnchecked(p + 3) !== CharacterCodes.exclamation


I missed this in my last review. These need to be range checked:

Suggested change

&& charCodeUnchecked(p + 1) === CharacterCodes.question

&& charCodeUnchecked(p + 2) === CharacterCodes.lessThan

&& charCodeUnchecked(p + 3) !== CharacterCodes.equals

&& charCodeUnchecked(p + 3) !== CharacterCodes.exclamation

&& charCodeChecked(p + 1) === CharacterCodes.question

&& charCodeChecked(p + 2) === CharacterCodes.lessThan

&& charCodeChecked(p + 3) !== CharacterCodes.equals

&& charCodeChecked(p + 3) !== CharacterCodes.exclamation

rbuckton

One more I missed.

rbuckton · 2024-05-22T14:15:36Z

src/compiler/scanner.ts

                                    isPreviousTermQuantifiable = true;
                                    break;
                                }
                            }
-                            if (max && Number.parseInt(min) > Number.parseInt(max)) {
+                            else if (max && Number.parseInt(min) > Number.parseInt(max) && (anyUnicodeModeOrNonAnnexB || text.charCodeAt(pos) === CharacterCodes.closeBrace)) {


Suggested change

else if (max && Number.parseInt(min) > Number.parseInt(max) && (anyUnicodeModeOrNonAnnexB || text.charCodeAt(pos) === CharacterCodes.closeBrace)) {

else if (max && Number.parseInt(min) > Number.parseInt(max) && (anyUnicodeModeOrNonAnnexB || charCodeChecked(pos) === CharacterCodes.closeBrace)) {

Correct RegExp Behavior Related to Annex B

e049438

typescript-bot added the For Uncommitted Bug PR for untriaged, rejected, closed or missing bug label Apr 25, 2024

graphemecluster commented Apr 25, 2024

View reviewed changes

Fix Comment

358eb30

graphemecluster changed the title ~~Correct RegExp Behavior Related to Annex B~~ Correct Regular Expressions Behavior Related to Annex B Apr 26, 2024

graphemecluster added 3 commits April 27, 2024 15:21

Correct Quantifiability of {1 in Annex B

8facb0a

\c is an Annex B thing

f5c0b60

Add Tests

cff993f

graphemecluster commented Apr 27, 2024

View reviewed changes

This was referenced Apr 27, 2024

Improve Recovery of Unterminated Regular Expressions #58289

Open

Design Meeting Notes, 4/26/2024 #58416

Open

jakebailey requested review from rbuckton and jakebailey May 6, 2024 23:02

sandersn added this to Not started in PR Backlog May 7, 2024

Merge from main

603c3cf

rbuckton requested changes May 16, 2024

View reviewed changes

PR Backlog automation moved this from Not started to Waiting on author May 16, 2024

Apply Suggested Changes

2e62d25

graphemecluster force-pushed the regex-syntax-check-annexB branch from 2868bd6 to 2e62d25 Compare May 22, 2024 03:24

rbuckton requested changes May 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct Regular Expressions Behavior Related to Annex B #58320

Correct Regular Expressions Behavior Related to Annex B #58320

graphemecluster commented Apr 25, 2024 •

edited

typescript-bot commented Apr 25, 2024

graphemecluster Apr 25, 2024

graphemecluster Apr 27, 2024

jakebailey May 15, 2024

graphemecluster May 20, 2024

graphemecluster Apr 27, 2024

jakebailey commented May 15, 2024

rbuckton commented May 15, 2024

graphemecluster commented May 16, 2024

rbuckton May 16, 2024 •

edited

graphemecluster May 20, 2024

rbuckton May 20, 2024

rbuckton May 16, 2024

graphemecluster May 20, 2024

rbuckton May 20, 2024

graphemecluster May 21, 2024

rbuckton May 21, 2024

rbuckton commented May 20, 2024

rbuckton left a comment

rbuckton May 22, 2024

graphemecluster May 23, 2024 •

edited

rbuckton May 22, 2024

rbuckton May 22, 2024

rbuckton May 22, 2024

rbuckton left a comment

rbuckton May 22, 2024

	// Annex B treats any unicode mode as the strict syntax.
	// Regular expressions are checked more strictly when either in 'u' or 'v' mode, or
	// when not using the looser interpretation of the syntax from ECMA-262 Annex B.

	return scanEscapeSequence(/shouldEmitInvalidEscapeError/ unicodeMode, /isRegularExpression/ anyUnicodeModeOrNonAnnexB \|\| "annex-b");
	return scanEscapeSequence(/shouldEmitInvalidEscapeError/ unicodeMode, /isRegularExpression/ true);

	else if (max && Number.parseInt(min) > Number.parseInt(max) && (anyUnicodeModeOrNonAnnexB \|\| text.charCodeAt(pos) === CharacterCodes.closeBrace)) {
	else if (max && Number.parseInt(min) > Number.parseInt(max) && (anyUnicodeModeOrNonAnnexB \|\| charCodeChecked(pos) === CharacterCodes.closeBrace)) {

Correct Regular Expressions Behavior Related to Annex B #58320

Are you sure you want to change the base?

Correct Regular Expressions Behavior Related to Annex B #58320

Conversation

graphemecluster commented Apr 25, 2024 • edited

typescript-bot commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakebailey commented May 15, 2024

rbuckton commented May 15, 2024

graphemecluster commented May 16, 2024

rbuckton May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbuckton commented May 20, 2024

rbuckton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graphemecluster May 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbuckton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graphemecluster commented Apr 25, 2024 •

edited

rbuckton May 16, 2024 •

edited

graphemecluster May 23, 2024 •

edited