使用 ANTLR 解析 JavaScript 正则表达式

Parsing JavaScript regex with ANTLR

我有一个 ANTLR JavaScript 语法(取自互联网),它似乎支持除正则表达式文字之外的所有内容。

正则表达式文字的问题在于你有两条规则,本质上是:

1
2
multiplicativeExpression
    : unaryExpression (LT!* ('*' | '/' | '%')^ LT!* unaryExpression)*

1
2
regexLiteral
    : '/' RegexLiteralChar* '/'

规则 RegexLiteralChar 使用与普通表达式不同的词法分析器规则(例如,双引号不会终止它)。

这意味着我需要以某种方式从我的解析器中更改某种词法分析器状态。我怎样才能做到这一点?有可能吗?


看这里Bart Kiers评论中提到的语法,可以看到这条评论,

The major challenges faced in defining this grammar were:

-1- Ambiguity surrounding the DIV sign in relation to the multiplicative expression and the regular expression literal. This is
solved with some lexer driven magic: a gated semantical predicate
turns the recognition of regular expressions on or off, based on the
value of the RegularExpressionsEnabled property. When regular
expressions are enabled they take precedence over division
expressions. The decision whether regular expressions are enabled is
based on the heuristics that the previous token can be considered as
last token of a left-hand-side operand of a division.

...

areRegularExpressionsEnabled() 函数定义为,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
private final boolean areRegularExpressionsEnabled()
{
    if (last == null)
    {
        return true;
    }
    switch (last.getType())
    {
    // identifier
        case Identifier:
    // literals
        case NULL:
        case TRUE:
        case FALSE:
        case THIS:
        case OctalIntegerLiteral:
        case DecimalLiteral:
        case HexIntegerLiteral:
        case StringLiteral:
    // member access ending
        case RBRACK:
    // function call or nested expression ending
        case RPAREN:
            return false;
    // otherwise OK
        default:
            return true;
    }
}

然后函数用在RegularExpressionLiteral表达式中,

1
2
3
RegularExpressionLiteral
    : { areRegularExpressionsEnabled() }?=> DIV RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart*
    ;