4 Common JavaScript Syntax Highlighting Specification
pcwalton редактировал(а) эту страницу 2010-12-22 11:59:28 -08:00

"Editor" refers to the application using the syntax highlighting specification. MUST, MUST NOT, SHOULD, SHOULD NOT, MAY, and OPTIONAL are the usual RFC 2119 spec-ese.

File format

A syntax highlighting specification is a single ECMAScript source file with a .js extension. The file consists of three parts: a header, a body, and a footer, in that order. The contents of the header and footer are not specified. The header and footer may be used for module transport formats, for editor-specific initialization and destruction code, or for any other purpose as required by the editor.

Two functions, getRules() and getInfo(), MUST be defined within the body and assigned to an object named exports. The exports object itself MUST NOT be defined within the body (although it may be defined in the header). The getRules() function MUST take no parameters and MUST return a ruleset object. The getInfo() function MUST take no parameters and MUST return a metadata object.

Each property of a ruleset object MUST be an Array of rule objects. For each such property, the name of the state consisting of those rules is defined to be the corresponding property name.

Each rule object MUST contain the following properties with the corresponding property names:

  • A RegExp object named regex. The regular expression represented by the RegExp object MUST NOT contain capturing parentheses and MUST NOT contain the beginning-of-line anchor character "^". This object specifies the match pattern for the rule.

  • A string or Function named token. If the property is a Function, it MUST take one parameter—the text that was matched—and MUST return a string representing the token corresponding to this rule. If the property is a string, it simply represents the token corresponding to this rule.

Each rule object MAY contain one or both of the following properties with the corresponding property names:

  • A string named next. This string, if present, specifies the name of the new state to transition to if this rule matches.

  • A string named mode. This string must consist of 3 substrings, delimited by a colon (":") character (e.g. "js:start:"). The first substring names the target syntax mode (name), the second substring names the state of the target syntax mode to transition to (state), and the third substring names the terminator string (term).

Each rule object MAY also contain one or more editor-specific properties. The format and semantics of these properties are not specified. All editor-specific properties MUST begin with an underscore ("_") character.

Rule objects MUST NOT contain properties with names other than regex, token, next, or mode, unless those names are prefixed with an underscore or are inherited through the prototype chain.

A metadata object MUST contain the following properties:

  • A string named name. It specifies the name of the syntax mode. This is the string that is used to refer to this syntax in mode directives.

  • An Array of strings named fileexts. Each string specifies a file extension to which this syntax mode should apply.

  • An Array of strings named mimetypes. Each string specifies a MIME type to which this syntax mode should apply.

Semantics

The syntax model is that of a finite state machine. An editor is free to use any implementation for its syntax engine, but its output MUST be identical in all cases to an editor that followed this procedure.

let state = "start", stack = [ (mode, null) ]
while untokenized input remains do
  repeat
    let (mode, term) = the top element of stack
    if term is non-null and term matches the input then
      pop stack
    else
      break repeat
  end repeat
  
  let ruleset = rules[state]
  for each rule in ruleset do
    if rule.regex matches the input then
      mark the matched span as a token of type rule.token
      advance the input by the length of the matched span
      if rule.next is nonempty then
        state = rule.next
      OPTIONALLY:
        if rule.mode is nonempty then
          push (rule.mode.name, rule.mode.term) onto stack
          state = rule.mode.state
        end if
    end if
  end for
end while

For any string s, each ruleset MUST contain some rule that matches s. (Rationale: Many syntax engines don't cope well with a large number of very small tokens, so specifying a fallback rule of something like { regex: /./, token: 'plain' } would potentially do more harm than good from a performance standpoint.)

If a rule matches the empty string, it MUST either transition to a new state or transition to a new mode. (Rationale: Prevents trivial infinite loops.) Regex objects may not match multiple lines. (Rationale: All JavaScript code editors I know of are line-oriented.)

Note that if a target mode to switch to and a new state are both specified, the editor is free to choose which to use. (This allows an editor to attempt to switch to the new mode, but to switch to the new state as a fallback in case the new mode can't be loaded.)

The token name MUST consist of a series of token name components separated with period (".") characters. A token name component is defined as a dash ("-"), or a letter ("A-Z" or "a-z"), followed immediately by a letter, and then any number of dashes, underscores, letters, or numbers. (Rationale: This allows editors to use the token names as CSS class names. The use of period separated components allows themes great flexibility in choosing specific styles for different languages, but also allows simple themes to be developed with as few as 12 rules.)

The first token name component SHOULD be one of the following strings: comment, constant, entity, invalid, keyword, markup, meta, plain, storage, string, support, variable. Unless the first token name component is plain, the last token name component SHOULD be the name of the syntax mode. Examples of suggested token names include "comment.js" and "comment.xml". (Rationale: TextMate compatibility.)

It is suggested that the editor apply the syntax mode when one of the following conditions is true:

  • The MIME type of the file being edited is one of the MIME types specified in the mimetypes property.

  • The extension of the file being edited, without the leading ".", is one of the file extensions specified in the fileexts property.

Error handling mechanisms are undefined, but the editor SHOULD report violations of the specification to the user.

Implementations