241 lines
		
	
	
		
			7.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			241 lines
		
	
	
		
			7.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
Overview [](https://travis-ci.org/lydell/js-tokens)
 | 
						||
========
 | 
						||
 | 
						||
A regex that tokenizes JavaScript.
 | 
						||
 | 
						||
```js
 | 
						||
var jsTokens = require("js-tokens").default
 | 
						||
 | 
						||
var jsString = "var foo=opts.foo;\n..."
 | 
						||
 | 
						||
jsString.match(jsTokens)
 | 
						||
// ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...]
 | 
						||
```
 | 
						||
 | 
						||
 | 
						||
Installation
 | 
						||
============
 | 
						||
 | 
						||
`npm install js-tokens`
 | 
						||
 | 
						||
```js
 | 
						||
import jsTokens from "js-tokens"
 | 
						||
// or:
 | 
						||
var jsTokens = require("js-tokens").default
 | 
						||
```
 | 
						||
 | 
						||
 | 
						||
Usage
 | 
						||
=====
 | 
						||
 | 
						||
### `jsTokens` ###
 | 
						||
 | 
						||
A regex with the `g` flag that matches JavaScript tokens.
 | 
						||
 | 
						||
The regex _always_ matches, even invalid JavaScript and the empty string.
 | 
						||
 | 
						||
The next match is always directly after the previous.
 | 
						||
 | 
						||
### `var token = matchToToken(match)` ###
 | 
						||
 | 
						||
```js
 | 
						||
import {matchToToken} from "js-tokens"
 | 
						||
// or:
 | 
						||
var matchToToken = require("js-tokens").matchToToken
 | 
						||
```
 | 
						||
 | 
						||
Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type:
 | 
						||
String, value: String}` object. The following types are available:
 | 
						||
 | 
						||
- string
 | 
						||
- comment
 | 
						||
- regex
 | 
						||
- number
 | 
						||
- name
 | 
						||
- punctuator
 | 
						||
- whitespace
 | 
						||
- invalid
 | 
						||
 | 
						||
Multi-line comments and strings also have a `closed` property indicating if the
 | 
						||
token was closed or not (see below).
 | 
						||
 | 
						||
Comments and strings both come in several flavors. To distinguish them, check if
 | 
						||
the token starts with `//`, `/*`, `'`, `"` or `` ` ``.
 | 
						||
 | 
						||
Names are ECMAScript IdentifierNames, that is, including both identifiers and
 | 
						||
keywords. You may use [is-keyword-js] to tell them apart.
 | 
						||
 | 
						||
Whitespace includes both line terminators and other whitespace.
 | 
						||
 | 
						||
[is-keyword-js]: https://github.com/crissdev/is-keyword-js
 | 
						||
 | 
						||
 | 
						||
ECMAScript support
 | 
						||
==================
 | 
						||
 | 
						||
The intention is to always support the latest ECMAScript version whose feature
 | 
						||
set has been finalized.
 | 
						||
 | 
						||
If adding support for a newer version requires changes, a new version with a
 | 
						||
major verion bump will be released.
 | 
						||
 | 
						||
Currently, ECMAScript 2018 is supported.
 | 
						||
 | 
						||
 | 
						||
Invalid code handling
 | 
						||
=====================
 | 
						||
 | 
						||
Unterminated strings are still matched as strings. JavaScript strings cannot
 | 
						||
contain (unescaped) newlines, so unterminated strings simply end at the end of
 | 
						||
the line. Unterminated template strings can contain unescaped newlines, though,
 | 
						||
so they go on to the end of input.
 | 
						||
 | 
						||
Unterminated multi-line comments are also still matched as comments. They
 | 
						||
simply go on to the end of the input.
 | 
						||
 | 
						||
Unterminated regex literals are likely matched as division and whatever is
 | 
						||
inside the regex.
 | 
						||
 | 
						||
Invalid ASCII characters have their own capturing group.
 | 
						||
 | 
						||
Invalid non-ASCII characters are treated as names, to simplify the matching of
 | 
						||
names (except unicode spaces which are treated as whitespace). Note: See also
 | 
						||
the [ES2018](#es2018) section.
 | 
						||
 | 
						||
Regex literals may contain invalid regex syntax. They are still matched as
 | 
						||
regex literals. They may also contain repeated regex flags, to keep the regex
 | 
						||
simple.
 | 
						||
 | 
						||
Strings may contain invalid escape sequences.
 | 
						||
 | 
						||
 | 
						||
Limitations
 | 
						||
===========
 | 
						||
 | 
						||
Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be
 | 
						||
perfect. But that’s not the point either.
 | 
						||
 | 
						||
You may compare jsTokens with [esprima] by using `esprima-compare.js`.
 | 
						||
See `npm run esprima-compare`!
 | 
						||
 | 
						||
[esprima]: http://esprima.org/
 | 
						||
 | 
						||
### Template string interpolation ###
 | 
						||
 | 
						||
Template strings are matched as single tokens, from the starting `` ` `` to the
 | 
						||
ending `` ` ``, including interpolations (whose tokens are not matched
 | 
						||
individually).
 | 
						||
 | 
						||
Matching template string interpolations requires recursive balancing of `{` and
 | 
						||
`}`—something that JavaScript regexes cannot do. Only one level of nesting is
 | 
						||
supported.
 | 
						||
 | 
						||
### Division and regex literals collision ###
 | 
						||
 | 
						||
Consider this example:
 | 
						||
 | 
						||
```js
 | 
						||
var g = 9.82
 | 
						||
var number = bar / 2/g
 | 
						||
 | 
						||
var regex = / 2/g
 | 
						||
```
 | 
						||
 | 
						||
A human can easily understand that in the `number` line we’re dealing with
 | 
						||
division, and in the `regex` line we’re dealing with a regex literal. How come?
 | 
						||
Because humans can look at the whole code to put the `/` characters in context.
 | 
						||
A JavaScript regex cannot. It only sees forwards. (Well, ES2018 regexes can also
 | 
						||
look backwards. See the [ES2018](#es2018) section).
 | 
						||
 | 
						||
When the `jsTokens` regex scans throught the above, it will see the following
 | 
						||
at the end of both the `number` and `regex` rows:
 | 
						||
 | 
						||
```js
 | 
						||
/ 2/g
 | 
						||
```
 | 
						||
 | 
						||
It is then impossible to know if that is a regex literal, or part of an
 | 
						||
expression dealing with division.
 | 
						||
 | 
						||
Here is a similar case:
 | 
						||
 | 
						||
```js
 | 
						||
foo /= 2/g
 | 
						||
foo(/= 2/g)
 | 
						||
```
 | 
						||
 | 
						||
The first line divides the `foo` variable with `2/g`. The second line calls the
 | 
						||
`foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only
 | 
						||
sees forwards, it cannot tell the two cases apart.
 | 
						||
 | 
						||
There are some cases where we _can_ tell division and regex literals apart,
 | 
						||
though.
 | 
						||
 | 
						||
First off, we have the simple cases where there’s only one slash in the line:
 | 
						||
 | 
						||
```js
 | 
						||
var foo = 2/g
 | 
						||
foo /= 2
 | 
						||
```
 | 
						||
 | 
						||
Regex literals cannot contain newlines, so the above cases are correctly
 | 
						||
identified as division. Things are only problematic when there are more than
 | 
						||
one non-comment slash in a single line.
 | 
						||
 | 
						||
Secondly, not every character is a valid regex flag.
 | 
						||
 | 
						||
```js
 | 
						||
var number = bar / 2/e
 | 
						||
```
 | 
						||
 | 
						||
The above example is also correctly identified as division, because `e` is not a
 | 
						||
valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*`
 | 
						||
(any letter) as flags, but it is not worth it since it increases the amount of
 | 
						||
ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are
 | 
						||
allowed. This means that the above example will be identified as division as
 | 
						||
long as you don’t rename the `e` variable to some permutation of `gmiyus` 1 to 6
 | 
						||
characters long.
 | 
						||
 | 
						||
Lastly, we can look _forward_ for information.
 | 
						||
 | 
						||
- If the token following what looks like a regex literal is not valid after a
 | 
						||
  regex literal, but is valid in a division expression, then the regex literal
 | 
						||
  is treated as division instead. For example, a flagless regex cannot be
 | 
						||
  followed by a string, number or name, but all of those three can be the
 | 
						||
  denominator of a division.
 | 
						||
- Generally, if what looks like a regex literal is followed by an operator, the
 | 
						||
  regex literal is treated as division instead. This is because regexes are
 | 
						||
  seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division
 | 
						||
  could likely be part of such an expression.
 | 
						||
 | 
						||
Please consult the regex source and the test cases for precise information on
 | 
						||
when regex or division is matched (should you need to know). In short, you
 | 
						||
could sum it up as:
 | 
						||
 | 
						||
If the end of a statement looks like a regex literal (even if it isn’t), it
 | 
						||
will be treated as one. Otherwise it should work as expected (if you write sane
 | 
						||
code).
 | 
						||
 | 
						||
### ES2018 ###
 | 
						||
 | 
						||
ES2018 added some nice regex improvements to the language.
 | 
						||
 | 
						||
- [Unicode property escapes] should allow telling names and invalid non-ASCII
 | 
						||
  characters apart without blowing up the regex size.
 | 
						||
- [Lookbehind assertions] should allow matching telling division and regex
 | 
						||
  literals apart in more cases.
 | 
						||
- [Named capture groups] might simplify some things.
 | 
						||
 | 
						||
These things would be nice to do, but are not critical. They probably have to
 | 
						||
wait until the oldest maintained Node.js LTS release supports those features.
 | 
						||
 | 
						||
[Unicode property escapes]: http://2ality.com/2017/07/regexp-unicode-property-escapes.html
 | 
						||
[Lookbehind assertions]: http://2ality.com/2017/05/regexp-lookbehind-assertions.html
 | 
						||
[Named capture groups]: http://2ality.com/2017/05/regexp-named-capture-groups.html
 | 
						||
 | 
						||
 | 
						||
License
 | 
						||
=======
 | 
						||
 | 
						||
[MIT](LICENSE).
 |