Java – EBNF / parsed: how to translate regexp into peg?
This is a problem specific to the parsed parser framework and general BNF / PEG
Suppose I have a fairly simple regular expression
^\\s*([A-Za-z_][A-Za-z_0-9]*)\\s*=\\s*(\\S+)\\s*$
Represents pseudo EBNF
<line> ::= <ws>? <identifier> <ws>? '=' <nonwhitespace> <ws>? <ws> ::= (' ' | '\t' | {other whitespace characters})+ <identifier> ::= <identifier-head> <identifier-tail> <identifier-head> ::= <letter> | '_' <identifier-tail> ::= (<letter> | <digit> | '_')* <letter> ::= ('A'..'Z') | ('a'..'z') <digit> ::= '0'..'9' <nonwhitespace> ::= ___________
How do I define non whitespace (one or more characters that are not spaces) in EBNF?
For those who are familiar with the Java parsed library, how to implement rules that define non whitespace?
Solution
You still insist on using the lexical generator convention to specify character ranges and operations on character ranges
Many lexical analyzer generators accept hexadecimal values (similar to 0x) to represent characters, so you can write:
'0'..'9' 0x30..\0x39
For numbers
For non whitespace, you need to know which character set you are using For 7-Bit ASCII, non whitespace is conceptually all printed characters:
0x21..\0x7E
For iso8859-1:
( 0x21..\0x7E | 0x80-0xFF )
You can decide whether the character code above 0x80 is a space (space is an uninterrupted space?) You can also decide the control character 0x0 0x1f status Is the label (0x9) a blank character? How about Cr 0xd and LF 0xa? How about ETB control characters?
Unicode is more difficult because it is a huge collection and your list becomes huge and confusing it is life. Our DMS software reengineering toolkit is used to build parsers for various languages, and must support ASCII, iso8859-z lexical parsers and many Z and Unicode DMS does not write a complex "addition" regular expression range, but allows subtraction of regular expressions. Therefore, we can write:
<UniCodeLegalCharacters>-<UniCodeWhiteSpace>
This is easier to understand and correct on the first attempt