Do you know the ancient proverb that says “If you want to really understand something, parse or generate it!”. Just kidding :) I made that up. But doesn’t make less true through.
This post briefly explains BNF and how Lark uses it to build AST(Abstract syntax tree) which is basically the biggest computer science-y part of the job.
BNF Link to heading
BNF is used to define formal specifications of context-free grammars using Production Rules
and Terminal symbols
. In this example from wiki, twelve
is a production rule to terminals "1"
and "2"
twelve = "1", "2" ;
two hundred one = "2", "0", "1" ;
three hundred twelve = "3", twelve ;
twelve thousand two hundred one = twelve, two hundred one ;
Standards like 1800 use BNF heavily to define language syntax (semantics is decribed by natural language whenever needed). So, parsers know what’s legal and what’s not.
Lark Link to heading
Lark is a python library to parse text based on CFG(context-free grammars). I will just copy from Lark docs.
What can it do?
Parse all context-free grammars, and handle any ambiguity gracefully
Build an annotated parse-tree automagically, no construction code required.
Provide first-rate performance in terms of both Big-O complexity and measured run-time (considering that this is Python ;)
Run on every Python interpreter (it’s pure-python)
Generate a stand-alone parser (for LALR(1) grammars)
The important part is that Lark parses context-free grammars which can be written in EBNF and generates AST parse tree. Then Transformers can be written to shape the generated tree.
To use lark, Lark needs the grammar and start symbol.
from lark import Lark
GRAMMAR = """
body: foobar [optional] | bar
foobar: CNAME
bar: "fff"
optional: "dd"
%import common.WS
%import common.CNAME
%ignore WS
"""
code = """
ggg
"""
parser = Lark(GRAMMAR, start='body')
ast_tree = parser.parse(code)
print(ast_tree.pretty())
print(ast_tree)
In this example. i am using lark CNAME
which is identifier as terminal for the rule foobar
. When printing the AST tree, CNAME
takes ggg
as it is matched by the production rule.
body
foobar ggg
Tree('body', [Tree('foobar', [Token('CNAME', 'ggg')])])
Lark Grammar Link to heading
Lark grammar can be found at docs. Basically, it combines EBNF and regex to provide clean pythonic way to parse text
- rule
- TERMINAL
- “string literal” or /regexp literal/
- (item item ..) - Group items
- [item item ..] - Maybe. Same as (item item ..)?, but when maybe_placeholders=True, generates None if there is no match.
- item? - Zero or one instances of item (”maybe”)
- item* - Zero or more instances of item
- item+ - One or more instances of item
For example, This is more complicated production rule for SVA sequence with optional, grouping and literal strings.
sequence_port_item: ( attribute_instance )* [ "local" [ sequence_lvar_port_direction ] ] sequence_formal_type formal_port_identifier (variable_dimension)* [ "=" sequence_actual_arg ]
That’s it. Happy Parsing!