Do you know the ancient proverb that says “If you want to really understand something, parse or generate it!”. Just kidding :) I made that up. But doesn’t make less true through.

This post briefly explains BNF and how Lark uses it to build AST(Abstract syntax tree) which is basically the biggest computer science-y part of the job.

BNF Link to heading

BNF is used to define formal specifications of context-free grammars using Production Rules and Terminal symbols. In this example from wiki, twelve is a production rule to terminals "1" and "2"

twelve                          = "1", "2" ;
two hundred one                 = "2", "0", "1" ;
three hundred twelve            = "3", twelve ;
twelve thousand two hundred one = twelve, two hundred one ;

Standards like 1800 use BNF heavily to define language syntax (semantics is decribed by natural language whenever needed). So, parsers know what’s legal and what’s not.

Lark Link to heading

Lark is a python library to parse text based on CFG(context-free grammars). I will just copy from Lark docs.

What can it do?

Parse all context-free grammars, and handle any ambiguity gracefully

Build an annotated parse-tree automagically, no construction code required.

Provide first-rate performance in terms of both Big-O complexity and measured run-time (considering that this is Python ;)

Run on every Python interpreter (it’s pure-python)

Generate a stand-alone parser (for LALR(1) grammars)

The important part is that Lark parses context-free grammars which can be written in EBNF and generates AST parse tree. Then Transformers can be written to shape the generated tree.

To use lark, Lark needs the grammar and start symbol.

from lark import Lark

GRAMMAR = """
body: foobar [optional] | bar
foobar: CNAME
bar: "fff"
optional: "dd"

%import common.WS
%import common.CNAME
 %ignore WS
"""

code = """
ggg
"""
parser = Lark(GRAMMAR, start='body')
ast_tree = parser.parse(code)

print(ast_tree.pretty())
print(ast_tree)

In this example. i am using lark CNAME which is identifier as terminal for the rule foobar. When printing the AST tree, CNAME takes ggg as it is matched by the production rule.

body
  foobar	ggg

Tree('body', [Tree('foobar', [Token('CNAME', 'ggg')])])

Lark Grammar Link to heading

Lark grammar can be found at docs. Basically, it combines EBNF and regex to provide clean pythonic way to parse text

  • rule
  • TERMINAL
  • “string literal” or /regexp literal/
  • (item item ..) - Group items
  • [item item ..] - Maybe. Same as (item item ..)?, but when maybe_placeholders=True, generates None if there is no match.
  • item? - Zero or one instances of item (”maybe”)
  • item* - Zero or more instances of item
  • item+ - One or more instances of item

For example, This is more complicated production rule for SVA sequence with optional, grouping and literal strings.

sequence_port_item: ( attribute_instance )* [ "local" [ sequence_lvar_port_direction ] ] sequence_formal_type formal_port_identifier (variable_dimension)* [ "=" sequence_actual_arg ]

That’s it. Happy Parsing!