LEX :
LEX is a tool used to generate a lexical analyzer. This document is a tutorial
for the use of LEX for SIL Compiler development. Technically, LEX translates a
set of regular expression specifications (given as input in input_file.l)
into a C implementation of a corresponding finite state machine
(lex.yy.c). This C program, when compiled, yields an executable lexical
analyzer.
The source SIL
program is fed as the input to the the lexical analyzer which produces a
sequence of tokens as output. (Tokens are explained below). Conceptually, a
lexical analyzer scans a given source SIL program and produces an output of
tokens. A token is a single element of the SIL programming language that is
recognized by the compiler . For instance integer, boolean, begin, end, if,
while etc. are tokens in SIL.
Example:
Integer͇ {return ID_TYPE_INTEGER;}
LEX
C compiler \
a.out
lex.yy.c
a.out
Input Output
lex.yy.c
input_file.l
This example
demonstrates the specification of rule in LEX. This rule in this example
specifies that the lexical analyzer must return the token named ID_TYPE_INTEGER
when the pattern ͆integer͇ is found in the input file. A rule in a LEX program
comprises of a pattern part (specified by a regular expression) and a
corresponding (semantic) action part (a sequence of C statements). In the above
example, ͆integer͇ is the pattern and {return ID_TYPE_INTEGER;} is the
corresponding action. The statements in the action part will be executed when
the pattern is detected in the input.
Declarations
The declarations
section consists of two parts, regular definitions and auxiliary declarations .
LEX allows the use of short-hands and extensions to regular expressions for the
regular definitions. The auxiliary declarations are copied as such by LEX to
the output lex.yy.c file.
Rules in a LEX
program consists of two parts :
i. The pattern
to be matched
ii. The
corresponding action to be executed
LEX generates C
code for the rules specified in the Rules section and places this code into a
single function called yylex(). (To be discussed in detail later). In addition
to this LEX generated code, the programmer may wish to add his own code to the
lex.yy.c file. The auxiliary functions section allows the programmer to achieve
this.
FLEX :
Flex (fast lexical analyzer generator) is a free and open-source software alternative to lex.[2] It is a computer program that generates lexical analyzers (also known as
"scanners" or "lexers").[3][4] It is frequently used as the lex implementation together
with Berkeley Yacc parser generator on BSD-derived operating systems (as both
lex
and yacc
are part of POSIX),[5][6][7] or together with GNU bison (a version of yacc) in *BSD ports[8] and in Linux distributions. Unlike Bison, flex is not part
of the GNU Project and is not released under the GNU General Public License.
Issues
Time complexity :
A Flex lexical analyzer usually has time complexity in
the length of the input. That is, it performs a constant number of operations
for each input symbol. This constant is quite low: GCC generates 12 instructions for the DFA match loop.[citation needed] Note that the
constant is independent of the length of the token, the length of the regular
expression and the size of the DFA.
However, using the REJECT macro in a scanner with the potential
to match extremely long tokens can cause Flex to generate a scanner with
non-linear performance. This feature is optional. In this case, the programmer
has explicitly told Flex to "go back and try again" after it has
already matched some input. This will cause the DFA to backtrack to find other
accept states. The REJECT feature is not enabled by default, and because of its
performance implications its use is discouraged in the Flex manual.[11]
Reentrancy :By
default the scanner generated by Flex is not reentrant. This can cause serious problems for
programs that use the generated scanner from different threads. To overcome
this issue there are options that Flex provides in order to achieve reentrancy.
A detailed description of these options can be found in the Flex manual.
Usage under non-Unix environments:
Normally the generated scanner contains references to unistd.h header
file which is Unix specific. To avoid generating code
that includes unistd.h, %option
nounistd should be used. Another issue is the call to isatty (a Unix library function),
which can be found in the generated code. The %option
never-interactive forces flex to generate code that doesn't use isatty.
Using flex from other languages :
Flex can only generate code for C and C++. To use the scanner code generated by flex
from other languages a language binding tool such as SWIG can be used.
Issues
However, using the REJECT
macro in a scanner with the potential to match extremely long tokens can cause
Flex to generate a scanner with non-linear performance. This feature is
optional. In this case, the programmer has explicitly told Flex to "go back
and try again" after it has already matched some input. This will cause
the DFA to backtrack to find other accept states. The REJECT feature is not
enabled by default, and because of its performance implications its use is
discouraged in the Flex manual.[11]
Reentrancy
By default the scanner
generated by Flex is not reentrant. This can cause serious problems for
programs that use the generated scanner from different threads. To overcome
this issue there are options that Flex provides in order to achieve reentrancy.
A detailed description of these options can be found in the Flex manual.[12]
Usage under
non-Unix environments
Normally the generated
scanner contains references to unistd.h header file which is Unix specific.
To avoid generating code that includes unistd.h, %option
nounistd should be used. Another issue is the call to isatty (a
Unix library function), which can be found in the generated code. The %option
never-interactive forces flex to generate code that doesn't use isatty.[13]
Using flex from
other languages
Flex can only generate code
for C and C++. To use the scanner code generated by flex
from other languages a language binding tool
such as SWIG can
be used.