Parser
Here be dragons
The standard text on writing parsers and compilers is “Compilers. Principles,
Techniques and Tools" by Aho, Sethi and Ullman. The book is so well known
it has a nickname: “The Dragon Book". This is because there is a picture
of a dragon on the cover. Who says programmers arenłt creative? If you
ever need to write a parser, never try to work out how to do it yourself.
The area is well researched and excellent techniques are well known. All
you need to do is apply them. The difference between a parser and a compiler
is that a parser processes some text, providing information on its structure
in a useful form. A compiler will contain a parser, but goes on to use
its output to generate machine code. Usually compilers also make use of
a symbol table, to store information about identifiers (like types, scope,
number of parameters etc) and an error handler, to report errors, and possibly
to take corrective action so that the remainder of the program can be parsed.
The phases can be simply described as follows:
1. Lexical analysis. This takes a stream of characters, and outputs
a stream of tokens. A token is a non-divisible element of the language,
like the keywords if or procedure, or the + and ; symbols are in Pascal.
A sequence of characters that makes up a token is called a lexeme, hence
the name. This phase is often responsible for removing whitespace (spaces,
tabs and new lines) and recognising identifiers and keywords. Usually the
lexical analyser recognises at least integers, instead of using grammar
rules like the example given earlier. Regular expressions are usually sufficient
to describe the actions needed, and they are often implemented as finite
state machines.
2. Syntax analysis. This phase takes the stream of tokens and
checks that they obey the grammar rules. The output from this stage is
a hierarchical structure of tokens that IÅ‚m not actually going to describe,
because it isnłt used in recursive descent parsers. Instead, the same information
is implicit in the order of functions calls on the call stack. There is
some flexibility in the division of labour between lexical and syntactic
analysis. For example, sometimes whitespace is removed by the lexical analyser
and sometimes by the syntax analyser.
3. Semantic analysis. This phase ensures that the program makes
sense. This usually takes the form of type checking; making sure that you
donłt multiply numbers and strings, for instance.
4. Code generation. This may result in either machine code,
or an intermediate form like p-code. This enables the same back-end to
be used for compilers for different languages. For instance, a Pascal and
a C++ compiler might both produce p-code, and then use the same code optimisation
and generation phases.
5. Code optimisation. Sometimes transformations can be made
to make the code run more efficiently. Substituting actual values for constants,so
they donłt need to be looked up at run-time, and taking advantage of instructions
relevant to a particular microprocessor are simple examples.
Obviously, parsing an HTML file is going to have slightly different
phases. A browser would replace code generation with producing an image
of the page. Fortunately, we can stop at syntax analysis, at which point
we will have enough information to produce our diagram. Basically, a parser
consists of stages 1 to 3.
There are two easy ways to write a parser. One is to use to tools Lex
and Yacc, which are freely available in several implementations (look for
LexAndYacc.zip on the Delphi Super Page at http://sunsite.icm.edu.pl/delphi/
). Their use is described in the Dragon Book. They are a little sophisticated
for our purposes here, so we will use the other easy technique, recursive
descent parsing, which is the easiest way to hand code a parser. Recursive
descent is an example of a type of parsing scheme known as syntax-directed
translation, which uses the grammatical rules to organise the code of the
parser. The first step is always to define the grammar of the language
we wish to parse. We then convert that grammar directly into code. Any
changes or corrections are made the same way; change the grammar first,
then change the code to reflect the new grammar. There is usually also
a set of semantic rules, which specify what else to do, which might include
type checking or updating the symbol table. For a larger project, these
semantic rules should be more formally specified, but I am going to introduce
them later on, when we get to write the parsing code.
To begin with, we need to define a few terms. A grammar defines the
syntax of a language, that is, what that language looks like. It does not
define the meaning, or semantics, of a language. Grammars are often defined
using BNF, or Backus-Naur Form. This is itself an example of a context-free
grammar, which have four components:
1. A set of tokens, known as terminal symbols.
2. A set of nonterminals, which are sequences of tokens.
3. A set of productions, which are the structuring rules of the grammar.
Each production has a nonterminal on the left side, an arrow, and then
a sequence of tokens and or nonterminals on the right side.
4. One of the nonterminals is defined as the start symbol.
In addition, BNF usually uses boldface strings to denote terminals,
italics for nonterminals, and normal for tokens (usually numbers and symbols).
We also usually use the symbol | to separate choices on the right
side of a production. In Extended BNF, or EBNF, we use the symbols * and
+ to denote zero or more, and one or more instances of a terminal or nonterminal,
respectively. (These operations are called Kleene closure and positive
closure, by the way.) If we want to use either closure on a set of choices,
the choices are enclosed in square brackets []. So, a set of productions
that define how to write a number might be:
number -> sign digitlist
sign -> + | - | ?
digitlist -> [0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9]+
Here the symbol ? denotes the empty string. The rules can be read as:
“A number has the form of a sign followed by a digitlist, where a sign
is the + symbol, or the
symbol, or a blank string (ie no symbol), followed
by a list of at least one character from the integers 0 to 9 inclusive."
We will use rules of this form to define HTML. As we arenłt interested
in all possible elements of an HTML page, we will define only those portions
of it that we need, but you will be able to extend the example easily enough
if you wish. Recursive descent parsers have a set of recursive procedures
to process the token stream, one procedure for each nonterminal in the
grammar. For example, our simple grammar would have procedures called Number,
Sign and DigitList, and the lexical analyser would just return single characters
as tokens. When several rules (or productions) have the same noterminal
on the left-hand side, they are combined into the one procedure.
At this point, we could get very boring, with tedious mathematical proofs,
and talk about start and follow sets, canonical derivations and suchlike
wonders. Instead, wełll look very briefly at a couple of key concepts,
and leave the details to Aho et al. One important notion is that of the
lookahead symbol. As the name suggests, this is the next token in the input
stream. It is used in predictive parsing to determine which grammar rule
to use. So the lookahead symbol will used to determine the next procedure
to call in our recursive descent parser. This is all a bit vague, so letłs
consider another simple example grammar, used to define the form of simple
expressions like 1 + 2, or 5
3. This example will also highlight an important
point about writing grammars. Here are our rules:
expr -> expr + term
expr -> expr
term
expr -> term
term -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
There is a problem with this set of productions. Suppose we begin at
the start symbol, in this case expr (unless otherwise stated, the first
nonterminal is considered to be the start symbol). The first production
is fired, which looks for an expr, then a + token, then a term (ie a digit).
When looking for an expr, the first rule is fired, and so on indefinitely.
This is an example of a left recursive grammar. What we need to do is rewrite
the productions so that the same language is defined, but where there is
no left recursion. There is a formal method to use in these situations,
which is well described in the Dragon Book, but basically we need to ensure
that no rules have the form:
A -> Ax
The example grammar can be rewritten as:
expr -> term rest
rest -> + expr | - expr | ?
term -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Now we can unambiguously determine which production to apply. To make
this a bit easier to follow, consider the following pseudocode:
procedure expr;
begin
term;
rest;
end;
procedure rest;
begin
if lookahead = Ä™+Å‚ then
match(Ä™+Å‚);
expr;
else if lookahead =
Ä™-Ä™ then
match(Ä™-Ä™);
expr;
else
Do nothing
end;
end;
procedure term;
begin
if IsDigit(lookahead)
then
match(lookahead);
else
Raise error ędigit expectedł
end;
end;
procedure match(t : token);
begin
if lookahead = t then
lookahead := GetNextToken;
else
Raise error ęt expectedł
end;
end;
Now it should be starting to become clear why this method of writing
a parser is so simple. We simply rewrite the productions as code. There
is an extra procedure match, which is used to simplify the other procedures.
It gets the next token only if its argument matches the lookahead symbol.
These procedures could be extended with code to provide the actions required
by the semantic rules; evaluating the expression, for instance. To see
how this works, try following this example through. Suppose we want to
parse the string:
7
2
The steps are:
1. The start symbol is expr, so call that procedure. This calls the
term procedure.
2. The term procedure finds the token 7, and returns to expr, which
calls rest.
3. The rest procedure finds a
token, and therefore calls expr again
(note the indirect recursion here).
4. The second instance of expr calls term, which finds the 2 token.
5. The second instance of expr calls rest again, which finds the empty
set.
6. The second instance of expr returns to the first call of rest, which
returns to the first call of expr, which then terminates.
The point of this little digression is that we need to make sure the
grammar we use for parsing HTML files is not left recursive. Right recursion
is fine, and will often lead to tail recursion in the code. That is, the
last statement in a procedure is a recursive call of that same procedure.
This can always be replaced by a while or repeat loop. Julian Bucknall
wrote an excellent article on recursion in issue ?? which discusses how,
and whether, to replace tail recursion.
Tag, youłre it
Thatłs probably sufficient background to start writing the HTML parser.
Again, the first step is to write the grammar. We need to decide how much
HTML we need to be able to decipher. Because we want to create a diagram
with links between an HTML page and pages linked to it, along with any
images used on the original page, we can ignore most tags, and only look
for the following:
1. The A tag. The HREF attribute of this tag specifies the URL of a
hypertext link.
2. The AREA tag. The HREF attribute of this tag specifies the URL of
a hypertext link.
3. The BASE tag. This HREF attribute of this tag will modify any other
URL on the page.
4. The FRAME tag. The SRC attribute of this tag specifies the URL of
a hypertext link.
5. The IMG tag. The SRC attribute of this tag specifies the URL of
a hypertext link, usually an image or video clip.
6. The LINK tag. The HREF attribute of this tag specifies the URL of
another document that can be used in different ways in the current page.
For example, if the REL attribute is set to STYLESHEET, the linked document
is a style sheet. A style sheet is just a text file with instructions on
how to modify some default characteristics of an HTML page. It is usually
used to give a consistent look to all pages on a web site.
7. The TITLE and /TITLE tags. If it exists, we will use the document
title as the name of the current page on the diagram. It is simple to parse,
as the text between the start and end tags cannot contain any other HTML.
Note that we are ignoring the APPLET, BLOCKQUOTE, DEL, DIV, EMBED, HEAD,
IFRAME, ILAYER, INPUT, INS, LAYER, META, OBJECT, Q, SCRIPT, SPAN, STYLE
tags, all of which can have an URL as one of the attributes. You may like
to add these yourself. In addition, we will want to ignore the content
of comment and META tags, as they can contain all sorts of weird things.
The general form of an HTML tag is:
<TAGNAME ATTRIBUTE1=value1 ATTRIBUTE2=value2
>
where TAGNAME can contain letters, numbers, hyphens and exclamation
marks (<!-- --> and <!DOCTYPE> are legal tags). ATTRIBUTE
tags generally contain only letters and hyphens, but because they can sometimes
be event names, we will allow numbers as well. We will allow exclamation
marks just to make the lexical analysis easier. If values contain anything
other than letters, numbers, hyphens or periods, then they must be enclosed
in double quotes. Also, in theory, every well-formed HTML document should
have at least these elements:
<!DOCTYPE>
<HTML>
<HEAD>
<TITLE>The document title goes here</TITLE>
</HEAD>
<BODY>
The actual document information goes here
</BODY>
</HTML>
This is actually somewhat too complicated for our purposes, but it might
be useful information for more sophisticated applications. You should note
too, that in practice, many documents do not contain all these elements.
We shall consider an HTML document to be simply a list of tags and data.
All of this suggests the following grammar:
Document -> [Tag | Data] +
Tag
-> < TagName AttributeList > | < / TagName >
Data -> [any
character except <]*
TagName -> Identifier
AttributeList -> [AttributeName AttributeRest]*
AttributeName -> Identifier | QuotedValue
AttributeRest -> = Value | ?
Value -> QuotedValue
| PlainValue
Identifier -> [A..Z | a..z | 0..9 | - | !]+
QuotedValue -> “ [any character]* “
PlainValue -> [A..Z | a..z | 0..9 | - | .]+
Notice that there is no left recursion in any production. Our semantic
rules will involve recording information for certain attributes of the
tags listed above. We will use a similar idea to the symbol table used
in most compilers. For error handling, we will raise Delphi exceptions,
as this will automatically release the call stack for us, and allow us
to catch the exceptions in one place. The lexical analysis will be extremely
simple, just returning tokens of one character. I have built in the capability
to easily extend this if you wish. The definition of the parsing classes
is shown below.
interface
uses
SysUtils, Classes;
type
TjimToken = class(TObject)
private
FTokenType : Integer;
FAsString : string;
public
property
TokenType : Integer read FTokenType
write
FTokenType;
property
AsString : string read FAsString
write
FAsString;
end;
TjimLexicalAnalyser = class(TObject)
private
FText : string;
FPosition : Integer;
procedure
SetText(const Value : string);
public
constructor
Create;
procedure
GetNextToken(NextToken : TjimToken);
property
Text : string read FText write
SetText;
end;
TjimSymbolType = (stTitle,stBase,stLink,stImage);
TjimSymbol = class(TCollectionItem)
private
FSymbolType : TjimSymbolType;
FSymbolValue : string;
public
procedure
Assign(Source : TPersistent); override;
property
SymbolType : TjimSymbolType read
FSymbolType write FSymbolType;
property
SymbolValue : string read FSymbolValue
write
FSymbolValue;
end;
TjimSymbolTable = class(TCollection)
private
function
GetItem(Index : Integer) : TjimSymbol;
procedure
SetItem(Index : Integer;Value : TjimSymbol);
public
function
Add : TjimSymbol;
function
AddSymbol(SymType : TjimSymbolType;SymValue : string) : TjimSymbol;
property
Items[Index : Integer] : TjimSymbol read
GetItem write SetItem; default;
end;
TjimHtmlParser = class(TObject)
private
FLookahead : TjimToken;
FLexAnalyser : TjimLexicalAnalyser;
FSymbolTable : TjimSymbolTable;
FLastTag : string;
procedure
Match(T : Integer);
procedure
ConsumeWhiteSpace;
procedure
Document;
procedure
Tag;
procedure
Data;
procedure
TagName;
procedure
AttributeList;
function
AttributeName : string;
function
Value : string;
function
Identifier : string;
function
QuotedValue : string;
function
PlainValue : string;
public
constructor
Create;
destructor
Destroy; override;
procedure
Parse(const DocString : string);
property
SymbolTable : TjimSymbolTable read
FSymbolTable;
end;
EjimHtmlParserError = class(Exception);
There are a couple of support classes to represent tokens and symbols.
Should you wish to write a parser for your own purposes, these should function
as templates you can build on. Usually, as here, the tokens have a type
that is the ASCII value of the character, if they are single characters,
otherwise it is some other predefined integer constant, outside the range
0 to 255. So, multiple character tokens would each need their own type.
A common extra token type is a marker for the end of the document. We will
use the constant ttEndOfDoc, defined to be
1. Obviously, the symbol table
would need to store different information as well. Often this would include
the type of the symbol (as in our application) which would then be used
for type checking (which we wonłt be doing). We will not go through the
code for the lexical analyser as it is extremely simple, and you will easily
be able to follow the code on the disk. Similarly, the symbol table and
symbol objects are descended from TCollection and TCollectionItem, respectively,
and are straightforward implementations as described in the Delphi help
and Xavier Pachecołs article in issue 10 (see how many opportunities to
plug that CD-ROM IÅ‚m giving you, Mr Editor, do I get a commission?). Hopefully,
the code for TjimHtmlParser will be easy to follow as well. A couple of
examples should be enough to give you the gist.
procedure TjimHtmlParser.Document;
begin {Document}
while FLookahead.TokenType
<> ttEndOfDoc do begin
ConsumeWhiteSpace;
if FLookahead.AsString
= '<' then begin
Tag;
end else begin
Data;
end;
end;
Match(ttEndOfDoc);
end; {Document}
procedure TjimHtmlParser.Tag;
begin {Tag}
Match(Ord('<'));
ConsumeWhiteSpace;
if FLookahead.AsString
= '/' then begin
// Finding an end tag
Match(Ord('/'));
FLastTag := '/';
ConsumeWhiteSpace;
TagName;
end else begin
// Finding a start tag, or a tag that doesn't
enclose anything
FLastTag := '';
ConsumeWhiteSpace;
TagName;
ConsumeWhiteSpace;
AttributeList;
end;
Match(Ord('>'));
end; {Tag}
The procedure Document is the first procedure called when parsing a
file, as it corresponds to the start symbol of our grammar. As you can
see, it continues to process the file until the end of document marker
is reached. We will consume whitespace in the parser rather than the lexical
analyser, because in our simple example it is easier to manage. The lookahead
symbol is sufficient for us to be able to determine which procedure to
call next. If we follow the case where it is Ä™<Å‚, the Tag procedure
will be called. In this procedure, the Ä™<Å‚ symbol is first matched,
so that the next symbol is requested from the lexical analyser. After removing
any whitespace, the lookahead symbol should be either a tag name, or the
start of an end tag. To help us implement the semantic rules, we need to
keep track of the last tag processed, so we will store its value in FLastTag.
The TagName procedure collects the rest of the name, and if we are in a
start tag, the attribute list is processed. The tag must end with a Ä™>Å‚
symbol. The other procedure are built in the same fashion. Some of them
are complicated slightly because they are where our semantic rules are
implemented. For instance, the TagName procedure steps through comment
and META tags ignoring their content. The Data procedure collects the title
of the document if the last tag was the TITLE tag, and having found it,
adds it to the symbol table. The procedure that does most of the work,
though, is AttributeList. It is shown below.
procedure TjimHtmlParser.AttributeList;
var
FLastAttribute : string;
FLastValue : string;
begin {AttributeList}
while FLookahead.AsString
<> '>' do begin
FLastAttribute := AttributeName;
ConsumeWhiteSpace;
if FLookahead.AsString
= '=' then begin
Match(Ord('='));
ConsumeWhiteSpace;
FLastValue := Value;
ConsumeWhiteSpace;
// Should only
get here if FLastAttribute is not an empty string
if
(CompareText(FLastTag,'BASE') = 0) and
(CompareText('HREF',FLastAttribute)
= 0) then begin
//
Special case when found the HREF attribute of a BASE tag
FSymbolTable.AddSymbol(stBase,FLastValue);
end else
if (CompareText(FLastTag,'IMG') = 0) and
(CompareText('SRC',FLastAttribute)
= 0) then begin
//
Found an image
FSymbolTable.AddSymbol(stImage,FLastValue);
end else
if ((CompareText(FLastTag,'A') = 0) or
(CompareText(FLastTag,'AREA') = 0) or
(CompareText(FLastTag,'LINK') = 0)) and
(CompareText('HREF',FLastAttribute) = 0) then
begin
//
Found an ordinary link
FSymbolTable.AddSymbol(stLink,FLastValue);
end else
if (CompareText(FLastTag,'FRAME') = 0) and
(CompareText('SRC',FLastAttribute) = 0) then begin
//
Found an ordinary link
FSymbolTable.AddSymbol(stLink,FLastValue);
end;
end;
end;
end; {AttributeList}
You can see that the syntactic work is done early. Our grammar states
that an attribute list consists of a list of attributes after the tag name
and before the Ä™>Å‚ symbol. Each attribute has a name, and possibly a value.
In the case that the attribute does have a value, we need to check if it
is one of the attributes we need for our diagram. So for our selected tags,
we check the attribute name, stored in the local variable FLastAttribute,
and if it is an URL we store it in the symbol table. This is the time that
we determine the type of the URL, whether it is a BASE tag, an image, or
another HTML file.
The remainder of the article is printed in the magazine.
Wyszukiwarka
Podobne podstrony:
migration4 parserParserDelegatorParser ParseReadMeParser inputWFiIS 11 Parsery LLParserDelegatorHTMLEditorKit ParserCallbackmigration4 parserparsermodule parserfunction xml parser get optionParseRss (2)html parser objectsfunction xml parser set optionfunction xml parser createwięcej podobnych podstron