Homework 10 - Prolog Project

Due: TBD (1159 PM Friday Dec 14, but note that HW11 is due before this)

Lets have the following late policy on this one: 5% off if submitted before noon on Dec 15, 10% if before 1159PM Dec 16, 20% after that (up to Dec 21 11:59 PM).

WARNING: This is by far the longest assignment of the semester, and not one you want to procrastinate on!

Since this homework is a project, it is weighted significantly heavier than other homeworks.

Objectives:

This program serves to put together everything you have learned about Prolog and programming languages. In particular, it has the following objectives:

Learn more about some common data structures used in compilers and software engineering. In particular, this includes abstract syntax trees (ASTs) and control flow graphs (CFGs).
Get familiar with a few common tools and techniques in compilers and software engineering: the applications here are widely used in these areas, and of course that is the topic of this course.
Learn a little about XML: As you might know, Extensible Markup Language (XML) is emerging as a standard for interchanging information in various diverse domains. It is useful for you to learn the basics of XML structure, as you might want to consider using XML for future projects. There isn't much to learn though, as its structure is simple.
Learn how to process XML data using Prolog: Prolog has found a niche in processing xml (and html) data, including numerous web applications such as the semantic web.
Learn how to use APIs: You may already know this well through your prior programming experience, but you will learn it now if not. Learning how to read documentation to reuse existing code is of immense value towards your computer science career.
Learn how to understand and attack a medium-sized problem: This is probably the single most important thing to learn during all your first year courses, in preparation for research. The task probably looks daunting at first, but will become easy if you think clearly. The hard part is understanding the problem, not coding it up; indeed, any attempt to start coding first will probably be wasted.

To do these, you will be doing some analyses of programs written in Louden's language. More precisely, you will be inputting AST/CFGs and analyzing them. Please note that the length of this assignment is not a reflection of the program's difficulty - the program is fairly simple to implement if you take a methodogical step-by-step approach, though it can be overwhelming if you try to do it without understanding the project first. Of course, all real applications are that way.

For this program, you are allowed to use any predefined predicates that you want; however, I caution you that you can probably write any desired predicate in much less time than needed to find it in the manual. You should also minimize the use of extralogical predicates, as overuse leads to distress.

Problem:

Given a program written in Louden's language, we will be generating an AST/CFG from it, and then analyzing the AST/CFG to do some typical software engineering tasks. The specific analysis tasks that you will do are listed below.

Definitions:

We will use the following definitions throughout this assignment: A variable x is defined at a node N if x is assigned to by N (i.e., N is an assignment and x is on the left hand side of N). A variable x is used at a node N if N is an assignment and x is in the right hand side of N, or N is an selection or iteration and x appears in the condition of N.

A control-flow graph (CFG) consists of vertices for each statement in the program and edges for each possible flow of execution in the program.

Since there are several programs we are talking about in the assignment, we will refer to the program (in Louden's language) being analyzed as the target program, and the [Prolog] program you are writing as the analyzer program.

Attacking the Problem:

I've outlined how to attack the problem here. You are required to use the XML AST produced by the parser discussed below, as one of the major points of this program is to learn how to handle such data structures.

0. Tokenize the input program:

As you know, the first stage of compilation is lexical analysis (i.e., tokenization). For the purposes of this project, your brain will be the tokenizer, and you will input a target program as a list of tokens.

1. Parse the Input Program (into an AST):

Naturally, we will use Louden's language for target programs. I've implemented a parser for his language here, so there is very little for you to do here. To use this, store the parser in a file named parse.pl (you may need to use a different file type on your system), and put ":- [parse]." (without the quotes) at the beginning of your analyzer program file. This tells it to load parse.pl when it is loaded. You may then use parseLouden/3 defined there to produce an AST/CFG into an XML file.

You should read the parser documentation in the file first, and then run the parser on some small target program to get familiar with how its output looks. Note that there are examples at the end of the file, but you will need different examples for your test cases. You DO NOT need to understand the parser code - it uses definite clause grammars, which can be thought of as notational conveniences for Prolog rules.

If you get an error message referring to a missing library or package predicate, you need to compile the appropriate packages/libraries on your installation. I am told that the Windows and Mac versions include all libraries by default, but all of SWI's Linux versions don't.

2. Generate a CFG

The next step after parsing is CFG generation. This step is also trivial, since my parser also generates CFG edges, resulting in an integrated AST/CFG. You should write a small target program and figure out what the CFG looks like (you want to draw the graphs on paper). Make sure your program has nested loops and/or selection statements.

3. Input the AST+CFG

This step is trivial, as SWI-Prolog includes predicates for reading/writing xml files - see the SGML/XML package documentation (from the SWI homepage, click on "Manual" and follow the links). I suggest looking at load_xml_file/2. At this point, you will have the xml file contents as some huge Prolog structure.

4. Perform analysis tasks

Write predicates for each of the analysis tasks listed below. You are required to use the AST+CFG generated by the parser.

5. Test your program

You should test your analyzer program completely. Make sure you supply test cases that are complete enough to convince the grader that your analyzer works. If you haven't tested major cases, we assume that your analyzer doesn't work.

Write a runtest/0 predicate that runs all your test cases. If your program structure isn't compatible with such an approach, you may also CLEARLY indicate how to run your program through all your test cases. Of course, we will try additional test cases.

Analysis Tasks:

For all of the following, the arguments are as follows:

Prog: a program, represented as a list of tokens (Ex: [x, ':=', '3', '*', x, '+', x, ';', x, ':=', y].
Var: the name of a variable in the target program (Ex: x).
Node*: the index of a token in Prog (0 is the first node). Depending on which library predicates/options you use, it might be an integer or atom (e.g., 4 or '4'). You can write your code either way, but note that '4' would be printed to screen as 4 (so you want to first make sure which way your choices work). Also note that this is not the same as the node<n> atoms generated by the parser (read the parser documentation).
Nodes: a set of nodes
*Vars: a set of variable names (Ex: [x,y])

varStats(+Prog,+Var,?NumDef,?NumUsed) succeeds if NumDef/NumUsed are the number of times Var is defined/used in the program. However, we wish to distinguish defs/uses that are inside an iteration from those outside, so NumDef is the structure def(NumInIter,NumOutIter), and similarly for NumUsed and used(NumInIter,NumOutIter).
selClose(+Prog,+NodeIf,?NodeFi) succeeds if NodeIf is an 'if' node and NodeFi is the index of the matching 'fi'.
badLoop(+Prog,+Node) succeeds if Node is an iteration node and the loop body does not define any variables in the loop condition.
uninitVar(+Prog,+Var,-Node,-Impossible) succeeds if Node uses Var and it is possible that Var has not been defined before Node. Impossible is bound to true if it is impossible that Var has been defined before Node, and false otherwise.
Determine a useful task of your own choice, and code it. Make sure you document exactly what the task is (in comments). Your grade is based on 1) how creative and useful your task is, 2) how difficult it is to implement, and 3) correctness. Thus, you may wish to 1) argue for why your task is useful, and 2) design a task that requires both AST and CFG information.

Many of the above tasks are useful and often done by compilers and/or software engineering tools, and there are thus various compiler techniques for doing similar tasks efficiently (e.g., dataflow analysis). Of course, that's not the point here and you don't need to learn those techniques for this project (indeed, you will probably not finish if you spend time on that, though I'd be glad to talk to you later about related research in the software engineering world).

Submission

The programs should be submitted electronically to the grader, and cc'd to me. Your main program should be named <firstname>_<surname>_hw10; other filenames should start with your two initials, and your main program should automatically load them when it is loaded.

Submit a program including runtest/0.

Hints/Clarifications/Corrections

To repeat what the assignment says, you will not be writing any significant code until step 4. If you are, then stop and read the assignment again.
As long as you can process small (<200 tokens) programs in a few seconds, don't worry about efficiency.
Don't forget that Prolog's only data structure is the structure, with lists being a special case. Similarly, 'id=node17' is also just shorthand for the structure '=(id,node17)'. Thus, you can unify against this: e.g., if you unified 'id=X' against this, X would be bound to node17.
This assignment is about static analysis, and you don't need to worry about data values, which are in general dynamic. For example, in "x:=0-x; while x do ...", the while loop is never entered but that relies on data values (so, you don't need to worry about that). Dynamic analysis is a much harder (and undecidable) problem of course. Most compilers eliminate constant expressions using constant folding/propagation, but the assignment does not ask you to do that.
The "test cases" at the bottom of the parser are test cases I used for my parser. They have nothing to do with the test cases that you should come up with for your program.
IMPORTANT: The parser produces a file (e.g., ast4.xml). However, the line in the code that actually prints it out was commented out in my original post, which makes the directions in step 3 mysterious. This line is not commented out in the currently posted version of the parser, so that is correct. Sorry for any confusion this may have caused.
Its OK to use any built-in predicates you want (though you're likely to waste more time finding something than just coding it). However, its not OK to 'cheat' by using other languages/libraries. Examples of this are using XPATH, or the external language interface to an imperative language.
I can't vouch for this, but a student told me that different versions of SWI treat the atom vs. integer issue (see my definition of Node* above) differently. Nobody has ever told me of such a problem in the past though, so its probably because of different default options in the xml predicates.
If you need to convert between [numeric] atoms and integers, atom_number/2 will do the job. This might be relevant due to the issues I mentioned in defining Node* above. Note that Section 4.21 of the SWI reference manual has various other similar conversion predicates.