libbash weekly report #9 (mid-term summary)

It’s been four months since I started working on this project. I just passed the mid-term evaluation so I’d like to talk about the current status of libbash.

My plan for the mid-term evaluation is to be able to parse the ebuilds that don’t inherit any eclass. With a few exceptions, I achieved that more than a month ago. If you read my weekly reports, you might have noticed that we could parse about 7000+ ebuilds at some point. That’s when I achieved the goal.

If you want to see what libbash can handle so far, there are several test scripts where you can find the answer.

But why are there some exceptions? The answer is due to the limitation of parser grammar. The parser grammar was supposed to be working at the end of GSoC 2010. However we found there were still many issues in the grammar while implementing the runtime of libbash. The parser grammar is quite important and I can’t work on the runtime if the parser grammar does not work properly.

Then what are you doing these days? I spent most of the time fixing the problems in the parser grammar. In my GSoC application, I said if I had enough time, I would try to disable global backtracking for the parser grammar. But now this seems to be required. I started working on it from last week and it goes pretty well so far.

Some people doubt if I could completely disable backtracking. I’m not a compiler expert and I don’t have the confidence to say if bash is LL(k). As far as I see, I can handle most of the syntax with LL(k) parser. But what I am really trying to do is to remove the global backtracking option of ANTLR grammar. With that option, ANTLR will automatically backtrack so there might be many places where you could have avoided backtracking. It also makes the problems in the parser grammar even more complicated. It’s highly recommended not to use that option in production code. As a result, I’m trying to use left factoring, syntactic predicate and local backtracking option instead of the global one. So far the parser grammar is faster and it’s much easier to fix the problems in it. I’ll give the performance comparison in the next weekly report.

What I have done in the last week:

  • worked on a new parser grammar without the global backtracking option

Plan for this week:

  • finish the work with the new parser grammar and improve unit tests
  • incorporate the new parser grammar
  • fully support here document
  • improve the rules for parameter expansion
  • improve the rules for built-in and keyword test
  • fix the problem with \newline
  1. #1 by Betelgeuse on July 18, 2011 - 9:30 pm

    ANTLR does LL(*) so bash doesn’t have to be LL(k)

  2. #2 by on August 1, 2011 - 1:57 am

    You wrote: “I’m not a compiler expert and I don’t have the confidence to say if bash is LL(k). As far as I see, I can handle most of the syntax with LL(k) parser.”

    A file containing Bash source code is a sequence of characters. This implies that – on a certain level – there surely exists an LL(k) grammar which will accept the source code. A given LL(k) parser will recognize some features of the Bash language and will fail to recognize some (other) features of the Bash language when parsing a Bash file. The unrecognized parts are then resolved (by a Turing machine) later during compilation or during run-time if it is too hard/inconvenient to resolve them at compile-time. You just need to save the unrecognized parts into internal data structures for later processing, so that the compiler or interpreter can do the rest of the recognition process.

    Most programming languages in use today are neither LL(k) nor LR(k). One reason for this is that the parser is unable to recognize whether an identifier points to a function, variable, type definition, etc. It is clearly impossible for LL(k)/LR(k) to decide what the meaning of an identifier is.

    The usual tactic is to parse the source code by an LL(k) or LR(k) grammar, and do the necessary checks and resolutions of all the remaining ambiguities later in the compilation process (or at run-time) by a Turing machine. A Turing machine is simply a Java/C/C++/etc program.

    I hope this was helpful.

    • #3 by qiaomuf on August 1, 2011 - 10:36 am

      Hi, I get clearer at some concepts after reading it. Currently we are doing exactly the same way as you explained. We use an LL(k) grammar to parse bash scripts and resolve some difficult parts at runtime. I think I should read my compiler book again at some time to get better understanding of many things I’m dealing with. Thank you.

  3. #4 by Confidentiality Agreement on January 2, 2012 - 8:10 am

    Thank you, it is an awesome information regarding expressing together with generating people recognize concerning the exercises which usually might be receiving performed.My partner and i think I can find out a good deal far more useful information the following, all the best.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: