Introduction to libbash (GSoC 2011)

In this article, I’d like to introduce the libbash project. I’ll explain what libbash can do with some examples. At the end of this article, a benchmark result is given for egencache (Portage), instruo (Paludis) and instruo reimplemented with our library.

I’ve been planning to write this article for a long time. This project was started last year as a GSoC project proposed by Petteri Räty. Nathan Eloe did a great work and achieved to build Abstractr Syntax Tree(AST) for a give shell script. This year, again as a GSoC student, I’ll work on the runtime part, or put it simply, making the library capable to run shell scripts. I’ve started contributing to this project since March 2011 and find it really amazing.

Libbash will enable programs to use Abstract Syntax Trees(AST) to parse and interpret shell scripts directly instead of using regular expressions. Most of bash 3.2 syntax will be supported. This will be a great benefit to programs both outside and inside Gentoo, including Portage/Paludis and repoman.

For instance, you have /etc/conf.d/net which is essentially a shell script. Libbash will tell you what variables and functions are there, what values of the variables will be after interpreting it. It also allows you to use compound statements and shell builtins in the script. We plan to support common bash 3.2 syntax except features related to interactive shell and executing external process. Currently it lacks a lot of functionality ( of cause 🙂 ), but it begins to shape up and can do some real work.

Let me show you how we handle /etc/conf.d/net with the library at hand.

$ ./variable_printer /etc/conf.d/net
auto_eth0=true
auto_eth1=false
auto_myxjtu2=false
auto_ppp0=false
auto_qiaomuf=true
config_eth0=202.xxx.xxx.xxx/24 192.168.14.xxx/24 192.168.4.xxx/24
config_eth1=dhcp
......

The variable_printer is a utility program that is linked against our library. All the non-local variables defined in /etc/conf.d/net are printed out by the utility including arrays. Actually we can do much more than that. For example, function definition, variable expansion and command substitution are supported(Although their functionality is not complete yet). If you need to analyze bash script, this library should be helpful.

Our goal of this summer is to support Portage metadata generation, so let me give you an example for it.
You may already know what is Portage metadata. It is used to speed up searches and the building of dependency trees. You can find it under $PORTDIR/metadata/cache and regenerate it by executing ‘egencache –update’. We have a utility that generates metadata for a give ebuild:

$ ECLASSDIR=scripts ./metadata_generator scripts/sunpinyin-2.0.3-r1.ebuild
dev-db/sqlite:3 dev-util/pkgconfig foo/bar
dev-db/sqlite:3
0
http://sunpinyin.googlecode.com/files/.tar.gz http://open-gram.googlecode.com/files/dict.utf8.tar.bz2 http://open-gram.googlecode.com/files/lm_sc.t3g.arpa.tar.bz2
http://sunpinyin.googlecode.com
LGPL-2.1 CDDL
SunPinyin is a SLM (Statistical Language Model) based IME
~amd64 ~x86
foo
1
compile install postinst unpack

This ebuild is modified to inherit foo.eclass written for testing purpose. Because some features are missing so the content is not exactly the same as the one under $PORTDIR. But the format should be the same now. (I removed unnecessary blank lines for better readability)

You may wonder why we need it as we already have egencache. The problem with egencache and some other Portage utilities is that the performance is not good. The overhead of forking bash and sourcing eclasses costs a lot of time. With libbash, the overhead can be avoided. Here’s a benchmark test for egencache, instruo and instruo reimplemented with libbash:

Environment:
Linux puma 2.6.37-gentoo-r4 #1 SMP Fri May 13 14:44:26 CST 2011 x86_64 Intel(R) Xeon(R) CPU E5405 @ 2.00GHz GenuineIntel GNU/Linux
CFLAGS & CXXFLAGS: -march=core2 -g -O2 -pipe -mtune=generic
CPU Freq governor: performance
time egencache --jobs=1 --update --cache-dir=meta_egencache
 real        95m49.598s
 user        52m8.223s
 sys         16m26.867s
time INSTRUO_THREADS=1 instruo -D /usr/portage/ -o meta_paludis
 real        123m0.811s
 user        54m6.507s
 sys         39m7.614s
time ./instruo -D /usr/portage/ -o meta_libbash 2>error
 real        1m24.977s
 user        1m18.070s
 sys         0m6.555s
time pmaint regen /usr/portage 1
 real        30m23.433s
 user        10m21.820s
 sys         6m2.990s

Thanks to ferringb for mentioning about pmaint (Based on the result, pmaint is the fastest metadata generation tool for now). Thanks to nirbheek and Ford_Prefect for reminding me of the kernel cache. Now every command is run 4 times and the result is the mean running time without caring about the first run. egencache and instruo were running in a single-threaded environment because our implementation of instruo is single-threaded. Note that /usr/portage/metadata/cache was removed every time before egencache was run.

I thought egencache would be slower since it generates two different metadata formats. But it turns out that writing metadata is not the bottleneck. Kernel cache has little impact on metadata generation as there’s little time difference between the first and second run for all the three commands.

Although our time will grow as we implement more features(We ignore the statements that we can’t handle), the result looks good. Our implementation of instruo doesn’t cheat. We just embedded our code that reads variable values from an ebuild into the original instruo implementation.
The main reason of the performance gain is that we don’t have to fork a huge number of bash process. In the meanwhile, the AST of eclasses are cached while generating metadata. Without the AST cache, we need 30 minutes to generate the ebuild metadata.

git repository: http://git.overlays.gentoo.org/gitweb/?p=proj/libbash.git;a=summary.

,

  1. #1 by Pavel on May 6, 2011 - 9:10 am

    wow, the performance difference is quite something! at the end of your GSOC, do you think portage & friends could already take advantage of libash?

    • #2 by qiaomuf on May 6, 2011 - 10:12 am

      That’s our goal. I think Portage could make use of libbash to correctly generate metadata at the end of the summer.

  2. #3 by Johan on May 6, 2011 - 4:50 pm

    I saw that chose boost for some solutions, might I ask why you chose that huge dependency over something smaller? Imo, it will decrease portablility and general acceptance.

    • #4 by qiaomuf on May 6, 2011 - 4:56 pm

      Without boost, I guess we can’t achieve our goal at the end of the summer. Some people also argued why not use C instead of C++. The answer is the same, we don’t have enough time.
      Because the boost libraries we use are all templates, we only have compilation dependency on it, not runtime dependency.
      Thanks for your interest.

      • #5 by Johan on May 6, 2011 - 5:06 pm

        Thanks for your quick response.

        I understand your goal and think your decision is correct. With a good API it’s surely feasible to port to simpler structures further down the road.

  3. #6 by on May 12, 2011 - 11:58 pm

    Try to run “time egencache –update” with BASH manually compiled without Unicode support, any please report the results if you do so. It should take much less time than BASH with Unicode, but it won’t be as fast as libbash because of the still present performance hits from fork(), from BASH process initialization, etc.

    Also, when reporting performance results, it is good to know what features were used (e.g: Unicode support in BASH, GCC compiler optimization level).

    • #7 by qiaomuf on May 13, 2011 - 8:51 am

      Thank you for the remind. I’ll add these info

  4. #8 by ferringb on May 15, 2011 - 5:52 pm

    Your test there needs some tweaks for portage- specifically locking it to just one repo

    Might be worth taking a look at ‘pmaint regen 1’ also, although you’ll need to wipe metadata/cache and /var/cache/edb/dep/

    egencache –jobs=1 …
    real 79m49.864s
    user 47m2.821s
    sys 15m28.378s

    pmaint regen repo-path 1 …

    real 22m2.510s
    user 11m20.012s
    sys 5m41.118s

    Main reason it’s worth looking at it is pmaint specifically avoids forking, and does some other tricks; the gap in user and real basically comes down to inefficiencies in bash’s read function implementation (stupid bugger goes byte by byte).

    I presume that you’re libbash efforts will be thread safe btw? 😉

    • #9 by qiaomuf on May 15, 2011 - 6:50 pm

      Thank you for the info, I’ll try to add a test for pmaint. The library will be thread safe but it’s not for now. We will fix that soon.

    • #10 by qiaomuf on May 15, 2011 - 9:01 pm

      test for pmaint is added

  5. #11 by ant on May 19, 2011 - 4:30 am

    Wow. This is impressive, especially that speed difference. I was thinking of making a package manager just for fun, now it seems a lot more possible 🙂

  6. #12 by Waldi Syafei on September 10, 2011 - 1:38 am

    Great.. So impressive.. Just wanna try it soon..

  7. #13 by Whatwat on March 12, 2012 - 1:53 am

    Thanks for the info ferringb. Looking forward 4 the next article men.

  1. Home page and updated benchmark for libbash « Mu Qiao's English Blog
  2. libbash runtime weekly report #1 « Mu Qiao's English Blog

Leave a reply to qiaomuf Cancel reply