The best kittens, technology, and video games blog in the world.

Wednesday, April 25, 2007

Unit testing for RLisp

hoshi loves rawhide by paul+photos=moody from flickr (CC-NC)

I've been coding RLisp pretty much continuously since about one week before RuPy 2007. Even a temporary switch to different hardware and revision control system did not make me stop.

Turning RLisp into something more widely usable seems like a huge task. Every new language is expected to come with full CPAN out of the box. I don't know if RLisp will ever get there, but I should at least cover the few most important targets, like unit testing and web development frameworks. Of course I' not going to code them on my own. Like every good programmer I'm way too lazy and impatient for that, and and adding nice interfaces to existing Ruby stuff achieves far more in a lot less time and effort.

At first I implemented all tests in plain Ruby with a few helper methods.

class Test_RLisp_stdlib < Test::Unit::TestCase
def setup
@rlisp = RLispCompiler.new
@rlisp.run_file("stdlib.rl")
end
def assert_runs(code, expected)
assert_eql(expected, @rlisp.run(RLispGrammar.new(code).get_expr))
end
def test_double
run_rlisp("(let double (fn (x) (* x 2)))")
assert_runs("(double 5)", 10)
assert_runs("(double 11.0)", 22.0)
end
...
end


Testing plain functions worked well enough, testing macros that way turned to be much more difficult, and I wanted to code something in RLisp.

A functional prototype took no time and it looked fairly well:

(ruby-require "test/unit")

(let Test_testing_framework [Class new Test::Unit::TestCase])
(class Test_testing_framework
(method test_assertions ()
[self assert_equal 9 (+ 2 7)]
[self assert_equal 2 (- 6 4)]
[self assert_in_delta 1.99 2.02 0.05 "Deviation is too high"]
[self assert_not_equal () nil]
[self assert_same #f false]
[self assert_same #t true]
)
)


In Ruby messages to self have nice syntax, and they're often used for designing DSLs. Unit testing language is a DSL too. Lisps are well known for their DSLs too, except that they build them with macros.

That's how the first RLisp DSL began. Syntactic sugar for test suites (classes) and tests (methods) was straightforward.

(defmacro test (name . body)
(let test_name [(+ "test_" [name to_s]) to_sym])
`(method ,test_name () ,@body)
)

(defmacro test-suite (name . body)
(let class_name [(+ "Test_" [name to_s]) to_sym])
`(do
(let ,class_name [Class new Test::Unit::TestCase])
(class ,class_name
,@body
)
)
)


It was less obvious what to do with assertions. There are too many assert_* methods, and creating one globally visible macro for each didn't sound like an exactly elegant idea. Making the macros active only within a test suite would be nicer, but I wasn't really convinced it was the best way, and besides I've never done such a thing before, it seemed complicated, and writing nontrivial macros with just defmacro and quasiquotes is very hard.

But I wasn't limited to defmacro and quasiquotes any more - the RLisp pattern matching macro (itself implemented with way too many quasiquotes) was just sitting there waiting for some action.

After some syntactic experiments I came out with this syntax:

(require "rlunit.rl")

(test-suite RLUnit
(test assertions
(assert 9 == (+ 2 7))
(assert 2 == (- 6 4))
(assert 2 == (- 6 4) msg: "Subtraction should work")
(assert 96 == (* 8 4 3))
(assert nil? nil)
(assert true)
(assert (> 3 2))
(assert msg: "assert_block failed" block: (true))
(assert (not (== 3 7)))
(assert 1.99 == 2.02 delta: 0.05 msg: "Deviation is too high")
(assert '() instance_of? Array)
(assert '(foo bar) instance_of? Array)
(assert '(foo bar) kind_of? Enumerable)
(assert () != nil)
(assert #f same: false)
(assert #t same: true)
)
)


I could have converted everything to assert_block, but it seemed more elegant to use specific assertions, mostly to reuse nice failure messages.

(defmacro assert args
(match args

('nil? a) `[self assert_nil ,a]
('nil? a 'msg: ,msg) `[self assert_nil ,a ,msg]

(a '=~ b) `[self assert_match ,b ,a]
(a '=~ b 'msg: msg) `[self assert_match ,b ,a ,msg]

(a '!~ b) `[self assert_no_match ,b ,a]
(a '!~ b 'msg: msg) `[self assert_no_match ,b ,a ,msg]

(a '!= b) `[self assert_not_equal ,a ,b]
(a '!= b 'msg: msg) `[self assert_not_equal ,a ,b ,msg]

(a 'same: b) `[self assert_equal ,a ,b]
(a 'same: b 'msg: msg) `[self assert_equal ,a ,b ,msg]

(a 'not_same: b) `[self assert_not_same ,a ,b]
(a 'not_same: b 'msg: msg) `[self assert_not_same ,a ,b ,msg]

(a 'kind_of? b) `[self assert_kind_of ,b ,a]
(a 'kind_of? b 'msg msg) `[self assert_kind_of ,b ,a ,msg]

(a 'instance_of? b) `[self assert_instance_of ,b ,a]
(a 'instance_of? b ,msg: msg) `[self assert_instance_of ,b ,a ,msg]

(a b) `[self assert_equal ,a ,b]

(a '== b 'delta: c) `[self assert_in_delta ,a ,b ,c]
(a '== b 'delta: c
'msg: msg) `[self assert_in_delta ,a ,b ,c ,msg]


(a '== b) `[self assert_equal ,a ,b]
(a '== b 'msg: msg) `[self assert_equal ,a ,b ,msg]

('block: (blk)) `[self assert_block & (fn () ,blk)]
('block: (blk) 'msg: msg) `[self assert_block ,msg & (fn () ,blk)]
('msg: msg 'block: (blk)) `[self assert_block ,msg & (fn () ,blk)]

(a) `[self assert ,a]
(a 'msg: msg) `[self assert ,a ,msg]

(raise SyntaxError [(cons 'assert args) inspect_lisp])
)
)


At first it was just a few lines, but as pattern matching is so simple I just kept adding more assertions, and then even Smalltalk-style pretty optional arguments. I'm not convinced wheather raising an exception during macro expansion is a good idea or not. Exceptions in RLisp are much less useful than in Ruby - RLisp compiler compiles RLisp to Ruby and then runs it. Unfotunately backtraces telling you that an exception came from (eval) statement aren't very useful. I pached the compiler so that at least the file names are preserved. It would be far better to have line numbers and function names, but Lisp code is usually heavily processed by macros, so most of the information is already lost before the compilation even begins. I still think it should be possible to get something that works reasonably well most of the time.

The internals have been heavily refactored. For example it's possible to freely mix Ruby and RLisp tests, and RLispCompiler will handle .rlc files and so on on its own. Brad Ediger in his RLisp plugin for Ruby on Rails simply made Ruby require statement handle RLisp files transparently. Doing so would probably require a global instance of RLispCompiler, and I was trying to make it possible to have many independent instances of RLisp running concurrently. Maybe it doesn't make that much sense - after all they will all share a single Ruby environment anyway. Right now running RLisp code takes only a bit more typing than a straight require:

require 'test_parser'
require 'test_rlvm'
require 'test_rlisp'
require 'test_run'
rlisp = RLispCompiler.new
rlisp.run_file 'stdlib.rl'
rlisp.run_file 'test_macros.rl'
rlisp.run_file 'tests/unit_test_test.rl'
rlisp.run_file 'tests/test_rlunit.rl'


$ ./test_all.rb
Loaded suite ./test_all
Started
..............................................................................................
Finished in 13.523565 seconds.

94 tests, 308 assertions, 0 failures, 0 errors


Half of the tests implemented in Ruby, half in RLisp. Here's a sample:

(defmacro assert-macroexpands (source expected-expansion)
`(assert
',expected-expansion
==
(macroexpand-rec ',source)
)
)

(test-suite OO_Macros
(test class
(assert-macroexpands
(class Foo code code2 code3)
(send Foo 'instance_eval & (fn ()
code code2 code3
))
)
(assert-macroexpands
(class Foo)
(send Foo 'instance_eval & (fn ()))
)
)
(test method
(assert-macroexpands
(method foo (args) body)
(send self 'define_method 'foo & (fn (args) body))
)
)
; Should they autoquote ?
(test attributes
(assert-macroexpands
(attr_reader 'foo)
(send self 'attr_reader 'foo)
)
(assert-macroexpands
(attr_writer 'foo)
(send self 'attr_writer 'foo)
)
(assert-macroexpands
(attr_accessor 'foo)
(send self 'attr_accessor 'foo)
)
)
)


There's plenty of other changes. Capitalized names, with ::s or not, are simply treated as Ruby constants. RLisp one-liners are possible with the usual -e syntax, dotted pair notation is supported by many operations including pattern matching (of course as RLisp uses arrays and not linked lists, it wouldn't be possible to support all use cases), #to_s_lisp (used by print) and #inspect_lisp (used by REPL) operations are now separate just like in Ruby. It should simply be a lot more fun to hack.

Tuesday, April 24, 2007

Distributed version control with SVK

charleyBook by Hackintosh from flickr (CC-NC)

I had to shut down the server with my Subversion repository for a few days. As I felt like coding and I hate coding without some sort of revision control system in place, it was the time to check out some Subversion alternatives.

I dumped my Subversion repository (svnadmin dump) and scp'ed it to another machine. Pretty much at random I picked SVK first.

It turned out what I did wasn't the right strategy for setting up a local repository - most distributed revision control systems prefer to use network connections to set up mirrors. I had possible to convert the dump back to Subversion database (svk admin load does more or less the same thing as svnadmin load), moving the imported files somewhere else, and mirroring that, using file:// protocol (svk mirror followed by svk sync -a).

That wasn't all, as one is not supposed to work on mirrors directly - the official way is to create a local branch (svk copy), commit there, and sync later. Setup was a bit compicated, and svk sync -a took over 15 minutes, but afterwards it was a pure pleasure.

Pretty much every single command I used in Subversion just worked, I only had to say svk whatever instead of svn whatever. Even svnadmin commands are available with svk admin. That's quite important, as I knew and liked Subversion already, and didn't really felt like switching paradigm more than necessary. svk was also very fast once installed. That's also a nice thing, as I tend to commit every single new unit test separately instead of doing big commits every few hours.

Well, there's not much more to add - SVK just works, and without a lenghty adjustment period.

Monday, April 23, 2007

RuPy 2007

the thicket by Steve took it from flickr (CC-NC-SA)

I've been too busy coding to write anything about it earlier, but a week and a few days ago I attended a Ruby & Python conference in Poznań - RuPy 2007.

I strongly prefer active participation, so I prepared a talk for the conference. My talk was about RLisp. I wasn't really sure if it would be well-received, as the conference was dominated by web developers, and back then RLisp didn't even have a web framework. As it turned out the reception was very enthusiastic, and many people seemed to be fascinated by the idea, even if just for the sake of geek points. What surprised me even more was that the very first person I talked with knew my blog.

Slides I prepared for the talk are available in ODP and PDF formats. The main point of my talk was that RLisp is at the same time Lisp with access to all great Ruby libraries, and Ruby with macros. There were few Lispers in the audience, but Lisp hype spread by Paul Graham and others seemed to have reached everyone. I made a bold claim that RLisp has more libraries than all other Lisps taken together. Ruby definitely does, thanks to Ruby gems, and RLisp is a fully-featured Lisp. The boldness was in my assumption that Ruby and RLisp code can seamlessly live together and naturally blend. I designed RLisp with this goal in mind, but it with so little real life testing maybe it's too early to say such things.

I think that integrating with Ruby runtime was the best possible choice. An obvious alternative from library count point of view would be Perl, but its semantics are so hideously complex that a language living in the same environment as Perl would have to be Perl, or else it would have severely restricted access to its libraries. I'm talking about Perl's semantics, not syntax. Things in Perl don't even know wheather they're equal to other things or not. Strings "00" and "0" are considered equal in some contexts, but not in others (they're ==, they are not eq, and the former is true while the latter is false). If any language were to work with Perl, it would need to support String-Number hybrids, scalar/list/void and some even more obscure contexts for all function calls, badly bolted-on OO system without even usable reflection, and it even would need to copy all ugliness of Perl references. I tried it once, it wasn't nice.

I'm not trying to bash Perl (at least not too much, it's such an easy target it wouldn't be much fun). It has many redeeming features like CPAN and focus on getting things done efficiently even if it meant breaking with holy traditions of programming language design. The last time I wrote about Perl here an anonymous reader left a truly insightful comment:
Saying "Perl is great at ..." is the same as saying "I don't know much about ...."

And that's the very point of Perl. You grab a package from CPAN, run some Unix commands, convert the data with a few one liners full of regular expression magic and you're done. There's not enough time for becoming an expert at everything. It's far more efficient to take advantage of someone else's expertise - in this case the the CPAN module's author's. Of course every now and then you get deep into something. One of the greatest things about Perl is that the community of Perl programmers developed strong traditions of sharing effects of their work, on CPAN and elsewhere.

Continuing the side note, Python seems rather unfriendly towards functional programming, and static-centric environments like JVM and .NET would most likely cause very severe impedance mismatch with a highly dynamic language like Lisp. I'm really impressed by how far JRuby guys were able to get. Code examples available online don't seem particularly drastic for code that crosses language boundaries, and between two established languages at that, with so little room for modifying either language to work better with the other.

Getting back to my RuPy talk, one of the greatest powers of Lisp are its metaprogramming facilities - having a very convenient representation for code, which can be robustly generated, transformed, and analyzed. Robust code generation is a far more important issue than most people realize. Four common bugs in run-time code generation - XSS, SQL injections, PHP includes, and format string attacks - are responsible for 40% of all security vulnerabilities (counting only vulnerabilities of known types). Unfortunately in Web 2.0 even HTML is code, highly security-critical code at that. Would you rather have robust and convenient macros for generating safe HTML, or pray every day that you never forget to escape dangerous input ? I don't want to sound bleak, but we all know for how long people had means of avoiding all buffer overflows, SQL injections, PHP include, format string, and many other classes of vulnerabilities, and didn't use them. Most likely many people will keep escaping HTML by hand even in 2020.

Returning from another side track, RLisp is Lisp with Ruby libraries, object system and so on, but it's also Ruby with macros. For some reasons languages with rich syntax like Ruby rarely get powerful macro systems, and even when they get some, like Dylan, Nemerle, and Lua, macros are very reluctantly used by most programmers. Only in Lisps macro use is widespread. I think the basic reason is that Lisp code can not only be generated easily, but albo transformed and analyzed. All Lisp syntax gets converted to a small core, and the core is all a metaprogrammer needs to deal with. It's certainly possible to imagine all Ruby syntax getting compiled to just literals, message sends, and a few control structures. Unfortunately real Ruby AST has 105 kinds of nodes in its abstract syntax tree, the set which is highly non-orthogonal, and it's difficult to transform the AST in any nontrivial way without accidentally breaking code semantics. If you feel like getting your hands dirty, SexpProcessor converts Ruby to something vaguely Lisp-like and back. I tried it once and it SexpProcessor felt much harder than Lisp macros. Just in case someone thinks I'm exaggerating, here's an example of Ruby magical behaviour. When a blocks is called with one argument, it expects multiple arguments (possibly optional, but |*args| doesn't cause magic), and the argument is an Array, it is magically expanded into a list of arguments.
irb> x="foo"; Proc.new{|a,*b| a}.call(x)
=> "foo"
irb> x=[1,2]; Proc.new{|a,*b| a}.call(x)
=> 1
I'd love to improve Ruby's elegance and hackability. If I had my way, I'd get rid of most of these irregularities in Ruby 2 (saving one or two keypresses for explicit * or really isn't worth the added complexity), together with protected, most of the private (global functions like Kernel#puts need to stay private for technical reasons, but many should be made public, especially commonly used metaprogramming methods like Module#define_method), class checks on UnboundMethod#bind (checking internal C representation is more than enough), and I'd get Object#become, Object#class= and other useful methods from evil.rb. I guess I'm not a particularly representative Ruby hacker, so maybe it's better for most that I'm not in charge of Ruby. ;-)

I really meant to write about the conference, but it turned into a post about RLisp. So on the conference ... people seemed to like RLisp. The obvious question I received after the talk was about web frameworks for RLisp. Back then there weren't any, the compiler was barely written a few days before and not even pubished, but I thought RLisp should be able to use Ruby on Rails without too much porting effort. And indeed just a few days later Brad Ediger created an RLisp plugin for Rails. Ruby has not only great libraries, but also great hackers.

And on RuPy 2007, the talk I liked most was "PyPy, the implementation of Python in Python". They're doing a lot of cool stuff with Python, I'd realy love to see similar things for Ruby. Some other talks I liked were "Python usage in Embedded Environments", "Domain-specific languages in Ruby", and "Hands-on! Wrapping FileMagic as a RubyGem". Between sessions, there were many cool hackers there, a lot of them involved in local Ruby User Groups (or suchlike). I think I'll follow their advice and check if there's something like that around.

Friday, April 20, 2007

Pattern matching for RLisp

claws by dotpolka from flickr (CC-NC-ND)One of the reasons I had for creating RLisp was a desire for a platform on which I could experiment with Lisp-style macros, and where they would do something useful. All other Lisps are either gross like Common Lisp or lack full macro system like Scheme. And in either case they don't have things that we take for granted nowadays in Ruby, Python or even Perl. Now that RLisp exists, and is even decently fast, I finally got to playing with macros. Some observations:

  • It's very difficult. Much more difficult than plain metaprogramming by code generation.
  • Macros are very intolerant - a typo or thinko in macro definition usually results in a completely uninformative error message. And thinkos are quite common.
  • I was never biten by macro hygiene issues. Obviously I (gensym) an identifier if I need one. It seems that to have hygiene problems the macro would need to use global variable, which was shadowed in local scope. This doesn't seem likely to happen.
Probably the most complicated macro I've written so far is pattern matching macro (match ...). It provides pattern matching syntax similar to OCaml's (SML's, Haskell's, and so on):
(defun arithmetics (expr)
 (match expr
   (x 'plus y)  (+ (arithmetics x) (arithmetics y))
   (x 'minus y) (- (arithmetics x) (arithmetics y))
   (x 'times y) (* (arithmetics x) (arithmetics y))
   (x 'div y)   (/ (arithmetics x) (arithmetics y))
   z z
 )
)
(arithmetics '5.0) ; => 5.0
(arithmetics '(1 plus 2)) ; => 3
(arithmetics '((3 times 4) plus 5)) ; => 17
The macro is built in stages. The first stage simply protects against evaluating the expression more than once.
(defmacro match (val . cases)
 (let tmp (gensym))
 `(do
   (let ,tmp ,val)
   (match-2 ,tmp ,@cases)
 )
)
(match-2 ...) does the same thing as (match ...), except that it knows its first argument is a temporary variable, so it doesn't have to worry about evaluating it multiple times. (match-2 ...) builds a tree (if (tmp matches pattern 1) (code 1) (if (tmp matches pattern 2) (code 2) ...)), possibly with a default value if all patterns fail.
(defmacro match-2 (tmp . cases)
 (if [cases empty?]
   `nil
   (if (== [cases size] 1)
      (hd cases)
      `(if (match-3 ,tmp ,(nth cases 0))
         ,(nth cases 1)
         (match-2 ,tmp ,@(ntl cases 2))
      )
   )
 )
)
(match-3 ...) generates code which matches a single value against a single pattern. There's a cool part - in RLisp (let foo bar) sets foo variable in the nearest enclosing scope (typically a function), so we don't have to do any extra work to get our patterns bind values to identifiers.
(defmacro match-3 (tmp pattern)
 (if [pattern is_a? Array]
   (if [pattern == ()] `[,tmp == ,pattern]
     (if [(hd pattern) == 'quote]
       `[,tmp == ',(nth pattern 1)]
       `(bool-and
          [,tmp is_a? Array]
          [[,tmp size] == ,[pattern size]]
          ,@[[Range new 0 [[pattern size] - 1]] map & (fn (i)
            (let tmpi (gensym))
            `(do
               (let ,tmpi [,tmp get ,i])
               (match-3 ,tmpi ,(nth pattern i))
            )
          )]
       )
     )
   )
   (if [pattern is_a? Symbol]
     `(do (let ,pattern ,tmp) true)
     `[,tmp == ,pattern]
   )
 )
)
(match-3 ...) isn't very pretty. It handles the following patterns:
  • Symbol - bind value to identifier
  • Array (quote symbol) - check if value equals symbol
  • Empty array - check if value is an empty array too
  • Other array - check if value is an array or the right size. If it is, check each of its elements against each element of pattern array.
  • Anything else - check if value equals pattern.
It would be nice to handle variable-sized lists with patterns like (foo bar . rest) . It shouldn't be that hard to implement it. (bool-and ...) is simply (and ...) except that it returns true or false, instead of the last value or nil. (and ...) would work here too, but I wanted simpler version for unit testing (match ...).
(defmacro bool-and args
 (if (empty? args)
   true
   (if [[args size] == 1]
     (hd args)
     `(if ,(hd args)
       (bool-and ,@(tl args))
       false
     )
   )
 )
)
I'm still undecided when to send messages like [args empty?] and when to call functions like (empty? args) and on other such details.

Thursday, April 19, 2007

Judy for Ruby 1.8.6

Pink Curls by creativity+ from flickr (CC-NC-ND)I ported Judy patch to Ruby 1.8.6. Speedups are comparable to ones in 1.8.5. On benchmarks from YARV (score is mean of 3 middle times from 5 runs):

BenchmarkSpeedup
app_answer1.9%
app_fib1.3%
app_mandelbrot4.2%
app_raise7.1%
app_strconcat0.3%
app_tak0.8%
app_tarai2.7%

Mandelbrot benchmark is obviously significantly improved, as Judy improves instance variable getting/setting a lot, and it uses complex numbers, which are Ruby objects. I'm not really sure why raise is affected so much, but it was consistent accross 5 benchmark runs.

In any case, you can get the patch here.

Varnish vs Squid - assembly still matters

Heart and M&Ms by Bob.Fornal from flickr (CC-NC-SA)Varnish is a server-side HTTP proxy, which is taking over the world as we speak. It does so by being 10x-20x faster than Squid on the same hardware. (at least that's the number they claim)

It does so not by using fancy algorithms by being aware of modern computer architectures.

Architecture Notes for Varnish are a great read. There's also a presentation about various other technical aspects of Varnish.

Dieting for geeks

Toefel has his own 'food chain'... by Sibi from flickr (CC-NC-ND)I have no idea if it's going to be useful for anyone or not. I played a bit with USDA Nutrient Database, trying to score different foods on their "healthiness". There is a very widespread belief that eating has an enormous impact on health. Different sources suggests eating more or less of certain foods, or getting more or less of certain nutrients. Such advice tends to have very little basis in hard evidence. What is basically known is that very low intake of vitamins and minerals, and that diet has a moderate impact on cardiovascular health. There's hardly any evidence associating high or very high intakes of particular vitamins, minerals, or other nutrients with major improvement in health, and little evidence that dieting can have more than just a minor effect on cancer or diseases other than cardiovascular diseases. If you're interested in dieting and health, probably the first thing you should read is paper "Diet quality and major chronic disease risk in men and women: moving toward improved dietary guidance". The paper explored some alternative indexes of healthy eating, knowing from previous research that the USDA Food Pyramid" failed miserably at predicting anything.

After adjusting for other risk factors in the multivariate adjusted analysis, men in the highest quintile of AHEI scores had a 39% lower risk of CVD than did men in the lowest quintile (RR = 0.61; 95% CI: 0.49, 0.75), whereas men in the highest quintile of RFS had a 23% reduction in CVD risk (RR = 0.77; 95% CI: 0.64, 0.93). Neither score predicted cancer risk, after multivariate adjustment.
The associations of both the AHEI score and RFS with risk of major chronic disease, CVD, and cancer in women are shown in Table 5. Overall, the findings were weaker than those for men. The AHEI score predicted a weak but significant reduction in major chronic disease risk in our multivariate models (RR = 0.89; 95% CI: 0.82, 0.96; P = 0.008). AHEI scores in the highest quintile compared with the lowest quintile were associated with a 28% lower risk of CVD in women (RR = 0.72; 95% CI: 0.60, 0.86; P < 0.001). As with men, we observed no significant associations between AHEI score and cancer risk. All models evaluating the RFS in women were nonsignificant after multivariate adjustment.
So AHEI only moderately correlates with risk of CVD, and with very little else. As indexes which measure whole diet get such weak results, there's really little hope that measuring individual nutrient intakes, or individual food products will predict anything at all. Yet, just for fun, here's a tool for building such indexes. Just don't consider the results meaningful.
# The data can be downloaded from the following URLs:
# http://www.nal.usda.gov/fnic/foodcomp/Data/SR19/asc/FD_GROUP.txt
# http://www.nal.usda.gov/fnic/foodcomp/Data/SR19/asc/FOODS_DES.txt
# http://www.nal.usda.gov/fnic/foodcomp/Data/SR19/dnload/sr19abbr.zip
# * (Unpack ABBREV.txt file from the zip)

def parse_sr19(file_name)
fh = File.open(file_name)
fh.each{|line|
fields = line.chomp.split(/\^/).map{|entry| if entry =~ /\A~(.*)~\Z/ then $1 else entry.to_f end }
yield(*fields)
}
end

$food_group = {}
parse_sr19('FD_GROUP.txt') {|food_id, name|
$food_group[food_id] = name
}

$food_des = {}
parse_sr19('FOOD_DES.txt') {|food_id, food_group_id, long_dsc, *rest|
$food_des[food_id] = [$food_group[food_group_id], long_dsc]
}

food = {}
parse_sr19('ABBREV.txt') {|*fields|
ndb_no     = fields[0]

group, dsc = $food_des[ndb_no]

short_dsc  = fields[1]
energy     = fields[3]
next if energy == 0.0

# Conversion factor:
# * Assume 2000.0 kcal diet
# *
cf = 2000.0 / energy

# Score in RDAs for male, 19-30y
# * http://www.iom.edu/Object.File/Master/21/372/0.pdf
ca         = fields[10]*cf/1000.0 # mg -> RDA
fe         = fields[11]*cf/8.0    # mg -> RDA
mg         = fields[12]*cf/400.0  # mg -> RDA
phosphorus = fields[13]*cf/700.0  # mg -> RDA
k          = fields[14]*cf/4700.0 # mg -> RDA
na         = fields[15]*cf/1500.0 # mg -> RDA
zn         = fields[16]*cf/11.0   # mg -> RDA
cu         = fields[17]*cf/900.0  # mg -> RDA
mn         = fields[18]*cf/2.3    # mg -> RDA
se         = fields[19]*cf/55.0   # ug -> RDA
vit_c      = fields[20]*cf/90.0   # mg -> RDA
thiamin    = fields[21]*cf/1.2    # mg -> RDA
riboflavin = fields[22]*cf/1.3    # mg -> RDA
niacin     = fields[23]*cf/16.0   # mg -> RDA
panto_acid = fields[24]*cf/5.0    # mg -> RDA
vit_b6     = fields[25]*cf/1.3    # mg -> RDA
folate     = fields[26]*cf/400.0  # ug -> RDA
vit_b12    = fields[30]*cf/2.4    # ug -> RDA
vit_a_rae  = fields[32]*cf/900.0  # ug rae -> RDA
vit_e      = fields[34]*cf/15.0   # mg -> RDA
vit_k      = fields[35]*cf/120.0  # ug -> RDA

# More than 5x RDA density doesn't give any extra score
score_vits = [vit_c,thiamin,riboflavin,niacin,panto_acid,vit_b6,folate,vit_b12,vit_a_rae,vit_e,vit_k,
          ca,fe,phosphorus,k,na,zn,cu,mn,se].
          map{|e| if e > 5.0 then 5.0 else e end}.
          inject{|a,b| a+b}

vitamins = [[vit_c,thiamin,riboflavin,niacin,panto_acid,vit_b6,folate,vit_b12,vit_a_rae,vit_e,vit_k].map{|e| (100*e).round*0.01}].map{|x|
       "#{x[8]}A #{x[1]}B1 #{x[2]}B2 #{x[3]}B3 #{x[4]}B5 #{x[5]}B6 #{x[6]}B9 #{x[7]}B12 #{x[0]}C #{x[9]}E #{x[10]}K"}[0]
minerals = [[ca,fe,phosphorus,k,na,zn,cu,mn,se].map{|e| (100*e).round*0.01}].map{|x|
       "#{x[0]}Ca #{x[1]}Fe #{x[2]}P #{x[3]}K #{x[4]}Na #{x[5]}Zn #{x[6]}Cu #{x[7]}Mn #{x[8]}Se"}[0]

protein    = fields[4]*cf  # g
lipid      = fields[5]*cf  # g
carb       = fields[7]*cf  # g
fiber      = fields[8]*cf  # g
fa_sat     = fields[41]*cf # g
fa_mono    = fields[42]*cf # g
fa_poly    = fields[43]*cf # g
choles     = fields[44]*cf/1000.0 # g

# And now the scoring
score = 0.3 * protein + fa_poly - fa_sat + 3 * score_vits - 10.0 * choles

fa_poly = fa_poly.round
fa_mono = fa_mono.round
fa_sat  = fa_sat.round
choles  = (10.0*choles).round*0.1

food[group] ||= []
food[group] << [-score, "* #{dsc}
* Score: #{score}
* Fats: #{fa_poly}/#{fa_mono}/#{fa_sat}/#{choles}
* VM-Score: #{score_vits}
* Vitamins: #{vitamins}
* Minerals: #{minerals}
"]
}

food.to_a.sort.each{|group, foods|
puts "Group: #{group}"
puts foods.sort.map{|sc,dsc| dsc}
}
The script parses USDA data and converts it to useful format, namely "how many times the RDA would you get if you ate only such food, 2000 kcal/day". Vitamins and minerals are in RDA, polyunsaturated, monounsaturated, and saturated fatty acids, protein and cholesterol in grams/day, assuming only this particular type of food is eaten. Scores of this kind are pretty useful, as mixing different foods simply corresponds to computing a weighted average. To avoid giving overly high scores to single-nutrient foods, nutrient content above 5x RDA is ignored. The output with the default scoring function looks something like that:
Group: Ethnic Foods
* Buffalo, free range, top round steak, raw (Shoshone Bannock)
* Score: 259.922257628
* Fats: 2/9/8/0.0
* VM-Score: 45.4825033844174
* Vitamins: 0.0A 2.58B1 5.05B2 8.74B3 3.07B5 12.12B6 0.0B9 9.68B12 0.0C 0.0E 0.0K
* Minerals: 0.06Ca 6.59Fe 5.77P 1.53K 0.58Na 5.51Zn 0.0Cu 0.1Mn 2.57Se
* Caribou, shoulder meat, dried (Alaska Native)
* Score: 253.402785187772
* Fats: 3/9/12/1.2
* VM-Score: 47.6705470921111
* Vitamins: 0.09A 1.92B1 7.38B2 6.83B3 6.13B5 2.83B6 0.15B9 46.43B12 0.0C 0.03E 0.0K
* Minerals: 0.1Ca 10.15Fe 5.8P 1.27K 0.97Na 6.31Zn 0.01Cu 0.35Mn 4.94Se
Group: Fast Foods
* McDONALD'S,  Hamburger
* Score: 64.1448551547567
* Fats: 2/25/23/0.2
* VM-Score: 19.4689199722205
* Vitamins: 0.0A 1.63B1 1.46B2 2.25B3 0.0B5 0.0B6 1.27B9 2.74B12 0.05C 0.0E 0.0K
* Minerals: 0.96Ca 2.62Fe 1.21P 0.34K 2.68Na 1.35Zn 0.0Cu 0.89Mn 0.0Se
Group: Baked Products
* ARCHWAY Home Style Cookies, Coconut Macaroon
* Score: -74.9294687966982
* Fats: 4/7/87/0.0
* VM-Score: 1.66698866197016
* Vitamins: 0.0A 0.07B1 0.2B2 0.06B3 0.0B5 0.0B6 0.05B9 0.0B12 0.0C 0.0E 0.0K
* Minerals: 0.02Ca 0.45Fe 0.0P 0.11K 0.7Na 0.0Zn 0.0Cu 0.0Mn 0.0Se
Two more things - a lot of data is missing. Usually the amount of nutrient for which there's no data is negligible, so the score isn't be affected, but that's not always so. And there's a huge subject of bioavailability, which the data completely ignores, even though the difference can be quite significant.

Wednesday, April 18, 2007

set_trace_func, smoke and mirrors

we built this city on smoke and mirrors by zen♫ from flickr (CC-NC-SA)Ruby interpreter provides event hooks, which can be used to trace execution of Ruby program. To setup an event hook, call set_trace_func(some_proc), to remove it, call set_trace_func(nil).

Tracers are probably the most useful debugging tool ever invented, far more useful than debuggers based on single-stepping and breakpoints.

Ruby traces 8 events:

  • line - file/line changed
  • class - module/class definition started
  • end - module/class definition ended
  • call - Ruby function was called
  • return - Ruby function returned
  • c-call - C function was called
  • c-return - C function returned
  • raise - exception was raised
Some events are not traced, like setting and getting variables (local, global, instance, or class).
set_trace_func proc { |event, file, line, id, binding, classname|
STDERR.printf "%8s %s:%-2d %10s %8s\n", event, file, line, id, classname
}

@welcome = "Hello"
separator = ", "
$object = "world"
@@excl = "!"
puts(@welcome + separator + $object + @@excl)
$ ./example-1.rb
line ./example-1.rb:7 false
line ./example-1.rb:8 false
line ./example-1.rb:9 false
line ./example-1.rb:10 false
line ./example-1.rb:11 false
c-call ./example-1.rb:11 + String
c-return ./example-1.rb:11 + String
c-call ./example-1.rb:11 + String
c-return ./example-1.rb:11 + String
c-call ./example-1.rb:11 + String
c-return ./example-1.rb:11 + String
c-call ./example-1.rb:11 puts Kernel
c-call ./example-1.rb:11 write IO
c-return ./example-1.rb:11 write IO
c-call ./example-1.rb:11 write IO
c-return ./example-1.rb:11 write IO
c-return ./example-1.rb:11 puts Kernel

In the example String#+ was run 3 times, and IO#write 2 times - the second one most likely to write out newline character.

Many things in Ruby are magical and are not converted to method calls. One such thing is string interpolation with "... #{ code } ... ", which is not converted to String#+, or Array#join, but handled specially by the interpreter.

set_trace_func proc { |event, file, line, id, binding, classname|
STDERR.printf "%8s %s:%-2d %10s %8s\n", event, file, line, id, classname
}

@welcome = "Hello"
$object = "world"
puts("#{@welcome}, #{$object}\n")
$ ./example-2.rb
line ./example-2.rb:7 false
line ./example-2.rb:8 false
line ./example-2.rb:9 false
c-call ./example-2.rb:9 puts Kernel
c-call ./example-2.rb:9 write IO
c-return ./example-2.rb:9 write IO
c-return ./example-2.rb:9 puts Kernel
In the same way, creating classes is not converted to Class#new, and adding methods is not converted to Module#define_method. Fortunately if we need to capture such events Ruby provides specialized hooks - when a class is created its parent's inherited method is called, and when a method is added, Ruby calls method_added on its class.

set_trace_func proc { |event, file, line, id, binding, classname|
STDERR.printf "%8s %s:%-2d %10s %8s\n", event, file, line, id, classname
}

class Geolocation
def initialize(lat, long)
@lat = lat
@long = long
end
def to_s
"<#{@lat}, #{@long}>"
end
end
$ ./example-3.rb
line ./example-3.rb:7 false
c-call ./example-3.rb:7 inherited Class
c-return ./example-3.rb:7 inherited Class
class ./example-3.rb:7 false
line ./example-3.rb:8 false
c-call ./example-3.rb:8 method_added Module
c-return ./example-3.rb:8 method_added Module
line ./example-3.rb:12 false
c-call ./example-3.rb:12 method_added Module
c-return ./example-3.rb:12 method_added Module
end ./example-3.rb:7 false
If we want to trace something useful it's probably time to change the format. If we're interested in tracing execution, file/line information is usually just noise, return/c-return/end would be far better represented by indentation, and we're really interested in receiver (self), not its class.
$indent = 0
set_trace_func proc { |event, file, line, id, binding, classname|
if event == "line"
# Ignore
elsif %w[return c-return end].include?(event)
$indent -= 2
else
if event == "class"
STDERR.printf "%*s%s %s\n", $indent, "", event, eval("self", binding)
else
STDERR.printf "%*s%s %s.%s\n", $indent, "", event, eval("self", binding), id
end
$indent += 2 if %w[call c-call class].include?(event)
end
}

class Geolocation
def initialize(lat, long)
@lat = lat
@long = long
end
def to_s
"<#{@lat}, #{@long}>"
end
end

a = Geolocation.new(51.12, 17.03)
puts a
p a
$ ./example-4.rb  >/dev/null
c-call Object.inherited
class Geolocation
c-call Geolocation.method_added
c-call Geolocation.method_added
c-call Geolocation.new
call <, >.initialize
c-call main.puts
call <51.12, 17.03>.to_s
c-call 51.12.to_s
c-call 17.03.to_s
c-call #<IO:0xb7c80fc0>.write
c-call #<IO:0xb7c80fc0>.write
c-call main.p
c-call <51.12, 17.03>.inspect
c-call 17.03.inspect
c-call 17.03.to_s
c-call 51.12.inspect
c-call 51.12.to_s
c-call #<IO:0xb7c80fc0>.write
c-call #<IO:0xb7c80fc0>.write
There's still a minor quirk in the output, as Geolocation#to_s is called by the tracer when Geolocation#initialize starts, and before it can be meaningfully printed. In this case we simply got garbage <, >, but some objects raise exceptions if printed before initialization.
require 'complex'

$indent = 0
set_trace_func proc { |event, file, line, id, binding, classname|
if event == "line"
# Ignore
elsif %w[return c-return end].include?(event)
$indent -= 2
else
obj = eval("self", binding)
if event == "class"
STDERR.printf "%*s%s %s\n", $indent, "", event, obj
else
obj = "<#{obj.class}##{obj.object_id}>" if id == :initialize
STDERR.printf "%*s%s %s.%s\n", $indent, "", event, obj, id
end
$indent += 2 if %w[call c-call class].include?(event)
end
}

a = Complex.new(11.0, -5.0)
b = Complex.new(2.0, 13.5)
c = a * b
$ ./example-5.rb
c-call Complex.new
call <Complex#-605829048>.initialize
c-call 11.0.kind_of?
c-call 11.0.kind_of?
c-call -5.0.kind_of?
c-call -5.0.kind_of?
c-call Complex.new
call <Complex#-605832038>.initialize
c-call 2.0.kind_of?
c-call 2.0.kind_of?
c-call 13.5.kind_of?
c-call 13.5.kind_of?
call 11.0-5.0i.*
c-call 2.0+13.5i.kind_of?
c-call 11.0.*
c-call -5.0.*
c-call 22.0.-
c-call 11.0.*
c-call -5.0.*
c-call 148.5.+
call 11.0-5.0i.Complex
c-call 138.5.==
call 89.5.real
call 138.5.imag
c-call 89.5.-
call 89.5.imag
call 138.5.real
c-call 0.+
c-call Complex.new
call <Complex#-605842188>.initialize
c-call 89.5.kind_of?
c-call 89.5.kind_of?
c-call 138.5.kind_of?
c-call 138.5.kind_of?

Many otherwise hard to find bugs become trivial with a custom tracer, grep, and some Ruby one-liners to massage the logs.

Smoke and mirrors

set_trace_func is useful not just for debugging, but also for all kinds of magic. Binding.of_caller, which unfortunately doesn't work any more, was implemented with set_trace_func. magic/help uses set_trace_func to provide convenient help access in irb:

irb> help { STDERR.write }
--------------------------------------------------------------- IO#write
ios.write(string) => integer
------------------------------------------------------------------------
Writes the given string to _ios_. The stream must be opened for
writing. If the argument is not a string, it will be converted to a
string using +to_s+. Returns the number of bytes written.

count = $stdout.write( "This is a test\n" )
puts "That was #{count} bytes of data"

_produces:_

This is a test
That was 15 bytes of data

magic/help works by setting a set_trace_func handler, executing the passed block, and then aborting execution by throw after first interesting event. This means that calling help { rm_rf "/" } should be perfectly safe.

One nasty case I found while developing magic/help was inconsistent traces when functions are called with wrong number of arguments.

For functions implemented in C, even with explicit number of arguments specified, Ruby always issues c-call before calling ArgumentError.new. On the other hand for Ruby functions, ArgumentError.new is issued, then return from the function, but call never happens.

To illustrate, tracing C function STDERR.write gives:

c-call #<IO:0xb7c98fa8>.write
c-call ArgumentError.new
c-call #<ArgumentError: ArgumentError>.initialize
c-return #<ArgumentError: wrong number of arguments (0 for 1)>.initialize
c-return ArgumentError.new
c-call #<ArgumentError: wrong number of arguments (0 for 1)>.backtrace
c-return #<ArgumentError: wrong number of arguments (0 for 1)>.backtrace
c-call #<ArgumentError: wrong number of arguments (0 for 1)>.set_backtrace
c-return #<ArgumentError: wrong number of arguments (0 for 1)>.set_backtrace
raise #<IO:0xb7c98fa8>.write
c-return #<IO:0xb7c98fa8>.write
On the other hand tracing Ruby function FileUtils.remove_entry gives:
# "call FileUtils.remove_entry" never happens !
c-call ArgumentError.new
c-call #<ArgumentError: ArgumentError>.initialize
c-return #<ArgumentError: wrong number of arguments (0 for 1)>.initialize
c-return ArgumentError.new
c-call #<ArgumentError: wrong number of arguments (0 for 1)>.backtrace
c-return #<ArgumentError: wrong number of arguments (0 for 1)>.backtrace
c-call #<ArgumentError: wrong number of arguments (0 for 1)>.set_backtrace
c-return #<ArgumentError: wrong number of arguments (0 for 1)>.set_backtrace
raise FileUtils.remove_entry
# "return FileUtils.remove_entry" does happen.
return FileUtils.remove_entry
In both cases checking number of arguments happens before the call. IO#write in defined as 1-argument function:
rb_define_method(rb_cIO, "write", io_write, 1);
New versions of magic/help handle this case too - if the first event is c-call to ArgumentError.new, magic/help waits for the first return event (if any) instead of instantly aborting execution of the passed block. Just grab the new version and enjoy the magical help.

Monday, April 16, 2007

Compiler for RLisp

some cats love sinks by Kevin Steele from flickr (CC-NC)RLisp was a wonderful idea, unfortunately its first implementation was a very slow pure-interpreter, and it was too difficult to retrofit good performance into it. The first version of RLisp even evaluated macros each time they were encountered, making them in practice "FEXPR"s instead of macros. I was able to improve RLisp performance a bit. The biggest gain was making sure macros expand just once. Other optimizations resulted in only minor improvement, and the code was quickly turning into a unreadable mess. The solution was a major rewrite. In the new implementation RLisp code is compiled to something that looks a lot like Lua 5.0's virtual machine code, and then to pure Ruby. Some of Lua virtual machine's ideas like flat closures are simply awesome and greatly simplify the implementation. The code is much cleaner, and much faster. In "x times slower than Ruby" benchmarks:

Benchmarkaryhelloackermann
Early interpreterx96.1x37.8x1027.0
Optimized interpreterx35.5x58.2x53.1
Compilerx5.3x6.7x7.5

Compiled code

RLisp compiler changed a lot since the teaser. Now to create functions it uses Object#function_* and Proc.new instead of Function objects. Function objects were more elegant, but they could only be used for free-standing functions, not for methods - inside such function self was a closure. The new form makes it possible to use RLisp (fn ...) as methods, instance_eval them and so on. Unfortunately as Proc.new{|*args,&blk| ... } is a syntax error in Ruby 1.8, it's impossible to create RLisp functions which take Ruby-compatible block argument. Fortunately you can still call such functions from RLisp code, and RLisp functions can take Proc objects as normal arguments. A few examples.
(fn (x)
(fn (y) (+ x y))
)
compiles to:
class ::Object; def function_29(globals, x)
  Proc.new do |*args|
      y, = *args
      t0 = x.send(:+, y)
      t0
  end
end; end
class ::Object; def function_28(globals)
  Proc.new do |*args|
      x, = *args
      t1 = function_29(globals, x)
      t1
  end
end; end
t2 = function_28(globals)
t2
Proc.new takes |*args| instead of |x| to work around magical behaviour of Ruby blocks called with one Array argument, which is autoexpanded:
Proc.new{|first_arg, *other_args| first_arg}.call("foo") # => "foo"
Proc.new{|first_arg, *other_args| first_arg}.call("foo", "bar") # => "foo"
Proc.new{|first_arg, *other_args| first_arg}.call([1, 2, 3]) # => 1
Proc.new{|first_arg, *other_args| first_arg}.call([1, 2, 3], "bar") # => [1, 2, 3]
Fibonacci function:
(let fib (fn (x)
(if (< x 2)
  1
  (+ (fib (- x 1)) (fib (- x 2)))
)
))
compiles to:
class ::Object; def function_28(globals)
  Proc.new do |*args|
      x, = *args
      t0 = globals[:<]
      t1 = t0.call(x, 2)
      if t1
          t2 = 1
      else
          t3 = globals[:fib]
          t4 = x.send(:-, 1)
          t5 = t3.call(t4)
          t6 = globals[:fib]
          t7 = x.send(:-, 2)
          t8 = t6.call(t7)
          t9 = t5.send(:+, t8)
          t2 = t9
      end
      t2
  end
end; end
t10 = function_28(globals)
globals[:fib] = t10
t10
If RLisp can determine that a variable is constant or isn't used within nested function, it doesn't create Variable objects for it, and performs various optimizations based on such information. Of course Variable objects are created when needed, like in the following code which creates a counter function. A counter function called with an argument increases the internal counter and returns the old value.
(let make-counter (fn (val)
(fn (incr)
   (let old val)
   (set! val (+ val incr))
   old
)
))
class ::Object; def function_29(globals, val)
  Proc.new do |*args|
      incr, = *args
      t0 = val.get
      old = t0
      t1 = val.get
      t2 = t1.send(:+, incr)
      val.set t2
      t3 = old
      t3
  end
end; end
class ::Object; def function_28(globals)
  Proc.new do |*args|
      a_val, = *args
      val = Variable.new
      val.set a_val
      t4 = function_29(globals, val)
      t4
  end
end; end
t5 = function_28(globals)
globals[:"make-counter"] = t5
t5

Access to Ruby

RLisp can access Ruby with (ruby-eval "Ruby code") function. All unknown uppercase variables are also assumed to be Ruby constants (nested constants like WEBrick::HTTPServer not supported yet). This makes using most Ruby libraries from RLisp pretty straightforward.
rlisp> (ruby-eval "require 'complex'")
true
rlisp> (let a [Complex new 1.0 2.0])
1.0+2.0i
rlisp> (let b [Complex new -2.5 3.0])
-2.5+3.0i
rlisp> (let c [a * b])
-8.5-2.0i
Like before, RLisp supports syntactic sugar [receiver method args] for method calls, which expands to (send receiver 'method args). Now it's also possible to specify block argument with &.
rlisp> ['(1 2 3) map & (fn (x) (* x 2))]
(2 4 6)
RLisp can be run in two modes. To execute a standalone program use rlisp.rb program.rl. To run interactive REPL use rlisp.rb -i. It uses readline for input and prints out each expression as it is evaluated. Like every decent REPL, it captures exceptions without exiting.
rlisp> (hello "world")
./rlisp.rb:102:in `initialize': No such global variable: hello
rlisp> (defun hello (obj) (print "Hello, " obj "!"))
#<Proc:0xb7c9e9a8@(eval):2>
rlisp> (hello "world")
Hello, world!
nil
It also supports multiline expressions (without a special prompt yet):
rlisp> (defun hello ()
rlisp>   (print "Hello, world!")
rlisp> )
#<Proc:0xb7c241e4@(eval):2>
rlisp> (hello)
Hello, world!
nil
There's also a wrapper script which makes it possible to run RLisp programs with #!, but as it needs to know path to RLisp it's not enabled by default.

Object-orientation and macros

It's possible to create object-oriented code with a bunch of standard macros like.
rlisp> (class String (method welcome () (+ "Hello, " self " !")))
#<Proc:0xb7c4e8a4@(eval):2>
rlisp> ["world" welcome]
Hello, world !
(class ...), (method ...) and so on are just macros, and they expand to:
rlisp> (macroexpand '(class String do_stuff))
(send String (quote instance_eval) & (fn () do_stuff))
rlisp> (macroexpand '(method foo (arg) do_stuff))
(send self (quote define_method) (quote foo) & (fn (arg) do_stuff))
You can explore macroexpansions with (macroexpand 'code) and (macroexpand-1 'code), which are just normal functions. You can look at final generated code with -d command line switch:
$ ./rlisp.rb -d -i
rlisp> (+ 1 2 3)
t0 = 2.send(:+, 3)
t1 = 1.send(:+, t0)
t1
6
rlisp> ((fn (x) (+ x 2)) 5)
class ::Object; def function_30(globals)
  Proc.new do |*args|
      x, = *args
      t2 = x.send(:+, 2)
      t2
  end
end; end
t3 = function_30(globals)
t4 = t3.call(5)
t4
7
Other options are -n (don't use RLisp standard library), -r (force recompilation of standard library), and -h (display help message).

The future

You can download RLisp from RLisp website. It's licenced under BSD/MIT-like licence, has reasonable performance (CPU-wise 5x slower than Ruby - perfectly reasonable for database-bound or network-bound programs), and a lot of libraries (that is - it can use most Ruby libraries out of the box). The code is rather clean, and reasonably tested (test code to code ratio 0.96:1). No major program was written in RLisp so far, so a few misdesigns and major bugs are certainly present. What I'd like to do is create a new backend for RLisp. For now Lua-like VM opcodes are compiled to Ruby, but they're so simple that they could be compiled to C, x86 assembly, or LLVM and still stay fully compatible with Ruby runtime. That would probably make RLisp somewhat faster than Ruby, as Ruby's slowest part is eval. Ruby backend could probably be made faster too without too much work. It would be awesome to do add at least limited tail-recursion optimization to RLisp. Have fun.

Monday, April 09, 2007

New compiler for RLisp - teaser

giggling by sachama from flickr (CC-NC-ND)RLisp is slow. On a few random benchmarks it is about 50x slower than Ruby - possibly too slow to be useful. But that's only because the implementation was created without giving any thought whatsoever to performance. There's nothing in RLisp as a language that forces it to be slow. In fact it could easily be much faster than Ruby and still enjoy full access to Ruby runtime. I tried a few optimizations in the past, without too much measurable success. My latest approach is building a new compiler from scratch (keeping the parser, and possibly parts of the standard library). The compiler generates Ruby code, but it limits itself to a small subset of it. The subset is inspired by Lua 5.0 virtual machine. Here's compiled (fn (x) (fn (y) (+ x y))) function:

class Function_2 < Function
   def initialize(globals, x)
       @globals = globals
       @x = x
   end
   def call(a_y)
       y = Variable.new
       y.set a_y
       t0 = @globals[:+]
       t1 = @x.get
       t2 = y.get
       t3 = t0.call(t1, t2)
       return t3
   end
end
class Function_1 < Function
   def call(a_x)
       x = Variable.new
       x.set a_x
       t4 = Function_2.new(@globals, x)
       return t4
   end
end
t5 = Function_1.new(@globals)
The basic trick is division of variables into three classes - global (accessed through @globals), closure (accessed through instance variables of a closure - all being Variable objects), and local (Ruby local variables). A few things could be easily optimized, like creating Variable objects only for variables which are passed to subclosures and modified somewhere, and using fewer temporaries. But what I really hope for is being able to connect RLisp compiler to a backend like LLVM. Here's one more example - Fibonacci function:
(let fib (fn (x)
 (if (< x 2)
   1
   (+
     (fib (- x 1))
     (fib (- x 2))
))))
which gets compiled to:

class Function_1 < Function
   def call(a_x)
       x = Variable.new
       x.set a_x
       t0 = @globals[:<]
       t1 = x.get
       t2 = 2
       t3 = t0.call(t1, t2)
       t4 = Variable.new
       if t3
           t5 = 1
           t4.set t5
       else
           t6 = @globals[:+]
           t7 = @globals[:fib]
           t8 = @globals[:-]
           t9 = x.get
           t10 = 1
           t11 = t8.call(t9, t10)
           t12 = t7.call(t11)
           t13 = @globals[:fib]
           t14 = @globals[:-]
           t15 = x.get
           t16 = 2
           t17 = t14.call(t15, t16)
           t18 = t13.call(t17)
           t19 = t6.call(t12, t18)
           t4.set t19
       end
       t20 = t4.get
       return t20
   end
end
t21 = Function_1.new(@globals)
@globals[:fib] = t21
return t21
The hardest major optimization, at least for Ruby code output, would probably be inlining without harming semantics. Not inlining basic functions like + and send unfortunately guarantees bad performance. Other cool things would be making sure let behaves intuitively as much as possible, and at least partial tail recursion optimization (for calls to self).

Friday, April 06, 2007

Most popular blog posts

Mine! by Gini~ from flickr (CC-NC-ND)Popularity tends to follow some sort of a power law - a few posts are interesting to many people, and most posts are interesting to just a few (or only the author). Blogs often contain lists of most popular posts on sidebar. A person visiting such a blog can look at such list, and if there's anything on the blog that they would be interested in, it's most likely there. Blogger doesn't support such lists, so I had to hack them on my own. Blogger also doesn't support statistics, so I'm getting them through Google Analytics. Did you notice a pattern here ? Having a blog on Blogger means spending a lot of time hacking small tools to get things which WordPress bloggers get for free. Probably that's why so many hackers use Blogger instead of WordPress - creating tools is much more fun than using them. At first I wanted to simply list the most popular posts according to Google Analytics, but the list isn't supposed to faithfully reflect their popularity. It's supposed to guess what would be most useful to readers. I got data from three sources:

  • From Google Analytics - number of unique visits
  • From del.icio.us - number of people who saved post url
  • From Blogger - number of comments

Comments

This wasn't the first metric I used, but the code is relatively simple. The first thing - I use wget instead of net/http or open-uri. Ruby HTTP libraries are weak and hard to use for anything non-trivial, while wget is absolutely awesome - powerful and simple. I also cache everything I download on disk, in meaningfully named files. It's not particularly important in "production", but during development having to wait for things to download is extremely distracting. It destroys productivity almost as badly as waiting for things to compile. So the following function fetches url from the Internet, and caches it in file. If file is not specified, it's simply the portion of url following the last slash.
def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 retval = system "wget", "--no-verbose", url, "-O", file unless File.exists?(file)
 File.read(file)
end
And the whole script:
require 'rubygems'
require 'hpricot'

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 retval = system "wget", "--no-verbose", url, "-O", file unless File.exists?(file)
 File.read(file)
end

url = "http://t-a-w.blogspot.com/"
file_name = "blog-main"

doc = Hpricot(wget(url, file_name))
(doc/"div#ArchiveList a.post-count-link").each{|archive_link|
 url = archive_link.attributes["href"]
 next unless url =~ %r[\Ahttp://t-a-w\.blogspot\.com/[0-9_]*_archive.html\Z]
 archive_doc = Hpricot(wget(url))
 (archive_doc/".post").each{|post|
     post_link = post.at("h3.post-title a")
     link = post_link.attributes["href"]
     title = post_link.inner_text
     post.at("a.comment-link").inner_text =~ /\A(\d+) comments?\Z/
     comments = $1.to_i

     puts [comments, title, link].join("\t")
 }
}
Its output looks something like that:
6       How to code debuggers   http://t-a-w.blogspot.com/2007/03/how-to-code-debuggers.html
7       Programming in Blub     http://t-a-w.blogspot.com/2006/08/programming-in-blub.html
One more thing - I always use \A and \Z instead of ^ and $ in Ruby regular expressions, unless I explicitly need the latter. ^ and $ mean different things in different contexts and can cause weird problems. Ten most popular posts according to number of comments metric were:

del.icio.us saves

I save all the blog posts on del.icio.us, tagged taw+blog. At first I did it because Blogger didn't have labels, but even when it has labels, it's just useful to have all URLs in one place. Thanks to web 2.0 buttons, it's just a single click to add blog posts to del.icio.us (or reddit or digg). Number of people who saved URL on del.icio.us might be even better indicator of post interestingness than bare view count - these people consider the post interesting enough to bookmark it, instead of just enter, look at the cat pic, and go away ;-) del.icio.us is kind enough to show number of other people who saved an URL. Unfortunately it only does so for logged in users, so the wget function needs to be slightly modified to make del.icio.us think I'm logged in. It would be too much bother to write login code just for such a simple script - reusing Firefox cookies is way simpler. Firefox uses silly strings in directory names - my cookies are in /home/taw/.mozilla/firefox/g9exa7wa.default/cookies.txt, but Ruby builtin Dir[] function saves the day.
$cookies = Dir["#{ENV['HOME']}/.mozilla/firefox/*/cookies.txt"][0]
unless $cookies
 STDERR.print "Cannot find cookies\n"
 exit 1
end

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 system "wget", "--no-verbose", url, "--load-cookies", $cookies, "-O", file unless File.exists?(file)
 File.read(file)
end
And the entire script:
require 'rubygems'
require 'hpricot'

$cookies = Dir["#{ENV['HOME']}/.mozilla/firefox/*/cookies.txt"][0]
unless $cookies
 STDERR.print "Cannot find cookies\n"
 exit 1
end

class String
 # hpricot/text is supposed to be doing this, but it doesn't work
 def unescape_html
     ent = {"quot" => "\"", "apos" => "'", "lt" => "<", "gt" => ">", "amp" => "&"}
     gsub(/&(quot|apos|lt|gt|amp);/) { ent[$1] }
 end
end

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 system "wget", "--no-verbose", url, "--load-cookies", $cookies, "-O", file unless File.exists?(file)
 File.read(file)
end

def deli_pages(*tags)
 tags_u = "/#{tags.join('+')}" unless tags.empty?
 url = "http://del.icio.us/taw#{tags_u}?setcount=100"
 page_number = 1
 page = wget(url, "deli_bookmarks_page_#{tags.join('_')}_#{page_number}")
 yield page
 while page =~ /<a rel="prev" accesskey=".*?" href="(.*?)">/
    page_number += 1
    url = "http://del.icio.us#{$1}&setcount=100"
    page = wget(url, "deli_bookmarks_page_#{tags.join('_')}_#{page_number}")
    yield page
 end
end

pages = []
deli_pages(%w[taw blog]){|page|
 doc = Hpricot(page)
 (doc/"li.post").each{|post|
     desc = post.at("h4.desc a")

     title = desc.inner_text.unescape_html
     link = desc.attributes["href"]

     saved_by_others = post.at("a.pop")
     if saved_by_others
         saved_by_others.inner_text =~ /\Asaved by (\d+)/
         popularity = $1.to_i
     else
         popularity = 0
     end

     pages << [popularity, title]
 }
}
pages.sort.each{|popularity, title|
 puts "#{popularity} - #{title}"
}
Top ten according to number of del.icio.us saves is:
  • 84 - The right to criticize programming languages
  • 78 - How to code debuggers
  • 56 - Atomic Coding with Subversion
  • 38 - Modern x86 assembly
  • 28 - Prototype-based Ruby
  • 24 - Making Ruby faster
  • 20 - taw's blog
  • 19 - Segfaulting own programs for fun and profit
  • 18 - magic/help for Ruby
  • 17 - Yannis's Law: Programmer Productivity Doubles Every 6 Years

Google Analytics

Google Analytics has no API and is proud of it. Fortunately they provide reports in tab-separated format. URLs are not documented anyway - to extract URL open the Flash application, view the report, and copy its URL. I had problems with authorization too. It seems using Firefox's cookies.txt is not enough for Google Analytics to let me view reports. They probably use session cookies or something like that. I copy&amp;pasted the cookie using FireBug and it worked. The modified wget function is now:
$cookie_header = File.read("/home/taw/.google_analytics_cookie").chomp

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 system "wget", "--no-verbose", url, "--header", $cookie_header, "-O", file unless File.exists?(file)
 File.read(file)
end
The .google_analytics_cookie file looks like that: Cookie: AnalyticsUserLocale=en-GB; __utm... and the whole source:
$cookie_header = File.read("/home/taw/.google_analytics_cookie").chomp

def wget(url, file=url.match(/([^\/]*)\Z/)[0])
 system "wget", "--no-verbose", url, "--header", $cookie_header, "-O", file unless File.exists?(file)
 File.read(file)
end

class GoogleAnalytics
 def initialize
     @rid        = '1222880'
     @start_date = '20060701'
     @end_date   = '20101231'
 end
 def report_url(vid)
     "https://www.google.com/analytics/home/report?rid=#{@rid}&user=&amp;amp;amp;amp;vid=#{vid}&bd=#{@start_date}&amp;amp;amp;amp;ed=#{@end_date}&ns=10&amp;ss=0&fd=&ft=2&amp;sf=2&sb=1&amp;amp;amp;amp;dow=0&dt=10&amp;dtc=2&dcomp=0&amp;xd=1&x=1"
 end
 def content_by_titles_url
     report_url(1306)
 end
 def referring_source_url
     report_url(1208)
 end
 def content_by_titles
     wget(content_by_titles_url, "content_by_titles")
 end
 def referring_source
     wget(referring_source_url, "referring_source")
 end
 def self.run
     ga = new
     ga.content_by_titles
     ga.referring_source
 end
end

GoogleAnalytics.run
The Google Analytics top ten is:
  • 6631 - The right to criticize programming languages
  • 3776 - Making Ruby faster
  • 3066 - Yannis's Law: Programmer Productivity Doubles Every 6 Years
  • 2631 - Modern x86 assembly
  • 2562 - Atomic Coding with Subversion
  • 2033 - How to code debuggers
  • 1525 - Prototype-based Ruby
  • 1511 - Segfaulting own programs for fun and profit
  • 1211 - Big O analysis considered harmful
  • 930 - My first impressions of Erlang

The final results

On all criteria "The right to criticize programming languages" got the top spot by far. Lists created by both del.icio.us saves and Google Analytics unique views seemed reasonable. Number of comments seemed to be a poor predictor of overall popularity. In the end I added percent of views and percent of saves to get the final score:
# Uses 3 sources to determine popularity:
# * Number of del.icio.us saves
# * Number of Blogger comments
# * Number of visits according to Google Analytics

unless File.exists?("content_by_titles")
 system "./google_analytics_report"
end

unless File.exists?("blogger_comments_cnt")
 system "./blogger_comments >blogger_comments_cnt"
end

unless File.exists?("../del.icio.us/deli_popularity")
 Dir.chdir("../del.icio.us") {
     system "./deli_popularity.rb >deli_popularity"
 }
end

visits   = File.read("content_by_titles")
comments = File.read("blogger_comments_cnt")
saves    = File.read("../del.icio.us/deli_popularity")

statistics = {}

class Article
 attr_reader :comments, :visits, :saves
 @@comments_total = 0
 @@visits_total   = 0
 @@saves_total    = 0
 def initialize(title)
     @title    = title
     @comments = 0
     @visits   = 0
     @saves    = 0
 end
 def comments=(comments)
     @@comments_total += comments - @comments
     @comments         = comments
 end
 def visits=(visits)
     @@visits_total += visits - @visits
     @visits         = visits
 end
 def saves=(saves)
     @@saves_total += saves - @saves
     @saves         = saves
 end
 def comments_perc
     100.0 * @comments / @@comments_total
 end
 def saves_perc
     100.0 * @saves / @@saves_total
 end
 def visits_perc
     100.0 * @visits / @@visits_total
 end
 def to_s
     sprintf "%s (%.0f%% saves, %.0f%% visits, %.0f%% comments)",
         @title, saves_perc, visits_perc, comments_perc
 end
 include Comparable
 def <=>(other)
     (saves_perc + visits_perc) <=>
     (other.saves_perc + other.visits_perc)
 end
end

comments.each{|line|
 comments, title, url = line.chomp.split(/\t/)
 statistics[title] = Article.new(title)
 statistics[title].comments = comments.to_i
}

saves.each{|line|
 saves, title = line.chomp.split(/ - /, 2)
 next if title == "taw's blog"
 statistics[title].saves = saves.to_i
}

visits.each{|line|
 line.chomp!
 next if line == "" or line =~ /\A#/
 title, unique_views, *stuff = line.split(/\t/)
 if title =~ /\Ataw\'s blog: (.*)/ and statistics[$1]
     statistics[$1].visits = unique_views.to_i
 end
}

stats = statistics.values

puts "By saves:"
stats.sort_by{|post| -post.saves}[0,10].each{|post|
 puts "* #{post}"
}
puts ""

puts "By views:"
stats.sort_by{|post| -post.visits}[0,10].each{|post|
 puts "* #{post}"
}
puts ""

puts "By comments:"
stats.sort_by{|post| -post.comments}[0,10].each{|post|
 puts "* #{post}"
}
puts ""

puts "By total score:"
stats.sort.reverse[0,15].each{|post|
 puts "* #{post}"
}
puts ""
The final top 15 list is:
  • The right to criticize programming languages (17% saves, 16% visits, 11% comments)
  • How to code debuggers (16% saves, 5% visits, 2% comments)
  • Atomic Coding with Subversion (12% saves, 6% visits, 3% comments)
  • Modern x86 assembly (8% saves, 6% visits, 2% comments)
  • Making Ruby faster (5% saves, 9% visits, 3% comments)
  • Yannis's Law: Programmer Productivity Doubles Every 6 Years (4% saves, 7% visits, 4% comments)
  • Prototype-based Ruby (6% saves, 4% visits, 1% comments)
  • Segfaulting own programs for fun and profit (4% saves, 4% visits, 1% comments)
  • magic/help for Ruby (4% saves, 2% visits, 2% comments)
  • Fight to the death between Ruby and Python (3% saves, 2% visits, 5% comments)
  • RLisp - Lisp naturally embedded in Ruby (3% saves, 2% visits, 1% comments)
  • My first impressions of Erlang (2% saves, 2% visits, 4% comments)
  • Big O analysis considered harmful (1% saves, 3% visits, 2% comments)
  • The programming language of 2010 (2% saves, 2% visits, 7% comments)
  • iPod-last.fm bridge (2% saves, 2% visits, 4% comments)

Wednesday, April 04, 2007

How to add Web 2.0 buttons to Blogger Beta

YA? by spiicytuna from flickr (CC-NC-ND) As unfortunate side effect of upgrading from the old Blogger to Blogger Beta, web 2.0 buttons stopped working. Here's a quick guide how to get them back. Enter Dashboard, select "Template" and "Edit HTML". Select checkbox "Expand Widget Templates". Now locate the following fragment (or something like that):

<p class='post-footer-line post-footer-line-3'/>
And replace it with the following code and save:

<p class='post-footer-line post-footer-line-3'>
Add to:
<a expr:href='"http://digg.com/submit?phase=2&amp;amp;url=" + data:post.url + "&amp;amp;title=" + data:post.title'>
<img src="https://photos1.blogger.com/blogger/1785/2897/1600/digg.gif" /> digg</a>
<a expr:href='"http://reddit.com/submit?url=" + data:post.url + "&amp;amp;title=" + data:post.title'>
<img src="https://photos1.blogger.com/blogger/1785/2897/1600/reddit.png" /> reddit</a>
<a expr:href='"http://del.icio.us/post?url=" + data:post.url + "&amp;amp;title=" + data:post.title'>
<img src="https://photos1.blogger.com/blogger/1785/2897/1600/delicious.gif" /> del.icio.us</a>
</p>
Now you should have Digg, Reddit, and del.icio.us buttons below each of your posts.

Technical details

Special hacking is needed, because buttons cannot simply submit the current URL. The reader might be reading blog main page or one of archive pages, not necessarily posts' page. The code is rather straightforward. expr:href is executed and becomes href. data:post.url is post URL, data:post.title is post title, strings are quoted, + concatenates. It seems necessary to double-quote special characters like & to &amp;amp;. You can put the buttons somewhere eles within <b:includable id='post' var='post'>...</b:includable>. Support for different web 2.0 services, different images etc. should be easy. If you have any questions, don't hesitate to ask.

Monday, April 02, 2007

Fixing Blogger feeds

Patches:  Computer Technician by VincenzoF from flickr (CC-NC)
Blogger software is not particularly good. I can live with whitespace in code examples being slaughtered, just like I could live without tags before Blogger got Beta, but not providing usable feeds makes me want to close my Blogger account and switch to WordPress.

Blogger feeds are broken. The only sensible way of sorting posts in a feed is publication date. Blogger sorts them by updating date. This means:

  • Fixing some typos breaks the feed.
  • Fixing URLs breaks the feed.
  • Even adding labels to posts breaks the feed.
There's simply no reasonable scenario where Blogger behaviour is correct. Fortunately it's possible to work around this bug. If it wasn't, you'd be reading this post from WordPress.

The hack is not by be, but by some other bloggers. Basically it uses Yahoo Pipes to process Blogger feed and sort it correctly.

If you subscribe to this blog, please update your feed to point to http://pipes.yahoo.com/pipes/pipe.run?_id=bvvI9j_h2xGMh7iydbq02Q&_render=rss
or http://tinyurl.com/2adbxg for short.

Feed link in the sidebar is already fixed. <link rel="alternative" ...> links are generated by Blogger JavaScript instead of being included in the template, so there's no easy way of fixing them. Maybe I should switch to WordPress after all ...

Regression toward the mean


It's been a few days since my last rant, so here a new one. There were times when Google was totally awesome compared to pretty much everything else. Next to Google, Yahoo was simply uncool. Nowadays it's less and less so. Sure, Google still does great things every now and then, like GMail, but the competition does awesome things too - Yahoo Pipes and YUI being prime examples. Heck, even Microsoft can build an awesome product these days.

Things created by Google are nowadays not only "not revolutionary", but sometimes simply worse than average. Blogger Beta and Google Video are good examples here. I'm not flaming Google in particular here, it's just that regression toward the mean doesn't stop working just because it's Web 2.0.

Endless dancing with Stepmania .crs files

Disco Cat ! by Foxicat from flickr (CC-NC-ND)StepMania instead of just letting me dance wastes my time after each song on score screens, waiting screens, and selecting songs from the menu. Endless mode picks songs at random without any waiting, what would be cool if default courses provided by Stepmania didn't all suck:

  • Random "Standard" songs - many boring 4s, and 5s.
  • Random "Heavy" songs - "Max 300" and other 10s way too often.
What I want is a random selection of songs in specific range of difficulties. It turns out it's possible to do just that with very little hacking. Just enter Courses subdirectory in Stepmania directory, and create EndlessSixToEight.crs (or whatever.crs) file with the following content:
#COURSE:Endless 6-8;
#REPEAT:YES;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
#SONG:*:6..8;
The first line specifies the title, the second line specifies that it's for endless mode, further lines are songs. 6..8 means random song with numeric difficulty 6 to 8. For selecting just a single difficulty, use something like #SONG:*:8;.

It's important to copy&paste the same line all over. If you write #SONG:*:6..8; just once, you'll be dancing the same random song in loop. 50 means about 90 minutes at typical song lengths (~105s), what should be enough. If it's not, just copy&paste another 50 lines.

Stepmania wiki contains more information about CRS format.