taw's blog: 2006

Tuesday, December 26, 2006

magic/help update

It's been a long time since the revolution in Ruby documentation that magic/help was. And today there's a big new release. magic/help uses many interesting heuristics to look for documentation. Often it is able to find the right documentation when plain ri wouldn't. Unfortunately there seem to be a major mismatch between the reality and Ruby documentation. One good example are SomeClass.new methods. They are all documented, but none of them exists ! In reality there's only Class.new, that calls SomeClass.initialize. So magic/help "correctly" returned documentation for Class.new when asked about SomeClass.new. That was the worst case, but there are many more. In new release magic/help is much smarter. It is now able to handle many quirks it previously couldn't. It is tested against full documentation database - and it doesn't make even a single mistake now. magic/help now comes in tarball, zip, and gem formats. So just grab it, install the gem, and add the following to your ~/.irbrc:

require 'rubygems'
require 'magic_help'

Blinkenlights, part 1

In two days, I'm going to the Chaos Computing Congress. CCC is about two things - computer security, and blinkenlights. I didn't want to be the only person on the whole congress who didn't make any blinkenlights in their live. Well, I did make a 16-bit ALU, and I even attached some diodes to its input and output ports, but they didn't blink, so it doesn't really count. Actually back then I wanted to build a CPU out of simple 74LS TTL parts. When I started I literally (by literally I mean literally) couldn't tell a NAND gate from a NOR gate. It was an awesomely fast way of learning about hardware. Then I learned how to run simulations on Verilog, how to solder stuff together, and how simple CPUs work. It took me ages before I could reliably connect an inverter to a diode. It gradually became easier, and finally I had a pretty decent 16-bit ALU. I didn't go any further, because by that time I already learned more about hardware than anyone can without jeopardazing their sanity. Also, the register file I wanted now seemed way too difficult to actually make. It was about as complex as the ALU when looking only at number of 74LS TTLs used, but it would be a cabling nightmare. There was just no way to do it in reasonable amount of time, and it would be very fragile. A worse problem was that I had absolutely no idea how to connect it to a computer. If I wanted to do anything useful with the CPU, it would need some way of reading and writing data, network access and so on. I wasn't anywhere hardcore enough to build memory controller and ethernet card on my own, so at least at first memory access and network would have to go through a real computer. After it works, I'd give it local memory, but a computer link would still be necessary for internet, loading code etc. So the project got shelved. If you're interested, the design, Verilog files and photos are all available. One thing I did in the mean time was getting a ColdHeat soldering iron. It's way better than traditional soldering irons, it's much safer, and it's very cheap, so there's really no excuse for not getting one if you want to play with hardware. Of course the interesting thing is connection to the computer. It seems that parallel port is exactly what I was looking for. Parallel port contains ground pins, 8 data pins, all operating on TTL-compatible 5V. It really couldn't have been any easier. Parallel ports have some anti-fry protection, but if it fails, the whole motherboard would have to be replaced, so I didn't want to test in on my machine. Instead, I took some old box, burned a Knoppix, and used the following code (run with sudo):

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/io.h>
#include <errno.h>
#include <string.h>

int main(int argc, char **argv)
{
    int port = atoi(argv[1]);
    int value = atoi(argv[2]);
    if(ioperm(port, 1, 1)) {
        fprintf(stderr, "Error: cannot access port %d: %s\n", port, strerror(errno));
        return 1;
    }
    outb(value, port);
    return 0;
}

The hardware part is a bit of an overkill. The only thing you need is to cut a parallel cable and take the 8 data wire and any ground wire, ignoring the rest, but I soldered all 25 cables to a board. It was pretty quick, the ColdHeat soldering iron is really cool compared to my old one. I can already tell that the parallel port acts as expected, using a multimeter on pins. The code above indeed changes pins in the expected way. You can see some photos from today. I could already use it for 8-diode blinkenlights, but that would be pretty lame. It should be pretty easy to create some circuit with more diodes, that accepts 8-bit commands. A trivial one would be a bunch of flip flops, 6 bits for address, 1 for value, and 1 for strobe. That would be enough for 32 bits, still not enough. So I really need some sort of address register. The computer would either first send the address, and then data, or the address register would automatically increase by one every time, whatever is more convenient. I'll think about it later, so far it's been a major success.

Sunday, December 24, 2006

iPod-last.fm bridge update

Good news for all iPod-owning last.fmers (doesn't it sound cool to be an "iPod-owning last.fmer" ...). New version of iPod-last.fm bridge just got released and can be downloaded here.

Changes in the new release:

Config moved to a separate file
Timezone handling fixed - it should now work automatically as long as timezone on the computer and on the iPod are set correctly
Much more readable data format
Official client id from last.fm
A script to extract song ratings from iPod
Some compatibility issues fixed

The things that's still missing is a decent README file. I hope comments in config.rb are enough, but if you have any questions, problems, or praise ;-), don't hesitate to contact me.

Wednesday, December 06, 2006

My "trolling" on Reddit

I don't consciously "troll" on reddit (in this post by "reddit" I mean programming reddit, I rarely use main or joel reddits), or for that matter anywhere else. The only place where I can remember starting flame baits on purpose was alt hierarchy on Usenet, and trolling is kinda the point there.

But we all know most flame wars are started unintentionally. Someone makes a completely honest comment, someone else feels strongly about the subject and honestly replies in a way that angers more people, and eventually everything is on fire, and people accuse each other of trolling. Genuine trolls are few and far between.

The way reddit comment system works, everybody can moderate any comment by clicking either up-arrow (+1) or down-arrow (-1). If comment gets score -5 or worse, it is hidden from the default view, and can only be shown by clicking "show comment". Comments start at +1 (reddit assumes everyone up-arrows own comments), so if 6 more people down-arrow than up-arrow your comment, it will be hidden in default view.

Most of my comments on reddit are moderated positively, neutrally, or at worse slightly negatively. However, a few of them were down-arrowed intensely enough to get hidden. I'd like to take a look at them now. Maybe they'll reveal something about me, maybe they'll reveal something about reddit comment system, or reddit community. I have no idea yet.

First some statistics. I made 46 comments. The best rated of them got +17, the worst rated got -10. Mean rating is +2.5, median is +2, two comments got ratings bad enough to be hidden (-6 and -10), eight others had bad ratings but weren't hidden (-3, -2, five -1s, and 0). So ten comments weren't liked by the community. Here they are.

-10 points, "Common Lisp sucks". It is factually correct, but feels emotionally loaded. Still, one would think that factual correction of article called "Common Lisp - Myths and Legends" that actually contains mostly propaganda would be treated better.
-6 points, "CLOS sucks". Another 100% factually correct comment. This time it had nothing emotionally loaded inside.
-3 points, "links to relevant content". This one baffles me most. I linked two articles that were the most relevant articles on the subject. Well, the original poster "read that Steve Yegge piece", but half of what Steve Yegge writes is related to the question, so I wasn't really sure if he meant this one. Mention of RLisp was clearly emoticoned, so it shouldn't matter.
-2 points, "keep your rants short". This one is emotionally worded, but the point is valid - the original rant should have focused on the important issues. Instead, it criticizes everything, whether it makes sense or not. This kind of ranting is harder to read, less useful for readers, and less powerful at getting the point through, whatever the point is.
-1 points, "formal methods suck". This one is directly related to the commented paper, and factually correct - some things (like formal methods) are popular in the academia, because they make good papers, but they aren't popular anywhere else, because they aren't very practical. Do you think you could publish a paper on unit testing a simple program ? You couldn't, but the formal method guys could. And what are you going to use in real programs anyway ? Yeah, unit tests.
-1 points, "XML isn't that bad". A simple list of reasons why XML is better than S-expressions for markup, and better than "simple custom text-based formats" for configuration. I don't see anything controversial here.
-1 points, "Smalltalk is better than Lisp". I point out that Smalltalk has every feature on the list, and without Lisp paretheses. I also point out that the question of relative power of Lisp macros and Ruby-style metaprogramming can be proven once and for in RLisp. It's not in the post, but heavy macro use doesn't seem to require Lisp syntax, see Nemerle macros and Dylan macros [PDF] (nobody uses Dylan, but the concept of "skeleton syntax" in the paper seems very interesting).
-1 points, "Technical details why Common Lisp sucks". Someone disagreed with my comment, so I replied explaining some details.
-1 points, "Common Lisp was designed for Lisp machines". Factual correction, with links to the relevant sources.
0 points, "Java is the most popular language". Someone claimed that Visual Basic is more popular than Java, and that popularity doesn't matter. I disagreed on both points.

So there doesn't seem to be a single troll among the downmoded comments. And it's clear that criticizing Common Lisp is a sure way of getting downmoded, whether the criticism is factual or not, and whether the comment it is emotionally loaded or not.

This posts seems pretty boring, so here are three extras.

First, a comment on reddit by some smug Lisp weenie. I think the poster was just angry at me, not trolling on purpose. Still, it's absolutely hillarious.

The main problem with Ruby is that Ruby advocates, like taw, are too much into Anime.

The second is a point from one of my downmoded comments:

If you aren't writing significant parts of your application in assembler - it's an absolute proof that performance is not critical for it.

Some people keep repeating that one "language" or another is slow. Assembly coders used to say that about C, now C luddites say that about Java, and smug Lisp weenies together with Java zealots about Ruby. The "performance" argument shows how desperate they are, having run out of more decent arguments. In the real world of course:

It is extremely atypical for programs to be CPU-bound by operations within the program. Most programs are programmer-bound, or network-bound, or disk I/O-bound, or CPU-bound within some library code (like SSL or movie decoding or whatever).
If the program is actually CPU-bound by own operations (what is very rare), high-level optimizations like changing the algorithm, caching, etc. usually fix it.
In extremely rare case when high-level optimizations are not enough, performance-critical parts can usually be rewritten in lower-level language, like C for high-level languages, or assembly for C. The idea that one language must be used for everything (common among the Common Lisp folks) is retarded. The program is 99% high-level language, 1% low-level language, and performs as well or better than one written in 100% low-level language.
There are a few cases when even rewriting small parts in lower-level languages is not enough. This is extremely rare, you're unlikely to write such program even once in your lifetime. A good example are codecs. But programs that need that much performance are never written in a single language anyway ! They invariably use very heavy assembly. Just look at any audio/video codec, or libssl, or whatever. I've never seen a single program that didn't use any assembly, and it was too performance-critical to use "99% high-level language, 1% low-level language" formula.

And the third point, one that got me downmoded most:

CLOS is much less powerful than Smalltalk/Ruby-style object oriented system. In particular, it is impossible to emulate method_missing/doesNotUnderstand in CLOS.

All decent object-oriented languages (like Smalltalk, Ruby, and ... did I mention Smalltalk and Ruby ?) have been designed with object-oriented programming in mind. Bolting it on later invariably results in something uglier, harder to use, and less powerful. A few languages like Python survived the retrofitting relatively well, most of the time however (Perl, OCaml, I'm not going to say a word about C++ here) the result is disastrous.

CLOS tries to bolt something it calls "object-oriented programming" on Common Lisp, mostly by heavy use of macros. The name "object-oriented programming", which originally (in Smalltalk) meant quite a lot, was abused to the point that it doesn't mean much any more. Some common abuses make it mean "class-oriented programming", "encapsulation", or even particular syntax (obj.meth(args)).

CLOS goes even further than that. Basically, it doesn't even have methods. What it has instead "generic functions". The main reason for it is trying too hard not to introduce any syntax for method calls (also known as sending messages). So while other languages have function calls f(a, b, c) and method calls a.f(b,c), CLOS has one syntax for both (f a b c). The big question is - how does it know whether (f a b c) is a function call or a method call ? It's simple - all methods used anywhere must be first predefined. So if any object has method to_yaml, (to_yaml whatever) will try to call method to_yaml on whatever, whether it has such method or not. Actually there's nothing method-like about to_yaml - it's really just a function. This function can be defined separately for each "class". So in one place you write definition for converting lists to yaml, somewhere else for converting numbers to yaml, and even elsewhere for converting widgets to yaml.

So simple object-oriented programming is possible with CLOS. Unfortunately it's impossible to emulate method_missing or do anything else advanced.

If generic function like to_yaml is applied to object of unknown kind, a no-applicable-method handler is called. Unfortunately it is impossible to setup this handler per-object ! So farewell, dynamic proxies which forward messages they receive to some other object.

Even worse, no-applicable-method is called only when the "method" was declared as one. So (foo bar) instead of calling bar's no-applicable-method handler, or even a generic no-applicable-method handler, is going to call undefined-function handler instead, and undefined-function handler doesn't even get to look at function's arguments.

A very limited implementation of method-missing in CLOS is here:


; Default method-missing
(defmethod method-missing (m (self t) &rest args)
    (format t "~s not found, no replacement method provided!~%" (cons m (cons self args)))
    ; Should throw an exception ...
    nil
)
; Override global no-applicable-method handler
(defmethod no-applicable-method (m &rest args)
    (format t "~s not found, calling method-missing!~%" (cons m args))
    (apply #'method-missing (cons m args))
)
; Now it's possible to make limited dynamic proxies
(defclass dynamic-proxy () ((obj :accessor obj :initarg :obj)))
(defun make-dynamic-proxy (obj) (make-instance 'dynamic-proxy :obj obj))
(defmethod set-obj ((self dynamic-proxy) new-obj) (setf (slot-value self 'obj) new-obj))

(defmethod method-missing (m (self dynamic-proxy) &rest args)
    (apply m (cons (obj self) args))
)

Of course it breaks with multiple dispatch (that's a great reason not to have multiple dispatch), with methods that weren't predefined, and basically with pretty much everything non-trivial.

Anyway, The Right Thing is to separate calling functions from sending messages. So functions would be called like (f a b c), and methods like (send obj 'meth foo bar). As sending methods is extremely common, we could even introduce different kind of parentheses for it, let's say [obj meth foo bar]. Now just add full metaprogramming support, a decent standard library, clean up legacy stuff, and you end up with RLisp.

RLisp sucks in many ways, but at least it has a decent object system. Unlike CLOS.

Thursday, November 30, 2006

The new Coca Cola Chicken

When I started cooking, I had absolutely no idea what I'm going to prepare. I just looked around the kitchen for things that seemed to fit. It's hard to explain what exactly was in my mind when I selected the ingredients, but they seemed to be somehow right. The result tasted great !

500g chicken breast, in small pieces (about 2-3cm cubes, like for Chinese cuisine)
480ml (1 bottle) "hot" ketchup
500ml (2 cups) Coca Cola or Pepsi
3 medium tangerines (250g), in pieces
3 tablespoons (50ml) honey
2 tablespoons (30ml) Garam Masala
half teaspoon (2g) monosodium glutamate

Mix everything in a pot and cover with aluminium foil. Put the pot in a hot oven for about 60 minutes. Serve with rice.

That's the recipe. Now on my intuitions.

Aluminium foil or some other cover is important. Ketchup is thick, so the sauce is medium thick before the baking, and if baked without cover, water would evaporate very quickly, and the sauce would be very thick. As sugar does not evaporate with the water, it would also become too sweet. It's definitely uncenessary to add extra thickeners like starch.

To make the dish less sweet, lower amount of ketchup and coke to about 350ml (1.5 cup). I think it might be a good idea to start with these lower values, and go up only if you want something more. It's too sweet to be eaten without something like rice.

Any fruit can be added, as long as its taste is not too strong, as it shouldn't dominate the dish. Some vegetables like tomatoes and onions would do as well. Adding either honey or jam would do - I think version without fruit would be better with a lot of jam (like 150ml or 2/3 cup), version with fruit would be better with a bit of honey (like 50ml or 3 tablespoons). It's probably better to use a single-fruit jam.

Then of course we need some "generic" spicing. Depending on how you feel, curry, or some other Indian cuisine spice mix (like garam masala) would be right. You can try your own mixes, with spices like ginger and garlic. Adding either chili or significant amount of soy sauce would probably require modifying other parts of the recipe. It should be at least a bit spicy.

The dish is going to be pretty mild. I used "hot" ketchup, because for some reason in stuff they sell in Europe, label "extra hot" means "hot", "hot" means "mild", and "mild" means "no taste at all". I've heard that in some countries they actually got it right, in which case you can use normal ketchup.

I didn't add salt (even to rice) or anything salty, as I don't like salty food much. If you feel otherwise, a bit of salt or soy sauce should be compatible with the recipe, but not too much.

The ingredients have fairly strong taste, but they blend and become milder with cooking, so don't worry.

For more extreme version of Coca Cola Chicken, take a look at this Youtube movie:

Tuesday, November 21, 2006

Modern drug design for dummies

Designing new drugs is an important part of medical science that people know very little about, and about which there are many misconceptions. It would take way too long to tell the full story, so here's an abridged version. Details and variations are skipped, but it should give you a good big picture view.

First rule of designing new drugs - don't. It's extremely expensive. Exact figures are difficult to get, but they're in hundreds of millions euro. Finding a promising molecule is expensive, and running all tests imposed by health authorities like FDA even more so. So most research is actually into improving existing drugs, mixing them and so on - it's faster, cheaper, and the authorities require a lot less testing.

Even when scientists actually work on new drugs, they tend to work on ones similar to existing drugs - either similar molecules, or different molecules working in similar way.

Why drugs ?

Why is medicine so much into drugs as opposed to other kinds of therapy ? The main reason is price - drugs are extremely cheap, and need no specialized medical personel to apply, especially orally taken drugs. Year's worth of drugs is typically much cheaper than a single surgical procedure or a week at a hospital. That doesn't mean drugs are cheap. Patent protection for drugs is just a few years, and when it's over the competition immediately enters the market with identical generic equivalents causing the prices to plummet, so the full costs of research must be recuperated very quickly, by heavily overpricing new drugs. Add to that costs of huge marketing campaigns without which adoption would be too slow, and you have the answer for high drug prices. That and the fact that neither the patient nor the doctor (who decide which drugs to use) pay for the drugs - most of the price is typically covered by either state healthcare system or private insurance. So there is usually little incentive for taking cheaper but less efficient older drug, situation rarely found in other fields of economy.

Drugs are also popular because they can handle so many different health issues, and pretty much any doctor can prescribe them. Most other therapies can handle very narrow range of conditions and require highly-specialized personel.

The target

In the ancient past (like 50 years ago) people discovered drugs mostly accidentally. They knew they worked, but had really little idea how. In modern drug design the "how" question is asked before the search for drugs even starts. Most commonly we want to affect some molecular process associated with the disease. Most commonly we want to block one of the enzymes, with minimal effects on everything else.

For example to treat HIV infection, we want to block reverse transcriptase (enzyme which copies viral genome to cell genome), or protease (enzyme used to assemble new viruses). To fight inflamation, we want to block cyclooxygenase enzymes. Against depression - we block serotonin reuptake by neurons. In most diseases there are at least a few promising molecular targets. How do we even know where to start ? That's what the basic medical research is for ! By studying how diseases work, we are later able to target their vulnerable aspects. Unlike drug design itself, this research is more often than not publically funded.

Selected target must then be verified. If you think HIV needs reverse transcriptase to be infectious, genetically engineer HIV without reverse transcriptase and check if it really worked.

Some diseases have many targets to choose from. Bacterial infections are particularly easy. Because bacteria are complex cells very far evolutionary from humans, it is extremely simple to find vital enzymes in them that don't have counterparts in humans and block them. That's why antibiotics were so successful (the main problem here is not harming the mitochondria, which were originally symbiotic bacteria, and still share many enzymes with their free-living cousins). Viruses are far more difficult, because they simply reuse host's cellular machinery, and have only a few enzymes of their own. Even more difficult are cancer cells, which are genetically almost identical to normal cells.

How to block the target ?

So we selected a target and convinced ourselves that it will work. What now ? Most commonly the target is a protein enzyme, which has one or more active sites - parts where the reaction occurs. Usually we want to design a drug that will bind to one of them so tightly as to disrupt all normal functionality (competitive inhibition). A common alternative is binding somewhere else that disrupts enzyme shape and makes it inactive (noncompetitive inhibition). Enzymes can be disrupted in two ways - either we destroy the enzyme chemically (irreversible inhibition, like acetylsalicylic acid or aspirin), or simply bind very tightly without causing any chemical changes (reversible inhibition, like ibuprofen). Reversible inhibition is more popular, as it is less likely to produce side effects, and they're usually equally efficient.

It is usually trivial to find sequence of aminoacids forming the enzyme by simply looking at the genome. The harder part is finding 3D structure of the enzyme. This problem (protein folding) considered too expensive to compute now, but projects like Folding@home begin to change it. Not everybody is optimistic about it, but perhaps in 10-20 years folding will be routinely performed in silico (on a computer). For now experimental methods are typically used, the most popular of which are X-ray crystalography and Nuclear Magnetic Resonance. 3D structures are published in public databases like the famous Protein Data Bank.

Unfortunately proteins take many different forms, and it's difficult to guess which one is the biologically active form. It's most difficult for transmembrane proteins - a very large and important group of proteins that live in cellular membranes. We don't know how to get accurate 3D structure while they're in the membrane, and taking them out changes their structure completely. This is one of the hottest areas of drug design research.

When we have an accurate 3D structure, we need to find active sites. There are many methods. If 3D structures were taken of protein together with some known inhibitor drug (very common case - more often than not we design better drugs for old targets), we simply need to take a look where the drug is. We can guess active site by finding aminoacids that are most "conserved". Enzymes are almost never unique - humans, mice, rats, and so on have typically similar but not identical enzymes. Changes in the active site are very rare (the enzyme wouldn't normally work any more, causing disease or death), while changes in other sites are pretty common. If conserved aminoacids are all in one place, that's most likely our active site. We can also use geometric methods (active sites tend to look like small "cavities") or a computer simulation.

What does a good drug look like ?

Before going any further, we should consider a question what do we want to develop. A drug should definitely be able to treat some disease, but that's only part of the story. It must be cheap to manufacture, reasonably stable for storage, and fit many other criteria, but the main issues are: Absorption, Distribution, Metabolism, Excretion, and Toxicity:

Absorption - Drugs that can be administered orally are strongly preferred, other methods like injection (insulin before introduction of insulin pumps) or inhalation (like anti-flu drug zanamivir) are used only when oral use would be impossible. In case of oral drugs, it is extremely important for them to be well absorbed from digestive tract, other routes are more tolerant. Drugs must also be able to pass through the cellular membrane from blood to cells, and in case of drugs affecting the central nervous system, to pass the blood-brain barrier. One example is neurotransmitter serotonin, which cannot pass the blood-brain barrier. Instead either its precursor 5-HTP is taken, or drugs like SSRI that increase effects of existing serotonin.
Distribution - Drugs are commonly needed in some parts of the body. They also tend to be distributed inequally in various organs and tissues. It is important for significant portion of the drug to reach the intended site. If the drug isn't distributed well, it lowers efficiency and increases side effects. It is probably most crucial in case of cancer, as anti-cancer drugs tend to have severe side effects.
Metabolism - the body doesn't let foreign substances to move around freely - it uses a wide range of methods to break them down. If drugs are metabolized too easily, efficiency will be low. It would be even worse if products of such metabolism were harmful. A good example is methanol, which isn't overly harmful itself, but alcohol dehydrogenase enzyme breaks it to extremely dangerous formic acid and formaldehyde. Sometimes we actually want the drug to be metabolized, as the product is active, not the original drug (which is usually called prodrug). Prodrugs are most commonly used for easier absorption.
Excretion - drugs would be very dangerous if they could freely accumulate in the body, and keep affecting it long after administration of the drug ceased. Most drugs are either excreted by urine (partially metabolized, partially unchanged) or broken down into simple molecules like carbon dioxide and water.
Toxicity - drugs do have side effects, and it is not going to change. More drugs than not may cause nausea, dizziness, headaches, and an ocasional allergic reaction, and many important drugs are significantly worse than that. If possible, side effects in new drugs should be less severe, but they won't rule out drug approval if they are offset by increased efficiency, different range of applicability, or at least are significantly different. A good example is antibiotic vancomycin, which has more severe side effects than most other antibiotics. But many bacteria are resistant to other antibiotics, so vancomycin is very useful in spite of the side effects. It isn't necessarily more efficient - as more benign antibiotics were more commonly used, the bacteria had higher chances of developing resistance to them. So paradoxically, vancomycin is more useful because of more severe side effects. Another example is rofecoxib (Vioxx) (a selective COX-2 inhibitor), which causes fewer disturbances in the gastrointestinal tract than traditional non-steroidal anti-inflamatory drugs like naproxen (non-selective COX inhibitors), while increasing cardiovascular risk. Depending on the patient, either of them may be preferable.

List of all aspects taken into account is very long. The short story - the new drug should be drug-like, that is similar to successful drugs.

The famous Lipinski's Rule of Five states that typical orally administered drug has:

no more than 5 hydrogen bond donors (like OH and NH groups),
no more than 10 hydrogen bond acceptors (like N and O atoms in rings),
molecular weight under 500,
partition coefficient (relative solubility in octanol and water, it estimates how hydrophobic the molecule is) log P under 5.

One drug that is far away from this description is insulin, with molecular weight of 5808. And indeed, it is impossible to administer it orally and the only known way to synthesize it is to use genetically modified organisms.

Getting a Hit

So we have a 3D structure of a verified target, know where to bind, and know the intended result. What to do next ? There are a few ways, but by far the most popular is docking. Simply take a database of let's say - 100 million molecules, and run a computer simulation to see how strongly each of them binds to the target. This is pretty easy - atoms and groups of some kinds attract when they are close to each other, while other kinds repel. Just sum all such interactions to get a rough estimate of binding free energy. This isn't particularly accurate, but it's very fast - and we simply want to go down from 100 million molecules to a small number like a few hundreds. First the fastest and crudest methods are used to rule out the obviously bad matches. Then increasingly more accurate and increasingly slower methods are applied until we get a reasonable number of hits.

Unfortunately, nobody really believes in docking. All results are verified in vitro. Due to sheer number of experiments that need to be performed, automated facilities are used. This is the so called High-throughput screening, and a major fully automated laboratory can test as many as 100 000 compounds a day. Molecules that bind best are our "hits".

Hit to Lead

Not all hits will become drugs, a still fairly large number of "hits" must be reduced to a small (like 5) number of leads. Many experiments are performed in silico and in vitro (from simple chemical assays to cell cultures). Is the molecule absorbed well by cellular membranes ? Is it stable ? Is it soluble in water ? Is it non-toxic ? Can we easily synthesize it ? Is it selective enough (doesn't significantly affect other enzymes) ? Isn't it metabolized too rapidly ? Finally, is it free of other companies' patents ? Probably none of the hits fits all the criteria, so they're modified until they do reasonably well.

A very diverse set of tests is applied, but basically we want to develop drugs that are "drug-like", or similar to successful drugs (using rules like Lipinski's Rule of Five). But we don't want "drug-like" leads. What we're looking for are "lead-like" leads, or similar to successful leads. Turning a lead into a drug candidate usually makes it bigger, more complex, and more hydrophobic, so we're interested in leads that are smaller, simpler, and less hydrophobic than good drugs.

Lead optimization

By now we have a few promising molecules. It's still not the time for human testing. First, we want to optimize the leads. For each lead, a vast number of similar molecules is synthesized and tested, and the most successful ones become drug candidates. The testing is again in silico and in vitro. Usually modification is addition of some chemical group or replacement of one group by another, so the drug candidates tend to be bigger and more complex than leads.

It is important to develop cheap and efficient methods of drug synthesis at this point, as previously only miligram quantities were required, and large-scale testing will require kilogram quantities.

Animal testing

After many experiments with computers, test tubes, and cell cultures, we hopefully have a few promising drug candidates. However, no regulatory authority is going to let us proceeding directly to human testing. Safety and efficiency of drugs must be tested on animals first. This is a very annoying part, because it's very expensive, and the results are only weakly correlated to results on humans. The most common test animals are mice (about 80%), rats (about 20%), and all others including other rodents, primates, rabbits, dogs, etc. together make up less than 2%.

Rodents are reasonably cheap, but very different from humans, so sometimes rodents with some human genes are used. Other animals are even more expensive, so they're used mostly when the rodents won't do. For example there is no way to infect mice with HIV, so primates need to be used to test HIV drugs.

In drug development less expensive methods are always preferred to more expensive ones. So whenever possible, human testing is replaced by animal testing, animal testing by cell cultures, cell cultures by simple chemical assays, and assays by computations. By Moore's law, computers get 100x more powerful every 10 years. In vitro testing becomes more automated and cheaper very rapidly too, and more complex experiments with cell cultures start to become automated too. As they get cheaper, they can handle more complex and more realistic setups, and be more accurate. But there is no way to automate animal testing, to make it cheaper, to significantly increase throughput, or to make it significantly more accurate (human-mouse hybrids would probably do, but that would be a public relations disaster). So in my opinion animal testing is going to greadually become less and less relevant, and at some not so distant point in the future to disappear completely.

Related to increasing automation is the fail early doctrine. Early phases of drug development are relatively cheap, while late phases like human testing are very expensive. So if the drug doesn't show much promise, experimentation should be terminated as early as possible. Many drugs that would eventually work are rejected this way, but it's cheaper overall.

In many countries (EU, USA) but not all (Japan) animal experimentation requires a licence or even a government approval of every single experiment.

Human testing - Phase I

You're probably wondering when do we start testing whether the drug works on humans. It's not this point yet. We need to apply to a regulatory authorities for permission to start human testing, but it's only going to be safety testing, the so called Phase I clinical trials.

Safety testing verifies that the drug has no unexpected adverse effects on a small group (like 30, exact numbers vary a lot depending on the condition so don't care much about them) of healthy individuals. Most drugs are expected to have some side effects, but they should all be documented. If an unexpected side effect is found, even a relatively insignificant one, the regulator is likely to require further testing at some earlier stage before proceeding any further.

In addition to safety, pharmacokinetics (what happens with the drug in the body, how is it absorbed, distributed, metabolized, and eliminated) and pharmacodynamics (what desired and undesired effects the drug has in the body) of the drug at different dosages are evaluated.

At this point companies typically apply for patents. In most countries (including EU) only the "first to file" a patent application can get the patent, and in the few that follow the "first to invent" rule (like USA), it would take a long and costly lawsuit to recover the patent if someone else filed first. So companies don't want to wait too long. On the other hand, the patent only lasts 20 years (previously 17 years), so filling too early means shorter monopoly. Because clinical trials tests and waiting for all approvals take many years, especially if there were some problems, the actual patent monopoly is often just a few years.

Human testing - Phase II

Hopefully everything went well, and we can finally test how well the drug works. This requires another approval from the authorities. Phase II clinical trials measure drug efficiency on a limited number (like 200) of actual patients in highly controlled conditions. This point, very late in drug development, is the first time where efficiency is evaluated under realistic conditions, and unfortunately many drugs fail here, and such late failures are very expensive.

The tested drug is supposed to be more efficient than all existing drugs, have less severe side effects, be more widely applicable, and so on. The rules are not exactly fair - if a more efficient drug is registered first, a less efficient one will be rejected. But if a less efficient drug is registered first, and a more efficient one is found later, the former won't be pulled from the market. The most extreme example is probably acetylsalicylic acid (aspirin) which has so many side effects that it would never pass the drug registration process today, or at best end up as a prescription drug for a very limited range of conditions. Most authorities are far on the paranoid side - accepting a drug that has to be pulled later is a political disaster, while rejecting or delaying a perfectly fine drug doesn't cost them a dime. Procedures are usually more lenient for the most deadly diseases like cancer and AIDS, and for rarely occuring diseases (so called "orphan diseases").

Human testing - Phase III

If Phase II went well, the authorities may approve proceeding to Phase III clinical trials - that is wider randomized testing of the drug, on hundreds or even thousands of patients. At this point we have preliminary evidence that the drug is safe and efficient, and the wider trials will provide information on interactions with other drugs or conditions, less common side effects, and give a final confirmation that the drug is indeed safer and more efficient.

After Phase III is completed, the company which developed the drug applies for registration. It would be extremely costly and painful to fail here, fortunately it doesn't happen that often.

Success !

As I said, the first rule of designing new drugs is don't. So when the new drug gets to the market, the research team doesn't get back to designing another drug - very often even more intense research on the newly developed drug starts, sometimes even before the Phase III is over. Extending it to more conditions, improving bioavailability, work on similar molecules, combinations with other drugs, such research can be extremely profitable as it carries much lower risk, nobody was there before, and as it is freshly patented, everything containing the new drug is covered.

It would mean a guaranteed stream of money if not for two issues. First, the patent doesn't last that long, and a few years probably already passed since filling (around Phase I clinical trials usually).

The other issue is the competition. Usually all simple modifications of the molecule are covered by the patent, but the target is not covered by patents (the big pharma tried to cover these too, but courts tend to throw them out). For most targets it's not exactly difficult to find alternative drugs.

A great example are inhibitors of cGMP specific phosphodiesterase type 5 enzyme. The first one sildenafil (Viagra) was patented in 1996 and approved by FDA on March 27, 1998. Depending on the country, the patents will expire somewhere around 2011–2013. Based on patent laws alone, that would be a 15-year monopoly over a huge market. However FDA approved two different erectile dysfunction drugs targetting the same enzyme soon - vardenafil (Levitra) on August 19, 2003, and tadalafil (Cialis) on November 21, 2003. That's just 5 years and a few months.

So before even the drug is registered, a huge marketing campaign is started to ensure its speedy adoption. This adds even more to the overall cost, but without it the very valuable monopoly time would be lost. After generic drugs or other competition enters, the prices can stay high for some more time due to brand recognition and plain inertia, but the profits fall quite fast, so you better hurry.

That's about what you need to start designing new drugs. At least if your rich uncle dies leaving you a few hundred million euro. ;-)

The Pharmaceutical Industry

Unless you're interested in political issues like this one, simply ignore this section.

No discussion of drug development would be complete without at least a mention of the pharmaceutical industry.

During the first few years, the drug is heavily "overpriced" compared to the cost (this is intended as a statement of fact, not a moral judgment). Commonly, the price would fall by over 90% if free competition was allowed. A good example was introduction of generic antiretrovival drugs in India in 2000, which caused the prices to fall from $778 a month to $33 a month (96% decrease) in 2003, what also raised number of people living with HIV receiving anti-retrovival therapy from 22% to 44%, and a huge decrease in number of HIV-related deaths, but the main point is how "overpriced" the drugs are. This only compares prices against marginal costs, that is manufacturing, and basic operational costs, and doesn't include things like research.

We can also compare against total costs, including manufacturing, research, development, marketing, sales, CEO compensation, oportunity cost of capital, and everything else. In 2004 top ten big pharmaceutical companies had $305 billion in revenue, $64 billion in net income, and just $43 billion in research and development spending. Average net income at 21% of revenue is far above almost any other industry, including big oil. More typical values are around 5%.

If some industry has very high profits, normally capital would flow to it from less profitable industries, with loads of new companies joining, competition lowering prices, until average profits are back to the industry standard range. Getting exact numbers requires a bit more work than just pulling them from a chart on Wikipedia, but pharmaceuticals undeniably make a very profitable industry, it's the case for quite some time, and industries can stay very profitable for a long time only due to high barriers to entry. In this case - mostly government-issued drug patents.

Actually the taxpayers fund all medical research - basic research in the academia, and drug development by having government healthcare systems pay any company which successfully developed new drugs. After all, virtually nowhere patients pay directly for new drugs - it's always either public healthcare or private insurance. This is great, because the taxpayers only pay for successful developments, not for the failures.

How much does it cost ? Let's make a simple model - let's say the big pharma loses all patent and other protections, stops doing any research and development at all, their profits are brought to industry average, that is 5% over costs (and let's check 10% too), and all savings go to publically funded research (which already does most of the basic medical research). If $305 billion was revenue, $64 billion net income, and $43 billion R&D, the non-R&D costs are $198 billion, plus 5% net income that's $208 billion. Alternatively with 10% net income it'd be $220 billion. That leaves the tax payers with extra $85 billion to $97 billion to fund drug development. So unless the big pharma is 97% to 125% more efficient than the academia in drug development, taxpayers would benefit from the switch. Universities aren't particularly well-regarded on their abilities of bringing results of the research to the market, so their R&D would probably be less efficient, but would the difference be that high ? The largest part of the expenses is after all clinical testing required by the regulatory authorities.

This back of an envelope computation is far too unrealistic for any serious use, however a bad model is far better than the hand-waving approach commonly used to discuss the big pharma or pretty much any other subject in politics or economy. But it seems the taxpayers should consider subsidies to partially cover clinical trials of publically developed patent-free drugs. This avoids financing most of the failures (the drug got to the clinical trials, so it's not a total flop), encourages practice that leads to successful designs, and it's probably cheaper than paying for the patented drug later.

Thursday, November 09, 2006

Baka Y2K6

It had been two years since my last anime convent. The one I visited last week, BAKA (Very Attractive Anime Convent) Y2K6, was an attempt at resurecting the famous BAKA series, last of which took place two years ago. Many people, especially those who organized BAKA in the past, objected to using such name for a convent that was pretty sure not to achieve the level of earlier BAKA convents. They were mostly right, it didn't really live up to the name. Nevertheless, it was a lot of fun.

First, I don't remember anything nearly as disorganized. The draft program and even the opening hour weren't put on the website until a day before the convent. Everyone got copy of the planned program at entry, but it was completely useless - lists of anime weren't there, room numbers were wrong, and half of the things either didn't take place or took place at different time in a different room. The most up-to-date list of events was hanged near the entrance, but it wasn't very accurate either.

I'm not compaining for no reason - because of the mess I missed half of the Hentai Night, and a panel on Lolita Complex ! Hentai Night was completely unannounced, and Lolita Complex panel took place a day before it was supposed to.

As far as attractions are concerned, some really cool anime were shown. I like Death Note and Code Geass most, both of which started airing in Japanese in October 2006. It's scary how fast fansubbers can be. Other anime I liked were REC, The Third - Aoi Hitomi no Shōjo, Arashi no Yoru Ni, and .hack//Roots.

I'm not sure what to think about Dead Leaves, It had no plot, it looked absolutely horrible, and the humor was really crude (Chinko Drill). And somehow it was really enjoyable to watch.

There was obviously a DDR room. Unfortunately it was closed for the night, and very crowded during the day, so it wasn't much fun. The best part about it was a new (and not yet released) mix containing songs like School Rumble intro. Really great.

Console room was too crowded, so I didn't even care. There were two LARPs (whatever). People were playing go everywhere. There were some panels (more about it later). Corridors were taken by people selling things like yaoi dōjinshi and Hard Gay stickers, or doing things like body painting and free hugs. Mostly because of the name, the convent was simply flooded with people. And that's important, as convents are mostly social activities.

That's pretty much what the convent was about. I want to write a bit more on two panels I attended - a "Seppuku tutorial" and "Why people hate Japan ?" panel.

Japanese swords

"Seppuku tutorial" was pretty funny, and pretty scary. The funny part was of course seppuku tutorial itself. The scary part was the following discussion, and some of the participants. To my astonishment, many people actually believe in magical properties of Japanese swords. Stuff like them being made by billions of folds over 45 years, being able to cut through everything like butter, and of course being million times better than any other sword ever made.

This is of course pure crap. The real story is more or less like this. Japanese had much less iron than Europe, so it was expensive. It was also of dreadful quality. So they had to spent much more time on each one of them, and as iron was expensive, it didn't make much difference.

The simplest way of making an iron sword, one used by Roman "gladius" sword and other ancient people is taking soft wrought iron (which has low carbon content), and increasing carbon content on the surface to make it hard enough to hold sharp edge. It is fast, cheap, and if iron has few impurities good enough for most uses.

The slightly more advanced technique is pattern welding, where the sword is repetitively carburized and then folded. This increases carbon content of the sword, making it harder but not brittle. This famous "Japanese" technique was actually widely used by Romans for their "spartha" sword, ancient Barbarians, and pretty much everyone in Medieval Europe.

Number of folds was typically 8-10. The process had to be tightly controlled - alloys of iron has multiple stable and metastable allotropic phases, like martensite and pearlite. Content of carbon and other alloying elements, and speed of cooling determine hardness and brittleness of the end result. Obviously, the process is bound by limits of chemistry - no amount of magic is able to create sword much better than one made of modern high-quality steel.

Leaving modern steel aside, about two thousand years ago Indians discovered a much more advanced technique. It was also used in the Middle East, and their famous "Damascene swords" were considered hugely superior to anything Europe could offer. That's right - the "mythical" Japanese technique of "pattern welding" (used in Europe) was no match for something known 2000 years ago.

Of course there's no need to use European examples to dispel myths of Japanese swordmaking technology. Japanese history provides plenty of examples. The first time Japanese fought foreign army was during Mongol invasions in 1274 and 1281. Japanese armies were beaten throughout, and their swords couldn't even handle Mongol leather armors. Against chain mail and plate armour commonly used in Europe they would be pretty much useless. Japan was only saved by ill-preparedness of the invasion - using hastly acquired river boats instead of high sea ships, which weren't able to withstand the typhoon, and internal problems within the Yuán China following death of Kublai Khan which made further attempts impossible.

Apparently early 1300s were the high time of sword making. It is believed that pattern welding was reinvented in Japan during that era. Civil wars of the Muromachi period (1336-1573) are pretty much the only time where good Japanese swords were used in actual battle. Except for a minor Korean anti-pirate expedition in 1419, we cannot tell anything about efficiency of armies using Japanese sword in that period against different tactics.

Firearms were introduced in Japan by Portuguese in 1542. They were increasingly used in Japanese civil wars, and by 1575 battle of Nagashino, in which winners used European-style tactics and firearms, hardly anything commonly associated with "samurai" fighting style (swordfighting, horseback archery) was left.

When Japan invaded Korea in 1592-1598, the dominant weapons were already matchlock muskets, arquebuses, cannons, grenades, and mortars, in addition to more traditional bows. There was very little sword fighting or any other close combat. It was even more true in later wars.

In the following Edo period (1603-1867), it is commonly believed that quality of swords deteriorated. The "new style swords" (新刀) from that time were considered vastly inferior to "old style swords" (古刀), and the old knowledge was never restored. It is very likely considering limitations on military technology put by the shogunate.

It is during this peaceful time that samurai caste really developed. The most famous samurai text Go Rin No Sho was written around 1645. Cult of the sword principally dates to that era, where samurai were no longer fighting, and sword making technologies were long forgotten.

A few words are in order on sword shape. Unlike European swords since antiquity, Japanese swords were designed for use against unarmored or lightly-armored opponents. They would be useless against much more heavily armed soldiers in Europe. Against armour, stabbing is far more effective than cutting, and katana is a primarity cutting weapon. As iron was expensive in Japan, few people could afford heavy armour, and such weapons could be pretty effective.

So to sum it up - Japanese swords really sucked, there were better swords pretty much everywhere, and Japanese katana-wielding samurais would be totally crushed by a much smaller European force with European swords and a decent armour. Any European force, Roman, barbarian, medieval, heck even Greek phalanx would most likely do. Most armies from Asia would likewise crush Japanese army (see Mongol invasion, which consisted of Chinese and Korean soldiers mostly). Stories of Japanese sword-making magic are no more than myths, popularized during Edo period when there was hardly any fighting, and that mostly with firearms. Believing such myths is as lame as believing in feng shui

Why people hate Japan ?

The premise of this panel was - "People in Asia (Chinese, Koreans, Russians and so on) hate the Japanese, because the Japanese committed terrible crimes against them and instead of apologizing, they falsify their history, glorify the war criminals etc.".

There's certainly some point in that. Yasukuni Shrine (靖国神社) glorifies 12 convinced (and 2 accused who died before the trial) war criminals, denies that Nanking Massacre took place and portrays Japan as a defender of Asia against Western threat.

The shrine was visited by many officials, including Japanese prime ministers Miki Takeo (三木武夫), Fukuda Takeo (福田赳夫), Ōhira Masayoshi (大平正芳), Suzuki Zenkō (鈴木善幸), Nakasone Yasuhiro (中曽根康弘), Miyazawa Ki'ichi (宮澤喜一), Hashimoto Ryūtarō (橋本龍太郎), and Koizumi Jun'ichirō (小泉純一郎). Can you imagine Angela Merkel visiting a SS musuem that denies Holocaust ever happened and claims Nazis were actually defending European civilization against Bolshevik threat ?

This would be clearly absurd. And while they actually have a point that Europeans and other Asians committed many atrocities in Asia, and trials after WW2 were conducted with total disregard of any rules, it doesn't change the basic facts that war crimes were committed by the Japanese army, Japanese nationalists are totally fucked up people by not admiting it, and Japanese public should be ashamed of not reacting when their prime ministers associate themselves with such fuckups as those who run Yasukuni Shrine.

It also seems that most people in Japan are unaware of scale of war crimes committed by the Japanese Army.

I completely disagree with the premise. Sure Germans are treating their history much more responsibly than Japanese. But Japanese aren't exception here, Germans are ! Pretty much every nation glorifies its past war criminals, denies or minimizes them, and definitely refuses to apologize.

Just a set of random examples. Relations between Poland and Ukraine. During WW2, Polish (AK) and Ukrainian (UPA) guerilla murdered each others' civilians. Hundreds of thousands of innocent people died. But ask any nationalist - they're going to remember crimes of the other side only, not of their side. Fortunately, the denial is mostly over for the general public. Or how about massacres of Jews committed during WW2 like one in Jedwabne ? A lot of Polish will absolutely reject the very idea that Polish people could have done that. Surely, it must have been Nazi Germans, right ? They also deny responsibility for crimes by Communist government of Poland, as it was controlled by "the Russians". In common mentality, the Polish were always victims, and even the idea that some of them cooperated with the occupants, let alone did anything wrong on their own.

Or moving somewhere else. Christopher Columbus is widely glorified in spite of all his crimes. He personally introduced slavery to America and started genocide of Indians, which between 1492 and 1508 killed three million people (according to Las Casas). After report by Francisco de Bobadilla, he was arrested for attrocities committed as "Governor of the Indies" 1493-1500. So Columbus' guilt shouldn't be exactly news to anyone. It was widely known 500 years ago, so why the heck is he still regarded as a great hero instead of genocidal madman that he was ?

Or take Iraq. American invasion is responsible for about 650,000 deaths. What does Bush do ? Completely disregards reality and claims that maybe some 30 thousand people died. That's great. How about Angela Merkel claiming 300 thousand victims of Nazi death camps ? Whatever the number, is anybody preparing a tribunal for Bush's war crimes ? Last time I checked, waging war of aggression is a crime according to the international law and USA accepted this by supporting Nuremberg Trials.

And there's of course Russian government, which still glorifies the Red Army, which invaded Central Europe together with the Nazis. Lenin's Mausoleum is still open and Russian dictator Vladimir Putin described the collapse of the Soviet Union as "the greatest geopolitical catastrophe of the 20th century". Nobody could top that one.

Almost every nation failed to confront a lot of evilness it is guilty of, whether against other nations, or its own people.

In spite of all past crimes and denial, most countries aren't hated the way Japan is in East Asia. Few people care about things that took place such a long time ago, and even recent events are mostly ignored. Like - most people in the world hate Bush, Rumsfeld and the rest of neocon war criminals, but it rarely turns into hatred of all Americans.

I think the real reason of anti-Japanese feelings is different. Governments of People's Republic of China, South Korea and other countries in the region, try to incite anti-Japanese sentiment to shift public attention away from domestic problems. It's just like with Muhammad cartoons. Not a single rioter in the Middle East even read Jyllands-Posten. Muslims do make pictures of Muhammad (usually not cartoons). But it was so convenient for Middle Eastern governments and radical immams to direct people against Danish cartoonists. And it happened again after famous remark by Pope Benedict XVI. How many Muslims listened to that lecture ?

Now, it's perfectly understandable that some people feel seriously pissed off when Koizumi visits Yasukuni Shrine, Danish newspapers print Muhammad cartoons, or pope quotes Byzantine emperors who didn't like islam much. But don't people have more serious problems ? There are wars all over the world (whether your country is invaded or invader). Very often poverty, crime, and corruption are widespread, and democracy and human rights are lacking. Are cartoons and some lame shrine really that important ?

Anyway, I'm pretty sure it's because of governments of People's Republic of China, South Korea and other countries in the region, that anti-Japanese sentiment is so widespread, even to the point of ourbursts of violence. Sure, Koizumi is a jackass for visiting Yasukuni Shrine, but this is simply irrelevant. Move on.

Friday, November 03, 2006

magic/help for Ruby

Help is a weakness of almost all programming languages. Ruby help really sucks too. For example let's try to get help on sync method of an opened File:

$ irb
irb(main):001:0> f = File.open("/dev/null")
=> #<File:/dev/null>
irb(main):002:0> help f.sync
------------------------------------------------ REXML::Functions::false
     REXML::Functions::false( )
------------------------------------------------------------------------
     UNTESTED
=> nil
irb(main):003:0> help 'f.sync'
Bad argument: f.sync
=> nil
irb(main):004:0> help File.sync
NoMethodError: undefined method `sync' for File:Class
        from (irb):4
irb(main):005:0> help 'File.sync'
Nothing known about File.sync
=> nil
irb(main):006:0> help 'File#sync'
Nothing known about File#sync
=> nil
irb(main):007:0> help 'sync'
More than one method matched your request. You can refine
your search by asking for information on one of:

     IO#sync, IO#fsync, IO#sync=, Zlib::GzipFile#sync,
     Zlib::GzipFile#sync=, Zlib::Inflate#sync,
     Zlib::Inflate#sync_point?, Mutex#synchronize,
     MonitorMixin#mon_synchronize, MonitorMixin#synchronize,
     StringIO#sync, StringIO#fsync, StringIO#sync=
=> nil
irb(main):008:0> eat flaming death
(irb):8: warning: parenthesize argument(s) for future version
NameError: undefined local variable or method `death' for main:Object
        from (irb):8
irb(main):009:0> ^D
$ firefox http://www.google.com/

Of course nobody would actually do that. Everyone either visits Google the first thing, or talks with objects using reflection. The help system is just way too weak. It's not that the help isn't there, Ruby has plenty of documentation. It's just too hard to find. Compare Ruby:


help "Array#reverse"
---------------------------------------------------------- Array#reverse
     array.reverse -> an_array
------------------------------------------------------------------------
     Returns a new array containing _self_'s elements in reverse order.

        [ "a", "b", "c" ].reverse   #=> ["c", "b", "a"]
        [ 1 ].reverse               #=> [1]

with Python:


>>> help([].reverse)
Help on built-in function reverse:

reverse(...)
    L.reverse() -- reverse *IN PLACE*

So Ruby has more documentation, but it's more difficult to access it. At least it was till today morning. Because right now, Ruby totally dominates ! If you pass class, class name, or class instance, you get documentation on the class:

irb(main):001:0> help "Array"
irb(main):002:0> help Array
irb(main):003:0> help { Array }
irb(main):004:0> help { ["a", "b", "c"] }
----------------------------------------------------------- Class: Array
     Arrays are ordered, integer-indexed collections of any object.
     Array indexing starts at 0, as in C or Java. A negative index is
     assumed to be relative to the end of the array---that is, an index
     of -1 indicates the last element of the array, -2 is the next to
     last element in the array, and so on.

------------------------------------------------------------------------


Includes:
---------
     Enumerable(all?, any?, collect, detect, each_cons, each_slice,
     each_with_index, entries, enum_cons, enum_slice, enum_with_index,
     find, find_all, grep, include?, inject, map, max, member?, min,
     partition, reject, select, sort, sort_by, to_a, to_set, zip)


Class methods:
--------------
     [], new


Instance methods:
-----------------
     &, *, +, -, <<, <=>, ==, [], []=, abbrev, assoc, at, clear,
     collect, collect!, compact, compact!, concat, dclone, delete,
     delete_at, delete_if, each, each_index, empty?, eql?, fetch, fill,
     first, flatten, flatten!, frozen?, hash, include?, index, indexes,
     indices, initialize_copy, insert, inspect, join, last, length, map,
     map!, nitems, pack, pop, pretty_print, pretty_print_cycle, push,
     rassoc, reject, reject!, replace, reverse, reverse!, reverse_each,
     rindex, select, shift, size, slice, slice!, sort, sort!, to_a,
     to_ary, to_s, transpose, uniq, uniq!, unshift, values_at, zip, |

If you call a method inside the block, you get documentation on it. It won't really be called, because magic/help plugs into debugging hooks (set_trace_func). So you can safely ask for help on start_global_thermonuclear_warfare.

irb(main):005:0> help { 2 + 2 }
--------------------------------------------------------------- Fixnum#+
     fix + numeric   =>  numeric_result
------------------------------------------------------------------------
     Performs addition: the class of the resulting object depends on the
     class of +numeric+ and on the magnitude of the result.

It doesn't matter whether it's in the class, or one of its ancestors, or an included module. You can also pass Method or UnboundMethod object, or method name. It all does the right thing.

irb(main):006:0> f = File.open("/dev/null")
=> #<File:/dev/null>
irb(main):007:0> help { f.sync }
irb(main):008:0> help "File#sync"
irb(main):009:0> help f.method(:sync)
irb(main):010:0> help File.instance_method(:sync)
---------------------------------------------------------------- IO#sync
     ios.sync    => true or false
------------------------------------------------------------------------
     Returns the current ``sync mode'' of _ios_. When sync mode is true,
     all output is immediately flushed to the underlying operating
     system and is not buffered by Ruby internally. See also +IO#fsync+.

        f = File.new("testfile")
        f.sync   #=> false

magic/help tries to guess whether you meant class or instance method. So help "Dir.[]" gives you documentation for class method of Dir, while help "Array.[]" gives you documentation for instance method of Array. Using magic/help requires almost no effort. Simply copy magic_help.rb to some visible place, and add require 'magic_help' to your ~/.irbrc. Works with either 1.8 or 1.9. I haven't converted it to a gem yet. For now go to magic/help website and get a tarball or a zip file. Documentation is minimal, as it was just finished. Unit test coverage is naturally 100%.

Sunday, October 29, 2006

Prototype-based Ruby

Object-oriented programming is all about objects. About objects and messages they pass to each other. Programs contain many objects. Way too many to make each of them by hand. Objects need to be mass-produced. There are basically two ways of mass producing objects.

The industrial way - building factory objects that build other objects.
The biological way - building prototype objects that can be cloned.

A language with classes that are not objects is not object-oriented. Period. Most object-oriented languages like Smalltalk and Ruby use the industrial way. The object factories are also known as "class objects" (or even "classes", but that's a bit confusing). To create a new object factory you do:

a_factory = Class.new()
a_factory.define_instance_method(:hello) {|arg|
    puts "Hello, #{arg}!"
}

And then:

an_object = a_factory.new()
an_object.hello("world")

Some languages like Self and The Most Underappreciated Programming Language Ever (TMUPLE, also known as JavaScript), use biological method instead. In biological method you create a prototype, then clone it:

a_prototype = Object.new()
a_prototype.define_method(:hello) {|arg|
    puts "Hello, #{arg}!"
}

Then:

an_object = a_prototype.clone()
an_object.hello("world")

Biological way is less organized, but simpler and more lightweight. There are only objects and messages, nothing more. Array is a prototype for all arrays and so on. Industrial way is more organized, but much more complex and heavy. There are objects, classes, class objects, superclasses, inheritance, mixins, metaclasses, singleton classes. It's just too complex. This complexity exists for a reason, but sometimes we'd really rather get away with it and use something simpler.

Prototype-based programming in Ruby

And in Ruby we can ! First, we need to be able to define methods just for individual objects:

def x.hello(arg)
    puts "Hello, #{arg}!"
end

x.hello("world") # => "Hello, world!"

Now we just need to copy existing objects:

y = x.clone()
y.hello("world") # => "Hello, world!"

The objects are independent, so each of them can redefine methods without worrying about everyone else:

z = x.clone()

def x.hello(arg)
    puts "Guten Tag, #{arg}!"
end

def z.hello(arg)
    puts "Goodbye, #{arg}!"
end

x.hello("world") # => "Guten Tag, world!"
y.hello("world") # => "Hello, world!"
z.hello("world") # => "Goodbye, world!"

Converting class objects into prototype objects would probably introduce compatibility issues, so let's go halfway there:

class Class
    def prototype
        @prototype = new unless @prototype
        return @prototype
    end
    def clone
        prototype.clone
    end
end

def (String.prototype).hello
    puts "Hello, #{self}!"
end

a_string = String.clone
a_string[0..-1] = "world"

a_string.hello #=> "Hello, world!"

Horizontal gene transfer

Of course transfer of genes from parents to offspring is only half of the story. The other half is gene transfer between unrelated organisms. We can easily use delegation and method_missing, but let's do something more fun instead - directly copying genes (methods) between objects.

a_person = Object.new
class <<a_person
    attr_accessor :first_name
    attr_accessor :name

    def to_s
        "#{first_name} #{name}"
    end
end

nancy_cartwright = a_person.clone
nancy_cartwright.first_name = "Nancy"
nancy_cartwright.name = "Cartwright"

hayashibara_megumi = a_person.clone
hayashibara_megumi.first_name = "Megumi"
hayashibara_megumi.name = "Hayashibara"

But Megumi is Japanese, so she needs reversed to_s method:


def hayashibara_megumi.to_s
    "#{name} #{first_name}"
end

Later we find out that another person needs reversed to_s:

inoue_kikuko = a_person.clone
inoue_kikuko.first_name = "Kikuko"
inoue_kikuko.name = "Inoue"

We want to do something like:

japanese_to_s = hayashibara_megumi.copy_gene(:to_s)
inoue_kikuko.use_gene japanese_to_s

OK, first let's fix a few deficiencies of Ruby 1.8. define_method is private (should be public), and there is no simple singleton_class. Both will hopefully be fixed in Ruby 2.

class Object
    def singleton_class
        (class <<self; self; end)
    end
end

class Class
    public :define_method
end

And now:

class Object
    def copy_gene(method_name)
        [method(method_name).unbind, method_name]
    end

    def use_gene(gene, new_method_name = nil)
        singleton_class.define_method(new_method_name||gene[1], gene[0])
    end
end

We can try how the gene splicing worked:

puts nancy_cartwright #=> Nancy Cartwright 
puts hayashibara_megumi #=> Hayashibara Megumi
puts inoue_kikuko #=> in `to_s':TypeError: singleton method called for a different object

If we try it in Ruby 1.9 we get a different error message:

puts inoue_kikuko #=> in `define_method': can't bind singleton method to a different class (TypeError)

What Ruby does makes some sense - if method was implemented in C (like a lot of standard Ruby methods), calling it on object of completely different "kind" can get us a segmentation fault. With C you can never be sure, but it's reasonably safe to assume that we can move methods between objects with the same "internal representation". We need to use Evil Ruby. Evil Ruby lets us access Ruby internals. UnboundMethod class represents methods not bound to any objects. It contains internal field rklass, and it can only bind to objects of such class (or subclasses). First, let's define a method to change this rklass:

class UnboundMethod
    def rklass=(c)
        RubyInternal.critical {
            i = RubyInternal::DMethod.new(internal.data)
            i.rklass = c.object_id * 2
        }
    end
end

Now we could completely remove protection, but we just want to loosen it. Instead of classes, we want to compare internal types:

class Object
    def copy_gene(method_name)
        [method(method_name).unbind, method_name, internal_type]
    end

    def use_gene(gene, new_method_name = nil)
        raise TypeError, "can't bind method to an object of different internal type" if internal_type != gene[2]
        gene[0].rklass = self.class
        singleton_class.define_method(new_method_name||gene[1], gene[0])
    end
end

And voilà!

puts nancy_cartwright #=> Nancy Cartwright 
puts hayashibara_megumi #=> Hayashibara Megumi
puts inoue_kikuko #=> Inoue Kikuko

This is merely a toy example. But sometimes prototypes lead to design more elegant than factories. Think about the possibility in your next project.

Full listing

require 'evil'

class Object
    def singleton_class
        (class <<self; self; end)
    end
end

class UnboundMethod
    def rklass=(c)
        RubyInternal.critical {
            i = RubyInternal::DMethod.new(internal.data)
            i.rklass = c.object_id * 2
        }
    end
end

class Class
    public :define_method
end

class Object
    def copy_gene(method_name)
        [method(method_name).unbind, method_name, internal_type]
    end

    def use_gene(gene, new_method_name = nil)
        raise TypeError, "can't bind method to an object of different internal type" if internal_type != gene[2]
        gene[0].rklass = self.class
        singleton_class.define_method(new_method_name||gene[1], gene[0])
    end
end

a_person = Object.new
class <<a_person
    attr_accessor :first_name
    attr_accessor :name

    def to_s
        "#{first_name} #{name}"
    end
end

nancy_cartwright = a_person.clone
nancy_cartwright.first_name = "Nancy"
nancy_cartwright.name = "Cartwright"

hayashibara_megumi = a_person.clone
hayashibara_megumi.first_name = "Megumi"
hayashibara_megumi.name = "Hayashibara"

def hayashibara_megumi.to_s
    "#{name} #{first_name}"
end

inoue_kikuko = a_person.clone
inoue_kikuko.first_name = "Kikuko"
inoue_kikuko.name = "Inoue"

japanese_to_s = hayashibara_megumi.copy_gene(:to_s)
inoue_kikuko.use_gene japanese_to_s

puts nancy_cartwright
puts hayashibara_megumi
puts inoue_kikuko

Wednesday, October 04, 2006

Why Perl Is a Great Language for Concurrent Programming

If you asked people what things Perl is good for, they would talk about stuff like web programming, data processing, system scripting and duct taping things. But concurrency ? Can Perl seriously beat languages designed with concurrency in mind like Java (well, kinda) and Erlang ? Surprisingly it can, and here's the story.

Wikipedia contains a lot of links to other websites. About one per article. It does not have any control whatsoever over servers where they are hosted, and every day many of the links die. Some websites move somewhere else, others are removed completely, and there are even those that were incorrect in the first place due to typos etc. Nobody can seriously expect people to check millions of links by hand every day - this is something that a bot should do. And that's what tawbot does.

The first thing we need is to extract list of external links from Wikipedia database. There are basically two ways. First way is to get XML dump of Wikipedia and parse wiki markup to extract links. It's easy to get about half of the links this way, but it's pretty much impossible get close to all. There are at least three formats of external links, and even worse - links can also be assembled from pieces by templates hackery - like links to imdb in film infobox which are not specified explicitly, but only imdb code of a movie is in the article, and template turns the code into an URL.

The alternative is to use MediaWiki engine. Reparsing every single page would be really slow. That's not a new problem - every Wikipedia mirror needs to do the same thing. Fortunately they already fixed it, and Wikipedia dumps include SQL dumps of the tables that are technically derivable from the XML database, but are a bit too expensive. One of the tables is "externallinks" table (yay). Another is "page" table with pages metainformation (id, title, access restrictions etc.).

The first problem is to extract data from MySQL dumps somehow. I could run a MySQL server, but I hate using database servers for standalone applications. I would need to install MySQL server, setup user accounts, and create a special database on every computer I wanted to run the script on. And then deal with usernames, passwords and the rest of the ugly stuff. So I thought I'd use Sqlite 3 instead. It was to be expected that table creation syntaix is different - they aren't the same in any two RDBMS (SQL is as much of "a standard" as Democratic People's Republic od Korea is of "a democracy"). That part was easy. But Sqlite didn't even accept MySQL INSERTs !

Interting multiple values with a single INSERT statement is MySQL extension of SQL (a pretty useful one): INSERT INTO `externallinks` VALUES (a, b,c), (d, e, f), (g, h, i); and Sqlite doesn't support it. Now come on, MySQL is extremely popular, so being compatible with some of its extensions would really increase Sqlite's value as duct tape. Anyway, I had to write a pre-parser to split multirecord inserts into series of single inserts. It took some regexp hackery to get it right, and Sqlite started to accept the input. But it was so unbelievably slow that Perl 6 would have been released by the time Sqlite finishes importing the dump. I guess it was doing something as braindead as fsync'ing database after every insert. I couldn't find anything about it on man pages or Sqlite FAQ. But then I thought - the code that splits multi-value INSERTs into single-valued inserts is already almost parsing the data, so what the heck do I need Sqlite for ? A few more regexps, a lot of time (regexping hundreds of MBs is pretty slow, but it was way faster than Sqlite), and I have the list of links to check extracted. So far I didn't use any Perl or concurrency, it was unthreaded Ruby.

Now let's get back to Perl. There were 386705 links to check, and some time ago I wrote a Perl program verifying links. It is so simple that I'm pasting it almost whole here (headers + code to preload cache omitted):

sub link_status {
my ($ua,$link) = @_;
if(defined $CACHE{$link}) {
    return $CACHE{$link};
}
my $request = new HTTP::Request('HEAD', $link);
my $response = $ua->request($request);
my $status = $response->code;
$CACHE{$link} = $status;
return $status;
}

my $ua = LWP::UserAgent->new();
while(<>)
{
/^(.*)\t(.*)$/ or die "Bad input format of line $.: `$_'";
my($title,$link)=($1,$2);
my $result = link_status($ua, $link);
print "$result\t$title\t$link\n";
}

Basically it reads info from stdin, runs HTTP HEAD request on all URLs, and prints URL status on stdout. As the same URLs are often in multiple articles, and we want to be able to stop the program and restart it later without remembering where it finished, a simple cache system is used.

The program was getting the work done, but it was insanely slow. Most HTTP requests are very fast, so it doesn't hurt much to run them serially. But every now and then there are dead servers that do not return any answer, but simply timeout. When I came the next day to check how the script was doing, I found it was spending most of its time waiting for dead servers, and in effect it was really slow. Now there is no reason to wait, it should be possible to have multiple HTTP connections in parallel. Basically it was in need for some sort of concurrency.

Concurrent programming in most languages is ugly, and people tend to avoid it unless they really have no other choice. Well, I had no other choice. First thing that came to my mind was Erlang (really). The concurrency part shtould be simple enough, and it would be highly educational. But does it have something like LWP ? It would really suck to implement cool concurrency and then have to work with some lame half-complete HTTP library. So I didn't even try. Ruby has only interpretter threads, so concurrent I/O is not going to work. And it would be an understatement to call its HTTP library incomplete. But wait a moment - Perl has real threads. They are pretty weird, sure, but I don't need anything fancy. And I can keep using LWP.

Perl (since 5.6.0) uses theading model that is quite different from most other languages. Each thread runs as a separate virtual machine, and only data that is explicitly marked as shared can be shared. It also has syntax-based locks and a few other unusual ideas. The real reason it was done this way was retrofitting threading on fundamentally unthreadable interpretter. That's the real rationale behind Perl OO too. Of course it sucks for some things, but for some problems it happens to work very nicely.

The design I used was very simple - one master thread, and a group of worker threads. Master would keep pushing tasks on @work_todo, and threads would pop items from it and push result to @work_done. When master doesn't have anything else to do, it sets $finished to true and waits for workers to finish.

Before we spawn threads, we need to declare shared variables:

my @work_todo : shared = ();
my @work_done : shared = ();
my $finished : shared = 0;

Now let's spawn 20 workers (I originally had 5, but it wasn't enough):

my @t;
push @t, threads->new(\&worker) for 1..20;

Workers basically keep taking tasks from todo list until master says it's over:

sub worker {
my $ua = LWP::UserAgent->new(); # Initialize LWP object
while (1) {
    my $cmd;
    {
        lock @work_todo;
        $cmd = pop @work_todo; # get a command
    }
    if($cmd) { # did we get any command ?
        my ($link, $title) = @$cmd;
        # do the computations
        my $status = link_status($ua,$link);
        {
            # @done_item needs to be shared because threads are otherwise independent
            my @done_item : shared = ($status, $title, $link);
            lock @work_done;
            push @work_done, \@done_item;
        }
    } elsif($finished) { # No command, is it over already ?
        last; # Get out of the loop
    } else { # It's not over, wait for new commands from the master
        sleep 1;
    }
}
}

Link checker got even simpler, as cache logic was moved to master.

sub link_status {
my ($ua,$link) = @_;
my $request = new HTTP::Request('HEAD', $link);
my $response = $ua->request($request);
my $status = $response->code;
return $status;
}

After spawning threads master preloads cache from disk (boring code). Then it loops. The first action is clearing @work_done. The done items should be saved to disk and updated in the cache. It is important that only one thread writes to the same file. Under Unices writes are not atomic. So if one process does print("abc") and another does print("def"), it is possible that we get adbefc. Unix is again guilty of being totally broken, inconsistent and hard to use, for the sake of marginally simpler implementation.

while(1)
{
{
    lock @work_done;
    for(@work_done) {
        my ($result, $title, $link) = @$_;
        $CACHE{$link} = $result;
        print "$result\t$title\t$link\n";
    }
    @work_done = ();
}
if (@work_todo > 100) {
    sleep 1;
    next;
}
...
# More code
}

And finally the code to put new items on the todo list:

  my $cmd = <>;
last unless defined $cmd;
$cmd =~ /^(.*)\t(.*)$/ or die "Bad input format of line $.: `$_'";
my($link, $title)=($1, $2);
next if defined $CACHE{$link};
{
    # Explicit sharing of @todo_item again
    my @todo_item : shared = ($link, $title);
    lock @work_todo;
    push @work_todo, \@todo_item;
}

And the code to wait for the threads after we have nothing more to do:

$finished = 1;
$_->join for @t;

Now what's so great about Perl concurrency ?

It works
Access to all Perl libraries
By getting rid of really stupid "shared everything by default" paradigm, it avoids a lot of errors that plague concurrent programs

The most important are the first two points. Third is just a small bonus. Perl concurrency works a lot better than Ruby or Python (at least last time I checked). We get access to all libraries that Erlang will never ever get its hands on. And while "shared-nothing pure message passing" may be more convenient, at least explicit sharing and locking-by-scope are better than "shared everything" model. And we get the basic stuff like Thread::Queue (which I should have used instead of managing lists by hand, whatever).

Could such link checker be written in Erlang as quickly ? I don't think so. In most programs that need concurrency or would benefit from concurrency, the "concurrency" part is a very small part of the program. It's usually far better to have lame concurrency support and good rest of the language than absolutely awesome concurrency part and lame rest of the language. Every now and then you simply have to write a massively concurrent program (like a phone switch), and a special language can be really great. For everything else, there are "business as usual, with concurrency bolted-on" languages like Perl and Java.