Saturday, December 27, 2008

Functional Programming

I've harbored a secret desire to learn Haskell for a few years now. Simon Peyton-Jones is one of the key people behind Haskell. His web site at MSR has tons of papers, a tutorial on concurrent programming in Haskell, and a video lecture of A taste of Haskell. There's also a Simon Peyton-Jones podcast at SE-Radio.

What is Haskell

Haskell is a programming language that is

  • purely functional
  • lazy
  • higher order
  • strongly typed
  • general purpose

Why should I care?

Functional programming will make you think differently about programming

  • Mainstream languages are all about state
  • Functional programming is all about values

Whether or not you drink the Haskell Kool-Aid, you'll be a better programmer in whatever language you regularly use

I should read a Haskell book or two, and, in related functional goodness, I keep reading how great Practical Common Lisp is. I also need to fulfill my quest to finish SICP. I've read the first three chapters twice, doing the examples once in Scheme and again in OCAML. I've read chapter 4 on interpreters. I need to work through the examples in that chapter and take in the final fifth chapter.

Sunday, December 21, 2008

Principle of least surprise my ass

Let's start off by saying I'm not anti-Ruby. I like Ruby. Ruby is cool. Matz is cool. But, a while back I was wondering, What is a Ruby code block? My feeble curiosity has been revealed for the half-assed dilettantery it is by Paul Cantrell. Mr. Cantrell chases this question down, grabs it by the scruff of the neck, and wrings it out like a bulldog with a new toy. He also rocks on the piano, by the way.

So in fact, there are no less than seven -- count 'em, SEVEN -- different closure-like constructs in Ruby:
  1. block (implicitly passed, called with yield)
  2. block (&b => f(&b) => yield)
  3. block (&b =>
  5. proc
  6. lambda
  7. method

This is quite a dizzing array of syntactic options, with subtle semantics differences that are not at all obvious, and riddled with minor special cases. It's like a big bear trap from programmers who expect the language to just work. Why are things this way? Because Ruby is:

  1. designed by implementation, and
  2. defined by implementation.

Again, neither I nor Mr. P.C. are bashing Ruby. He shows how to pull off some tasty functional goodness like transparent lazy lists later in the article. Thanks to railspikes for the link.

Saturday, December 20, 2008

Random Introduction to Bioinformatics

I was asked recently to recommend some introductory reading material about bioinformatics, which made me realize I haven't read enough myself. Here's what I came up with plus a few additions.

If you're into stats, both of these are highly regarded, but miles over my head.


Dr. Larry Ruzzo at UW teaches a Computational Biology course. Some of the links above are from his reading list, particularly the Sean Eddy Primer articles from Nature Biotechnology.

In winter of 2008, some UW CS grad students held a seminar course on data management issues in life sciences. In case that link doesn't stay up forever, here's some of the reading list:

Intro to Biology

Overview on biological data integration

Specific tools and techniques

Also on the subject of data: Dynamic Fusion of Web Data.

Books I wanna read

Finally, here are some books that I haven't read, will probably never get the time to read, but I wish I would read.

Friday, December 12, 2008

Educating Software Developers

scholarSlashdot linked to Bjarne Stroustrup on Educating Software Developers which follows up on an earlier article, The 'Anti-Java' Professor and the Jobless Programmers. The Anti-Java professor is Robert Dewar at NYU, who coauthored a short paper, Computer Science Education: Where Are the Software Engineers of Tomorrow? They contend that computer science curricula have been dumbed down to counter falling enrollment post-dot-com-crash and partially blame Java, which fosters reliance on libraries and garbage collection. But, not all of their critique can be written off as language bigotry. The result?

We are training easily replaceable professionals.

Dewar advocates:

  • mentoring
  • reading code
  • working in groups
  • learning to reuse code

Those sound like solid points to me. One thing the field of medicine really gets right is an emphasis on mentoring. Mentoring is the heart of residency, which depending on specialty can last from 3 to 7 years. By the time a physician graduates from residency, they will have performed hundreds of procedures and seen thousands of patients under the guidance of an attending physician. I've often wished there was more of this in the computing field.

Over the years, I've accumulated a list of topics I wish I'd been exposed to as a CS undergrad.

  • source control
  • command line foo
  • linking and dependency management
  • security
  • software design: interfaces, refactoring, patterns, components, APIs
  • software architecture: design with large components - app servers, databases, message queues, transactions, etc.
  • data modeling: schema design and not just relational DBs. Hierarchical (XML) and graph (OO, RDF) representation
  • models of computation
    • imperative
    • OO
    • functional
    • logical
    • pipes and filters
    • and how they're are related.
  • scientific/engineering computing: MatLab, R
  • probability and statistics, statistical computing, analytics
  • writing

Of course, then my undergraduate degree would have taken 7 years... On second thought, only my Dad would have complained.

What do you wish you'd learned in college? Post a comment!

Tuesday, December 09, 2008

Firefox and bioinformatics

Firefox LogoI wondered who else might be working on bioinformatics related extensions for Firefox besides Firegoose. One interesting project is iHOPerator, which builds on Greasemonkey. And, there's a hint of something to come here.

It seems like there was a flurry of interest around 2005, in the early days of AJAX and mash-ups, which produced biobar along with two now-dead projects - bioFox and NCBI Search Toolbar. Back in those days, John Udell asked, How do you design a remixable Web application? Nifty developments like the REST API in EMBL's STRING 8.0 are starting to provide answers.

Bioinformatics as a Queryable Knowledge Map: the Pygr Project

Pygr is a hypergraph database in Python with applications in bioinformatics written by Christopher Lee, a faculty member at UCLA. There's a 30 minute video of talk about Pygr and a bunch of other resources on the Lee Lab website and Lee's thinking bioinformatics blog.

Thesis: Hypergraphs are a general model for bioinformatics and Python’s core models are already a good model of Bioinformatics Data
  • Sequence: protein and nucleic acid sequences 
  • Mapping / Graphs: alignment, annotation 
  • Attributes: schema, i.e. relations between data 
  • Namespace (import): the ontology of all bioinformatics data 
Pygr aims to show that these Pythonic patterns are a general and scalable solution for bioinformatics.

The general idea is not entirely different from the data types behind Gaggle, especially in the emphasis on basic data structures without a heavy semantic component.

Dr. Lee is also writing a textbook on probabilistic inference.

Saturday, December 06, 2008

Dynamic Fusion of Web Data

I happened across a very cool project on web data integration at the University of Leipzig. Their paper Dynamic Fusion of Web Data is worth a look. They're working towards a theory of on-the-fly data integration for mashup applications that they refer to as dynamic data fusion. Data integration in mashups is dynamic in that it occurs as runtime. This provides for a pay-as-you-go model, rather than a large up-front semantic mapping task that limits the scalability of traditional data integration methods like data warehouses.

They describe mashups as workflow-like. Do they mean mashups are programmatic as opposed to declarative? In place of SQL, this group's iFuice system uses a scripting language with "set operations (e.g., union, intersection, and difference) and data transformation (e.g., fuse, aggregate) which can be used to post-process query results". Other key features are instance-level mapping and accommodation of structured and unstructured data.

This definitely gets at what Firegoose is good for - using the web as a channel for structured data - an approach that does for data integration what loose coupling does for software. Firegoose, part of the Gaggle framework, is a toolbar for Firefox that allows data to be exchanged between desktop software and the web. Firegoose can read microformats, call web services, query databases, or even perform nasty dirty screen scraping. Unlike a mashup, data integration in Firegoose and Gaggle requires user participation, although the user never deals with schemas, only instances of the Gaggle data types - mainly lists of identifiers, matrices of numeric data, networks, and tuples. The identifiers serve in a role somewhat analogous to primary keys.

More papers in a similar vein

Tuesday, December 02, 2008

Browsing genomes

I may as well come clean and admit that I'm developing a genome browser. What? Another genome browser? Why? You may well ask these questions. Well, it's a long story. But here is a completely non-exhaustive list of existing genome browsers.

Note: updated in Sept. 2009 to reflect the fact that everyone and their uncle built a genome browser this past couple of years. See Brother, can you spare a genome browser?

Note: updated again in May of 2010 and again in Feb 2011 to add Savant.

Monday, December 01, 2008

UCSC Genome Browser

A while back, I wrote a little hack to to download and parse genome data from NCBI, but was flummoxed by NCBI's format for eukaryotes. A couple of local bioinformatics gurus directed me to UCSC as an alternate data source. UCSC's Genome Browser provides a nice interface to it's underlying data through a Table Browser. The main genome browser has data for eukaryotes, while archaea (and other prokaryotes) are in a separate project. The Table Browser for the archaeal genome browser is a little tricky to find, but it's there.

Friday, November 28, 2008

Better color chooser for Java Swing

I came across a color chooser that kicks the Swing JColorChooser's booty. It looks great and supports transparency. It's available at and on the developer's blog. It will soon be used in my genome browser. Thanks, Jeremy!

Tuesday, November 18, 2008

Java in Firefox extension hosed again

A full-featured browser, an XUL front end, and the wealth of libraries available in Java makes for a powerful and flexible combination. The browser extension capability of Firefox, along with LiveConnect has been used by at least three extensions:

Most of what I've figured out about using Java from an extension came from Simile's David Huynh. Sadly, development of Piggy Bank has now "quiesced".

I don't know about the others, but Firegoose is hosed by the latest Java 6 update 10. Apparently, Java 6.10 introduces some significant changes into LiveConnect and the Java browser plugin. It's certainly good that Java in the browser is getting some attention, but I wish Java in a Firefox extension was a supported and regression tested use case (see whining here). The fact that it's such an arcane, unsupported and brittle hack is holding back what could otherwise be a nice technique.

Interest in Java in Firefox extensions appears to exist according to these posts in the MozillaZine Extension Development forum:

First Problem: The error I get appears to happen when reflectively instantiating a Java array and looks like this:

Error calling method on NPObject!
[plugin exception: java.lang.IllegalArgumentException: No method
found matching name newInstance and arguments

Instantiating the array through reflection was, itself, a work-around for another LiveConnect issue with type conversion between Javascript arrays and Java arrays. It's barfing on line 03 below:

// from
_toJavaUrlArray: function(a) {
 var urlArray = java.lang.reflect.Array.newInstance(, a.length);
 for (var i = 0; i < a.length; i++) {
  var url = a[i];
   (typeof url == "string") ? new : url
 return urlArray;

Update 1: First problem solved easily enough.

var dummyUrl = new"");
var urlArray = java.lang.reflect.Array.newInstance(dummyUrl.getClass(), a.length);

Now, on to more hideousness:

Error calling method on NPObject!
[plugin exception: access denied
(java.lang.RuntimePermission createClassLoader)].

...caused by trying to instantiate a URLClassLoader. The next-generation Java Plug-in, including in update 10, makes changes to the security policy such that calls from Javascript to Java are uniformly treated as untrusted.

Update 2: A Work-around!

A post on the forum has a work-around. You can disable the "next-generation plug-in" through the Java control panel. Under the Advanced tab, open Java Plug-in, deselect Enable the next-generation Java Plug-in, then, restart Firefox. There is a bug filed whose comments seem to suggest that it will be addressed in a future release of the Java Plug-in.

Update 3: According to this thread on, a fix is on the way in Java SE 6 update 12. Thanks, Sun!

More references:

Thursday, November 13, 2008

SQLiteJDBC from Xerial

As noted earlier, I haven't been able to get native mode JDBC drivers for SQLite to work on Java 6 on OS X. Xerial has made what appears to be a fork of the Zentus SQLiteJDBC driver. They even have a discussion group, which apparently Zentus used to have but doesn't anymore.
Maybe ch-werner's javasqlite is another possibility.

Monday, November 10, 2008

Clean code, properties, and patterns

Robert C. Martin has a new book out called Clean Code that looks worthwhile, as if I'm not far enough behind in reading. I also came across his Principles and Patterns article. Completely unrelated to that, The Universal Design Pattern by Steve Yegge describes object system of Javascript as an instance of what he calls the Properties Pattern. Java's static types and even classes in Python and Ruby feel constrained after working with Javascript's bag-of-properties approach to objects. There's also some interesting stuff at a blog called Red and Sensual Java, if you can get past the name.

Friday, November 07, 2008

Clay Shirky on Econ-Talk

Clay Shirky appeared on the excellent podcast Econ-Talk, to which I've become hopelessly addicted. (My geekosity knows few bounds.) He was hawking his book Here Comes Everybody.

Shirky is a great thinker about what makes the social aspects of technology tick. His current theory goes something like this: There is a cognitive surplus brought about the increasing amount of leisure time in modern industrialized societies. For a generation, that surplus has been dissipated through the artifice of television. Now, that surplus is being put to use. Open source software, tagging, Wikipedia, and other social web phenomena are the early results.

The question now is, what happens when someone figures out how to move this phenomena out to the physical world. Someone posted this quote in the comments section to the podcast:

The effect of a newspaper is not only to suggest the same purpose to a great number of persons, but to furnish means for executing in common the designs which they may have singly conceived.

-Alexis de Tocqueville

Friday, October 31, 2008

Notes on SqliteJDBC and Java Web Start

Java has a long standing limitation about loading native code from a jar, stemming from the unfortunate belief that native code was somehow impure. In the practical world, there was occasional need to do this, so people came up with the work-around of copying the native library out of the jar and into the local filesystem where it could then be loaded with System.load().

Another way do do this, within web start is to platform specific resource elements like these to your jnlp:

<resources os="Windows" arch="x86">
  <j2se href="" version="1.5+"/>
  <nativelib href="sqlitejdbc-v052-native-win.jar"/>
<resources os="Mac OS">
  <j2se href="" version="1.5+"/>
  <nativelib href="sqlitejdbc-v052-native-mac.jar"/>
<resources os="Linux">
  <j2se href="" version="1.5+"/>
  <nativelib href="sqlitejdbc-v052-native-lin.jar"/>

This way, we get the native library for the right platform, if all goes according to plan. One wacky detail is that System.loadLibrary("foo") munges the library name in some platform specific and poorly documented way. So, you need a libfoo.jnilib for OS X, for linux, and a foo.dll for windoze. It's kind-of a hassle to set up all these separate jars. And System.load(...) just tries to do what it's told without munging the name, so people still frequently use the trick described above.

SqliteJDBC does just that to load the native Sqlite library. SqliteJDBC also includes "pure Java" drivers, which are considerably slower for some operations, but act as a nice fallback when the proper native libraries aren't handy.

This all works well enough. But, on OS X 10.5, Java 6, it ends up falling back on the non-native drivers. After digging a bit, I found that the call to System.load(..) was failing with this exception:

java.lang.UnsatisfiedLinkError: /private/tmp/libsqlitejdbc-19214.lib: 
 at java.lang.ClassLoader$NativeLibrary.load(Native Method)
 at java.lang.ClassLoader.loadLibrary0(
 at java.lang.ClassLoader.loadLibrary(
 at java.lang.Runtime.load0(
 at java.lang.System.load(

The same thing works find when using Java 5 on the same machine and on Java 6 on Windoze, so I'm guessing the native library needs to be recompiled to work correctly with OS X's 64-bit Java 6.

A little spelunking into SqliteJDBC's code reveals that the author has kindly provided hooks where you can insert the path and name of your own driver using these system properties:

  • org.sqlite.lib.path

The driver will then attempt this:

System.load(new File(libpath, libname).getAbsolutePath());

More Resources:

Monday, October 27, 2008


A previous post, Hacking NCBI Entrez, dealt with how to retrieve sequence information from NCBI's databases. That method seems to work for prokaryotes and for yeast, but fails for most other eukaryotes.

For mammals, efetch for genome XML gives back crap like this (for rat):


Or this for XML for human Y chromosome. Totally useless. I take it I'm supposed to request each of the referenced sequences and parse out the regions for each? What a colossal pain in the ass!

By the way, BioJava has a parser called GenbankXmlFormat, but it's docs say, "Deprecated. Use". What INSDseqFormat is or how that is supposed to replace GenbankXML is totally unclear.

Friday, October 10, 2008


Ed Lazowska's elements of eScience:
  • Sensors 
  • Networking 
  • Visualization 
  • Databases 
  • Data mining 
  • Machine learning
eScience is what happens when scalable mass produced computing infrastructure is applied to scientific problems. It's not about the computation, it's about the data.
...which reminds me of a Peter Karp paper.

Wednesday, September 24, 2008

Resolution of the Universe

I attended a talk by Stuart Kauffman about systems biology, complexity, information theory, and related topics. It inspired the following crackpot theory / misunderstanding of Kauffman's ideas about the nature of the relationship between theory and reality.

Let's take physics.

  • N = Newton
  • E = Einstein
  • Q = Quantum physics

Now, let's imagine an ill-defined concept that I'll call resolution (as in screen resolution). R(N) is a measure of the explanitory power or detail of the theory N (Newtonian physics).

R(N) < R(E) < R(Q)

Right? So is there such a thing as R(reality)? Or is the resolution of reality infinite? It's certainly true that theories are only useful when R(theory) << R(system being described). What use is a theory of Universe sized complexity?

I'm willing to guess is that there is a finite (although large) R(reality). And that there's some sort of Heisenberg/Goedel-like limit to how close any theory can come to accurately describing reality, regardless of how high the R of the theory is. More and more advanced theories asymptotically approach this limit.

Tuesday, September 23, 2008

Eclipse templates and import statements

I was encouraged by this guy to make better use of Eclipse's code templates. I've used templates for a while for boilerplate code like adding a log4j logger to a class, like this:

class Foo {
   private static final Logger log = Logger.getLogger(Foo.class);

Using templates, I just have to type 'l4j' and cntl-space and there I go. One thing that had been plaguing me is that I still had to cnt-space again to add the imports, and thanks to Sun's needless duplication of effort, I had to select the log4j Logger over the lame java.util.Logger.

Little did I know about the ${import} and ${importStatic} variables. The following template gives me the required imports for free.

private static final Logger log = Logger.getLogger(${enclosing_type}.class);

And how about this for JUnit 4 imports:


With enough Eclipse templates, Java might not drive me completely batshit after all. I love it when I file a bug and the feature already exists. Thanks, Eclipse.

Sunday, September 21, 2008

Generics hurt my head some more

As previously mentioned, generics hurt my head. Sun's Java generics tutorial calls Collections.max() one of the subtlest examples of the use of generics.

public static <T extends Object & Comparable<? super T>> T max(Collection<? extends T> coll)

Subtle is one word for it. Anyway, to my walnut-sized brain, generics are confusing; maybe too confusing to be worth it. Was ClassCastException really a major problem for anybody?

Generics can be used in two places. Generic classes, like List<T>, and generic methods, like Collections.max().

As I learned in Programming Languages 505, return types are covariant while method parameters are contravariant (see the wikipedia entry). Generics, because they can be used to parameterize both return types and method parameters, are invariant. Keeping this in mind explains a lot. Thus, List<String> is not a subtype of List<Object>. This leads to a lot of confusion, particularly because Arrays in Java took the opposite choice. String[] is a subtype of Object[]. Aaarg!

Wildcards attempt to mitigate this problem, with extends and super.

List<? extends Flapdoodle>
TreeSet(Comparator<? super E> c)

I pronounce List<?> as "list of WTF". You're really supposed to say it's an unbounded wildcard type. Seems about as annoying as @SuppressWarnings("unchecked").

Effective Java 2nd Ed., Generics

For more confusing information, see the Generics chapter out of Effective Java, 2nd Ed. (Thanks to Joe Bowbeer for the pointer!):

  • Produce-extends, consumer-super
    In other words, if a parameterized type represents a T producer, use <? extends T>; if it represents a T consumer, use <? super T>.
  • Private helper method for wildcard capture
  • Typesafe heterogeneous containers

Source Control and Continuous integration

Version Control:

Continuous integration:

All the cool kids are on github. ...and I still thought SVN was new and hip. Guess I'm behind the times.

Saturday, September 20, 2008

Building Teams

I must be getting old. I voluntarily went to a non-technical talk by Jared Richardson about building teams for software development. Now I'm blogging about it. And using the term "blogging". Evidently, I've finally gone stark raving batshit.

Maslow"s hierarchy of needs


  • Interaction
  • Mentoring
  • Learning
  • Vision
  • Leadership
  • Fun


  • Isolation
  • Immaturity
  • Apathy
  • Thrashing
  • Feuding
  • Boredom


  • Daily Meetings
  • Code Reviews
  • Team-wide Architecture
  • Test Automation
  • The List (of work by priority)
  • Tech Lead

Friday, September 12, 2008

Generics solution

Previously, generics were hurting my head. The generics tutorial has this to say about the problem that is at issue here:

Wildcards are designed to support flexible subtyping [...]. Generic methods allow type parameters to be used to express dependencies among the types of one or more arguments to a method and/or its return type.

The solution here (thanks to Eric Jain) uses a generic method to express the dependency between the type Features in the Container passed into renderMap's getRendererFor(...) function and the type of Renderer returned. The renderer must be able to render that type of feature.

package generics.solution;

// We have a hierarchy of specialized subtypes of feature.
class Feature {

class SpecialFeature extends Feature {

// Features of a given kind are accessed through containers
interface Container<F extends Feature> {
 public Iterable<F> features();

// We want to draw features in various specialized ways.
interface Renderer<F extends Feature> {
 public void render(Iterable<? extends F> features);

class SpecialRenderer implements Renderer<SpecialFeature> {
 public void render(Iterable<? extends SpecialFeature> features) {
  // render special features

// Which renderer we use for a given container of features gets
// configured at runtime. This holds the mapping between renderers
// and containers.
interface RendererMap {
 public <F extends Feature> Renderer<? super F> getRendererFor(Container<F> container);

class RenderingScheduler {
 RendererMap rendererMap;

 public void render(Iterable<Container<? extends Feature>> containers) {
  for (Container<? extends Feature> container : containers) {

 public <F extends Feature> void render(Container<F> container) {
  Renderer<? super F> renderer = rendererMap.getRendererFor(container);

Update: Generics hurt my head some more.

Thursday, September 11, 2008

Generics hurt my head

Here's a little generics problem, in the form of some code:

package generics.question;

// We have a hierarchy of specialized subtypes of feature.
class Feature {

class SpecialFeature extends Feature {

// Features of a given kind are accessed through containers
interface Container<T extends Feature> {
 public Iterable<T> foos();

class SpecialContainer implements Container<SpecialFeature> {
 public Iterable<SpecialFeature> foos() {
  return null;

// We want to draw features in various specialized ways.
interface Renderer<V extends Feature> {
 public void render(Iterable<V> foos);

class SpecialRenderer implements Renderer<SpecialFeature> {
 public void render(Iterable<SpecialFeature> foos) {
  // render Subfoo objects

// Which renderer we use for a given container of features gets
// configured at runtime. This holds the mapping between renderers
// and containers.
interface RendererMap {
 Renderer<? extends Feature> getRendererFor(Container<? extends Feature> container);

class RenderingScheduler {
 RendererMap rendererMap;

 public void render(Iterable<Container<? extends Feature>> containers) {
  for (Container<? extends Feature> container : containers) {
   Renderer<? extends Feature> renderer = rendererMap.getRendererFor(container);

   // The following line produces a compile error:
   //   The method render(Iterable<capture#3-of ? extends Feature>) in
   //   the type Renderer<capture#3-of ? extends Feature> is not
   //   applicable for the arguments (Iterable<capture#4-of ? extends
   //   Feature>)

   // this works, but... ug.
   // How are these type parameters helping me again?
   Renderer<Feature> r = (Renderer<Feature>)renderer;
   Iterable<Feature> foos = (Iterable<Feature>)container.foos();
Update: Solution! ...and generics hurt my head some more.

Sunday, August 24, 2008

Symbolic computation in biology

From Pathway Databases: A Case Study in Computational Symbolic Theories by Peter Karp in Science magazine in 2001:

One lesson for computer scientists provided by pathway DBs (and by other bioinformatics applications) concerns the importance of DB content to solving computational problems. Most computer scientists focus their attention on algorithms, thinking that the best way to solve a hard computational problem is through a better algorithm. However, for problems such as predicting the pathway complement of an organism from its genome, or predicting metabolic products that an organism can produce from a given growth medium, I know of no algorithms that can solve these problems without being coupled with an accurate and well-designed pathway DB.

Thursday, August 14, 2008

Threads are evil

The docs for Sqlite, under the heading "Threads are evil", pointed me to a paper by Edward A. Lee of UC Berkeley called The Problem with Threads.

Concurrency in software is difficult. However, much of this difficulty is a consequence of the abstractions for concurrency that we have chosen to use. The dominant one in use today for general purpose computing is threads. But non-trivial multi-threaded programs are incomprehensible to humans.

Lee argues that threads are wildly nondeterministic by default, while "nondeterminism should be judiciously and carefully introduced where needed, and should be explicit in programs."

Saturday, July 12, 2008


Saturday, July 05, 2008

Who in their right mind would use Roman numerals for anything?

And, who in their right mind would bother to write code to deal with it, since it's already been done about 5 million times? Well, me for one. OK, yes, I am a turbodork. I can no longer deny it. Here's the code:

 * Represent Roman numerals between 1 and 4999 inclusive. 
public class Roman implements Comparable {
 private int value;
 private String roman;

 private final static String[][] digits = {
  {"", "I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX"},
  {"", "X", "XX", "XXX", "XL", "L", "LX", "LXX", "LXXX", "XC"},
  {"", "C", "CC", "CCC", "CD", "D", "DC", "DCC", "DCCC", "CM"},
  {"", "M", "MM", "MMM", "MMMM"}

 private final static Pattern regex = Pattern.compile("M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})");

 public Roman(int n) {
  value = n;
  roman = intToRoman(n);

 public Roman(String string) {
  if (isRoman(string)) {
   value = romanToInt(string);
   roman = string;
   throw new RuntimeException("\""+string+"\" is not a valid roman numeral.");

 public String toString() {
  return roman;

 public int toInt() {
  return value;

  * Test whether the input string is a valid roman numeral (between 1 and 4999).
 public static boolean isRoman(String s) {
  return s!=null && !"".equals(s) && regex.matcher(s).matches();

  * Convert an integer between 1 and 4999 (inclusive) to a Roman numeral.
 public static String intToRoman(int n) {
  if (n>=5000 || n <= 0)
   throw new RuntimeException("Roman equivalents exist only for numbers between 1 and 4999.");
  StringBuilder sb = new StringBuilder(16);
  int place = 0;
  while (n>0) {
   int qoutient = n/10;
   int remainder = n - (qoutient * 10);
   sb.insert(0, digits[place][remainder]);
   n = qoutient;
  return sb.toString();

  * Convert a roman numeral to an integer. This method isn't strict about
  * the rules of what makes a valid roman numeral. In particular, it will
  * happily convert IIIIIIIIII to 10 or IL to 49 even though neither of
  * these are valid roman numerals. To check whether a String is a valid
  * roman numeral use isRoman().
 public static int romanToInt(String roman) {
  int value = 0;
  int prev = 1000;

  for (int i=0; i prev) {
    value -= (prev << 1);
   value += v;
   prev = v;

  return value;

  * @return return the integer value of a single roman character.
  * @throws RuntimeException if the character isn't a roman numeral.
 private static int charToValue(char c) {
  if (c=='I') return 1;
  if (c=='V') return 5;
  if (c=='X') return 10;
  if (c=='L') return 50;
  if (c=='C') return 100;
  if (c=='D') return 500;
  if (c=='M') return 1000;
  throw new RuntimeException("Illegal character in roman numeral: " + c);

 public int compareTo(Roman other) {
  return this.value - other.value;

 public int hashCode() {
  return value;

 public boolean equals(Object obj) {
  if (this == obj)
   return true;
  if (obj == null)
   return false;
  if (getClass() != obj.getClass())
   return false;
  Roman other = (Roman) obj;
  return this.value == other.value;

Wednesday, July 02, 2008

Unix command line foo

List files in a hierarchy without the svn crap.

find .  -name ".svn" -prune -o -print

Tuesday, June 24, 2008

Firefox extension security

Firefox 3 takes a couple of steps to make extensions a little more secure. In general, extensions run with full privileges, so this is important. The extension update mechanism now requires either SSL or digital signatures for both the update.rdf file and the xpi file.
If you want to bypass these restriction, open the URL "about:config" and create a preference called extensions.checkUpdateSecurity whose value is set to false. This can be useful in testing, but is discouraged in practice.

Thursday, June 19, 2008

Rhinos and Tigers

From Steve Yegge, (author of the next big language) a transcript of a talk supposedly about Rhino and javascript, but also about the virtual machines, scripting, and languages in general.
  • Low Level Virtual Machine (LLVM)

Wednesday, June 18, 2008

Firefox extension development

Uhoh. Now that Firefix 3 is out, it looks like I have to update my extension Firegoose to work. The trick here is that Firegoose uses Java.
Briefly, Firegoose is a Firefox extension that integrates several bioinformatics web resources into the Gaggle integration framework. The Gaggle is based on passing messages of a few fundamental data types in the bioinformatics domain, including lists, matrices, and networks. The transport protocol used in Gaggle is (for better or worse) Java RMI, and that, of course, requires Java. Hence, the Firegoose's reliance on being able to crank up a working (and unrestricted) JVM from inside Firefox.
That was done in for Firefox 1.x and 2.x using an arcane and dirty trick from the fine folks at Simile project at MIT called javaFirefoxExtension. And the sad thing for me is that the trick is apparently broken in FireFox 3. [Note: this turns out not to be the case.]
Silent failure is the curse of the Firefox extension developer. Debugging in Firefox is painful, at best, and even more so when using the bridge between Java and javascript (aka LiveConnect).
In order to do anything useful with Java in a Firefox Extension, there are at least 2 nasty bits to overcome. First, you have to load classes from inside an XPI file. This is essentially a variant of the classpath problem. Second, you probably have to give yourself full permissions (by manipulating
As mentioned above, an arcane solution to these problems has been worked out by folks at the SIMILE lab at MIT. Their PiggyBank extension is the prototype for use of Java for heavy lifting inside a Firefox extension. They actually run a full app server inside the browser. Another really cool extension that uses the same technique is xquseme, which embeds the Saxon XQuery processor. You can then perform arbitrary XQueries against any document you can browse to. Using Java in an extension gives you the power to combine the wealth of libraries available in the Java universe with a fully featured browser. So how do we go about doing it?
Loading classes from your XPI file is possible using Firefox's capability to resolve chrome URLs to paths in the filesystem. The mapping between chrome URLs and files is defined in the chrome.manifest file in your XPI. Once we have paths (as file:/ URLs), we create our own Calling the constructor[] urls) requires a trick, because the Java bridge seems not to do a very good job of coercing javascript types to java types. To further muddy the waters, the way js-to-Java type coersion is handled in LiveConnect changed in Firefox 3. You'd expect that passing a javascript Array containing objects to work. But, try that and you'll get an error like this:
InternalError: Unable to convert JavaScript value [...blah blah...] to Java value of type[]
Code gleaned from the SIMILE lab (thanks!) solves this particular pain in the ass:
// from
_toJavaUrlArray: function(a) {
 var urlArray = java.lang.reflect.Array.newInstance(, a.length);
 for (var i = 0; i < a.length; i++) {
  var url = a[i];
   (typeof url == "string") ? new : url
 return urlArray;
Truth and soul
Actually, problems with js-to-Java type conversion aren't limited to constructing classLoaders, but seem to be pervasive in FF3. Apparently, whenever you try to convert a js array to Java, LiveConnect screws it up. For example, passing a js array of strings, I get an array full of "true". Yes, it's true. There was an object there. Thanks a lot for that! (Happens on both Mac OS X and Windows, btw.)

Thursday, June 05, 2008

Virtual Private Hosting vs Self-hosting

According to our IT people, a middle-of-the-road server w/ a 3 years service agreement costs about 5 grand. (or 5000/36 moths = 139/month). The server would be available outside the company, but would be located in a network DMZ. We can have sudo, but can't mount any network filesystems. Giving outsiders privileged access to the server is apparently possible through some complicated process.
Apparently binding a DNS entry for a subdomain to an external server is quick and easy. So, why not get a virtual private server somewhere and avoid jumping through management hoops?
Here are a few providers I'm considering:

Sunday, June 01, 2008

On the Criteria to Be Used in Decomposing Systems into Modules

David L. Parnas's infuential 1972 paper On the Criteria to Be Used in Decomposing Systems into Modules (sited in Paul Shannon's Gaggle paper) described modularization by information-hiding, later developed into high cohesion and low coupling.

The paper points out that it takes deliberate attention to get good software structure and that limiting information is the key to good design. Where there are many modules, more structure is needed. It's useful to create a hierarchical modular structure using "part of" and arrange programs in a hierarchical structure based on "uses". (By programs, he seems to mean parts of a module?) He criticizes the idea of "layers of abstraction" as too vague in absence of any measure of what's more abstract than what.

The concept of information-hiding as a software design principle is widely accepted in academic circles. Many successful designs can be seen as successful applications of abstraction or information hiding. On the other hand, most industrial software developers do not apply the idea and many consider it unrealistic.
The principle is now clear. It is still hard to apply.

He also seems to foreshadow design patterns.

Researchers should be publishing proposed standard structures for classes of programs. Researchers should be publishing proposed standard designs.

This kind of design-centered approach to software engineering appeals to me a lot more than fluffy stuff like usability studies and requirements gathering.

Tuesday, May 27, 2008

General purpose programming on the GPU

The video-card manufacturers are at an interesting crossroads. nVidia is really pushing general purpose computing on graphics hardware.

Cool idea, but it can't really take off unless programs are written to detect the presence of capable hardware and load code specially compiled for that particular chipset. In other words, GPGPU code written for nVidia cards won't run on ATI cards. Bummer. Clearly nVidia understands the advantage to be had from a large body of code compiled for their instruction set. They want to be in the position of owning the GPU equivalent of the x86 instruction set. Nice move on their part.

But wouldn't it be much cooler for the rest of us if things were a little more open? Is a common instruction set for graphics cards at all plausable? Like ARM is for embedded devices? Or maybe a common intermediate layer and JIT for all types of GPU code? That would be cool. And something like that would induce a lot more developers to take the plunge into compiling to the GPU, which has got to be a fairly drastically different piece of code relative to a straightforward implementation.

Tuesday, May 20, 2008

What is a Ruby code block?

You can't believe everything you read on the internet. For example, you'll read that, in Ruby, use of the return keyword is optional. Like this:

def foo(a)
  return a*a


def bar(a)


So far, so good. You'll also read that a code block in Ruby is an anonymous function. And what could be more natural than to return from a function? But try this:

[1,2,3].map {|a| a*a}
>> [1,4,9]

[4,5,6].map {|a| return a*a}
>> LocalJumpError: unexpected return
>> from (irb):14
>> from (irb):14:in `map'
>> from (irb):14
>> from :0

Well then, apparently one or both of our premises is wrong. Anyway, what is this LocalJump thing that is apparently in error?

My guess as to what's really going on is this: The code block is an attempt at creating a function-like artifact that matches the behavior of a code block in a non-functional language. That is, the code in the block doesn't execute in its own environment. Calling a code block doesn't cause a stack frame to be created. The reason the return causes problems is that there's no separate environment to return from.

It bothers me that I can't seem to google up an explanation of exactly what a Ruby code block (or Proc or lambda or method for that matter) is in terms of what does or doesn't happen to stack frames or environments. There's plenty of tutorials on how to use these constructs, but, little information on what they are, how they work, or why there are so many flavors of function-like things in Ruby.

Here's what I got from comp.lang.ruby.

Wednesday, May 14, 2008

Working with microarray data

State-of-the-art DNA microarrays contain between 1 million and 6 million features (different probes) on a single slide. Assuming a 32 bit floats, we need at 4 bytes per feature. If we include start and stop coordinates on the genome for each feature, we're up to 12 bytes per feature.

featuresMBMB w/ coords
1 million424
6.5 million2475

In tiling arrays, probes target regularly spaced segments of the genome so that expression can be measured at every point along the genome. Our group has done some work with arrays containing 60bp probes tiled every 20 base-pairs along the genome of Halobacterium salinarium. The genome of H. salinarum is 2.6 million bases, so we are able to cover most of the genome with a little less than a quarter of a million probes. To cover the entire human genome at similar resolution would take ~307 million probes.

As a side note, AFFY sells a set of 14 arrays with a total of almost 90 million probes that covers the whole genome at 35 bp resolution. (90 million * 35bp = 3150 mpbs. Does their idea of a probe include both forward and reverse strand?)

These arrays are all custom made for an individual genome. I wonder if an array with all 4 million possible 11-mers would be useful? Since there's only 4 million of them, you could expect a fair number of collisions - where a probe is non-unique on the genome. How do you do that calculation?

64-bit Java

As far as I can tell, the capability to handle much bigger heap sizes is the main benefit from 64-bit Java. Primitive types don't change in size. An int is still 32 bits.

Java arrays are indexed by 32-bit integers (actually, signed integers, so 31 bits). Apparently this doesn't change in 64-bit Java. From the Java Language Spec:

Arrays must be indexed by int values; short, byte, or char values may also be used as index values because they are subjected to unary numeric promotion (§5.6.1) and become int values. An attempt to access an array component with a long index value results in a compile-time error.

This is a serious limitation, if you ask me. The main advantage of 64-bits is a simple programming model for large data objects. Apparently, Sun disagrees. Integer array indexes are baked fairly deeply into the language and into existing application code. Think of everywhere you refer to the length or an offset into an array as an int. Imagine how often that's done in the library code. Dunno if they'll ever change it.

Other side-effects of 64-bit

  • Increased object size
    • 64 bit pointers
    • alignment to 8-byte boundaries
  • The minimally required heap size is 51.1% larger
  • More cache misses

source: 64-bit versus 32-bit Virtual Machines for Java

Finally, here's an intriguing tidbit: sun.misc.Unsafe

Monday, April 07, 2008

Microformats and JSON

I maintain a Firefox extension called Firegoose that (among other things) works with data embedded in HTML pages. We do this using microformats, embedded XML, and even, if we have to, nasty screen scraping. I'm not too religious about the format of the data - just give me something I can parse.

So, it occurred to me to wonder why JSON wasn't more popular in this niche. Easily parsable, legal in a browser, scriptable from within the page - why not? Of course people are, as usual, miles ahead of me here:

Strangely, activity around this topic seems to have surfaced briefly and disappeared sometime in mid 2007.

Saturday, April 05, 2008

More visualization stuff

Ben Fry's Visualizing Data Here's an article about Processing and a cool book: Flash Math Creativity: Second Edition

Blog "LivingCode" w/ some Processing related stuff.

IBM's Visual Communication Lab

John Resig's processing.js

Nodebox: a graphics framework for Python on OS X.

Thursday, April 03, 2008


Lisp Alien
I think this is cool. It's definitely bizarre.
There are funny tutorials for Haskell and Lisp.

Wednesday, March 12, 2008

Information Visualization

I went to a cool talk (video should eventually be available) at UW by Jeffery Heer, developer of the prefuse visualization toolkit, at Berkeley's Visualization Lab.

His Flare toolkit, a flash-based follow-on to prefuse, also seems to validate my desire to ditch Java for this sort of thing and move to Adobe's platform (flash/flex/AIR).

Mentioned during the talk were: GapMinder, ManyEyes

A related class at Berkeley on infoviz. I suppose, if I get interested in this stuff, I ought to read those Tufte books.

Monday, March 03, 2008

What Google should do

Since nobody asked me, let me tell you. Google needs another revenue stream. Advertising is a cyclical business, headed straight down just at the moment. And commoditized online advertising probably isn't a big enough business to sustain an operation the size of Google in the manner to which they've become accustomed. 
But when you're as big as Google, what businesses are worth getting into?
How much does VISA take in per year? I dunno, but I'm pretty sure if you want a money fountain to finance an altruistic geek nirvana, that's the kind of revenue stream you want.
Where support by advertising is a poor fit you need a way to keep the lights on. Charity and community spirit work for some efforts, but some things require the luster of cold hard (micro) cash. For the information economy to become a reality, this is a requirement. Many others have tried and failed, but if anyone can make micropayments work, it would be Google.
Medical records
There's a mint to be made data mining the nation's (or world's) medical records. Just collect the data and lease it back to the pharmas, biotechs, and academic researchers. The big roadblocks, which Google might have enough good will to overcome, are (justifiable) privacy concerns and the pathological motives of vested interests in the field.
Ok, I guess the GOOG is way ahead of me on both of these...

Saturday, March 01, 2008

Lecture videos online

You can become a scholar these days just by sitting in front of your computer watching video lectures.

A blog called FreeScienceOnline has a better course catalog than most universities.


A couple discussions about computer science lecture videos on stack exchange.

More Updates

Life-hacker has a post on how to Get a free college education online.

Wednesday, February 27, 2008

What a drag

Java's drag and drop functionality is painful. What I want is a list that the user can reorder by dragging. Here a couple of nice efforts with the caveat that they work only for single selections.

Sun's tutorial is here. There's a chapter in O'Reilly's Swing Hacks about it, but their code seems buggy to me.

Wednesday, February 13, 2008


Classcycle is a tool for analyzing dependencies among Java classes and packages. It can really help when trying to make sure you haven't introduced any unintentional dependencies. It's similar to JDepends, but a little more up-to-date. Its output in a nicely formatted web page (actually an xml + xsl -> html deal). 

Tuesday, February 05, 2008

Programming Languages are where it's at

There's a lot going on in the world of programming languages these days - from functional/OO crossbreeding to new language constructs for concurrency. Here's a few random interesting tidbits:

Oh yeah, and concurrency is where it's at too:

Sunday, February 03, 2008

Scala tutorial

On a cold gray sunday afternoon, I went through the tutorial for the Scala programming language (by a group in Switzerland called LAMP). There was an exercise straight out of SICP about evaluating simple algebraic expressions. Here's what I came up with. It may be far from optimal, but seems to work for at least the one test case.

abstract class Tree
case class Sum(l: Tree, r: Tree) extends Tree
case class Var(n: String) extends Tree
case class Const(v: int) extends Tree

object Expressions {
type Environment = String => int

def eval(t: Tree, env: Environment): int = t match {
  case Sum(l, r) => eval(l, env) + eval(r, env)
  case Var(n) => env(n)
  case Const(v) => v

def derive(t: Tree, v: String): Tree = t match {
  case Sum(l, r) => Sum(derive(l, v), derive(r, v))
  case Var(n) if (v == n) => Const(1)
  case _ => Const(0)

// there's probably a more functional way to do this
def simplify(t: Tree): Tree = t match {
  case Sum(l, r) =>
    val sl: Tree = simplify(l)
    val sr: Tree = simplify(r)
    sl match {
      case Const(lv) => sr match {
        case Const(rv) => Const(lv + rv)
        case _ =>
          if (lv==0) sr
        else Sum(sl,sr)
      case _ => sr match {
        case Const(rv) if (rv==0) => sl
        case _ => Sum(sl,sr)
  case Var(n) => Var(n)
  case Const(v) => Const(v)

def toString(t: Tree): String = t match {
  case Sum(l, r) => "(" + toString(l) + " + " + 
                    toString(r) + ")"
  case Var(n) => n
  case Const(v) => v.toString()

def main(args: Array[String]) {
  val exp: Tree = Sum(
  val env: Environment = { case "x" => 5 case "y" => 7 }
  println("Expression: " + toString(exp))
  println("Evaluation with x=5, y=7: " + eval(exp, env))
  println("Derivative relative to x: " + 
          toString(derive(exp, "x")) + " = " +
          toString(simplify(derive(exp, "x"))))
  println("Derivative relative to y: " +
          toString(derive(exp, "y")) + " = " +  
          toString(simplify(derive(exp, "y"))))


The subclasses of Tree define the nodes of a parse tree for simple expressions like:

((x + x) + (7 + y))

The program demonstrates Scala's pattern matching capabilities, defining methods that perform operations on trees: evaluate, take derivatives, and print. Pattern matching is a feature borrowed from the ML family of languages. You might think of them as case statements on steroids. Sometimes they can concisely replace polymorphism. Here, they play the role of the visitor pattern.

The visitor pattern makes it easy to add new operations on all members of an inheritance hierarchy in a modular way. (Modular meaning the operation is encapsulated). The cost is that adding a new member to the inheritance hierarchy requires modification of all existing operations. The cost and benefit of pattern matching statements is the same.

More Scala Resources:

Friday, February 01, 2008

Things I'd like to be able to do in Eclipse

  • Right-click on an interface in the Package Explorer and say "Implement" and get a new class with stub implementations of all the methods in the interface.
  • Similarly, right-click on a class or interface and say "Extend".
  • Further, you could right-click on a class or interface and say "Generate Proxy" or "Wrap" and get a new class that wraps an instance of the existing artifact and then opens the "Generate Delegate Methods" dialog.

I know there are ways to do all these things that only take a few steps, but they're fundamental tasks that deserve a one-step command. "Implement this interface" is a lot closer to how I think than "Create a class that implements interface Foo".

  • Expand-all packages (but don't expand into java files) in the Package Explorer.

Friday, January 25, 2008

Java Packages and Dependency Management

I used to think Java's packages were a great thing. They solved the problem of naming conflicts and enforce a consistent system of organizing source code on disk for everything written in Java. That view came from working on some C/C++ codebases where boundaries of name-spaces were haphazard or nonexistant and the conceptual organization of the source bore no relation to the location in the file system of the various modules that made up the program. Tracing an execution path through the code was next to impossible and the build process a nightmare.

I had a friend, an experienced C++ developer who said he always felt straight-jacketed by Java packages. He didn't want to be forced to organize his files in the same way he organized his packages. I never understood why anyone would want the freedom to create the type of disaster I had seem before. But there's a part of the picture I was missing. Packages also imply something about dependencies.

Good guidelines on packaging are remarkably hard to come by. But, at minimum, a well designed package structure will minimize dependencies between packages and avoid circular dependencies. I've come to follow other rules of thumb about inter-package dependencies. Everyone can depend on a util package. But a util package should have no dependencies inside the project. Domain model classes should be defined abstractly and be free of outside dependencies.

Inversion of control is when code at a lower level of abstraction depends on code at a higher level of abstraction. The domain model is defined at a high level, with little or no implementation detail. In fact, much of my domain models are just interfaces. Lower level packages do the grunt work with specializations of the domain model artifacts.

Recently, I worked on a project with a Swing UI, which I hadn't done in years. I created a ui package to hold all the Swing related kerfuffle. I started noticing that I had a lot of actions. These little 20 line subclasses of AbstractAction define the event handlers used by controls throughout the UI and seem to multiply like rabbits. Mixed in with the rest of the UI classes, classes that formed the windows, panels, dialogs, and other on-screen widgetry of my app, the actions were hard to find and just cluttered up the place. I wanted, for strictly organization reasons, to move them out of the way.

Of course, the actions depend on the UI widgets they invoke and the UI widgets depend on the actions. So, if I put my actions in a separate package, I'd have a big ugly circular dependency staring me right in the face. I go sick of looking at the action classes, so I did it anyway. Where dependencies and name-spaces are concerned, they're all in one conceptual unit. But on disk, and therefore necessarily in the package hierarchy, they're in two different places.

Since becoming more aware of the role dependencies play in packaging, I've notice more little conflicts of interest like this. I still prefer the enforced file system structure of packages to total chaos, but I now recognize my friend's point.

It would be an interesting extension to the language to be able to explicitly declare that package X was allowed to depend on packages Y, Z, and util. Any other dependency could generate a compiler warning. There are static code analyzers that will do just that. (JDepends for example.) It would be even better if the IDE was aware of such things and could issue a warning as you added a suspicious dependency.