N-grams and N-logs with Ruby

N-grams are useful, in spell-checking, for example. I've been working on a project where I need to extract the word-level equivalent of n-grams from phrases. Lacking a better name, I call them n-logs. As one might expect, Ruby makes this easy. Here is a pipe-friendly script:

Modifying this script to create n-grams instead is trivial.

The Need of a Study of Anatomy (also Swans)

"In our initial sketches for compositions, when memory has to take the place of the living model, we rely to a great extent on our anatomical knowledge for the suggestion of action and form generally. And again it adds materially to our faculties for self-criticism, which, like a sense of humour, is often, nearly always, our salvation."
Solomon J. Solomon, The practice of oil painting and of drawing as associated with it

Knowledge of your tools is necessary, but not sufficient. The choices you make when planning the structure of software depend on your knowledge of the problem domain. A project is limited (sometimes crippled) by your comprehension of the form and motion and constraints of the body before you.

"It looks like it was made with, you know... longing. Made by a person really longed to see a swan"
Kaylee, Firefly

Self-criticism and a sense of humor are ineluctably linked, I find. Those who have not the capacity to criticize their own efforts lack most of the capacity to laugh at their own failings. If you don't think it's funny when you spend two hours failing to find a mindless bug in a simple depth-first traversal function, then you're not me.

Determine local (not inherited or mixed in) methods in Ruby

I've been finding this useful for exploration of other people's code.

The Beginning of the End for Rubyforge

Jamis Buck is abandoning development of SQLite/Ruby, SQLite3/Ruby, Net::SSH and Capistrano. I do not say this derogatorily; Jamis owes us Capistrano like George R. R. Martin owes us A Dance with Dragons.

In the comments to that post, Dr Nic asked,

... were there ever “core contributors” who could be all added to the rubyforge project’s admin so they can start releasing new versions? Or did you ask all of them and no one said they’d take over the project?

Jamis replied:

"[T]here are no other core contributors. I tried once to create something like that, but no one else seemed to have the “passion” or “vision”. Lots of people submitting patches (many of them quite good!), but no one demonstrating a real, general desire to dig into the internals. That’s kind of why I left it like I did—there really wasn’t any heir-apparent that the keys could be left to.

"That said, if someone steps forward and seems to be getting community support (for any of the projects) behind them, I’ll be happy to give them admin access to the appropriate rubyforge pages."

Rubyforge served a purpose for several years, and served it well. But Rubyforge is a bottleneck in the distribution of code, and this is exacerbated by the Ruby community's reliance not only on RubyGems, but on the idea of the canonical, official version of a project. The increased popularity of distributed version control releases some of the pressue. GitHub has substantially reduced the friction involved in collaboration. Even so, the idea still holds that once a line of work is ready, you release it on Rubyforge, so that it's official.

Good coders, even those not afflicted with a love of novelty, will eventually grow bored with their projects. The distribution model represented by Rubyforge cannot, or at least should not, long survive this human tendency.

Ruby FFI example using #ffi_lib

Ruby FFI is a cross-VM library for calling foreign functions (i.e. C or C++).  It isn't obvious from the introductory blog posts how you specify which library to use, but the answer isn't hard to find in the source.
Examples speak louder than words:
require 'rubygems'
require 'ffi'

class MDB
  extend FFI::Library
  
  # The lib name gets spackled with platform-specific 
  # prefix and suffix. On Mac OS X, e.g., the ffi_lib
  # name turns into 'libmdb.dylib'
  ffi_lib 'mdb'
  
  # Who needs enum, anyway?
  NOFLAGS = 0
  MDB_TABLE = 1
  
  attach_function :mdb_init, [], :void
  attach_function :mdb_exit, [], :void
  
  # In the libmdb headers, you'll find that this function
  # actually returns a pointer to an MDBHandle struct.  
  # FFI::Struct would likely help out here, but just
  # calling the return result a :pointer works for now.
  attach_function :mdb_open, [ :string, :int], :pointer
  attach_function :mdb_close, [ :pointer], :void 
    
  def self.open(path)
    MDB.mdb_init
    db = MDB.mdb_open( path, MDB::NOFLAGS)
    
    yield db
    
    MDB.mdb_close(db)
    MDB.mdb_exit
  end
  
  attach_function :mdb_dump_catalog, [:pointer, :int], :pointer
    
end

MDB.open('mdb_files/sample.mdb') do |db|
  MDB.mdb_dump_catalog(db, MDB::MDB_TABLE)
end

Set a size limit on a Ruby/LDAP query

The RDocs available online for Ruby/LDAP are not much help here.  If you download the source, though, you'll find an FAQ file with the goods.

  
    conn = LDAP::Conn.new( 'localhost', 389 )
    # Limit the results set to a maximum of 10.
    conn.set_option( LDAP::LDAP_OPT_SIZELIMIT, 10 )
  

Use Array#pack and String#unpack instead of Base64

Array#pack("m*") is your friend. So is String#unpack("m*"). You can use them instead of the Base64 methods encode64 and decode64. Hat tip to Rack's authentication example.

ActiveMDB is on GitHub

ActiveMDB development, such as it is, will now take place on GitHub.

Ruby snippet for flexible blurbing of text

Here's a cute little method I use for extracting blurbs from text. Specify a minimum character count as the argument, and the method returns a string containing the minimum text plus whatever it takes to reach the next period, question mark, or exclamation point.

def blurb(size=75)
  reg = Regexp.new ".{#{size}}[,;:-_\\w\\s]*[\.\!\?]"
  body.slice(reg)
end

Developer dependencies in RubyGems 1.2.0 using add_development_dependency

As the release note says, RubyGems 1.2.0 now supports two levels of gem dependencies, runtime and developer. If you don't use add_development_dependency to specify a dependency, RubyGems treats it as a runtime dep.

Google, in its questionable wisdom, tells me that there's not much interest in this feature. Echoe is already using it, which is simultaneously unsurprising and reassuring.

Perhaps I shouldn't find the lack of widespread acclaim strange. I suppose Ruby developers don't hack other people's code, they just use it. GitHub will change this, perhaps already is changing this. Case in point: I cloned a project today just to see how it works, with no intention of actually using it. The project's tests failed because I didn't have three dependencies: one runtime, two developer.

Super. Let's add some RubyGems 1.2.0 typed dependencies (17 Awesome Points go to anyone who can figure out the project based on this snippet).

  spec = Gem::Specification.new do |s|
    ...
    s.add_runtime_dependency 'right_aws'
    s.add_development_dependency 'Shoulda'
    s.add_development_dependency 'mocha'
    ...
  end

Now, build the gem using whatever means have been provided, and install it with its developer dependencies.

... Wait, what? The release announcement doesn't tell me how to do trigger the developer dependencies. Google is no help, either. And good luck finding RDocs for RubyGems.

After I started grumbling, but before diving into the rubygems source, I remembered to run gem help install. There I found this beautiful, kissable tip:

 --development                Install any additional development
                              dependencies

Here's the information you came for: To install a gem with developer dependencies, use gem install --development

Developer dependencies should also be useful with a rake setup task, as Assaf Arkin advises having for all public projects. Somebody's going to have to dive into the source code to figure out how to make that work.

So now I have added some niceness to a project I cloned from GitHub. It's not much more work to fork the project, commit my changes, and shoot off a pull request. All for a project I'm not really going to use. Long live GitHub.

How to compile mdbtools on Mac OS X 10.4 and 10.5

Update: These instructions are working for me on Mac OS X 10.5.3

Prerequisites

You’ll need MacPorts, the mdbtools source, and a simple patch. Use macports to install glib2, libtool, and automake:

port install glib2 libtool automake

One commenter reported that he had to upgrade version 2.5.35 of flex. I had no trouble with the version of flex included with Leopard, viz. 2.5.33.

MDB Tools source

You can get the mdbtools source from CVS, via the instructions at the sourceforge site, and my patch here.

Alternatively, use a git repo I started because CVS makes baby Theanthropos cry:

git clone git://gitorious.org/mdbtools/mainline.git mdbtools

autogen.sh && make && make install

cd into the mdbtools directory and run autogen.sh Pass any configuration args to autogen.sh, and it will pass them along to configure. /usr/local is the default prefix. The options below set the install location, enable compilation of the mdb-sql tool, but not gmdb2, the Gnome MDB File Viewer and debugger.

./autogen.sh --prefix=/users/Matthew/local --enable-sql --disable-gmdb2
make
make install

Assume that make and install work: you can test the results like so:

mdb-ver /path/to/thingy.mdb
mdb-tables /path/to/thingy.mdb
mdb-schema /path/to/thingy.mdb

How to read the Rubigen home page

You'll need Firefox and Firebug.

  1. Navigate to http://rubigen.rubyforge.org/.
  2. While shielding your eyes, open Firebug and go to the HTML tab.
  3. Click on the body element.
  4. In the Style pane, set background-color to #000000.
  5. You may also wish to set the background-color of a elements to #000000

Compiling git on fresh Ubuntu 7.0.4

Note to self: packages I had to install for "make all doc" to work for git 1.5.3.8.

$ history | grep install

   36  sudo apt-get install curl
   41  sudo apt-get install libcurl
   49  sudo apt-get install tk8.4
   53  sudo apt-get install cpio expat
   55  sudo apt-get install zlib
   61  sudo apt-get install build-essential
   66  sudo apt-get install zlib1g-dev 
   72  sudo apt-get install asciidoc
   75  sudo apt-get install xmlto

John Galt, meet Paul Graham

From the Arc tutorial, line 372

arc> (is 'a 'a)
t
(for those who don't get the joke, don't worry, it wasn't that funny. Neither was Ayn Rand.

Installing Arc on Mac OS X

Installing the Arc language on Mac OS X is trivial, assuming you have the right version of MzScheme.

Install MzScheme 352 with MacPorts

Arc requires MzScheme, version 352. The install instructions warn: “Don’t use the latest version. There is said to be some bug/feature in it that breaks Arc.” The version of mzscheme in MacPorts is 371, so we have to tweak the portfile, and the only historical versions of mzscheme I can find portfiles for are 201 and 360. There seem to have been some major build changes between 352 and 360, so I frankensteined the portfile for version 201. You can find my version here. I’m running 10.4.11 on my dev machine, and I haven’t tested anything else.

Thanks go to bitshaker.com for the instructions on setting up a local port repo.

  1. mkdir /Users/Shared/dports/lang/mzscheme
  2. Edit /opt/local/etc/macports/sources.conf to look like this:
    # To enable your local ports repository, uncomment and customize the
    # following line to point at your local dports directory
    # Example: file:///Users/landonf/misc/macports/dports
    #
    # To get macports from the macports rsync server use:
    # rsync://rsync.macports.org/release/ports/
    file:///Users/Shared/dports
    rsync://rsync.macports.org/release/ports/
      
  3. Download this portfile and copy it to /Users/Shared/dports/lang/mzscheme/Portfile
  4. Update the ports index:
    portindex /Users/Shared/dports
  5. When you run port list mzscheme, you should have an entry for mzscheme @352
  6. Install with:
    sudo port install mzscheme @352
  7. Is mzscheme the right version? Does it run?
    $ mzscheme
    Welcome to MzScheme version 352, Copyright (c) 2004-2006 PLT Scheme Inc.
    > (+ 1 2 )
    3
    (exit)
    $
      

Download Arc and run it with mzscheme

Download http://ycombinator.com/arc/arc0.tar and untar it somewhere useful on your system. ~/src/ perhaps?

$ tar -xv arc0.tar
$ cd arc0
$ mzscheme -m -f as.scm 
Use (quit) to quit, (tl) to return here after an interrupt.
arc> (+ 1 2)
3
arc> (quit)

http://arclanguage.org/install

"Tables are the [lisp] lists of html"

Paul Graham:

Tables are the lists of html. The W3C doesn't like you to use tables to do more than display tabular data because then it's unclear what a table cell means. But this sort of ambiguity is not always an error.
Zed Shaw:

I may never do another CSS only layout again. I’m starting to wonder how … we got sucked into that crap, especially if the only way to really get a good looking layout with CSS and div tags is with mountains of stylesheets, html, and sometimes some damn javascript.

I’m not kidding about the javascript. I’ve seen people desperately trying to force their square-peg 3 column layout through the CSS round hole resort to javascript tricks to force the columns in the right spots.…

It’s not gauche to do what’s easiest and nobody’s going run you out of Designer Town (population 100) with sharpened pitchforks and blazing torches.

Insincere apologies to Zed for the bowdlerization.