24 August 2009

Excellent Quote: Chomsky on English Spelling

conventional orthography is ... a near optimal system for the lexical representation of English words.
Chomsky, N. & Halle, M. (1968) The Sound Pattern of English

Compare to Mark Twain's "Spelling Reform: A Plan for the Improvement of English Spelling,"

Fainali, xen, aafte sam 20 iers ov orxogrefkl riform, wi wud hev a lojikl, kohirnt speling in ius xrewawt xe Ingliy-spiking werld.

22 August 2009

One of my few Indian Victories: Cheap Tech Books.

One of the most respected algorithm text books out there, Cormen et al's Introduction to Algorithms will cost you $62 from Amazon (2nd Ed, paperback). On the other hand, I just bought it here for Rs 350 in India, brand new. That's about $7 US. Wonderful. I may have to go back for TAOCP 1-3.

12 August 2009

Excellent Quote: How programmers think

I can't tell you how correct this passage is:

Most programmers above a certain level are very good at conceptualising large, complex systems. They can interrogate perceived weaknesses in a program before it is even written. This is how good programmers manage to write programs that are largely free of defects and that generally work in the way that is expected. Small-scale errors aside, a good high-level conceptual understanding of the system, coupled with an ability to mentally simulate the behaviour of any portion of a program provides programmers with most of the tools they need to construct good-quality programs. The programmer is likely to know (at least at some high level) how the entire system is going to work any any given moment.

From Adventures in Programming Languages and Semantics, "If concurrency is easy, you're either absurdly smart or you're kidding yourself." The full article is a good read.

04 August 2009

Review and Analysis of C# Part 1: Partial Definitions are an Anti-pattern

I've had to use C# a lot at work recently. Like everything else, I have very strong opinions about C#. So, I've decided to write an N-part series reviewing, critisicing, and sometimes even praising its design.

So, here's the first installment: Partial Definitions are an Anti-Pattern.

C# introduces the partial keyword. In short, the partial keyword allows a programmer to partially define a class at several places. According to MSDN, the rationale for this feature is:
  • When working on large projects, spreading a class over separate files enables multiple programmers to work on it at the same time.

  • When working with automatically generated source, code can be added to the class without having to recreate the source file. Visual Studio uses this approach when it creates Windows Forms, Web service wrapper code, and so on. You can create code that uses these classes without having to modify the file created by Visual Studio.

The first argument, holds no water. Yes, projects sometimes grow large, and often multiple programmers need to work on it concurrently. But, this problem is better handled by version control software.

To make this argument, let me introduce the notion of a section of code. Let's say that a section a collection of lines of code such that a significant modification to any one requires modification to (or at least a critical evaluation of) the other lines within that section. For example, changing the type of a variable requires you to revisit each use of that variable. For example, changing an invariant of a class requires you to revisit all lines which assume that invariant. It's a loose definition, but all the programmers out there know what I'm talking about.

Now suppose that two or more programmers are trying to modify some large class. At any given moment, either they work on the same sections of code, or they do not. If they work on different sections (which implies they are working on different lines), then common version control software will be able to automatically combine their efforts. On the other hand, if these two programmers are working on the same section of code, the programmers will need to coordinate their efforts---even if they are able to isolate their changes to disjoint lines of code or separate files. Said another way, the ability to split the class into two separate files does nothing to solve the original problem.

The second argument is no stronger. Suppose a code generator produces a large class definition. Further suppose that, across multiple runs of the code generator, much of the output is common. Why then, I ask, would the code generator repeatedly emit redundant code? Wouldn't it make more sense for the code generator to place the common code into a runtime library, or even a base class from which the situation specifics may derive? Just because it is a code generator, there is no excuse to emit code which ignores good software engineering practice.

There may be efficiency concerns pushing towards this style of code generation output. It could be that there is a significant difference in run time. I argue that this is a symptom of a problem elsewhere in the software stack. If two different implementations differ only on a cosmetic level, but the compiler exhibits a drastic performance difference, then it is the compiler's responsibility to decrease that performance difference. In a high level language---and especially in the case of a language stack as complicated as the CLR---these kinds of implementation details should be below the programmer's cognitive radar.

I have argued against the claimed benefits of partial classes, and I'll be happy to argue against any others that people suggest. Let me next argue that partial classes not only give no benefit, but that they also do harm.

The primary goal of software engineering is to create software engineering best practices in which programmers can reason about software in a modular fashion. Classes, for instance, enable us to think about one algorithm at a time, and to ignore all of the other parts of the software. Imagine trying to prove that your implementation of a binary tree is correct if it depends upon the operation of your GUI? Yeah, that's ridiculous, and that's why we write modular programs.

Now suppose that you have written a module so complicated---so multi-faceted---that you believe that some parts of its definition should be expressed separately from the rest. This suggests you believe (1) that your module expresses at least two different aspects of the problem, (2) that a programmer can reason about each independently, and (3) that any attempt to reason about them as a single unit will lead to unnecessary confusion. All of these are reasons to place the second aspect of your module into a separate module, not a separate file. In fact, many design patterns describe interfaces which tackle precisely this sort of fine-grained interaction among modules. The only reason to keep them in one module is programmer laziness. In this case, the partial feature enables the programmer to defer good design; he may get the code to work faster, but may cause headaches for all of the programmers who must maintain the code.

Additionally, as a programmer, I find it valuable to see the entire definition of an algorithm (feel free to disagree; let me know why in the comments). I don't necessarily look at the entire definition, but I think it is critical to be able to find any code which contradicts my understanding of code. In this case, a single definition (or at least a single file) enables me to quickly gather all factors which contribute to the operation of an algorithm. From my experience with C#, partial class definitions make it harder to gather all factors, and make it difficult to be sure I have found all of them.

Virtual methods might cause this problem as well. However, virtual methods are more restricted. First, they are marked virtual. Second, I know immediately where to find refinements of virtual methods: lower in the inheritance tree. Partial definitions have an arbitrary number of components---they can be anywhere provided they are combined at compile time.

Some may counter that these arguments all suggest a feature-rich integrated development environment. Sure. I'm in favor of good tools. But (1) I appreciate languages which don't require IDE support to read, because sometimes you read code on a website, a book, or other media, and (2) even the best of the best tools don't yet support this style of code browsing. I've been using Visual Studio 2008 at work, and it can't even list all of the subclasses of a base class for me! Enumerating all parts of a partial definition is equally challenging.

My recommendation: avoid the partial keyword, and instead decompose your software into modules sensibly.