rod mclaughlin

Collections Considered Harmful

Mon, January 22, 2007

No code should contain collections except the classes which encapsulate the collections

Rod McLaughlin, July 2006 (updated)

Why is String a class in Java and .NET languages? In C, there were no strings. Instead, we used char array.
char mystring []; int i;
mystring = "Hello";
for( i = 0; i < 10; i ++ ) putc( mystring[ i ] ); // etc.
Why was String invented? Why not use char array?

The answer is well-known. Arrays are error-prone. Most of the software we use every day is written in C. When it crashes, half the time, it's because of array errors. Usually, it's trying to access the Nth member of the array when the array has less than N members (including the accidental use of negative indices, fencepost errors, and errors encouraged by zero-based indexing). So, in object-oriented languages, the creators decided to encapsulate the most widely-used array, char array, in a class, String.

What is bad about char array? The fact that it contains chars, or the fact that it is an array? It's not the fact that it contains chars. All arrays are prone to exactly the same errors, the most common of which is attempting to access an element with an out-of-bounds index. Yet Java and .NET languages have arrays. You may check the length of an array before indexing it, but you don't have to. Given an array A, before accessing the Nth element, you should

1. Check that A is not null
2. Check that A has at least N elements
3. Check that the Nth element is not null

Therefore, code with arrays is either bug-prone or littered with repetitive checks.

They also have other types of 'Collection', such as ArrayList (Array isn't technically a collection in Java, for historical reasons, but is in .NET). These collections address some of the issues with arrays, but do not solve the basic problem. There are iterators other than numerical index 0..N. But iterators don't solve the basic problem of collections: they expose too many of their internal details.

The solution is simple, and we already know the answer. String is a class precisely because the collection it encapsulates, char array, is harmful. The same principle applies to all collections. No code should contain collections except the classes which encapsulate the collections. This is a real example I came across:

String line = Customers[ 3 ].Addresses[ 2 ].AddressLine[ 1 ];

I can see eight possible exceptions this expression could generate. The Customer class contains an array of Addresses, and the Address class contains an array of strings, AddressLine. Customers, Addresses and AddressLine should be classes, not collections. For example, this is how to encapsulate the AddressLine array: US postal addresses have one or two address lines before City, State and Zipcode, so AddressLine's accessor methods should not be AddressLine[ N ], which cannot guarantee to return either the first or second address line, because N could be less than zero or more than one, but AddressLine.FirstLine() and AddressLine.SecondLine(). It would never cause an index-out-of-bounds error. Its behaviour could be improved without breaking code which depends on it. For example, you could change the answers to questions like: should SecondLine() return null or an empty string, or throw an exception, if there is no second line?, and what should happen if you attempt to add a second line to an AddressLine with no first line? It would be easy to extend for countries whose post offices are less bureaucratic.

Custom collections (where you extend a Collection class) are probably not what you need, because they tell you what methods you must implement. A good story is not just a 'collection' of words.

Another real-world example: the ADO.NET class DataSet. I extended it into a class which automatically inherits constraints from databases,, and needed to extend the myDs.Add( table ) method. Unfortunately, there isn't one. Instead, DataSet has a public collection, Tables, and you must do myDs.Tables.Add( table ). This method cannot be extended, because Tables is a collection. So is everything else in DataSet. Thus, the misuse of collections violates two principles of OOP: DataSet exposes its implementation details, and it is not extensible. The solution is to put a DataSet inside a my own class rather than inheriting from DataSet - in other words, hide the collections. But why didn't DataSet hide them in the first place?

Every collection in your code should be treated in the same way. Each one should be replaced with a class, and its use with an instance of that class. This conclusion follows from the fundamentals of object-oriented programming. Collections are inherently dangerous. They need to be caged in classes, and never let out.

Update February 2007: another reason collections should be encapsulated in specialized classes is to help serialization across networks. For example, using XFire Java web services is quite complicated when trying to send collections, and virtually impossible using Java 1.5 generics. Having your own specialized class containing a collection is easier.

Portland London