Wednesday, April 25, 2012

I've Sinned

"Forgive me father, I've sinned against the first law of software: Thou shall reuse." I've written a JSON codec. Yes, I known there are already lots of them around (probably too many already) but out of the myriad of JSON converters I could not find any one that had the qualities I wanted.

I needed JSON because a few months ago I wrote a blog about what is called Data objects (the C struct). I've come to the firm conclusion that objects and classes are nice but not between processes. Objects between process just do not work.

The problem is intrinsic to the object oriented paradigm. Classes provide modularity (hiding of private data) and this is very beneficial in a single program. However, once objects are exchanged between processes there is an implicit requirement that the classes in both processes are compatible. If they are identical then there is no issue because any hiding is properly maintained. However, any difference between the classes requires an agreement between the processes about the private details. In a distributed system this is very hard to guarantee at all times since systems must be updated and you rarely can bring down all instances in a cluster. Ergo, the modularity of the class is no longer effective since private data can no longer be changed without affecting others, which implies the loss of all modularity advantages. So we have to get over it and live with public data on the wire.

The dynamically typed languages have a significant advantage in a distributed world since extra fields do not harm and a missing fields are easy to detect. Where type safety can provide significant advantages inside a program, it seems to be in the way when we communicate. Though I like Javascript (I am a Smalltalker from origin)  I've come to like Eclipse and its extensive support that it can provide due to the type information. Can we combine the best of those worlds?

Clearly, JSON is the current standard in a distributed world. JSON is mostly simple, it only provides numbers, strings, arrays, and maps. Any semantics of this data are up to the receiver. This is very different from XML or ASN.1 where detailed typing is available. The advantages of this simplicity is that it is easy to get something working. The disadvantage is of course that it is also very easy to break things.

The Java JSON libraries I looked at all had a design had an impedance mismatch between Java and the JSON data. As I said, I like Eclipse's powerful support and want to use Java types, I do not want to use maps or special JSON lists in Java, they are awkward to use. Using foo.get("field") combines the worst of all worlds. What I needed was a converter that can take a Data object (an object with only public fields) and turn it into a JSON stream and take a JSON stream and turn it into a Data object. Fields in this object must be able to handler other data objects, primitives, most value objects, objects that can be constructed from their toString() method, enums, collections, maps, and some special cases like byte[], Pattern, and Date. And this all in a type safe way?

Yes we can!

It turns out that the Java type system can be a tremendous help in generating the JSON (this is really almost trivial) but it is also extremely useful in parsing. The object's fields can provide extensive type information through reflection and this can be used to convert one of the 4 basic JSON types (strings, numbers, lists, and maps) to to the field type. Since the class information is available there is no need for dynamic class loading, evil in an OSGi world. It also works surprisingly well with generics. Though every Java developers know that types are erased with generics, all generic type information about fields, methods, interfaces, and classes is available reflectively. Erasure is only about instances, from an instance you cannot find out its generic parameters. For example, take a Data object like:
public class Data {
  List<Person> persons;
}
The persons field provides the full generic information that that this is a List with instances of the Person class. When the JSON decoder encounters the persons field with a JSON list it can find out that the members of that list should be Person instances. For each member of the JSON list it will create an instance and parse the member. Obviously this is all working recursively. For example:

public enum Sex { MALE, FEMALE; }
public class Person {
  public String name;
  public Sex    sex;
  public List<Person> children = new ArrayList<Person>();
}

JSONCodec codec = new JSONCodec();
Person user = getUser();
String s = codec.enc().put( user ).toString();

// {"name":"Peter","children":[
//     {"name":"Mischa","sex":"FEMALE"},
//     {"name":"Thomas","sex":"MALE"}
//   ],
//   sex":"MALE"
// }

Person user = codec.dec().from(s).get( Person.class );

So what did I learn? First that the primitives and Java numbers absolutely totally suck, what a mess. It feels like someone with a lot of knowledge set out to make fluid use of different number types as complex as possible. The compiler can hardly do any type coercion but with reflection you better get the types exactly right and the only way to do it is to have special cases for each primitive and wrapper types. #fail

The other thing I learned was that it makes a lot of sense to stream JSON records. Initially I had in my mind to use the envelope model. If I had multiple persons than I would create a List of Person. However, in this model you need to parse the whole before you can process the first element. It turns out to work much better to sequence the Person objects. One of the nice things about JSON is that the parser is very simple and does not have to look beyond what it needs. Sequencing records also allows earlier records to provide help in parsing later records. For example, the first record could contain the number of records and maybe protocol version. It also works very well for digests, signing, and getting the class name that is needed to parse the next record.

So yes, I've sinned against the first law of reuse because I am confident that somebody will point out that there already exists a library out there that does exactly what I described. Well, it wasn't too much work and actually really like what I've got, this model turns out to work extremely well.

Peter Kriens

3 comments:

  1. Hi Peter,

    Nice perspective on what Uncle Bob refers to as Data/Object anti-symmetry.

    Do you see any application of semantic versioning here?

    Side note; there's a fifth json type; boolean.

    cheers.

    ReplyDelete
  2. Hello Peter,
    sounds very promising. Any chance to play with your JSON codec?
    Kisho

    ReplyDelete
  3. @earcam: yes, (semantic) versioning is part of this. Data objects will require versioning and backward compatibility.

    @kisho: The JSON Codec is in the next branch of bnd on https://github.com/bndtools/bnd in the aQute.lib project. However, it is still in development.

    ReplyDelete