User:Baxter.brad/Drafts/JSON Document Streaming Proposal

JSON Document Streaming Proposal

First of all, "Document Streaming" may not be the best terminology. This is intended to refer to the ability to include multiple JSON objects/structures in one file as a concatenated stream. So suggestions for a better way to refer to this are welcome.

Secondly, in a comment below his article in http://dilettantes.code4lib.org/blog/2010/09/a-proposal-to-serialize-marc-in-json/, Ross Singer says:

As far as single vs. multiple objects in the collection, I say this work just like marcxml: if the first character is a “[” it’s a collection, if it’s a “{“, it’s a single record. It’s hard to justify newline delimited JSON until there is some standardized way to advertise it.

I agree that requiring a pull-parser is less than ideal (and it wouldn’t be ‘required’, just ‘recommended’), but given a selection of sub-par alternatives (pull-parser vs. non-standard), I feel compelled to go with the thing we can advertise and consistently document.

Now, if ways to provide newline-delimited JSON were to improve, then I’m all for it. Also, this says nothing about any out of band arrangements you might have.

So finally, this proposal is intended to provide an alternative to a JSON pull-parser that can be advertised and consistently documented.

"If the first character is a '[' it's a collection, if it's a '{', it's a single record."

I propose that this still always be true. It also implies that the entire file be valid JSON as a whole. And I propose some additional indications (see below) that would allow each record in a collection to be parsed out one at a time without a pull-parser.

And since newline-delimited JSON is an alternative on the table, and since it would imply that the entire file is not valid JSON as a whole, then I feel comfortable proposing an alternative (again see below) that also makes the entire file not be valid JSON.

Parsing Records in a Collection

If the first character is a '[' it's a collection.

If the first line is "[\n" and the second line is blank, then it's still a collection, but there is a blank line between the records. This blank line can be used "out of band" to retrieve the text of the record, remove the trailing comma, and parse it out as a JSON object.

These blank lines would not invalidate the file as containing valid JSON as a whole, so it could still be parsed out whole or with a pull-parser.

It would mean that the text of a record must not contain a blank line.

With JSON emitters, this is normally true anyway, so I don't think this restriction would be onerous.

Parsing Records in a Concatenation

By "concatenation", I mean that the file contains multiple valid JSON records, but that the file as a whole is not valid JSON, i.e., it's not a collection.

If the first character is not '[' or '{' then the first line (with a newline tacked on to the front) is defined to be the record separator.

If the first line is blank, then the record separator is a blank line (i.e., "\n\n"), which may be used to retrieve the text of the record and parse it out as a JSON object (there would normally be no trailing comma as in the case above with a collection, but it could be checked for anyway).

If the first line isn't blank, then whatever it is would be the record separator. For example, the first line might be "---\n". In this case, there would be a "\n---\n" separating the text of each JSON record.

But the first line may be anything you want, with the following constraints:

the line doesn't begin with '[' or '{'
the text of the JSON record does not contain those characters (including the newline) at the beginning of a line

Newline-delimited JSON

I'm not a fan of this idea, but it has legs, so the each_record() proof of concept supports it by allowing the user to pass the file type after the file name.

Summary of Rules

if a file begins with "{", it is a single JSON record
if it begins with '[', it is a collection
if it begins with "[\n\n", then there is a blank line between records (but don't forget the ending comma)
if it begins with a blank line, it is a concatenation of JSON records with a blank line between them
if it begins with any character but '[' or '{', it is a concatenation of JSON records and the first line of the file is defined to be the record separator (this is the general form of the previous bullet)

File Types

For the sake of discussion, the names I'm using for the four file types are:

object
collection
collection-delimited
delimited
ndj (for newline delimited json)

I'm not wedded to these terms, but I think it would be good to decide on explicit terms to describe the file types.

See User:Baxter.brad/Drafts/JSON_File_Type_Examples for brief examples of each type.

Object

A file is of type "object" if it begins with "{". It is expected to contain a single JSON record.

Collection

A file is of type "collection" if it begins with "[". A "collection" file may also be a "collection-delimited" file. Either type is expected to be valid JSON as a whole.

Collection-Delimited

A file is of type "collection-delimited" if it begins with "[\n\n". That is, the only character on the first line is "[" and the second line is blank.

It is expected that there be a blank line between the members of the collection, that is, the separator between members is "\n\n".
It is also expected that there be a blank line after the last member (before the line containing the terminating "]" character).
If an out-of-band parser is reading these members by stopping at each blank line, it will probably have to remove any trailing comma before passing the text to a JSON parser.

Delimited

A file is of type "delimited" if it does not begin with either a "[" or a "{".

It is expected to contain multiple complete JSON objects concatenated together.
It is expected that the first line of the file will define the separator between these objects.
- If the first line is blank, then the separator is a blank line ("\n\n").
If the first line isn't blank, the characters on the first line define the separator (with a newline prepended).
- For example, if the first line is "---\n", then the separator is "\n---\n";
It is expected that the separator appear after the last object. (But a smart parser should handle the case where it is missing from the end.)

NDJ

A file is of type "ndj" (newline-delimited json) if each line in the file is a json object.

Implementations

As a proof-of-concept for this idea, the following Perl module contains an each_record() routine that uses the above rules to decide the type of the file being read.

http://search.cpan.org/~bbaxter/MARC-Utils-MARC2MARC_in_JSON-0.05/

The synopsis shows this example:

   my $get_next = each_record( "marc.json" );
   while( my $record = $get_next->() ) {
       print get_title( $record );  # you write get_title()
   }

The concept being proved is that the user need not know which file type "marc.json" is. The above code remains the same in each case.

The exception to this is type 'ndj', newline-delimited json. The file contents can't inform the program regarding this type, so it is up to the user. So the each_record() routine accepts a second file type parameter, e.g.

   my $get_next = each_record( "marc.json", "ndj" );
   while( my $record = $get_next->() ) {
       print get_title( $record );  # you write get_title()
   }