Monday, August 13, 2012

Secondary School: XML Parsed with OCaml

In my previous post, I described how C++ could transfer XML into some predefined types (basic structs).  Changing those types to make the XML reading easier was not an option (in effect that would be cheating) since I assumed that working with pre-existing types is part of the problem.  I will do the same thing with my OCaml version of the problem.  The types will not be laid out in a way that necessarily makes the XML reading any simpler.


Meet the Types


The OCaml types related to reading the school schema (tastyPS.xml is an example input) are as follows:

type person = { name : string }
type teacher = { super : person; subject : string }
type student = person
type clazz =
  { hour : int
  ; teacher : teacher
  ; students : student list }
type school = { classes : clazz list }
type result = { school : school }

Faking Expat


I did not look for a library which can read XML and provide string handlers for nodes (and their attributes); there may indeed be such a library.  Instead I faked out the behaviour of the C++ Expat by just writing a long function which will operate the node handlers in a similar way:


(** test program simulates the read of TastyPS.xml *)

let readxml' nd_start_fun nd_end_fun () =
  let res1 = nd_start_fun "school" [] in
  let res1_1 =
    let res2 = nd_start_fun "class"
       ["day","Monday"; "hour","10"] in
      let res2_1 =
        let res3 = nd_start_fun "teacher"
          ["name","Mr Gauss"; "subject", "math"]
        in nd_end_fun res3 [] in
      let res2_2 =
        let res3 = nd_start_fun "student" ["name", "Jake"]
        in nd_end_fun res3 [] in
      let res2_3 =
        let res3 = nd_start_fun "student" ["name", "Mark"]
        in nd_end_fun res3 []
      in  nd_end_fun res2 [res2_1;res2_2;res2_3] in
  let res1_2 =
    let res2 = nd_start_fun "class"
        ["day","Tuesday";"hour","11"] in
    let res2_1 =
      let res3 = nd_start_fun "teacher"
        ["name","Mr Shakespeare";"subject","english"]
      in  nd_end_fun res3 [] in
    let res2_2 =
      let res3 = nd_start_fun "student" ["name","Christine"]
      in  nd_end_fun res3 [] in
    let res2_3 =
      let res3 = nd_start_fun "student" ["name","Thom"]
      in nd_end_fun res3 []
    in  nd_end_fun res2 [res2_1;res2_2;res2_3]
  in nd_end_fun res1 [res1_1;res1_2]



This code is not really meant to be elegant, it is just a large expression which does the same function calls in the same order as would an Expat library for OCaml (in a way that is similar to Expat).


Some Helper Code



An exception type is useful for when things go wrong:


exception Reader of string

A partial XML type is used to hold the results of partially converting the XML to the pre-defined types above:


type partial_xml = (** type for recursing partial content *)
   | PXStudent of student
   | PXTeacher of teacher
   | PXClass of clazz
   | PXSchool of school

An attribute finding function is useful for traversing the lists of pairs that are thrown at our handler:


let get_attr attr aname =
  try List.assoc aname attr
  with Not_found ->
    raise (Reader ("attribute "^aname^" not found"))


Node Handlers


The start node handler is where all the real work happens, but the end node handler is just used to complete the continuation which takes the converted content (list of partial converted XML) and builds the appropriate type:



let start_fun nname attr =
  let part_class_content content =
    let pcc (tlist,slist) px =
      match px with
        (PXStudent s) -> (tlist,s::slist)
        | (PXTeacher t) -> (t::tlist,slist)
        | _ -> raise 
           (Reader "classes only have teachers & students")
    in  List.fold_left pcc ([],[]) content in
    let part_school_content content =
      let psc px =
        match px with
          (PXClass cls) -> cls
          | _ -> raise 
             (Reader "school can only have classes in it ")
      in  List.map psc content in
    let get_attr = get_attr attr
    in match nname with
      "student" ->
        (fun _ ->
          PXStudent { name = get_attr "name" })
      | "teacher" ->
        (fun _ ->
          PXTeacher { super = { name = get_attr "name" }
                    ; subject = get_attr "subject" })
      | "class" ->
        (fun content ->
           match part_class_content content with
             ([teacher],students) ->
                 PXClass { hour = 
                           (int_of_string (get_attr "hour"))
                         ; teacher = teacher
                         ; students = students }
             | _ -> raise (Reader 
                "each class needs exactly one teacher"))
      | "school" ->
        (fun content ->
          let classes = part_school_content content
          in  PXSchool { classes = classes })
      | _ -> raise (Reader "unrecognized node tag")


The end node handler and the wrapper code are much simpler:


let end_fun sfunres content = sfunres content

let readxml () =
  let px = readxml' start_fun end_fun ()
  in  match px with
    (PXSchool school) -> { school = school }
    | _ -> raise (Reader "top tag is not a school")

let _ = readxml () (* run it *)


I could have made a lookup array to be used with List.assoc (a lookup function from the ocre library which matchs a key to a list of key/value pairs and returns the associated value) to deliver the proper continuation (e.g. part_class_content used for the class tag). This might have been a nicer parallel to the C++ code.  I really like having a code structure with separated handlers for each node.  I think it makes it easier to understand, and gives a fresh set of eyes (who might want to add a new node type) a nice pattern to follow. Building and running this code (in schoolxml.ml) is a simple matter as well:


$ ocamlc -g schoolxml.ml -o s.x
$ ./s.x


Summary


Rewriting this code in OCaml is (after writing the C++ version) is informative.  The OCaml code seems to handle more corner cases and raises exceptions properly (I omitted the throw statements in the C++, but left comments for where they should be added).  One very interesting rule is that a class should have exactly one teacher -- not fewer or greater.  The C++ code will likely allow the teacher value to be NULL - an error by omission -- but will not allow there to be more than one teacher.  The last observation is that, although the syntax of OCaml is much simpler, the overall parsimony of the comparable code is much more satisfying.  I guess that is why I like functional programming so much.

No comments:

Post a Comment


Follow Mark on GitHub