Thursday, August 9, 2012

Back to School: XML Parsed with C++

Reading an XML file into a hierarchy of data structures is a pretty common problem.  If those data structures have been defined already and do not follow the recursive structure of the XML, some validation is needed to ensure the XML input conforms to the data structure declarations.

School Data Example


Consider the following data structures (from Target.h):


#ifndef TARGET_H
#define TARGET_H

#include <vector>

// forward declarations
struct Person;
struct Student;
struct Teacher;
struct Class;
struct School;
struct Result;

struct Person {
 Person() { name[0] = '\0'; }
 char name[256];
};

struct Student : public Person {
 typedef Class BelongsTo;
 Student() {}
};

struct Teacher : public Person {
 typedef Class BelongsTo;
 Teacher() {}
 char subject[256];
};

struct Class {
 typedef School BelongsTo;
 Class() : teacher(0) {}

 char day[64];
 int hour;
 Teacher * teacher;
 std::vector<student> students;
};

struct School {
 typedef Result BelongsTo;
 School() {}

 std::vector<class> classes;
};

struct Result {
 typedef void BelongsTo;
 Result() : school(0) {}

 School * school;
};

#endif // TARGET_H

And an example of XML input (fom TastyPS.xml):


<school>
 <student name="foo" />
 <class day="Monday" hour="10">
  <teacher name="Mr Gauss" subject="math" />
  <student name="Jake" />
  <student name="Mark" />
 </class>
 <class day="Tuesday" hour="11">
  <teacher name="Mr Shakespeare" subject="english" />
  <student name="Christine" />
  <student name="Thom" />
 </class>
</school>

The goal is to have a function which reads the content of a file and returns a Result * which can be used by other parts of the code.  If the structure of the the XML does not match the data structures defined above we want to have the reading code to raise a C++ exception.

Top-level interface


The interface for our main program to use for reading the XML file is very simple (in readschoolxml.h):


#ifndef XML_READ_H
#define XML_READ_H

struct Result;

namespace xml {

Result *
readfile(const char * filename);

} // namespace xml

#endif // XML_READ_H


The minimalism of just having a C-like functional interface is something worth appreciating now. The symbols exposed to the following main program are very limited (no need to over expose the implementation to the user code main.cpp):


#include "Target.h"
#include "readschoolxml.h"

int
main()
{
 Result * __attribute__((unused)) res = 
  xml::readfile("TastyPS.xml");
 return 0;
}



Parsing with Expat


Expat provide a C library interface for reading XML and handling every node with its attributes using callbacks (one for the start of the node and one for the end).  Interfacing Expat with a push-down state::Reader state can be done without referencing any detail of the data structures we are targetting:


#include "readschoolxml.h"
#include "Target.h"
#include "Attribute.h"
#include <expat.h>
#include <string.h>
#include <map>
#include <fstream>

namespace xml {
namespace state {
// implementation stuff linked from elsewhere

#define XML_READER_MAX_LINE 2048

struct Reader; // opaque here

extern Reader * initial();
extern Result * get_initial_obj_and_delete(Reader *);
extern void push(Reader **parent, const char * el,
    const char ** attr);
extern void pop(Reader **child);

} // namespace state


The four functions initial, get_initial_obj_delete, push and pop in the xml::state namespace will implement the XML reading state which the following Expat callbacks will use:


void
startHandler(void *data, const char *el, const char **attr)
{
 state::Reader ** reader = 
  reinterpret_cast<state::reader>(data);
 state::push(reader,el,attr);
}

void 
endHandler(void * data, const char * )
{
 state::Reader ** reader = 
  reinterpret_cast<state::reader>(data);
 state::pop(reader);
}

// data and comments are not interesting for us here
void dataHandler(void *, const char *, int) {}
void commentHandler(void *, const char *) {}



Finally, the readfile() function can be implemented with I/O calls and Expat parsing:

Result *
readfile(const char * filename)
{
 std::ifstream in(filename);

 if ( !in ) {
  // throw file not opened
 }

 XML_Parser p = XML_ParserCreate(NULL);
 if ( !p ) {
  // throw parser creation failed 
  // (odd problem)
 }
 state::Reader * reader = state::initial();
 XML_SetUserData(p, &reader);
 XML_SetElementHandler(p, startHandler, endHandler);
 XML_SetCharacterDataHandler(p, dataHandler);
 XML_SetCommentHandler(p, commentHandler);

 int done,len;
 char buff[XML_READER_MAX_LINE +1];

 while ( in.getline(buff, XML_READER_MAX_LINE) ) {
  len = strlen(buff);
  done = in.eof();
  if ( XML_Parse(p, buff, len, done) 
    == XML_STATUS_ERROR ) {
   // throw syntax error 
   // parsing a line of XML
  }
 }
 in.close();
 XML_ParserFree(p);
 Result * res = get_initial_obj_and_delete(reader);
 return res;
}

} // namespace xml


All of this code for interfacing with the Expat library is fairly generic. It just sets up a stack-based interface with an opaque type called state::Reader.

Dealing with Attributes



A class called Attribute is used to hold the attribute values and provide type conversions if necessary (since XML attributes are basically strings) in Attribute.h:


#ifndef XML_ATTRIBUTE_H
#define XML_ATTRIBUTE_H

#include <string>
#include <map>
#include <cstdlib>

namespace xml {

class Attribute {
public:
 Attribute() : _v(NULL) {}
 Attribute(const char *v) : _v(v) {}
 Attribute(const Attribute & a) : _v(a._v) {}

 Attribute & operator=(const Attribute & a)
 {
  _v = a._v;
  return *this;
 }
 int intVal() const { return atoi(_v); }
 const char * c_str() const { return _v; }
 bool boolVal() const
 {
  static const std::string t = "true";
  return t == _v;
 }
private:
 const char * _v;
};

Another class called AttributeMap is used to hold a map from constant strings to Attribute values:


class AttributeMap : public std::map<const char *, Attribute> {
public:
 Attribute & getOrThrow(const char * k)
 {
  std::map<const char *, Attribute>::iterator 
   itr = find(k);
  if(itr == end()) {
   // exception thrown here
  }
  return (*itr).second;
 }
 Attribute getOrDefault(const char * k, 
  const char * str_default)
 {
  // ...
 }
};

} // namespace xml

#endif // XML_ATRRIBUTE_H

Push-down State Details



In schoolstate.cpp the XML reader state is described in more detail. The add<>() template function describes how objects are added to other objects (partial template specialization is used here to help cover the variety of ways that one object is added to another):


#include "Target.h"
#include "Attribute.h"
#include <string.h>

namespace xml {
namespace state {

struct Reader {
 Reader(Reader * n) : next(n) {}
 virtual ~Reader() {}

 Reader * next;
};

void 
pop(Reader **r)
{
 Reader * d = *r;
 *r = d->next;
 delete d;
}

template< typename P
 , typename C
 >
inline void add(P * parent, C * child)
{}

template<> inline void 
add<Result,School>(Result * r, School *s)
{
 // throw if r->school != 0
 r->school = s;
}

template<> inline void 
add<School,Class>(School *s,Class * c)
{
 s->classes.push_back(c);
}
template<> inline void 
add<Class,Student>(Class * c,Student * s)
{
 c->students.push_back(s);
}
template<> inline void 
add<Class,Teacher>(Class * c,Teacher * t)
{
 // throw if c->teacher != 0
 c->teacher = t;
}



The ReaderImpl<> template creates the new Target.h object and adds it to the containing object when the Reader is destructed:

template< typename T
 >
struct ReaderImpl : public Reader {
 ReaderImpl(Reader * n) : Reader(n) , obj(new T()) {}
 virtual ~ReaderImpl()
 {
  ReaderImpl<typename T::BelongsTo> * parent =
   dynamic_cast<ReaderImpl< T::BelongsTo> * >(next);
  // throw on !parent (cast failed)
  add<typename T::BelongsTo,T>(parent->obj,obj);
 }

 T * obj;
};

template<> // top-level over-ride
struct ReaderImpl<Result> : public Reader {
 ReaderImpl<Result>(Reader * n) 
            : Reader(n) , obj(new Result()) {}
 virtual ~ReaderImpl<result>() {}

 Result *obj;
};

Reader *
initial()
{
 return new ReaderImpl<Result>(0);
}



Some constant strings are useful for understanding the XML content:

namespace vocab {

// attribute values
static const char * monday = "Monday";
static const char * tuesday = "Tuesday";
static const char * wednesday = "Wednesday";
static const char * thursday = "Thursday";
static const char * friday = "Friday";
static const char * saturday = "Saturday";
static const char * sunday = "Sunday";

static const char * math = "math";
static const char * science = "science";
static const char * english = "english";

// attributes:
static const char * day = "day";
static const char * hour = "hour";
static const char * name = "name";
static const char * subject = "subject";

// nodes:
static const char * school="school";
static const char * clazz="class";
static const char * teacher="teacher";
static const char * student="student";

} // namespace vocab



Filling the AttributeMap using the Expat provided array of attribute/values is fairly simple (and validation of the values happens when they are from a finite set):

void
fillAttributeMap(AttributeMap & map, const char ** attr)
{
 using namespace vocab;
 static const char * days[] =
  { monday, tuesday, wednesday, thursday, 
                  friday, saturday, sunday, 0 };
 static const char * subjects[] =
  { math , english , science , 0 };
 static const struct {
  const char * attr_name;
  const char ** restrict_values;
 } attr_vocab[] = 
  { { day, days }
  , { name, 0 }
  , { hour, 0 }
  , { subject, subjects }
  , { 0, 0 }
 };
 for(size_t i = 0; attr[i]; i += 2) {
  Attribute attrib(attr[i+1]);
  int fd = -1;
  for(size_t j = 0; attr_vocab[j].attr_name; ++j) {
   if(strcmp(attr_vocab[j].attr_name,attr[i]) == 0) {
    fd = j;
    break;
   }
  }
  if(fd < 0) {
   // throw unrecognized attribute
  }
  bool val_fd = (attr_vocab[fd].restrict_values == 0);
  for(size_t j = 0; !val_fd 
   && attr_vocab[fd].restrict_values[j]; ++j) {
   if(strcmp(attr_vocab[fd].restrict_values[j],attrib.c_str
()) == 0) {
    val_fd = true;
    break;
   }
  }
  if(!val_fd) {
   // throw invalid value for attribute
  }
  map[attr_vocab[fd].attr_name] = attrib;
 }
}


Pushing the XML reader state is now possible with the factory<> template function (using C++ template function partial specialization in order to get the attribute values into the data structures):

template< typename T > Reader * 
factory(Reader *n, AttributeMap &)
{
 return new ReaderImpl<T>(n);
}

template<> Reader * 
factory<Student>(Reader *n, AttributeMap & map)
{
 ReaderImpl<Student> * res = new ReaderImpl<Student>(n);
 strcpy(res->obj->name,map.getOrThrow(vocab::name).c_str());
 return res;
}

template<> Reader * 
factory<Teacher>(Reader *n, AttributeMap & map)
{
 ReaderImpl<Teacher> * res = new ReaderImpl<Teacher>(n);
 strcpy(res->obj->subject,
  map.getOrThrow(vocab::subject).c_str());
 strcpy(res->obj->name, map.getOrThrow(vocab::name).c_str());
 return res;
}

template<> Reader * 
factory<Class>(Reader *n, AttributeMap & map)
{
 ReaderImpl<Class> * res = new ReaderImpl<Class>(n);
 strcpy(res->obj->day, map.getOrThrow(vocab::day).c_str());
 res->obj->hour = map.getOrThrow(vocab::hour).intVal();
 return res;
}


void
push(Reader **parent, const char * el, const char ** attr)
{
 static const struct {
  const char * el_name;
  Reader * (*factory_fun)(Reader *, AttributeMap &);
 } lookup[] =
  { { vocab::school,  factory<School> }
  , { vocab::clazz,   factory<Class> }
  , { vocab::teacher, factory<Teacher> }
  , { vocab::student, factory<Student> }
  , { 0, 0 } };
  
        AttributeMap map;
        fillAttributeMap(map,attr);
 for(size_t i = 0; lookup[i].el_name; ++i) {
  if(strcmp(lookup[i].el_name,el)==0) {
   *parent = (*(lookup[i].factory_fun))(*parent,map);
   return;
  }
 }
 // throw not found
}

Result *
get_initial_obj_and_delete(Reader *r)
{
 ReaderImpl<Result> * ri =
  dynamic_cast<ReaderImpl<Result> > *>(r);
 // throw on !ri (cast failed)
 Result * res = ri->obj;
 delete r;
 return res;
}

} // namespace state
} // namespace xml


I have left the exception throwing parts of the code as comments, they really need to be there for this software to properly handle errors.

Summary


The first revision of the code that this sample is based on had a huge switch statement in it and maintained lots of state variables in order to get the job done.  These variable kept track of the XML state that had been read.  The code was fragile and hard to maintain (other coders easily broke it).  The sample above does more validation, encapsulates the knowledge of the target data structures (and how they are constructed) in one compilation unit, and the functions involved are much smaller.  The type checking that happens within the C++ template code also helps catch syntax errors for input that does not match what is expected.  I hope to compare this implementation with a comparable one in OCaml at some point in the future.  Static typing, recursive data structures, pattern matching and type inference will likely make for much simpler implementation of this code.

No comments:

Post a Comment


Follow Mark on GitHub