By Ben Webb
Currently the IATI Dashboard is hosted using github pages (although this will soon change, see below). As some of you will know, this means that it entire site is made up entirely from static files.
The static HTML files etc. are generated by the code in the IATI-Dashboard repository. This code uses a variety of sources, including some data from github, but is mostly based on static JSON files generated by IATI-Stats. This is turn generates the stats JSON files from IATI publisher's source XML files, which are downloaded by IATI-Registry-Refresher.
There are several advantages to using static files. They are easy to move around, and delete, and easy to serve (only a basic webserver is needed). Using them for data storage removes the need to run, use or understand any database software, which hopefully makes it easier for other people to deploy. All computation can take place overnight, when other server load is at a minimum. Generating static files also makes it easy to check that all pages generate successfully.
I chose JSON for my data storage format, because it is easier to quickly manipulate using a dynamic language such as Python or PHP, than a more "heavyweight" alternative like XML. After XML, it's also one of the most widely used generic data interchange formats.
Storing the stats output in JSON gives us a simple API for no extra effort. Since the datastore uses this data in this form, I can be sure all information on the dashboard is also accessible via a machine readable source by others. To help people find the relevant JSON file, the dashboard now has links from each table column or graph to the JSON.
Of course, using static files does have it's downsides. Some of the problems I've encountered:
- Static files are expensive to perform arbitary queries on. To avoid this problem, IATI-Stats computes all the different views I need on the data - aggregated by publisher and files, and "inverted" views that tell me for many publishers/files fell into a certain category (e.g. used a certain code on a codelist).
- Disk space can become problematic, since the static files are uncompressed - initially this was not a problem. As the ammount of data I was dealing with grew, I dealt with the increase in disk usage in a number of ways. I hashed input files, and used symlinks to avoid duplicating data when the input hadn't changed. I also made it possible to delete old data, by incrementally generating summaries, instead of creating the whole thing each time.
- Big files are difficult for other people to download.
- Small files take up a lot of space - the smallest section of the disk my filesytem will allocate 16KB. Thus every file will actually use at least this much space. As a result, lots of small files
- In the ext3 file system, a directory can not contain more than 40000 files/subdirectories - I hit this limit when using a single directory to store a cache of all the stats information I'd ever calculated for an XML file.
- Due to the way git works, having a large gh-pages branch in a repository makes checking out the code (even if you just want the master branch) annoyingly slow. As a result, I'm planning on moving away from using github pages for hosting, instead just hosting the static files with a traditional server set up.