Skip to content

RiskLib 0.1 Parsing

bwyss edited this page Mar 16, 2011 · 2 revisions

Overview:

The OpenQuake engines need to support a variety of input and output formats. At a later stage in the project it will likely make sense to develop a rigorous and documented formal exchange format - at this point, it's most important for us to:

  • Not duplicate work
  • Get things working end-to-end
  • Support as diverse a group of real-world users as possible, as early as possible

With this in mind, I suggest that, rather than undertaking formal development of the data format specification, we simply treat it as an area of common development. However, this means it's truly COMMON - one set of python modules that are collaborated upon. I expect to see a tremendous amount of discussion, either in Skype and IRC, or on a mailing list (if folks would like to take the time to develop well-reasoned rationale for their approach). From a technical standpoint, let's make sure we're using the appropriate underlying python classes for each type of input file:

  • If it's a data file (e.g., if we need to support both input and output of this format), use the Python "codecs" module, and implement IncrementalEncoder and IncrementalDecoder.
  • If it's a configuration file, make sure you shouldn't be using a --flagfile before using properties/ini config files.
  • When you're writing your parsing library, make sure you can round-trip the data (decode a file, and then encode to a file, and end up with equivalent files.)

Note also that the python zlib_codec supports on-the-fly decompression, which is an optimization for large binary datasets (and is almost always faster than the disk IO).

Some research and references:

REQUIREMENTS:

  • Fast (quantify) serialization and deserialization
  • Buffered deserialization
  • Straightforward ETL / simple translations
  • Schema and schema validation (nice-to-have)
  • Human-readable (nice-to-have)

OPTIONS:

Back to Blueprints

Clone this wiki locally