NEWS

Version 0.8.0 - 1st February 2016

- Removed all Tesla-generation GPU support from QUDA (sm_1x).  As a
  result, QUDA now requires a minimum of the Fermi-generation GPUs.

- Added support for building QUDA using cmake.  This gives a much more
  flexible and extensible build system as well as allowing
  out-of-source-directory building. For details see:
  https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake

- Improved strong scaling of the multi-shift solver by overlapping the
  shift updates with the subsequent iteration's dslash comms waiting.

- Improved performance of multi-shift solver by preventing unnecessary
  refinement of shifted solutions once the residual falls below
  floating point precision.

- Significantly improved performance of FloatNOrder accessor functors
  to ensure vectorized memory accesses as well as removal of
  unnecessary type conversions.  This gives a significant speedup to
  all algorithms that use these accessors.

- Significant improvement in compilation time using C++ traits to
  prune build options.

- Added support for gauge-field reconstruction to naive staggered
  fermions.

- Added hypercubic random number generator with multi-GPU support.

- Added topological charge computation.

- Added final computational routines to allow for complete off-load of
  MILC staggered RHMC to QUDA (momActionQuda - compute the momentum
  contribution to the action, projectSU3Quda - project the gauge field
  back onto the SU(3) manifold).

- In the MILC interface staggered solver, the resident gauge field is
  reused until it is invalidated by constructing new links (or
  overridden with the `num_iters` back door flag).

- Improved gauge field unitarization robustness and added check for
  NaN in the results.

- Some cleanup and kernel-fusion optimization of gauge force HISQ
  force kernels.  This also improves compilation time and reduces
  library size.

- Added support for imaginary chemical potential to the staggered phase
  application / removal kernel, as well as fixing bugs in this
  routine.

- Algorithms that previously used double-precision atomics now use a
  cub reduction.  This drastically improves performance of such
  routines.

- QUDA can now be configured to enable NVTX markup on the TimeProfile
  class and MILC interface to give improved visual profiling.

- All gauge field copies now check for NaN when `HOST_DEBUG=yes` to
  improve debugging.

- Set tunecache.tsv to be invalid if git id changes to ensure a valid
  tune cache is used.

- Reduced BLAS tuning overhead, by setting the maximum grid size to be
  twice the SM count to avoid an unnecessarily large parameter sweep.

- Added new profile that records total time spent in QUDA.

- Fixed bugs in long-link field generation.

- Multiple bug fixes to the library. Many of the fixes are listed here:
  https://github.com/lattice/quda/pulls?q=is%3Apr+is%3Aclosed+milestone%3A%22QUDA+0.8.0%22
  https://github.com/lattice/quda/issues?q=is%3Aissue+milestone%3A%22QUDA+0.8.0%22+is%3Aclosed


Version 0.7.2 - 07th October 2015

- Add support for separate temporal-spatial plaquette

- Fixed memory leak in MPI communications

- Fixed issues with assignment of GPUs to processes when using the QMP
  backend with multiple nodes with multiple GPUs

- Fixed bug in MR solver which led to incorrect convergence

- Similar to the NVTX markup support for MPI added in 0.7.1 we now
  support NVTX markup for calls to the MILC interface. Enabled by using
  "--enable-milc-nvtx" when configuring QUDA.

- Multiple bug fixes to the library. Many of the fixes are listed here:
  https://github.com/lattice/quda/issues?q=milestone%3A%22QUDA+0.7.2%22+is%3Aclosed


Version 0.7.1 - 11th June 2015

- Added Maxwell-generation GPU support.

- Added automatic support for NVTX markup of MPI calls for visualizing
  MPI calls in the visual profiler.  Enabled by using
  "--enable-mpi-nvtx" when configuring QUDA.

- Modified clover derivative code to use gauge::FloatNorder structs,
  which in the process adds support for different reconstruct types.

- Added autotuning support to clover derivative and sigma trace
  computations.

- Multiple fixes and improvements to GPU_COMMS feature of QUDA: fixed
  a bug when using full-field fermions, improved support on Cray
  systems, and added much more robust memory checking of message memory when
  host debugging is enabled.

- Multi-GPU dslash now correctly report flops and bandwidth when
  autotuning.

- Fixed a bug where by the 5-d domain wall dslash was called twice
  every time it was called.

- Fixed a bug when using both improved staggered fermions and naive
  staggered fermions with auto-tuning enabled.

- Fixed a bug with using fused exterior kernels with auto tuning that
  could result in incorrect results.

- To aid debugging, QUDA now prints its version, including a git id
  tag, when initialized.

- Drastically improved Doxygen markup of the MILC interface.

- Multiple bug fixes that affects stability and correctness throughout
  the library.  Many of these fixes are listed here:
  https://github.com/lattice/quda/issues?q=milestone%3A%22QUDA+0.7.1%22+is%3Aclosed


Version 0.7.0 - 4th February 2015

- Added support for twisted-clover, 4-d preconditioned domain wall and
  4-d preconditioned Mobius fermions.

- Reworked auto-tuning framework to drastically reduce the lookup
  overhead of querying the tune cache.  This has the effect of
  improving the strong scaling (greater than 10% improvement in solver
  performance seen at scale).

- Support for GPU-aware MPI and GPUDirect RDMA for faster multi-GPU
  communication.  This option is enabled using the --enable-gpu-comms
  option (GPU_COMMS in make.in), and requires a GPU-aware MPI stack
  (MVAPICH or OpenMPI).

- Reduction in communication latency for half-precision dslash through
  merging the main quark and norm fields into a contiguous buffer for
  host to device transfers.  This reduces API overhead and increases
  sustained PCIe bandwidth.

- Added support for double buffering of the MPI receive buffers in the
  multi-GPU dslash to allow for early preposting of MPI_Recv.

- Implemented an initial multi-threaded dslash (parallelizing between
  MPI and CUDA API calls) to reduce overall CPU frequency sensitivity.
  This implementation is embryonic: it simply provides for early
  preposting of MPI_Recv and will be extended to parallelize between
  MPI_Test and CUDA event querying.

- Added an alternative multi-GPU dslash where the update of the
  boundary regions is deployed in a single kernel after all
  communication is complete.  This reduces kernel launch overhead and
  ensures communication is done with maximum priority.

- Reworked multi-GPU dslash interface: there are now different
  policies supported for a variety of execution flows.  Supported
  policies at the moment are QUDA_DSLASH (legacy multi-gpu that
  utilizes face buffers for communication buffers), QUDA_DSLASH2 (the
  default - regular multi-GPU dslash with CPU-routed communication),
  QUDA_FUSED_DSLASH (use a single kernel to update all boundaries
  after all communication has finished), QUDA_GPU_COMMS_DSLASH (all
  communication emanates directly from GPU memory locations),
  QUDA_PTHREADS_DSLASH (multi-threaded dslash).  This can be described
  as experimental, and changing the policy type has yet to be exposed
  to the interface.

- New routines for construction of the clover matrix field and
  inversion of the clover matrices (with optional computation of the
  trace log of the clover field).  Presently exposed by using
  loadCloverQuda with NULL pointers to host fields to force
  construction instead of download of the clover field.

- Implemented support for exact momentum exponentiation to
  complement the pre-existing Taylor expanded variant
  (updateGaugeFieldQuda).

- Partial implementation of the clover-field force terms
  (clover_deriv_quda.cu and clover_trace_quda.cu).

- All extended gauge field creation routines have been offloaded to
  QUDA, minimizing PCIe traffic and minimizing CPU time.  This has
  lead to a significant speedup in routines that need this, e.g., the
  gauge force.

- Initial support for extended fermion-field creation routines (only
  supports staggered fields).

- Fermion field outer product implemented in QUDA.  Only exposed for
  staggered fermions at present (computeStaggeredOprodQuda).

- EigCG eigenvector deflation algorithm and subsequent initCG
  implemented for the preconditioned normal operator.  Added a
  deflation_test to demonstrate the use of this algorithm.

- Implemented Lanczos eigenvector solver (no unit test yet for
  demonstrating this - presently only hooked into the CPS).

- Implemented initial support for communication-avoiding s-step
  solvers: CG - QUDA_MPCG_INVERTER and BiCGstab -
  QUDA_MPBICGSTAB_INVERTER.  Only proof of concept at the moment and
  need to be optimized.

- Implemented initial support for overlapping domain-decomposition
  preconditioners.  Presently only proof of concept and needs further
  development.

- Implemented initial support for applying different phases to a gauge
  field.  Presently only proof of concept and needs further
  development.  Will be useful for minimizing memory and PCIe traffic
  in staggered HMC.

- Implemented support for computation of the gauge field plaquette.

- Implemented initial support for fermion-field contractions.

- Added support for the CGNE solver, to complement the already
  existing CGNR.

- Improvements to stability and robustness of the solvers in mixed
  precision.  QUDA will default to always using a high precision
  solution accumulator since this drastically improves convergence,
  especially using half precision.

- Improved the stability and robustness of CG when used in combination
  with the Fermilab heavy-quark residual stopping criterion.  This has
  been validated against the MILC implementation.

- Separated dslash_quda.cu into multiple files to allow for parallel
  building to increase compilation speed.

- Added interface support for Luescher's chiral basis for fermion
  fields: page 11 of doc/dirac.ps in the DD-HMC code package
  http://luscher.web.cern.ch/luscher/DD-HMC.  This is selected through
  setting QudaInvertParam::gamma_basis = QUDA_CHIRAL_GAMMA_BASIS.

- QUDA will now complain and exit if it detects that a stale tunecache
  is being used.

- Removed official support for obsolete compute capabilities 1.1 and
  1.2.  This makes the minimum supported device compute capability 1.3
  (GT200).

- Multiple bug fixes that affects stability and correctness throughout
  the library.  Many of these fixes are listed here:
  https://github.com/lattice/quda/issues?q=milestone%3A%22QUDA+0.7.0+%22+is%3Aclosed.

- Although not strictly related to this release, we have started to
  collect common running settings and hints in the QUDA wiki:
  https://github.com/lattice/quda/wiki.


Version 0.6.1 - 10th March 2014

- All unit tests now enable/disable CPU-side verification with the "--verify
  true/false" flag.  The default is true.

- The google test API is now used in some of the unit tests
  (dslash_test, staggered_dslash_test and blas_test).  (Eventually all
  unit tests will be built using this.)

- Various bugs have been fixed in fermion_force_test,
  hisq_paths_force_test, hisq_unitarize_force_test and
  unitarize_link_test

Version 0.6.0 - 23rd January 2014

- Support for reconstruct 9/13 for the long link in HISQ fermions.
  This provides up to a 25% speedup over using no reconstruction.
  Owing to architecture constraints, reconstruct 9/13 is not supported
  on "Tesla" architectures, and is only supported on superseding
  architectures (Fermi, Kepler, etc.).

- Implemented the long link calculation for HISQ and asqtad fermions.
  This has the net result of speeding up the gauge fattening by
  about a factor 1.6x.

- Implemented a gauge field update routine that evolves the gauge
  field by a given step size using a momentum field.  This is exposed
  as the function updateGaugeFieldQuda(...).

- Added support for qdpjit field ordering.  When used in conjunction
  with the device interface, this allows Chroma (when compiled using
  qdpjit) to avoid all CPU <-> GPU transfers.

- Completely rewritten gauge and clover field copying routines using a
  generic template-driven approach.  Due the large number of possible
  input / output combinations to keep the compilation time under
  control, the different interfaces need to be opted in at configure
  time (MILC and QDP interfaces are enabled by default).

- The QUDA interface (loadGaugeQuda, loadCloverQuda, invertQuda and
  invertMultishiftQuda) now supports device-side pointers as well as
  host-size pointers.  The location of a given pointer is set by the
  QudaFieldLocation members of QudaGaugeParam (location) and
  QudaInvertParam (input_location, output_location, clover_location).

- Added new interface support for QDPJIT ordered fields (dirac, clover
  and gauge fields).

- When doing mixed-precision solvers, all low-precision copies of
  gauge and clover fields are created from the pre-existing GPU copies
  instead of re-copying from the CPU.  This lowers the PCIe overhead
  by up to 1.75x.

- Significantly improved performance of both degenerate and
  non-degenerate twisted-mass CG solver (up to 17% and 32%, respectively).

- ColorSpinorField is now derived from LatticeField, with all
  LatticeField derivations now using common page-locked and device memory
  buffers.  This has the effect of reducing the overall packed-locked
  memory footprint.

- The source vector is now scaled such that it is equal to unity.
  This prevents underflow from occurring when the source vector is too
  small.

- Fixed double precision definition of *= vector operator, which caused a
  truncation to single precision for certain solver types.

- Fixed memory overallocation when doing clover fermions in half precision.

- Memory leak fix to clover fermions.

- Added work around to allow QUDA to compile with GCC 4.7.x.

- Many small fixes and overall code cleanup.

Version 0.5.0 - 20 March 2013

- Added full support for CUDA 5.0, including the Tesla K20 and other
  GK110 ("Kepler 2") GPUs.  QUDA has yet to be fully optimized for
  GK110, however.

- Added multi-GPU support for the domain wall action, to be further
  optimized in a future release.

- Added official support for the QDP-JIT library, enabled via the
  "--enable-qdp-jit" configure option.  With the combination of QUDA
  and QDP-JIT, Chroma runs almost entirely on the GPU.

- Added a fortran interface, found in include/quda_fortran.h and
  lib/quda_fortran.F90.

- QUDA is now compatible with the Berlin QCD (BQCD) package,
  supporting both Wilson and Clover solvers, including support for
  multiple GPUs.  This currently requires a specific branch of BQCD
  (https://github.com/lattice/bqcd-r399-quda).

- Added a new interface function, initCommsGridQuda(), for declaring
  the mapping of MPI ranks (or QMP node IDs) to the logical grid used
  for communication.  This finally completes the MPI interface, which
  previously relied on an undocumented function internal to QUDA.

- Added a new interface function, setVerbosityQuda(), to allow for
  finer-grained control of status reporting.  See the description in
  include/quda.h for details.

- Merged wilson_dslash_test and domain_wall_dslash_test together into
  a unified dslash_test, and likewise for invert_test.  The staggered
  tests are still separate for now.

- Moved all internal symbols behind a namespace, "quda", for better
  insulation from external applications and libraries.

- Vastly improved the stability and accuracy of the multi-shift CG
  solver.  The invertMultiShiftQuda() interface function now supports
  mixed precision and implements per-shift refinement after the
  multi-shift solver completes to ensure accuracy of the final result.
  The old invertMultiShiftQudaMixed() interface function has been
  removed.  In addition, the multi-shift solver now supports setting
  the convergence tolerance on a per-pole basis via the tol_offset[]
  member of QudaInvertParam.

- Improved the stability and accuracy of mixed-precision CG.  As a
  result, mixed double/single CG yields a virtually identical iteration
  count to pure double CG, and using half precision is now a win.

- Added support for the Fermilab heavy-quark residual as a stopping
  condition in BiCGstab, CG, and GCR.  To minimize the impact on
  performance, the heavy-quark residual is only measured every 10
  iterations (for BiCGstab and CG) or only when the solution is computed
  (for GCR).  This stopping condition has also been incorporated into the
  sequential CG refinement stage of the multi-shift solver.  The
  tolerance for the heavy-quark residual is set via the "tol_hq"
  member of QudaInvertParam (and "tol_hq_offset" for the
  multi-shift solver).  The "residual_type" member selects the
  desired stopping condition(s): L2 relative residual, Fermilab
  heavy-quark residual, or both.  Note that the heavy-quark residual
  is not supported on cards with compute capability 1.1, 1.2, or 1.3
  (i.e., those predating the "Fermi" architecture) due to hardware
  limitations.

- The value of the true residual(s) are now returned in the true_res
  and (for multi-shift) true_res_offset members of the QudaInvertParam
  struct.  When using heavy quark residual stopping condition, the
  true_res_hq and true_res_hq_offset members are additionally filled
  with the heavy-quark residual value(s).

- The BiCGstab solver now supports an initial-guess strategy.  This is
  presently only supported when employing a one-pass solve and does
  not yet work for a two-pass solve (e.g., of the normal equations).

- Enabled by default double-precision textures since the Fermi double
  precision instability has been fixed in the driver accompanying the
  CUDA 5.0 production release.

- Fixed a bug related to the sharing of page-locked (pinned) memory
  between CUDA and Infiniband that affected correct operation of both
  Chroma and MILC on some systems.

- Renamed the "QUDA_NORMEQ_SOLVE" solve_type to "QUDA_NORMOP_SOLVE",
  and likewise for "QUDA_NORMOP_PC_SOLVE".  This better reflects their
  behavior, since a "NORMOP" solve will always involve the normal operator
  (A^dag A) but might not correspond to solving the normal equations
  of the original system.

- Fixed a long-standing issue so that solve_type and solution_type are
  now interpreted as described in the NEWS entry for QUDA 0.3.0 below.
  More specifically,

    solution_type specifies *what* linear system is to be solved.
    solve_type specifies *how* the linear system is to be solved.

    We have the following four cases (plus preconditioned variants):

    solution_type    solve_type    Effect
    -------------    ----------    ------
    MAT              DIRECT        Solve Ax=b
    MATDAG_MAT       DIRECT        Solve A^dag y = b, followed by Ax=y
    MAT              NORMOP        Solve (A^dag A) x = (A^dag b)
    MATDAG_MAT       NORMOP        Solve (A^dag A) x = b

    An even/odd preconditioned (PC) solution_type generally requires a PC
    solve_type and vice versa.  As an exception, the un-preconditioned
    MAT solution_type may be used with any solve_type, including
    DIRECT_PC and NORMOP_PC.

    As also noted in the entry for 0.3.0 below, with the CG inverter,
    solve_type should generally be set to 'QUDA_NORMOP_PC_SOLVE',
    which will solve the even/odd-preconditioned normal equations via
    CGNR.  (The full solution will be reconstructed if necessary based
    on solution_type.)  For BiCGstab (with Wilson or Wilson-clover
    fermions), 'QUDA_DIRECT_PC_SOLVE' is generally best.

- General cleanup and other minor fixes.  See
  https://github.com/lattice/quda/issues?milestone=7 for a breakdown
  of all issues closed in this release.


Version 0.4.0 - 4 April 2012

- CUDA 4.0 or later is now required to build the library.

- The "make.inc.example" template has been replaced by a configure script.
  See the README file for build instructions and "configure --help" for
  a list of configure options.

- Emulation mode is no longer supported.

- Added support for using multiple GPUs in parallel via MPI or QMP.
  This is supported by all solvers for the Wilson, clover-improved
  Wilson, twisted mass, and improved staggered fermion actions.
  Multi-GPU support for domain wall will be forthcoming in a future
  release.

- Reworked auto-tuning so that BLAS kernels are tuned at runtime,
  Dirac operators are also tuned, and tuned parameters may be cached
  to disk between runs.  Tuning is enabled via the "tune" member of
  QudaInvertParam and is essential for achieving optimal performance
  in the solvers.  See the README file for details on enabling
  caching, which avoids the overhead of tuning for all but the first
  run at a given set of parameters (action, precision, lattice volume,
  etc.).

- Added NUMA affinity support.  Given a sufficiently recent linux
  kernel and a system with dual I/O hubs (IOHs), QUDA will attempt to
  associate each GPU with the "closest" socket.  This feature is
  disabled by default under OS X and may be disabled under linux via
  the "--disable-numa-affinity" configure flag.

- Improved stability on Fermi-based GeForce cards by disabling double
  precision texture reads.  These may be re-enabled on Fermi-based
  Tesla cards for improved performance, as described in the README
  file.

- As of QUDA 0.4.0, support has been dropped for the very first
  generation of CUDA-capable devices (implementing "compute
  capability" 1.0).  These include the Tesla C870, the Quadro FX 5600
  and 4600, and the GeForce 8800 GTX.

- Added command-line options for most of the tests.  See, e.g.,
  "wilson_dslash_test --help"

- Added CPU reference implementations of all BLAS routines, which allows
  tests/blas_test to check for correctness.

- Implemented various structural and performance improvements
  throughout the library.

- Deprecated the QUDA_VERSION macro (which corresponds to an integer
  in octal).  Please use QUDA_VERSION_MAJOR, QUDA_VERSION_MINOR, and
  QUDA_VERSION_SUBMINOR instead.


Version 0.3.2 - 18 January 2011

- Fixed a regression in 0.3.1 that prevented the BiCGstab solver from
  working correctly with half precision on Fermi.


Version 0.3.1 - 22 December 2010

- Added support for domain wall fermions.  The length of the fifth
  dimension and the domain wall height are set via the 'Ls' and 'm5'
  members of QudaInvertParam.  Note that the convention is to include
  the minus sign in m5 (e.g., m5 = -1.8 would be a typical value).

- Added support for twisted mass fermions.  The twisted mass parameter
  and flavor are set via the 'mu' and 'twist_flavor' members of
  QudaInvertParam.  Similar to clover fermions, both symmetric and
  asymmetric even/odd preconditioning are supported.  The symmetric
  case is better optimized and generally also exhibits faster
  convergence.

- Improved performance in several of the BLAS routines, particularly
  on Fermi.

- Improved performance in the CG solver for Wilson-like (and domain
  wall) fermions by avoiding unnecessary allocation and deallocation
  of temporaries, at the expense of increased memory usage.  This will
  be improved in a future release.

- Enabled optional building of Dirac operators, set in make.inc, to
  keep build time in check.

- Added declaration for MatDagMatQuda() to the quda.h header file and
  removed the non-existent functions MatPCQuda() and
  MatPCDagMatPCQuda().  The latter two functions have been absorbed
  into MatQuda() and MatDagMatQuda(), respectively, since
  preconditioning may be selected via the solution_type member of
  QudaInvertParam.

- Fixed a bug in the Wilson and Wilson-clover Dirac operators that
  prevented the use of MatPC solution types.

- Fixed a bug in the Wilson and Wilson-clover Dirac operators that
  would cause a crash when QUDA_MASS_NORMALIZATION is used.

- Fixed an allocation bug in the Wilson and Wilson-clover
  Dirac operators that might have led to undefined behavior for
  non-zero padding.

- Fixed a bug in blas_test that might have led to incorrect autotuning
  for the copyCuda() routine.

- Various internal changes: removed temporary cudaColorSpinorField
  argument to solver functions; modified blas functions to use C++
  complex<double> type instead of cuDoubleComplex type; improved code
  hygiene by ensuring that all textures are bound in dslash_quda.cu
  and unbound after kernel execution; etc.


Version 0.3.0 - 1 October 2010

- CUDA 3.0 or later is now required to build the library.

- Several changes have been made to the interface that require setting
  new parameters in QudaInvertParam and QudaGaugeParam.  See below for
  details.

- The internals of QUDA have been significantly restructured to facilitate
  future extensions.  This is an ongoing process and will continue
  through the next several releases.

- The inverters might require more device memory than they did before.
  This will be corrected in a future release.

- The CG inverter now supports improved staggered fermions (asqtad or
  HISQ).  Code has also been added for asqtad link fattening, the asqtad
  fermion force, and the one-loop improved Symanzik gauge force, but
  these are not yet exposed through the interface in a consistent way.

- A multi-shift CG solver for improved staggered fermions has been
  added, callable via invertMultiShiftQuda().  This function does not
  yet support Wilson or Wilson-clover.

- It is no longer possible to mix different precisions for the
  spinors, gauge field, and clover term (where applicable).  In other
  words, it is required that the 'cuda_prec' member of QudaGaugeParam
  match both the 'cuda_prec' and 'clover_cuda_prec' members of
  QudaInvertParam, and likewise for the "sloppy" variants.  This
  change has greatly reduced the time and memory required to build the
  library.

- Added 'solve_type' to QudaInvertParam.  This determines how the linear
  system is solved, in contrast to solution_type which determines what
  system is being solved.  When using the CG inverter, solve_type should
  generally be set to 'QUDA_NORMEQ_PC_SOLVE', which will solve the
  even/odd-preconditioned normal equations via CGNR.  (The full
  solution will be reconstructed if necessary based on solution_type.)
  For BiCGstab, 'QUDA_DIRECT_PC_SOLVE' is generally best.  These choices
  correspond to what was done by default in earlier versions of QUDA.

- Added 'dagger' option to QudaInvertParam.  If 'dagger' is set to
  QUDA_DAG_YES, then the matrices appearing in the chosen solution_type
  will be conjugated when determining the system to be solved by
  invertQuda() or invertMultiShiftQuda().  This option must also be set
  (typically to QUDA_DAG_NO) before calling dslashQuda(), MatPCQuda(),
  MatPCDagMatPCQuda(), or MatQuda().

- Eliminated 'dagger' argument to dslashQuda(), MatPCQuda(), and MatQuda()
  in favor of the new 'dagger' member of QudaInvertParam described above.

- Removed the unused blockDim and blockDim_sloppy members from
  QudaInvertParam.

- Added 'type' parameter to QudaGaugeParam.  For Wilson or Wilson-clover,
  this should be set to QUDA_WILSON_LINKS.

- The dslashQuda() function now takes takes an argument of type
  QudaParityType to determine the parity (even or odd) of the output
  spinor.  This was previously specified by an integer.

- Added support for loading all elements of the gauge field matrices,
  without SU(3) reconstruction.  Set the 'reconstruct' member of
  QudaGaugeParam to 'RECONSTRUCT_NO' to select this option, but note
  that it should not be combined with half precision unless the
  elements of the gauge matrices are bounded by 1.  This restriction
  will be removed in a future release.

- Renamed dslash_test to wilson_dslash_test, renamed invert_test to
  wilson_invert_test, and added staggered variants of these test
  programs.

- Improved performance of the half-precision Wilson Dslash.

- Temporarily removed 3D Wilson Dslash.

- Added an 'OS' option to make.inc.example, to simplify compiling for
  Mac OS X.


Version 0.2.5 - 24 June 2010

- Fixed regression in 0.2.4 that prevented the library from compiling
  when GPU_ARCH was set to sm_10, sm_11, or sm_12.


Version 0.2.4 - 22 June 2010

- Added initial support for CUDA 3.x and Fermi (not yet optimized).

- Incorporated look-ahead strategy to increase stability of the BiCGstab
  inverter.

- Added definition of QUDA_VERSION to quda.h.  This is an integer with
  two digits for each of the major, minor, and subminor version
  numbers.  For example, QUDA_VERSION is 000204 for this release.


Version 0.2.3 - 2 June 2010

- Further improved performance of the blas routines.

- Added 3D Wilson Dslash in anticipation of temporal preconditioning.


Version 0.2.2 - 16 February 2010

- Fixed a bug that prevented reductions (and hence the inverter) from working
  correctly in emulation mode.


Version 0.2.1 - 8 February 2010

- Fixed a bug that would sometimes cause the inverter to fail when spinor
  padding is enabled.

- Significantly improved performance of the blas routines.


Version 0.2 - 16 December 2009

- Introduced new interface functions newQudaGaugeParam() and
  newQudaInvertParam() to allow for enhanced error checking.  See
  invert_test for an example of their use.

- Added auto-tuning blas to improve performance (see README for details).

- Improved stability of the half precision 8-parameter SU(3)
  reconstruction (with thanks to Guochun Shi).

- Cleaned up the invert_test example to remove unnecessary dependencies.

- Fixed bug affecting saveGaugeQuda() that caused su3_test to fail.

- Tuned parameters to improve performance of the half-precision clover
  Dslash on sm_13 hardware.

- Formally adopted the MIT/X11 license.


Version 0.1 - 17 November 2009

- Initial public release.