Skip to content
bmatsuo edited this page Jul 13, 2011 · 4 revisions

Note: This benchmark data is old. Since then, the row iteration method has changed and other new features have been added. The measurements for the csvutil package are likely not as accurate anymore. I will work on updating the benchmark program.

Parsing

I ran some initial benchmark tests of csvutil versus gocsv, the dominant Go CSV parsing library. I made 3 random csv files of various sizes. A small one, a medium one and a large one (in filesize). The small one was square, the medium one had long rows, and the large one had a lot of rows and fewer columns. I tested the parsing in two settings,

  • Read the whole CSV file into a byte array and have the parsers read/parse data from the byte array. This avoids any I/O overhead during the actual parsing. This only worked for the small and medium CSV files I created. The large one was too big to allocate contiguous space for.

  • Read the CSV data directly from the files containing them. This is more of a real world application. It hardly seems practical to read the whole CSV stream into memory before hand as in the previous method. But, there are disk I/O overhead issues that could complicate accurate benchmark timing.

I/O overhead turned out not to be an issue for the most part except in the case of gocsv's processing of the small file when the whole file was previously read into memory. Something about gocsv's algorithm works very well when the source data is all in memory with good cache locality (small amount of data). It is able to zip over the data, while the bufio.Reader's underlying csvutil Reader objects is making unnecessary copies of the bytes. This is something that could probably be optimized in the future, but the usage case is fairly limited, so it will be of a low priority.

The data provided here was generated on my Eee PC netbook with 2-core Intel Atom N280 processor, 2GB ram, and Ubuntu 10.04 (Lucid) using Linux kernel 2.6.32-32-generic. It shows the relative performance of row iteration methods of gocsv and csvutil on the three different benchmark CSV files I generated. The are a total of 3 methods for iterating rows in csvutil, I ran the benchmark functions for two of them (the third should be somewhere close). I have tested the ReadRow() method that gocsv util requires (CSVUIter), and the concurrent ReaderRowIteratorAuto method for iterating over a whole data stream (CSVUIterO). These are compared against the only row iteration mechanism implmented in gocsv (GCSVIter).

You can see a steady improvement in the data rate of csvutil versus gocsv, except in the case mentioned before. The csvutil library is able to consistency stream at least 1.50 MB/s faster than gocsv.

The ability of each library to write entire files into memory (ReadFile) and their ability to dump CSV data (WriteRow/WriteFile) still need to have benchmarks written, but my intuition suggests they will show performance similar to what is seen here.

I will make the gotest benchmark available, and I will work on doing the same for the benchmark files.

#
# Generated with the command 
#
#    gotest -x -v -bench=".*" -file csv_test.go
#
# Using csvutil version: 0.2.3
#       gocsv Revision: 9fcac9155dbf
#
gotest 0.04s: gomake testpackage-clean
rm -f _test/csvbench.a
 [+0.05s]
gotest 0.09s: gomake testpackage GOTESTFILES=csv_test.go
8g  -o _gotest_.8 csvbench.go  csv_test.go
rm -f _test/csvbench.a
gopack grc _test/csvbench.a _gotest_.8 
 [+0.11s]
gotest 0.19s: gomake -s importpath
 [+0.06s]
gotest 0.26s: 8g -I _test _testmain.go
 [+0.03s]
gotest 0.28s: 8l -L _test _testmain.8
 [+0.62s]
gotest 0.90s: ./8.out -test.v=true -test.bench=.*
testing: warning: no tests to run
PASS
csvbench.BenchmarkCSVUIterOShortBuff	   20000	     60388 ns/op	35572.80 MB/s
csvbench.BenchmarkCSVUIterShortBuff	   50000	     23850 ns/op	90070.02 MB/s
csvbench.BenchmarkGCSVIterShortBuff	  100000	     10280 ns/op	208965.95 MB/s
csvbench.BenchmarkCSVUIterOMidBuff	       1	23038703000 ns/op	   4.66 MB/s
csvbench.BenchmarkCSVUIterMidBuff	       1	22852601000 ns/op	   4.70 MB/s
csvbench.BenchmarkGCSVIterMidBuff	       1	37386770000 ns/op	   2.87 MB/s
csvbench.BenchmarkCSVUIter0ShortStream	       1	24460132000 ns/op	   4.39 MB/s
csvbench.BenchmarkGCSVIterShortStream	       1	38147595000 ns/op	   2.82 MB/s
csvbench.BenchmarkCSVUIterShortStream	       1	23708103000 ns/op	   4.53 MB/s
csvbench.BenchmarkCSVUIter0MidStream	       1	24466523000 ns/op	   4.39 MB/s
csvbench.BenchmarkGCSVIterMidStream	       1	38281469000 ns/op	   2.81 MB/s
csvbench.BenchmarkCSVUIterMidStream	       1	23677385000 ns/op	   4.54 MB/s
csvbench.BenchmarkCSVUIter0LongStream	       1	24446052000 ns/op	   4.39 MB/s
csvbench.BenchmarkGCSVIterLongStream	       1	38211089000 ns/op	   2.81 MB/s
csvbench.BenchmarkCSVUIterLongStream	       1	24360139000 ns/op	   4.41 MB/s
 [+365.87s]
gotest 366.77s: done
Clone this wiki locally