extsort 1.0: external sorting library for Common Lisp

extsort is a Common Lisp library for external sorting of Unicode text files. External sorting is a sorting process where the data to be sorted cannot fit entirely in memory: this is the case, for example, of big files.

Source code is released under a MIT License and hosted at git repository https://gitlab.com/janduja/extsort.git under branch 1.0.

A compressed zip package can be downloaded at http://janduja.com/extsort/extsort-1.0-tar.gz.

Contents:

  1. Installation
  2. Use
  3. Internals
  4. Tests and compatibility

Installation

The library has no dependencies and can be loaded directly or as an ASDF package. First, get the source code or unpack the zip file.

To load directly:

(load "extsort.lisp")
To load as an ASDF system, ensure that the library source is located in a directory looked by ASDF (ie. ~/common-lisp/extsort), then in your code:
(asdf:load-system :extsort)

Use

[Generic] extsort origin-file target-file comparator max-line-proc => target-file
Sorts the lines of origin-file, by comparator function, that cannot fit entirely in memory. The sorted ouput file is written into target-file. The integer max-line-proc expresses the max number of lines to be processed in memory by extsort. Returns the target-file.

[Method] extsort (string,string)
Converts the string to pathname by parse-namestring then recall extsort

[Method] extsort (pathname,pathname)

The following example supposes the existence of a file containing a list of countries with their GDP in this format:

Taiwan|1177052
Austria|432424
South Africa|761926
Philippines|878980
...
This call performs an alphabetical sort of the file, processing a maximum number of 100 lines in memory, and writes the result into a file alph_result in relative folder output:
(extsort "countries_gdp_2017" "output/alph_result" #'string< 100)
This call use the regex library cl-ppcre and performs an ascending GDP sorting, processing a maximum number of 50 lines in memory:
(let* ((get-gdp (lambda (x) (parse-integer (cadr (cl-ppcre:split "\\|" x)))))
       (comparator (lambda (x y) (< (funcall get-gdp x) (funcall get-gdp y)))))
  (extsort "countries_gdp_2017" "output/asc_gdp_result" comparator 50))

Internals

The external sort is performed in two phases: first the input file is scanned sequentially to produce temporary smaller sorted files, then these files are merged togheter via a k-way merge to produce the final result.

Phase 1

The steps in the first phase are:

  1. the input file is read into a buffer (structure buffer-phase1) containing a maximum number of lines as specified by parameter max-line-proc in function extsort
  2. the buffer lines get sorted according to the comparator parameter in extsort
  3. the sorted lines are written into a new file located in the same directory as the future final file
  4. all the previous steps are repeated until the input file read is completed

Phase 2

In the second phase a list of buffers (structure buffer-phase2) is built for each temporary file, the capacity of each being the max-line-proc parameter of extsort divided by the number of temporary files (floor rounded).

A k-way merge is performed peeking (function peek) the top element of each buffer and choosing the first according to the comparator parameter in extsort. The chosen element is then removed from its buffer (function poll) and written to the final file. The buffers automatically get loaded, when empty, to the maximum of their capacity.

The process terminates when no more lines che be read by the buffers.

Tests and compatibility

Tests are collected in file extsort-test.lisp under the test directory and have dependencies on lisp-unit and cl-ppcre.

To run the tests you have to position as current directory in the extsort/test, then launch your Common Lisp implementation and load the file extsort-test.lisp. A Unit Test report is printed on the standard output telling the run tests, the passed and failed assertions and the execution errors.

The tests run successfully with the following Common Lisp implementations and hardware configuration:

[uname]Debian 3.16.36-1+deb8u2 (2016-10-19) i686 GNU/Linux
[/proc/cpuinfo]model name : Intel(R) Pentium(R) M processor 1.70GHz
[/proc/cpuinfo]cache size : 2048 KB
[/proc/cpuinfo]address sizes : 32 bits physical, 32 bits virtual
[cpufreq-info]hardware limits: 600 MHz - 1.60 GHz
[/proc/meminfo]MemTotal: 511492 kB


Last edited on: 13 May 2017
Author: Raffaele Arecchi
Contact the author of this document via email to: raffaele.arecchi {at} gmail {point} com.