:encoding is not just for slurp alone

You certainly know this moment when that small tool you have written, after it worked perfectly well in your development environment, meets the harsh conditions of the outside world. Some bad people call this ‘real world’, but sometimes it’s too bad to be real. This is another chapter in this titanic struggle, and it’s only worth noticing because I’ve found only few information about how to overcome it.

That’s the situation:

  • Write a small tool that reads a text file (check)
  • line by line (check)
  • that extracts data per line (check)
  • and produces some output with that extracted data. (check)

Not that a biggie, isn’t it? I’m using Leiningen, so I did a lein new del2sql and edited project.clj and src/del2sql/core.clj (Don’t use this code: there’s an error in it.):

And here’s the core.clj:

(Actually it’s a bit more complex. I’ve simplified it, because the program logic doesn’t really matter. So what’s the deal?

main is especting a file name as input which is passed to read-and-transform. Its with-open / doseq construct is reading the input file line by line, and the :encoding keyword says that we are using a file in Windows format. Maybe this is worth notifying, because :encoding is usually mentioned with functions like slurp, but not with reader. Both functions use “UTF-8” as a default, and in most cases this will run smoothly, but here we are leaving the secure path of default values.

“Windows format” is a bit misleading, because when you open this file in a Windows editor, like notepad, this editor says it’s an ANSI file. Behind this “ANSI” (which was never standardized) hides a code page, in this particular sample it’s the code page 1252, a.k.a. Cp-1252, or Windows-1252, and commonly mislabeled as ISO 8859-1 (which it isn’t; it’s a superset of that ISO code). This format has enough special characters, and the designers of the text file my tool has to read have chosen it for this very reason.

After reading in one line, it’s destructured into a vector whose elements then are used for compositing the output string.

I’m using Emacs for developing Clojure code like this, and within Emacs this code executed without problems. Unfortunately when running this code with

lein run def2sql

errors were emerging, because all special characters were replaced by “?”. The question is: where exactly does this problem occur? Is it the reader? It already knows that it has to use a “Windows-1252” encoding? Is it the output function println? I’ve written the output into a file, but the results were the same, so it wasn’t println. Was it the shell and its environment variable? Probably yes. When users try to use special characters, they often are given the advice to use the additional option

:jvm-opts [“-Dfile.encoding=(codepage)”]

in their projects.clj. So did I. I thought this would cover any cases of doubt, but it didn’t. When I finally omitted this line in my project.clj, the output finally was right, including all special characters.

So be careful with :jvm-opts in your project.clj. I haven’t figured out yet what exactly the problem was, but I’m keeping this behaviour in mind. Tell me if you know something about this or if you have / had similar problems.

About Manfred Berndtgen

Manfred Berndtgen, maintainer of this site, is a part-time researcher with enough spare time for doing useless things and sharing them with the rest of the world. His main photographic subjects are made of plants or stones, and since he's learning Haskell everything seems functional to him.