You certainly know this moment when that small tool you have written, after it worked perfectly well in your development environment, meets the harsh conditions of the outside world. Some bad people call this ‘real world’, but sometimes it’s too bad to be real. This is another chapter in this titanic struggle, and it’s only worth noticing because I’ve found only few information about how to overcome it.
That’s the situation:
- Write a small tool that reads a text file (check)
- line by line (check)
- that extracts data per line (check)
- and produces some output with that extracted data. (check)
Not that a biggie, isn’t it? I’m using Leiningen, so I did a lein new del2sql and edited project.clj and src/del2sql/core.clj (Don’t use this code: there’s an error in it.):
1 2 3 4 5 6 7 8 9 |
(defproject del2sql "0.1.0-SNAPSHOT" :description "reads iLife .del file, prints sql insert statements" :url "http://mycompany.com/rat" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :dependencies [[org.clojure/clojure "1.6.0"]] :jvm-opts ["-Dfile.encoding=Windows-1252"] :aot :all :main del2sql.core) |
And here’s the core.clj:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
(ns del2sql.core) (use 'clojure.java.io) (defn read-and-transform [file] (with-open [r (reader file :encoding "Windows-1252")] (doseq [line (line-seq r)] (when-not (= (first (seq line)) \#) ; skip comment lines (let [[slvID slwID _ schlKurz schlKurzRID schlLang] (clojure.string/split line #"\|")] (println (str "INSERT INTO MY-TABLE (FIELD1, FIELD2, FIELD3) VALUES ('" slvID "', '" slwID "', '" schlKurz ";" schlKurzRID ";" schlLang "');"))))))) (defn -main ([] (println "Missing input file name. Aborting.")) ([filename] (read-and-transform filename))) |
(Actually it’s a bit more complex. I’ve simplified it, because the program logic doesn’t really matter. So what’s the deal?
main is especting a file name as input which is passed to read-and-transform. Its with-open / doseq construct is reading the input file line by line, and the :encoding keyword says that we are using a file in Windows format. Maybe this is worth notifying, because :encoding is usually mentioned with functions like slurp, but not with reader. Both functions use “UTF-8” as a default, and in most cases this will run smoothly, but here we are leaving the secure path of default values.
“Windows format” is a bit misleading, because when you open this file in a Windows editor, like notepad, this editor says it’s an ANSI file. Behind this “ANSI” (which was never standardized) hides a code page, in this particular sample it’s the code page 1252, a.k.a. Cp-1252, or Windows-1252, and commonly mislabeled as ISO 8859-1 (which it isn’t; it’s a superset of that ISO code). This format has enough special characters, and the designers of the text file my tool has to read have chosen it for this very reason.
After reading in one line, it’s destructured into a vector whose elements then are used for compositing the output string.
I’m using Emacs for developing Clojure code like this, and within Emacs this code executed without problems. Unfortunately when running this code with
lein run def2sql
errors were emerging, because all special characters were replaced by “?”. The question is: where exactly does this problem occur? Is it the reader? It already knows that it has to use a “Windows-1252” encoding? Is it the output function println? I’ve written the output into a file, but the results were the same, so it wasn’t println. Was it the shell and its environment variable? Probably yes. When users try to use special characters, they often are given the advice to use the additional option
:jvm-opts [“-Dfile.encoding=(codepage)”]
in their projects.clj. So did I. I thought this would cover any cases of doubt, but it didn’t. When I finally omitted this line in my project.clj, the output finally was right, including all special characters.
So be careful with :jvm-opts in your project.clj. I haven’t figured out yet what exactly the problem was, but I’m keeping this behaviour in mind. Tell me if you know something about this or if you have / had similar problems.