Martin @ Blog

software development and life.

Flower

Encoding in Scala interpreter

One of the nice things of Scala is the availability of a command line interpreter based on the REPL principle (Read-evaluate-print loop). Last week, for a particular project, I wanted to generate a string containing a part of the UTF-8 character table.
Thanks to Scala’s concise syntax, this would not be very difficult:

(0x20AC until 0x20B6).foreach { x => print(x.toChar + " ") }

This example will print characters 0x20AC (euro symbol) up to 0x20B6 (an unknown symbol to me 🙂 ).

However, the result I got on my system (Mac OS X 10.6.2 using Scala 2.8 nightly) was not really what I expected:

? ? ? ? ? ? ? ? ? ?


Yes, indeed, I only got a list of question marks. Several attempts to solve this problem (writing it to a file, printing it using other conversions, etc.) didn’t solve this problem. I expected it had something to do with the encoding, but not knowing enough on this subject prevented me from finding the actual problem. I ended up posting a question on Stackoverflow:

I am not able to print unicode characters correctly. Of course a-z, A-Z, etc. are printed correctly, but for example € or ƒ is printed as a ?.

print(8364.toChar)
results in ? instead of €. Probably I’m doing something wrong. My terminal supports utf-8 characters and even when I pipe the output to a seperate file and open it in a texteditor, ? is displayed.

I got one answer, stating that he (or she?) could not reproduce the problem:

Euro’s codepoint is 0x20AC (or in decimal 8364), and that appears to work for me (I’m on Linux, on a nightly of 2.8):

scala> print(0x20AC.toChar)
€

So, either there was an issue on Mac OS X or there was a bug in Scala. After a week or so, I still didn’t found the cause of my problem (of course, my original problem, printing the string containing UTF-8 characters was already solved in a different way). I decided to investigate a bit further. As most of the time, the cause was pretty obvious. Scala uses the system property file.encoding to determine which encoding it should use. I posted the ‘solution’ on Stackoverflow:

The cause of the problem is the default encoding used by Mac OS X. When you start `scala` interpreter, it will use the default encoding for the specified platform. On Mac OS X, this is Macroman, on Windows it is probably CP1252. You can check this by typing the following command in the scala interpreter:

    scala> System.getProperty("file.encoding");
    res3: java.lang.String = MacRoman

According to the scala help test, it is possible to provide Java properties using the -D option. However, this does not work for me. I ended up setting the environment variable

    JAVA_OPTS="-Dfile.encoding=UTF-8"

After running scala, the result of the previous command will give the following result:

    scala> System.getProperty("file.encoding")
    res0: java.lang.String = UTF-8

Now, printing special characters works as expected:

    print(0x20AC.toChar)               
    €

So, it is not a bug in Scala, but an issue with default encodings. In my opinion, it would be better if by default UTF-8 was used on all platforms. In my search for an answer if this is considered, I came across a discussion on the Scala mailing list on this issue. In the first message, it is proposes to use UTF-8 by default on Mac OS X when file.encoding reports Macroman, since UTF-8 is the default charset on Mac OS X (keeps me wondering why file.encoding by defaults is set to Macroman, probably this is an inheritance from Mac OS before 10 was released?). I don’t think this proposal will be part of Scala 2.8, since Martin Odersky wrote that it is probably best to keep things as they are in Java (i.e. honor the file.encoding property).

So the best way to prevent this issue, is to set file.encoding to UTF-8 using the JAVA_OPTS environment variable which is loaded by default on startup.

Tags: , ,

One Response to “Encoding in Scala interpreter”

  1. May 28th, 2010 at 12:47

    OLIO – A Miscellany » Scala – Random Notes says:

    […] Character encoding […]