«« Rideau Canal Footbridge Durham 0.2.0 Released »»
blog header image
Default File Encoding Issues

Here's an interesting issue that North Americans using Eclipse probably won't run into: Eclipse uses a different default file encoding for Windows, Linux and Mac OS X.

On Windows the default file encoding is a Windows-specific "standard" called Cp1252 which is actually a modified version of the ISO-8859-1 file encoding standard.

On Linux the default text encoding for files in Eclipse is UTF-8 (Unicode). I would have never have noticed this discrepancy if I didn't have variables with accented characters in them. I'm using these variables to create an enumeration class (pre-Java5) to represent a code standard for the most popularly used languages in the world: ISO 639-2.

I originally created this source file in Eclipse on Windows and it used the Cp1252 standard. The file was then checked into a CVS repository.

When I checked out the files in Eclipse on Linux (screenshot) I got this Eclipse compiler error:

Syntax error on token "Invalid Character"

I initially thought that Eclipse or Linux was at fault but after discovering the encoding difference (thanks to some prodding from #eclipse-dev) I changed my Windows copy of Eclipse to use UTF-8 for all editors and now all is well.

As I mentioned above, the default file encoding Eclipse on Mac OS X is also different: it uses a one byte Apple-specific file encoding scheme called MacRoman.

Why isn't Unicode being used by default across the board? Wouldn't that make cross-platform development easier? Using different default file encodings for each platform somewhat nullifies Java's cross-platform nature, no?

True, the differences between these encodings are small but they are enough to possibly cause people internationalization (I18N) problems down the line.

---

But that's not all. Changing Eclipse's file encoding to UTF-8 has some interesting consequences for my project's Ant build scripts. I had to specify that I'm using UTF-8 for the source files in the javac Ant task:

<javac compiler="javac1.4" encoding="UTF-8" debug="yes" srcdir="${src.dir}" destdir="${src-classes-debug.dir}">
   <classpath refid="compile-time.classpath"/>
</javac>

Otherwise the Ant javac task won't properly recognize the special characters and fail.

Posted at July 13, 2005 at 11:19 AM EST
Last updated July 13, 2005 at 11:19 AM EST
Comments

You ask why Unicode is not the default. Isn't the answer obvious? Wherever possible Eclipse (like SVT) tries to look and feel like a native app. This necessarily includes saving files the way native apps do. On Windows and the Mac most native apps save files in legacy encodings (unfortunate, but mostly dictated by history). Unicode support on Windows is much better than it once was however, so I suppose they could probably get away with saving the files in UTF-8 however no doubt many legacy apps would trip over the byte-order marks (don’t recall if those are optional or not).

» Posted by: Rob McDougall at July 13, 2005 01:52 PM

That makes good sense, yes. But an encoding isn't "legacy" if it's still being used by applications to produce new files.

We're never going to get to Unicode if modern applications like Eclipse keep spitting out native file encodings by default. The rationale for continuing to use the older encodings will never disappear and existing applications will have no reason to update to Unicode.

We have to let go at some point. Operating system-specific legacy encodings (like Cp1252 and MacRoman) could be the exception rather than the default for Eclipse.

» Posted by: Ryan at July 14, 2005 07:48 AM

I should add that the different default file encodings make cross-platform Rich Client Platform (RCP) development more difficult, especially in areas of the world that use non-ASCII characters in code (ie. everywhere but North America).

Everyone that works on a cross-platform RCP project must make sure that they explicitly switch to UTF-8. Pretty easy to do if you know about it, but a person could just as easily have no idea. File encoding doesn't seem to be a well-known issue.

If the Eclipse Foundation wants to encourage developers to make RCP applications and tout RCP/SWT/Eclipse's cross-platform and I18N abilities then issues like this need to be smoothed over. Then it gets increasingly easier to develop cross-platform RCP/SWT applications.

It may seem like a minor issue but solutions to problems like this makes developer's lives easier and makes the RCP more attractive as a development platform. Today cross-platform development of Java applications may seem too difficult.

When will considerations for the well-being of RCP/SWT development win over Eclipse itself? It will be interesting to see if there is any shift in priority over time.

» Posted by: Ryan at July 14, 2005 08:01 AM

Why can't coders just stick with ascii characters? As far as the source code goes, it should be entirely sufficient. I don't really see why you would need accented variable names and function calls. Not only is it unnecessary, typing those accented characters on most keyboards is a real pain. As far as hard coded strings in the code goes. Well, I think we are pretty much safe to say that you really shouldn't have hard coded strings to begin with. Even though we all do.

Cross platform coding is hard enough when dealing with carriage returns on Mac, Unix, and Windows. Don't make it any more difficult by throwing accented characters into the mix.

» Posted by: Kibbee at July 14, 2005 08:37 AM

Are you saying that someone from India, China, Japan or the Middle East for example shouldn't be able to code in their native language?

It's bad enough that most of the documentation out there is in English, the tools shouldn't be yet another hurdle for people in these countries. We have the technology to solve that problem.

The fact that English dominates technology (and that most English people don't care) is surely widening the gap between the haves and the have-nots.

Saying "people should just use English" is fairly ignorant, sorry.

» Posted by: Ryan at July 14, 2005 08:44 AM

back in my day we only used the letters a-z and the numbers 1-10 and that's the way i likes it. i don't think that we should use capitals since that makes difficulty for me having to have to program and look for differences in case.

so what we should do is not allow capitalization or numbers greater than ten. remove zero while we are at too. that's double-plus good if i have even seen it.

» Posted by: Jim at July 14, 2005 11:40 AM

If you are programming something, then it had better all be in one language. Imagine trying to work on a project where half of it is in English, and half of it is in Hindi. Basically, with any project, you have to say, this is the language we are using, and stick to it. Otherwise, developers will all have to be multilingual. If you work somewhere where everyone speaks Hindi, then you would probably want to use Hindi. And then you would have to worry about this encoding fiasco. But if you are working on an english project with english coders, then please refrain from using accented characters that aren't on english keyboards, and causing everyone more stress than is necessary.

» Posted by: Kibbee at July 14, 2005 02:11 PM

If you were using Hindi you'd have to use Unicode on all platforms, which Eclipse doesn't do by default (only on Linux, Fedora Core 4).

That's what this blog post is all about.

» Posted by: Ryan at July 14, 2005 02:15 PM

Thanks for the tip. I work on a project where all the development workstations are Win 2K boxes. Being the oddball that I am, I take my Linux laptop and use it for development. All of the source files in the project have a copyright symbol that creates a warning under Linux. Having eighty-million warnings made my build much slower than it should have been. After reading your post, I added encoding="Cp1252" to the javac task of the ANT buildfile and those warnings went away.

David

» Posted by: David Heffelfinger at July 14, 2005 03:13 PM

Jim: "so what we should do is not allow capitalization or numbers greater than ten. remove zero while we are at too. that's double-plus good if i have even seen it."

I couldn't have put it better myself...

» Posted by: James at July 14, 2005 04:09 PM

Jim: that sounds like BASIC ;)

» Posted by: Ryan at July 14, 2005 04:16 PM

This is not about Eclipse using different encodings on different platforms. Eclipse by default uses the JVM default encoding, which by the way is dependant on the platform. You may want to ask: why does not the JVM enforce the same default encoding (e.g. UTF-8)? Well, because this would make it incompatible with other programs running on the same machine that use the OS default encoding.

About your problem with cross-platform development: whenever a project is being developed by people working on different platforms/locales, make sure you set the default encoding at the *project* level (not at the workspace level). Just open the project properties, go to the Info page and set the text file encoding to be the one you want. This setting is then shared with the project, which ensures everybody uses the same setting. This is supported since Eclipse 3.0.

» Posted by: Rafael at July 14, 2005 04:55 PM

Actually, I'd say that the problem comes from CVS, not Java or Eclipse.

Windows, Linux and Max all have mechanisms to deal with their encoding, so for a particular process (and its locale settings), Java will pick up which one is the current encoding. Any program can or should do that, including CVS.

Since Java files are text files, they should be treated as such, which includes proper encoding to and from the encoding in use. Just like it should be for EOL differences.

And no, expressing all of you source code in ASCII is not a viable option, even if you are forced to use English for identifiers.
Comments, for instance, needs the national characters.

Diatribe: English speaking people are lucky, you never get your name messed up like we do. And Fortune 100 companies do it, too.

UTF-8 is a good solution, but it will take a long time before programmers stop blindly widening UTF-8 into UTF-16 which happens all the time, especially on web sites, and particularly the US ones (this blog software gets it right, I'm happy to see)

» Posted by: Mr. Møller at July 15, 2005 01:23 PM

So Eclipse passes the buck to the JVM and the JVM passes the buck to the OS. I can see how this is an advantage for legacy support but it's at the expense of forward progress.

I know I'm an idealist. I also know that people will not change unless they are forced to do so. If you make it more of a pain to use legacy things than to move on (like to Unicode) then people will move on. If Eclipse used Unicode by default then people could still switch to legacy file encodings if they wanted to or had to.

Ultimately my idealist question is this: what's the plan for everyone in the world moving to Unicode? It should happen eventually, right? It seems like that was the whole purpose of Unicode.

The problem is, with applications like Eclipse continuing to produce files with legacy file encodings by default I don't see it happening any time soon.

» Posted by: Ryan at July 15, 2005 01:49 PM

The problem is, is that text files don't store in themselves what the encoding is. There is no header saying, "This is the encoding scheme used". You either have to know the encoding scheme, or guess based on a bunch of different things. Not only that, switching between encoding schemes is a pain if one encoding supports more characters than some other encoding. Even if we were all using UTF-8, we would all get pretty annoyed if someone coding on a project started using characters like "جذس١٥٤٣٩ﻹﻌﺦﭻỮ" and then we had to figure out how to type those characters on a US keyboard. Imagine trying to type Math.π instead of Math.PI. There's also a problem with fonts that don't support every single unicode character. I'm not sure if there are any fonts that support them all. Ending up with half the code looking like square boxes because some asshat decided to characters like ж in his variable names would be quite a pain.

Anyway, I do feel for the programmers who aren't English though. All the languages are in English as well as most of the documentation. I can't imagine how hard it is for someone who doesn't know english to program. Actually it would probably be easier for them to learn english than to try and struggle though and program without knowing english. Maybe we should make programming languages that are more language agnostic, so that instead of typing "public class circle", we type "dadlqe assjf circle", or even better "жоГתؤزφ τآаشΩ circle", to make sure that we use those unicode characters. If you don't speak any english then you could just as easily learn "assjf", or "τآаشΩ" as you could "class". They all have no meaning to you.

I think it's good that computers have standardized on a particular language. Imagine if HTML was in English, Java in French, and SQL is Chinese. I would have to know 3 languages to program a web application with a database back end. Good thing I only have to know one. People who know English are at an advantage to those who don't. But if you want to play the violin, you have to learn to read music. As far as comments go, write them in your native language if you feel so inclined. Shouldn't the compiler ignore comments anyway? jus strip out everything between /* and */, or from // to the end of the line.

» Posted by: Kibbee at July 18, 2005 10:30 PM
Google
 
Search scope: Web ryanlowe.ca