Unable to support non-UTF-8 characters in XML file generated with ColdFusion

When I run the following code on a server running ColdFusion 2018:

<cfsetting enablecfoutputonly="yes">  <cfxml variable="test">     <cfoutput>         <test>             áéíóú         </test>     </cfoutput> </cfxml>  <cfset testString = ToString(test)>  <cfset testStringISO = Replace(testString, "UTF-8", "iso-8859-1")>  <cffile action="write" file="#AbsoluteFilesPath#test.xml" output="#testStringISO#" charset="iso-8859-1"> 

Where AbsoluteFilesPath is an absolute reference to a location on the server. The method I’m using to change the encoding of the XML is found here. The test.xml file looks like this when I open it on the server in Notepad++:

<?xml version="1.0" encoding="iso-8859-1"?> <test>     αινσϊ </test> 

The encoding of the file shows as "ISO 8859-7".

Interestingly, opening the file with VSCode on my local machine shows up like this:

<?xml version="1.0" encoding="iso-8859-1"?> <test>     ����� </test> 

Here, the encoding of the file shows as "UTF-8". Selecting the command "Reopen with encoding ISO 8859-1" within the editor shows the file as it should be:

<?xml version="1.0" encoding="iso-8859-1"?> <test>     áéíóú </test> 

I’ve tested this code, replacing "iso-8859-1" with "utf-16", and the results are the same.

Why is the file encoding inconsistent and not what I expected? How can I ensure the file is created with the correct encoding?

Asked on July 16, 2020 in XML.
Add Comment
1 Answer(s)

Let’s clarfiy something first: The encoding attribute in an XML file is just an indicator for the reader. It does not affect the bytes written to the actual file.

So let’s simplify the example code to a single character á:

dump

UTF-8 stores 2 bytes, ISO-8859-1 stores 1 byte. This is what we expect.

Example code

<cfsetting enablecfoutputonly="true">  <cfxml variable="test">     <cfoutput><r>á</r></cfoutput> </cfxml>  <cfset xmlForUTF = toString(test)> <cfset xmlForISO = replace(xmlForUTF, 'encoding="UTF-8"', 'encoding="ISO-8859-1"')>  <cfset fileWrite(expandPath("UTF-8.xml"),      xmlForUTF, "UTF-8")> <cfset fileWrite(expandPath("ISO-8859-1.xml"), xmlForISO, "ISO-8859-1")> 

Resulting files

UTF-8.xml

utf

ISO-8859-1.xml

iso

This is exactly what we expected. Neither cfxml nor cffile/fileWrite is the issue. So how come you might not get the same result with the above code on your machine?

The problem: Page encoding

When ColdFusion parses template files (.cfm) and component files (.cfc), it will use the JVM’s default encoding, which, unless specified otherwise, is the system’s default encoding. This is also the reason, why everyone can get different results with the above code.

If you have a literal such as á in a file, this character is encoded using whatever you told your text editor to use. Let’s assume that’s UTF-8. If you inspect the file, you will see that the character is properly stored. However, when ColdFusion opens this file and parses the literal, it will assume the character to be encoded with the system’s default encoding. And unfortunately you seem to run a system that doesn’t or cannot use UTF-8 as systemwide codeset (Windows for example).

Solutions

An (ugly) way to solve it

cfprocessingdirective

A (hacky) way to solve it

Save every file that ColdFusion touches with its parser (all .cfm/.cfc files) as UTF-8 with BOM. When ColdFusion encounters these bytes at the start of the file, it is forced to use UTF-8, because that’s what BOM implies.

A global way to solve it

Add -Dfile.encoding=UTF-8 to your ColdFusion JVM. The parameter can be added here: /cfusion/bin/jvm.config (line: java.args=)

This requires a restart of ColdFusion to be picked up. All your files can then be saved as simple UTF-8 (without BOM) and it will work just fine.

Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.