Unable to support non-UTF-8 characters in XML file generated with ColdFusion
When I run the following code on a server running ColdFusion 2018:
<cfsetting enablecfoutputonly="yes"> <cfxml variable="test"> <cfoutput> <test> áéíóú </test> </cfoutput> </cfxml> <cfset testString = ToString(test)> <cfset testStringISO = Replace(testString, "UTF-8", "iso-8859-1")> <cffile action="write" file="#AbsoluteFilesPath#test.xml" output="#testStringISO#" charset="iso-8859-1">
Where AbsoluteFilesPath
is an absolute reference to a location on the server. The method I’m using to change the encoding of the XML is found here. The test.xml
file looks like this when I open it on the server in Notepad++:
<?xml version="1.0" encoding="iso-8859-1"?> <test> αινσϊ </test>
The encoding of the file shows as "ISO 8859-7".
Interestingly, opening the file with VSCode on my local machine shows up like this:
<?xml version="1.0" encoding="iso-8859-1"?> <test> ����� </test>
Here, the encoding of the file shows as "UTF-8". Selecting the command "Reopen with encoding ISO 8859-1" within the editor shows the file as it should be:
<?xml version="1.0" encoding="iso-8859-1"?> <test> áéíóú </test>
I’ve tested this code, replacing "iso-8859-1" with "utf-16", and the results are the same.
Why is the file encoding inconsistent and not what I expected? How can I ensure the file is created with the correct encoding?
Let’s clarfiy something first: The encoding
attribute in an XML file is just an indicator for the reader. It does not affect the bytes written to the actual file.
So let’s simplify the example code to a single character á
:
UTF-8 stores 2 bytes, ISO-8859-1 stores 1 byte. This is what we expect.
Example code
<cfsetting enablecfoutputonly="true"> <cfxml variable="test"> <cfoutput><r>á</r></cfoutput> </cfxml> <cfset xmlForUTF = toString(test)> <cfset xmlForISO = replace(xmlForUTF, 'encoding="UTF-8"', 'encoding="ISO-8859-1"')> <cfset fileWrite(expandPath("UTF-8.xml"), xmlForUTF, "UTF-8")> <cfset fileWrite(expandPath("ISO-8859-1.xml"), xmlForISO, "ISO-8859-1")>
Resulting files
UTF-8.xml
ISO-8859-1.xml
This is exactly what we expected. Neither cfxml
nor cffile
/fileWrite
is the issue. So how come you might not get the same result with the above code on your machine?
The problem: Page encoding
When ColdFusion parses template files (.cfm
) and component files (.cfc
), it will use the JVM’s default encoding, which, unless specified otherwise, is the system’s default encoding. This is also the reason, why everyone can get different results with the above code.
If you have a literal such as á
in a file, this character is encoded using whatever you told your text editor to use. Let’s assume that’s UTF-8
. If you inspect the file, you will see that the character is properly stored. However, when ColdFusion opens this file and parses the literal, it will assume the character to be encoded with the system’s default encoding. And unfortunately you seem to run a system that doesn’t or cannot use UTF-8 as systemwide codeset (Windows for example).
Solutions
An (ugly) way to solve it
A (hacky) way to solve it
Save every file that ColdFusion touches with its parser (all .cfm
/.cfc
files) as UTF-8 with BOM. When ColdFusion encounters these bytes at the start of the file, it is forced to use UTF-8, because that’s what BOM implies.
A global way to solve it
Add -Dfile.encoding=UTF-8
to your ColdFusion JVM. The parameter can be added here: /cfusion/bin/jvm.config
(line: java.args=
)
This requires a restart of ColdFusion to be picked up. All your files can then be saved as simple UTF-8 (without BOM) and it will work just fine.