I often work in PowerShell, and one day I needed to create a script that would pull the file encoding out a file.
However, this proved to be difficult since most encodings don’t require a BOM (Byte Order Mark). Here’s some good information that I found on the subject:
Automatically determining the correct encoding for a given byte array is notoriously difficult. Sometimes, to be helpful, the author of the data will insert something called a BOM (Byte Order Mark) at the beginning of the data. If a BOM is present, that makes detecting the encoding painless, since each encoding uses a different BOM.
However, the problem remains, how do you automatically detect the correct encoding when there is no BOM? Technically it’s recommended that you don’t place a BOM at the beginning of your data when using UTF-8, and there is no BOM defined for any of the ANSI code pages. So it’s certainly not out of the realm of possibility that a text file may not have a BOM. If all the files that you deal with are in English, it’s probably safe to assume that if no BOM is present, then UTF-8 will suffice. However, if any of the files happen to use something else, without a BOM, then that won’t work.
I came across some code on a PowerShell sharing site, POSHCode.org, that inspired me to do things a different way. So, I made the ammendments there as well. Unfortunately, since I’ve written this blog, it appears that POSHCode has gone down for the count: