Removing Â from multiple files

Nicholas Crabtree

Hello,

I have multiple files that contain the character Â.

I can only see them in HEX mode, but need to remove them in bulk

Could anyone assist ?

PeterJones

I can only see them in HEX mode

Since you can only see them in HEX mode, that means Notepad++ is doing the right thing, and you don’t fully comprehend the meaning of what you are seeing in the HEX. Let me explain:

You don’t have the character Â, that’s just the first byte in a two-byte representation for certain characters. You have a properly-formed UTF-8 document, which when Notepad++ is reading it natively as text, it will interpret those two adjacent bytes as a single character – hence, in the text mode, Notepad++ is just showing the single character that it correctly interpreted those two bytes as.

For the UTF-8 encoding, the Unicode codepoints U+0080 - U+00BF are represented by the byte pairs 0xC2 0x80 through 0xC2 0xBF. You’ll notice that the second byte in the two-byte sequence is the same byte as the Unicode codepoint of the actual character. Thus, if you look at any character U+0080 - U+00BF in a HEX editor, where it’s showing you the hex of the bytes, and then interpreting those bytes as ANSI characters in the panel on the right, you will see Â followed by the character you hoped would be there.

For example, the center character ¢ is U+00A2, which is represented in UTF-8 bytes as 0xC2 0xA2; when a HEX editor puts the ANSI representations to the right, it will show those two bytes as Â¢, thus tricking you into thinking there is an extra character there. There isn’t. There is one character represented by two bytes.

(You actually have the oppsoite problem of a lot of people: a lot of people have malformed UTF-8 documents which Notepad++ mis-interprets as using an ANSI codepage, and so normal Notepad++ shows the two bytes of that single UTF-8 character as Â¢. Yours is actually properly formed, so Notepad++ correctly interprets those two bytes together as representing ¢, so properly only shows the ¢ character. So congratulations on having a good file.)

If you were to somehow trick Notepad++ into deleting those Â bytes, then you could maybe delete those bytes, but then Notepad++, and any other application that believes it is trying to read a UTF-8 file, would see the lone 0x80 - 0xBF bytes as malformed, broken UTF-8, and they would complain to you about bad file encoding or improper UTF-8 or similar. If I helped you do this, I would be helping you to break your UTF-8 encoding.

That said, if you are trying to convert a UTF-8 encoded file into a 256-codepoint ANSI character-set encoding, then you can do that on each fiile inside Notpead++: after loading the UTF-8 file (and seeing it say UTF-8 in the lower-right of the Notepad++ status bar), you could go to Encoding > Convert to ANSI to get Notepad++ to convert from UTF-8 to ANSI… and as long as all the non-ASCII Unicode characters in your UTF-8 file were also found in your default ANSI codepage (Windows 1252 is usually the default in US installations of Windows), the file would look the same, and once you save it, and look at it in any HEX editor, the HEX editor will only show one-byte-per-character, because that’s all that ANSI character sets use; however, any Unicode characters in your original file that aren’t in the active codepage will be converted irreversably into ?, so be forwarned that it’s an incredibly bad idea to do that conversion without a good understand of what characters are in your file and what characters are in your default codepage.

But the best advice: just leave the UTF-8 file as it is, because that’s a good, international, modern standard for text interchange, whereas the 256 character sets of the various ANSI codepages were a tolerable workaround in the 80s, but completely insufficient in the 2020s, and no modern tool should be forcing you into codepages instead of allowing UTF-8 encodings.