Tuesday, May 14, 2013

What if, import file (txt/csv) having "BOM-ed UTF-8" encoding?

So what is "UTF-8 BOM" mean ? its byte order mark for UTF-8, some bytes (0xEF,0xBB,0xBF) are added at the start of the file to indicate that the file having unicode characters in it. BOM Characters "9".

As per Unicode documentation, the presence of BOM in file are useless, because it causes problems with non-BOM-aware software's to identify or parse the leading characters having at the start. Same has been quoted at the bottom of the Wikipedia page:

Related errors in PostgreSQL:
ERROR: invalid input syntax for integer: "9" (in psql-client)
SQL state: 22P02 (in PgAdmin-III)

Test case & fix on Windows:
Sample file "state_data.txt" created in NOTEPAD with unicode characters in it:
9,Karnataka,कर्नाटक
10,Kerala,केरळा
Table to import data:
create table states(state_code int, state_name char(30), state_in_hindi text);
Error:
postgres=# copy test from 'c:/Pgfile/state_data.txt' with delimiter ',' CSV;
ERROR:  invalid input syntax for integer: "9"
CONTEXT:  COPY test, line 1, column state_code: "9"
To fix, I have used a tool "bomremover.exe" to remove leading characters from a file as its on windows, if its on linux, then there are many tips & tricks available on net to wipe BOM from a utf-8 format file.

Tool Download link and usage:
http://www.mannaz.at/codebase/utf-byte-order-mark-bom-remover/

Eg:-
C:\Pgfile>bomremover.exe . *
Added '.\state_data.txt' to processing list.
Press enter to process all files in the list. (1 files in total)

Processing file '.\state_data.txt'...
Finished. Press Enter to Exit

After running bomremover.exe on file, re-run COPY command which will succeed to import data.
 state_code | state_name | State_name_in_hindi
------------+------------+---------------------
          9 | Karnataka  | αñòαñ░αÑ<8d>αñ¿αñ╛αñƒαñò
         10 | Kerala     | केरळा
(2 rows)

Some of the editors, avoids default saving text with UTF8-BOM:
- Windows - Notepad++ (In Notepade default BOM enabled)
- Linux - VI
- Mac - TextEdit


--Raghav

1 comment :

Post a Comment