Handling Unicode sequences in postgresql

Question

\u0000 is the one Unicode code point which is not valid in a string. I see no other way than to sanitize the string.

Since json is just a string in a specific format, you can use the standard string functions, without worrying about the JSON structure. A one-line sanitizer to remove the code point would be:

SELECT (regexp_replace(the_string::text, '\\u0000', '', 'g'))::json;

But you can also insert any character of your liking, which would be useful if the zero code point is used as some form of delimiter.

Note also the subtle difference between what is stored in the database and how it is presented to the user. You can store the code point in a JSON string, but you have to pre-process it to some other character before processing the value as a json data type.

Leave a Comment Cancel reply