How to fix "invalid byte sequence" error in td-agent2 / td-agent3 (Fluentd)
"invalid byte sequence" error
I have been using Fluentd (td-agent2 and td-agent3) in my production system. It was working fine, but when I looked at td-agent.log file, I found the following error message. Those are a part of error messages I got."ERROR: invalid byte sequence for encoding \"UTF8\": 0xc4 0x42\n"
"ERROR: invalid byte sequence for encoding \"UTF8\": 0x89\n"
"ERROR: invalid byte sequence for encoding \"UTF8\": 0xf2 0xe4 0xdc 0xb8\n"
"ERROR: invalid byte sequence for encoding \"UTF8\": 0xaf\n"
"string contains null byte"
The root cause
I was googling the error messages and noticed that Fluentd (td-agent2 and td-agent3) handles characters as ASCII-8BIT. I use PostgreSQL database to store all data and the database is created as UTF-8 database.
When Fluentd created a SQL statement, Fluentd data is converted into UTF-8, but my data contains some weird characters which is not defined in UTF-8 character encoding. When I looked into data, some record contain Emoji, so I looked for a way to ignore or replace those undefined characters in UTF-8.
How to fix the issue?
If the string is invalid byte sequence then replace invalid bytes with given replacement character, else returns self. If block is given, replace invalid bytes with returned value of the block.
"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
"abc\u3042\x81".scrub("*") #=> "abc\u3042*"
"abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"
Install fluent-plugin-string-scrub
<match **>
type string_scrub
tag scrubbed.string
replace_char ?
</match>
<filter **>
type string_scrub
replace_char ?
</filter>