Fluentd

How to fix "invalid byte sequence" error in td-agent2 / td-agent3 (Fluentd)

"invalid byte sequence" error

I have been using Fluentd (td-agent2 and td-agent3) in my production system. It was working fine, but when I looked at td-agent.log file, I found the following error message. Those are a part of error messages I got.

"ERROR: invalid byte sequence for encoding \"UTF8\": 0xc4 0x42\n"

"ERROR: invalid byte sequence for encoding \"UTF8\": 0x89\n"

"ERROR: invalid byte sequence for encoding \"UTF8\": 0xf2 0xe4 0xdc 0xb8\n"

"ERROR: invalid byte sequence for encoding \"UTF8\": 0xaf\n"

"string contains null byte"

The root cause

I was googling the error messages and noticed that Fluentd (td-agent2 and td-agent3) handles characters as ASCII-8BIT. I use PostgreSQL database to store all data and the database is created as UTF-8 database.

When Fluentd created a SQL statement, Fluentd data is converted into UTF-8, but my data contains some weird characters which is not defined in UTF-8 character encoding. When I looked into data, some record contain Emoji, so I looked for a way to ignore or replace those undefined characters in UTF-8.

How to fix the issue?

In order to fix the issue, I looked into Ruby language documentation and found "String#scrub" method. According to Ruby documentation, the method does the followings. This is what I was looking for.

If the string is invalid byte sequence then replace invalid bytes with given replacement character, else returns self. If block is given, replace invalid bytes with returned value of the block.

"abc\u3042\x81".scrub #=> "abc\u3042\uFFFD"
"abc\u3042\x81".scrub("*") #=> "abc\u3042*"
"abc\u3042\xE3\x80".scrub{|bytes| '<'+bytes.unpack('H*')[0]+'>' } #=> "abc\u3042<e380>"

Install fluent-plugin-string-scrub

In order to use the String#scrub method in Fluentd, we just need to use the following plugin.

fluent-plugin-string-scrub

In order to install the plugin, we just need to run the following command.

# td-agent-gem install fluent-plugin-string-scrub

Once it's installed, we can use the plugin in "match" or "filter" as follows.

<match **>
  type string_scrub
  tag scrubbed.string
  replace_char ?
</match>

<filter **>
  type string_scrub
  replace_char ?
</filter>

Everything you do is practice

How to fix "invalid byte sequence" error in td-agent2 / td-agent3 (Fluentd)