Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don’t need to worry about every new emoji being added.
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter,"");
So:
[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]
is a range representing all numeric (\\p{N}
), letter (\\p{L}
), mark (\\p{M}
), punctuation (\\p{P}
), whitespace/separator (\\p{Z}
), other formatting (\\p{Cf}
) and other characters aboveU+FFFF
in Unicode (\\p{Cs}
), and newline (\\s
) characters.\\p{L}
specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc.- The
^
in the regex character set negates the match.
Example:
String str = "hello world _# 皆さん、こんにちは! 私はジョンと申します。🔥";
System.out.print(str.replaceAll("[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]",""));
// Output:
// "hello world _# 皆さん、こんにちは! 私はジョンと申します。"
If you need more information, check out the Java documentation for regexes.