This is how I managed to do it with java:
FileInputStream input;
String result = null;
try {
input = new FileInputStream(new File("invalid.txt"));
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
InputStreamReader reader = new InputStreamReader(input, decoder);
BufferedReader bufferedReader = new BufferedReader( reader );
StringBuilder sb = new StringBuilder();
String line = bufferedReader.readLine();
while( line != null ) {
sb.append( line );
line = bufferedReader.readLine();
}
bufferedReader.close();
result = sb.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch( IOException e ) {
e.printStackTrace();
}
System.out.println(result);
The invalid file is created with bytes:
0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94
Which is hellö wörld
in UTF-8 with 4 invalid bytes mixed in.
With .REPLACE
you see the standard unicode replacement character being used:
//"h�ellö� wö�rld�"
With .IGNORE
, you see the invalid bytes ignored:
//"hellö wörld"
Without specifying .onMalformedInput
, you get
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
at sun.nio.cs.StreamDecoder.read(Unknown Source)
at java.io.InputStreamReader.read(Unknown Source)
at java.io.BufferedReader.fill(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)