Replace non-ASCII characters with a single space

Your ”.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead: return ”.join([i if ord(i) < 128 else ‘ ‘ for i in text]) This handles characters one by one and would still use one space per character replaced. Your regular expression should just replace consecutive non-ASCII characters with a space: … Read more

Unicode, UTF, ASCII, ANSI format differences

Going down your list: “Unicode” isn’t an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to … Read more

Reading a plain text file in Java

My favorite way to read a small file is to use a BufferedReader and a StringBuilder. It is very simple and to the point (though not particularly effective, but good enough for most cases): BufferedReader br = new BufferedReader(new FileReader(“file.txt”)); try { StringBuilder sb = new StringBuilder(); String line = br.readLine(); while (line != null) … Read more