Why does String.hashCode() in Java have many conflicts? [closed]

Question

I just hashed 58 thousand English language words (found here), both all-lowercase and also with the first letter capitalized . Know how many collided? Two: “Siblings” and “Teheran” (an alternate spelling of “Tehran”).

Just like you, I took a subdomain (in my case a likely one) of possible strings and analyzed the hashCode collision rate for it, and found it to be exemplary. Who is to say that your arbitrary subdomain of possible strings is a better choice to optimize for than mine?

The people that wrote this class had to do so knowing that they couldn’t predict (nor therefore optimize) the subdomain in which their users would be using Strings as keys. So they chose a hash function which distributes evenly over the entire domain of strings.

If you’re interested, here’s my code:

Map<Integer, List<String>> collisions = Files.lines(Paths.get(System.getProperty("user.home")+ "/corncob_lowercase.txt"))
    .flatMap(word -> Stream.of(word, word.substring(0, 1).toUpperCase() + word.substring(1)))
    .collect(Collectors.groupingBy(String::hashCode))
    .entrySet()
    .stream()
    .filter(e -> e.getValue().size() > 1)
    .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));

System.out.printf("Number of collisions: %d%n", collisions.size());
collisions.forEach((hash, words) -> System.out.printf("%d: %s%n", hash, words));

Edit

By the way, if you’re curious the same test with your hash function had 13 collisions compared to String.hashCode‘s 1.

Edit

Leave a Comment Cancel reply