Why are most string manipulations in Java based on regexp?

Question

Note that the regex need not be recompiled each time. From the Javadoc:

An invocation of this method of the form str.split(regex, n) yields the same result as the expression

Pattern.compile(regex).split(str, n)

That is, if you are worried about performance, you may precompile the pattern and then reuse it:

Pattern p = Pattern.compile(regex);
...
String[] tokens1 = p.split(str1); 
String[] tokens2 = p.split(str2); 
...

instead of

String[] tokens1 = str1.split(regex);
String[] tokens2 = str2.split(regex);
...

I believe that the main reason for this API design is convenience. Since regular expressions include all “fixed” strings/chars too, it simplifies the API to have one method instead of several. And if someone is worried about performance, the regex can still be precompiled as shown above.

My feeling (which I can’t back with any statistical evidence) is that most of the cases String.split() is used in a context where performance is not an issue. E.g. it is a one-off action, or the performance difference is negligible compared to other factors. IMO rare are the cases where you split strings using the same regex thousands of times in a tight loop, where performance optimization indeed makes sense.

It would be interesting to see a performance comparison of a regex matcher implementation with fixed strings/chars compared to that of a matcher specialized to these. The difference might not be big enough to justify the separate implementation.

Leave a Comment Cancel reply