Issue Origin
After having a glorious battle that lasted two days and three nights (and amazing ideas and thoughts from the comments) I’ve finally managed to fix this issue!
I’d like to post an answer for anybody running into similar issues where the string.Substring(i, j) function is not an acceptable solution to get the substring of a string because the string is either too large and you can’t afford the copying done by string.Substring(i, j) (it has to make a copy because C# strings are immutable, no way around it) or the string.Substring(i, j) is being called a huge number of times over the same string (like in my nested for loops) giving the garbage collector a hard time, or as in my case both!
Attempts
I’ve tried many suggested things such as the StringBuilder, Streams, unmanaged memory allocation using Intptr and Marshal within the unsafe{} block and even creating an IEnumerable and yield return the characters by reference within the given positions. All of these attempts failed ultimatively because some form of joining of the data had to be done as there was no easy way for me to traverse my tree character by character without jeopardizing performance. If only there was a way to span over multiple memory addresses within an array at once like you would be able to in C++ with some pointer arithmetic.. except there is..
(credits to @Ivan Stoev’s comment)
The Solution
The solution was using System.ReadOnlySpan<T> (couldn’t be System.Span<T> due to strings being immutable) which, among other things, allows us to read sub arrays of memory addresses within an existing array without creating copies.
This piece of the code posted:
string _s1 = s1.Substring(i, j);
if (stree.has(_s1))
{
score += j - i;
longest = j - i;
}
Was changed to the following:
if (stree.has(i, j))
{
score += j - i;
longest = j - i;
}
Where stree.has() now takes two integers (position and length of substring) and does:
ReadOnlySpan<char> substr = s1.AsSpan(i, j);
Notice that the substr variable is literally a reference to a subset of characters of the initial s1 array and not a copy! (The s1 variable had been made accessible from this function)
Note that at the moment of writing this I am using C#7.2 and .NET Framework 4.6.1 meaning that to get the Span feature I had to go to Project > Manage NuGet Packages, tick the “Include prerelease” checkbox and browse for System.Memory and install it.
Re-running the initial test (on strings of length 1 milion characters i.e. 1MB) the speed increased from 2+ minutes (I gave up waiting after 2 minutes) to ~86 miliseconds!!