Fast string search in a very large file

I once noticed that using -E or multiple -e parameters is faster than using -f. Note that this might not be applicable for your problem as you are searching for 50,000 string in a larger file. However I wanted to show you what can be done and what might be worth testing:

Here is what I noticed in detail:

Have 1.2GB file filled with random strings.

>ls -has | grep string
1,2G strings.txt

>head strings.txt
Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0
Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy
BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa
etrulbGONKT3pact1SHg2ipcCr7TZ9jc
.....

Now I want to search for strings “ab”, “cd” and “ef” using different grep approaches:

  1. Using grep without flags, search one at a time:
    grep "ab" strings.txt > m1.out  
    2,76s user 0,42s system 96% cpu 3,313 total
    
    grep "cd" strings.txt >> m1.out  
    2,82s user 0,36s system 95% cpu 3,322 total
    
    grep "ef" strings.txt >> m1.out  
    2,78s user 0,36s system 94% cpu 3,360 total

So in total the search takes nearly 10 seconds.

  1. Using grep with -f flag with search strings in search.txt

     >cat search.txt
      ab
      cd
      ef
    
     >grep -F -f search.txt strings.txt > m2.out  
     31,55s user 0,60s system 99% cpu 32,343 total
    

For some reasons this takes nearly 32 seconds.

  1. Now using multiple search patterns with -e

     grep -E "ab|cd|ef" strings.txt > m3.out  
     3,80s user 0,36s system 98% cpu 4,220 total
    

    or

     grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null  
     3,86s user 0,38s system 98% cpu 4,323 total
    

The third methode using -E only took 4.22 seconds to search through the file.

Now lets check if the results are the same:

cat m1.out | sort | uniq > m1.sort  
cat m3.out | sort | uniq > m3.sort
diff m1.sort m3.sort
#

The diff produces no output, which means the found results are the same.

Maybe want to give it a try, otherwise I would advise you to look at the thread “Fastest possible grep”, see comment from Cyrus.

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)