Pedro Moreno and others at Google/Youtube work on it. They use finite-state transducers to recognize sequences of music phone units, similar to phonemes in automatic speech recognition.
Check out this article:
- Eugene Weinstein, Pedro J. Moreno;
Music Identification with Weighted
Finite-State Transducers,
Proceedings of the International
Conference in Acoustics, Speech and
Signal Processing (ICASSP), 2007.
If you change the speed or pitch throughout the whole song I’m surprised that these algorithms still recognize the song. But maybe they normalize the pitch and speed (using the time between beats) to be able to recognize covered versions as well, not just the original ones. But it’s not surprising that it can ignore the beeps you added, since there is enough similarity in your audio stream otherwise.
(Actually the finite-state-based algorithm would be awesome to apply to my iTunes library, to tag the files correctly. Because services like MusicBrainz rely on more or less exact hash matches of your audio and the database entry, whereas the transducer method seems to be more difference-tolerant in recognizing the files.)