Regex to match Egyptian Hieroglyphics [closed]

Question

TLDNR: \p{Egyptian_Hieroglyphs}

Javascript

Egyptian_Hieroglyphs belong to the “astral” plane that uses more than 16 bits to encode a character. Javascript, as of ES5, doesn’t support astral planes (more on that) therefore you have to use surrogate pairs. The first surrogate is

U+13000 = d80c dc00

the last one is

U+1342E = d80d dc2e

that gives

re = /(\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2E])+/g

t = document.getElementById("pyramid").innerHTML
document.write("<h1>Found</h1>" + t.match(re))

<div id="pyramid">

  some     𓀀	really    𓀁	old    𓐬	stuff    𓐭	    𓐮
  
  </div>

This is what it looks like with Noto Sans Egyptian Hieroglyphs installed:

enter image description here

Other languages

On platforms that support UCS-4 you can use Egyptian codepoints 13000 to 1342F directly, but the syntax differs from system to system. For example, in Python (3.3 up) it will be [\U00013000-\U0001342E]:

>>> s = "some \U+13000 really \U+13001 old \U+1342C stuff \U+1342D \U+1342E"
>>> s
'some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮'
>>> import re
>>> re.findall('[\U00013000-\U0001342E]', s)
['𓀀', '𓀁', '𓐬', '𓐭', '𓐮']

Finally, if your regex engine supports unicode properties, you can (and should) use these instead of hardcoded ranges. For example in php/pcre:

$str = " some 𓀀 really 𓀁 old 𓐬 stuff 𓐭  𓐮";

preg_match_all('~\p{Egyptian_Hieroglyphs}~u', $str, $m);
print_r($m);

prints

[0] => Array
    (
        [0] => 𓀀
        [1] => 𓀁
        [2] => 𓐬
        [3] => 𓐭
        [4] => 𓐮
    )

Javascript

Other languages

Leave a Comment Cancel reply