JavaScript email obfuscation generated in shell

Writing a plain email address on a web page is risky: usually, it comes down to offering the address to all bots scrapping the web and looking for targets to add to spammers’ lists. Preventive measures can be used to keep—some, hopefully most—bots to parse email addresses. A very simple countermeasure consists in using a string such as username [at] mydomain [dot] tld instead of plain username@mydomain.tld, but I tend to believe that many bots are now able to recognize such patterns; moreover, it breaks mailto: HTML links. While these links are far from essential, they arguably make user experience more comfortable.

“At” sign, often used as a symbol for email address

Here we present a more subtle email obfuscation method. The article contains two parts:

First we will analyze how pandoc uses JavaScript and HTML entities to obfuscate email addresses in generated HTML file;
Then I propose a simple implementation of this feature in shell language, at least for ASCII-made emails, for those occasions when we cannot call pandoc directly.

1 Hide-and-seek with Pandoc

1.1 A simple test

Pandoc is a Haskell library able to convert text from one markup format to another, as well as the name of a command-line tool using this library. Here we want to produce HTML output; for input, we can use any of the numerous formats that pandoc is able to parse. For an example let’s convert a simple string from Markdown to HTML:

$ echo "This is not an email" | pandoc --from=markdown --to=html
<p>This is not an email</p>

That simple. Pandoc encloses the string into a <p></p> paragraph tag.
Now let’s try with a web link:

$ echo "<https://myblog.mydomain.tld>" | pandoc -f markdown -t html
<p><a href="https://myblog.mydomain.tld">https://myblog.mydomain.tld</a></p>

We get the <a></a> tag with associated href: address, as expected. Now consider an email address (I can omit the -f and -t options: pandoc defaults to converting from Markdown to HMTL anyway):

$ echo "<username@mydomain.tld>" | pandoc

A simple conversion would return:

<p><a href="mailto:username@mydomain.tld">username@mydomain.tld</a>

But actually more things happen with pandoc, and instead we get:¹

<p><script type="text/javascript">
<!--
h='&#x6d;&#x79;&#100;&#x6f;&#x6d;&#x61;&#x69;&#110;&#46;&#116;&#108;&#100;';a='&#64;';n='&#x75;&#x73;&#x65;&#114;&#110;&#x61;&#x6d;&#x65;';e=n+a+h;
document.write('<a h'+'ref'+'="ma'+'ilto'+':'+e+'">'+e+'<\/'+'a'+'>'"'"');
// -->
</script><noscript>&#x75;&#x73;&#x65;&#114;&#110;&#x61;&#x6d;&#x65;&#32;&#x61;&#116;&#32;&#x6d;&#x79;&#100;&#x6f;&#x6d;&#x61;&#x69;&#110;&#32;&#100;&#x6f;&#116;&#32;&#116;&#108;&#100;</noscript></p>

Let’s see what we have got here.

1.2 HTML entities

Pandoc returned a paragraph block (<p> root tag) made of two inner blocks: <script> and <noscript>. The <script> block contains JavaScript code for browsers able and willing to execute it (“willing”, because users may prefer to deactivate JavaScript); and of course, <noscript> contains HTML for other users (not all browsers can actually execute JavaScript, by the way: for example browsers running in a terminal, such as elinks, cannot do that). This latter block is easier to understand, since there is no code to execute, and we will start with it.

It contains a string of HTML entities, either in decimal () or hexadecimal () format. Copy-pasting the line in the first HTML decoder returned by your favorite search engine gives:

username at mydomain dot tld

Okay, so basically for browsers not executing JavaScript, pandoc performs two operations:

it expands @ and . symbols into letters;
it converts the whole string to HTML entities.

1.3 Generating “mailto:” link through JavaScript

Now come back to the <script> block. This is some simple code: we define three strings made of HMTL entities, h (host), a (at) and n (name). You can verify that h is equal to mydomain, a is @ (not expanded to letters) and n is username. Then we recreate the email address in a fourth variable e, with HTML entities again: e=n+a+h. At last we use the document.write() JavaScript function to directly insert HTML code in place of the script, just before the page is rendered by the browser. The string is split into multiple parts to “hide” the mailto: keyword, which would undoubtedly attract bots. So by assembling the parts, we get something equivalent to:

document.write('<a href="mailto:' + e + '">' + e + '<\/a>')

Where e will of course be expanded to the HTML encoded email value.

As a result, you will see no mailto: keyword in the source HTML, it is only generated at runtime (you can test on this blog: my email does not appear in plain text in the HTML code, and yet you can just click on my name at the top of this article to send me an email).

1.4 Using different text for link and target

It is possible to add an email link to any string in HTML, and pandoc can deal with it as well with following syntax:

$ echo "[Write to foobar!](mailto:foobar@mydomain.tld)" | pandoc

The email address is handled in the same way, but the link will be placed on the reference text instead of copying the address itself. For browsers not executing JavaScript, the resulting decoded string will be:

Write to foobar! (foobar at mydomain dot tld)

1.5 “How good is this mechanism at protecting my email address?”

I don’t know! I don’t code bots to scrape the Internet. My guess is that using HMTL entities makes it safer than simple @ and . symbols expansion, which might be caught by regular expressions. I don’t think that so many bots embed JavaScript interpreters when looking for emails. I know some can run JavaScript, especially for passing captchas, but there are so many unprotected email addresses out there that I am not sure it is worth loosing time on code execution for this task. Without JavaScript, they cannot use the mailto: keyword to detect the email, and they have to both perform HTML entities substitution and apply regular expressions to catch the “at” and “dot“ words; again, I can’t tell what percentage of bots are this sophisticated. Anyway, replacing @ and . by pictures probably remains one of the safest available protection if you really want to protect your email; but the solution presented here preserves design and mailto: links. In the end, it is a matter of choices.

Note that even though this mechanism is used by pandoc, I am not assuming that it was introduced by pandoc—actually, I don’t know where it comes from. I have seen similar solutions with other softwares (e.g. with dokuwiki, which uses only HTML entities converting but no JavaScript).

2 Now let’s do it again in shell

2.1 “What? What’s about shell?”

Alright, shell might not be the more intuitive language to use for this task. Furthermore, when I say “shell”, it means “sh”, and not “bash” for instance. So, why on Earth am I using shell to produce the HTML and JavaScript code needed to obfuscate emails? Actually, this is part of my workflow for this blog. I write articles in Markdown and feed pandoc with them, and it performs a good job at replacing all email addresses from article body before inserting it into the templates. Ah, templates. This is where we have an issue: I cannot use markdown to fill the template except for the body of the article. I can set variables to be expanded inside the HTML template, but they will not be interpreted; so if I have an $author-email$ variable in the template, I will have no obfuscation unless I do it myself prior to affecting its value to author-email variable. This is what is performed inside pangitive’s Git hooks, in shell. And this is why I am proposing my shell implementation here.

I suppose you know the basics of shell. If you don’t understand parts of the syntax below, you may want to try the man documentation for commands sh, cut, printf (in man section 1) and sed.

2.2 Converting a string to HTML entities

We want a function that, from a address string, returns the HTML snippet (including JavaScript code) used to generate the mailto: link at runtime.

Let’s start with a first function to convert any ASCII string to HTML entities. Basically we want to loop on the string length and, for each character, print the HTML code. Length of a string variable $i can be obtained in shell with ${#i}, and of course the first argument of our function is accessible in $1, so our loop will look like:

i=1;
while [ $i -le ${#1} ] ; do
  …
  i=$((i+1))
done

The $((i+1)) syntax evaluates the arithmetic expression and returns the result, so this line basically increments $i by one. For each pass in the loop, we can retrieve the letter at position $i with command cut: l=`echo -n "$1" | cut -c$i`.

If you are using bash or another recent shell, you should prefer a different syntax for command expansion, such as $(command) instead of `command`.

Printing the ASCII code of letter in $l can be obtained with printf '%d' \'"$l" (for decimal) or printf '%x' \'"$l" (for hexadecimal). Do we want decimal, or hexa? Pandoc uses both. I could not determine what pattern it uses, though. I did not read its source code. For some inputs it looks like it’s alternating between the two with each character; and for some inputs it looks as if a random pattern had been used (although for a given input, you only have a single output). So I chose to have decimal code for every even character, and hexa for odd ones. To obtain this we only have to get the modulo by 2 of the value of variable $i, which loops on the length of the string to convert. Don’t forget to add the HTML entity prefix (&# or &#x, depending on decimal or hexa encoding) and suffix ;, and we get our function.

convert_to_html(){
  i=1;
  while [ $i -le ${#1} ] ; do
    l=`echo -n "$1" | cut -c$i`
    if [ $((i%2)) -eq 0 ] ; then
      printf '&#%d;' \'"$l"
    else
      printf '&#x%x;' \'"$l"
    fi
    i=$((i+1))
  done
}

Now we can work on the main function. It will have to:

split the email address into local and domain parts;
encode each part as HTML entities (we already have this part);
print the resulting HTML and JavaScript code.

First part is easy to perform with cut if we assume that there is a single @ character in all the address string: we just have to use it as a field delimiter.

Here we are not respecting RFC 5322 about Internet Message Format, according to which other @ symbols could appear in a quoted string inside the local part (i.e., left side of traditional @) or inside square brackets in the domain part (right of the @) of the address. On the other hand, I have never seen a functional email address with more than one @ inside.

We get:

name=`echo $1 | cut -d@ -f1`
host=`echo $1 | cut -d@ -f2`
n=`convert_to_html "$name"`
a='&#64;'
h=`convert_to_html "$host"`

And the printing part (be cautious with the simple and double quotes):

echo '<script type="text/javascript">'
echo '<!--'
echo "h='"$h"';a='"$a"';n='"$n"';e=n+a+h;"
echo "document.write('<a h'+'ref'+'=\"ma'+'ilto'+':'+e+'\">'+'e'+'<\/'+'a'+'>');"
echo '// -->'
echo "</script><noscript>$n$a$h</noscript>"

But, wait: didn’t we forget something? What about the @ and . expansion to letters for the noscript block? We have to perform it before substituting HTML entities to letters. Running a sed command is perfect for this: we can instantly substitute all . by ␣dot␣ strings. The @ can be manually replaced by ␣at␣. So the code becomes:

obfuscate_email() {
  name=`echo $1 | cut -d@ -f1`
  host=`echo $1 | cut -d@ -f2`
  n=`convert_to_html "$name"`
  a='&#64;'
  h=`convert_to_html "$host"`

  n_ns=`echo "$nn" | sed 's/\./ dot /g'`
  n_ns=`convert_to_html "$n_ns"`
  a_ns=`convert_to_html ' at '`
  h_ns=`echo "$hh" | sed 's/\./ dot /g'`
  h_ns=`convert_to_html "$h_ns"`
  noscript_mail="$n_ns$a_ns$h_ns"

  echo '<script type="text/javascript">'
  echo '<!--'
  echo "h='"$h"';a='"$a"';n='"$n"';e=n+a+h;"
  echo "document.write('<a h'+'ref'+'=\"ma'+'ilto'+':'+e+'\">'+'e'+'<\/'+'a'+'>');"
  echo '// -->'
  echo "</script><noscript>$noscript_mail</noscript>"
}

We’re nearly done, but… It would be nice if we could change the text of the mailto: link, and default to the email address itself if nothing is provided as a second argument. Let’s do this by creating variables we will reuse in the printing part:

if [ "$2" != "" ] ; then
  text="'$2'"
  noscript_mail="`convert_to_html \"$2\"`&#32;&#x28;$n_ns$a_ns$h_ns&#x29;"
else
  text="e"
  noscript_mail="$n_ns$a_ns$h_ns"
fi

So now the full script. We add some verifications on the arguments to prevent errors on wrong inputs (first argument of the main function should not be empty, and should contain one and only one @ symbol). I also added an example function invocation at the end of the script:

#!/bin/sh
convert_to_html(){
  if [ "$1" = "" ] ; then
    return
  fi
  i=1;
  while [ $i -le ${#1} ] ; do
    l=`echo -n "$1" | cut -c$i`
    if [ $((i%2)) -eq 0 ] ; then
      printf '&#%d;' \'"$l"
    else
      printf '&#x%x;' \'"$l"
    fi
    i=$((i+1))
  done
}

obfuscate_email() {
  if [ "$1" = "" -o `echo -n "$1" | sed 's/[^@]//g' | wc -c` -ne 1 ] ; then
    echo "Usage: $0 <email_address> [text]"
    exit 1
  fi
  name=`echo $1 | cut -d@ -f1`
  host=`echo $1 | cut -d@ -f2`
  n=`convert_to_html "$name"`
  a='&#64;'
  h=`convert_to_html "$host"`

  n_ns=`echo "$name" | sed 's/\./ dot /g'`
  n_ns=`convert_to_html "$n_ns"`
  a_ns=`convert_to_html ' at '`
  h_ns=`echo "$host" | sed 's/\./ dot /g'`
  h_ns=`convert_to_html "$h_ns"`

  if [ "$2" != "" ] ; then
    text="'$2'"
    noscript_mail="`convert_to_html \"$2\"`&#32;&#x28;$n_ns$a_ns$h_ns&#x29;"
  else
    text="e"
    noscript_mail="$n_ns$a_ns$h_ns"
  fi

  echo '<script type="text/javascript">'
  echo '<!--'
  echo "h='"$h"';a='"$a"';n='"$n"';e=n+a+h;"
  echo "document.write('<a h'+'ref'+'=\"ma'+'ilto'+':'+e+'\">'+"$text"+'<\/'+'a'+'>');"
  echo '// -->'
  echo "</script><noscript>$noscript_mail</noscript>"
}

obfuscate_email foobar@mydomain.tld "Write to foobar!"

This time, we’ve got what we want! An email obfuscating function written in shell, producing nearly the same output as pandoc would (without the <p> tag, but it is easy to add, and with a different pattern for decimal and hexadecimal HTML encoding).

2.3 Input encoding

In this article we’ve been working with simple ASCII characters only. Today email addresses theoretically support Unicode characters, but the above function needs to be modified to handle this. I could not find a way to make it work in simple shell, because my printf binary does not appear to handle Unicode correctly. But I have a solution working with both bash and zsh, since they embed their own built-in versions of printf (which turn out to print correct values for Unicode characters). If you’re using one of these shells, you may deal with Unicode email addresses as follows: line 8 of the script (code block #16) could use a sed command instead of cut (cut fails to get Unicode characters as well).

    l=`echo -n "$1" | sed 's/.\{'$((i-1))'\}\(.\).*/\1/'`

The rest of the conversion is performed correctly. By the way, pandoc does not seem to recognize address emails containing Unicode characters, and deal with it as if they were simple strings.

That’s all. I hope this can help you to sanitize your email addresses!

Edit: Note that at some point email obfuscation in pandoc became opt-in through a command-line option. Pass --email-obfuscation=javascript to recent versions of pandoc to get the obfuscated email.↩