dinogalactic

Reverse engineering an obfuscated JavaScript rickroll

Or: What encodings are and how browsers work, and the origins of security and how we live in the world without screaming most of the time.

In which I get obsessed with understanding a silly little piece of code I found.

Intro

I found this guy's website where, among other cute things, he included the following bit of JavaScript:

unescape(escape`󠅎󠅥󠅶󠅥󠅲󠄠󠅧󠅯󠅮󠅮󠅡󠄠󠅲󠅵󠅮󠄠󠅡󠅲󠅯󠅵󠅮󠅤󠄠󠅡󠅮󠅤󠄠󠅤󠅥󠅳󠅥󠅲󠅴󠄠󠅹󠅯󠅵`.replace(/u.{8}/g,''))

I, of course, copied it and ran it in the browser console. The result was the following string:

"Never gonna run around and desert you"

How does this work? I did some preliminary investigation and found that it's ultimately a bunch of non-printable characters, escaped, and then that escaped value has replace() run on it, and then the percent-encoded-ASCII result is unescaped into the final string.

Here's the canonical hex-dump of the input:

🐔$ echo "unescape(escape`󠅎󠅥󠅶󠅥󠅲󠄠󠅧󠅯󠅮󠅮`.replace(/u.{8}/g,''))" | hexdump -C                                                                                                                  at 10:21:32
00000000  75 6e 65 73 63 61 70 65  28 65 73 63 61 70 65 60  |unescape(escape`|
00000010  f3 a0 85 8e f3 a0 85 a5  f3 a0 85 b6 f3 a0 85 a5  |................|
00000020  f3 a0 85 b2 f3 a0 84 a0  f3 a0 85 a7 f3 a0 85 af  |................|
00000030  f3 a0 85 ae f3 a0 85 ae  f3 a0 85 a1 f3 a0 84 a0  |................|
00000040  f3 a0 85 b2 f3 a0 85 b5  f3 a0 85 ae f3 a0 84 a0  |................|
00000050  f3 a0 85 a1 f3 a0 85 b2  f3 a0 85 af f3 a0 85 b5  |................|
00000060  f3 a0 85 ae f3 a0 85 a4  f3 a0 84 a0 f3 a0 85 a1  |................|
00000070  f3 a0 85 ae f3 a0 85 a4  f3 a0 84 a0 f3 a0 85 a4  |................|
00000080  f3 a0 85 a5 f3 a0 85 b3  f3 a0 85 a5 f3 a0 85 b2  |................|
00000090  f3 a0 85 b4 f3 a0 84 a0  f3 a0 85 b9 f3 a0 85 af  |................|
000000a0  f3 a0 85 b5 60 2e 72 65  70 6c 61 63 65 28 2f 75  |....`.replace(/u|
000000b0  2e 7b 38 7d 2f 67 2c 27  27 29 29 0a              |.{8}/g,'')).|
000000bc

Just how does someone go from an ASCII string like Never gonna run around and desert you (sic) to that obfuscated JavaScript? Let's find out!

The parts

Let's examine the input, the obfuscated JavaScript. A few parts of it catch my attention:

  1. There are non-printing characters between the backticks.
  2. The JS escape function, if that even is a function, is seemingly called without parentheses.
  3. The replace() call effectively removes a literal u character followed by 8 of any matching character.
  4. The final value is run through JS's unescape().

We can take these as parts, in order. The unescape(escape󠅎󠅥󠅶󠅥󠅲󠄠󠅧󠅯󠅮󠅮.replace(/u.{8}/g,'')) can be understood from the inside out, so we'll take the innermost part first, and then we'll take each layer outside of it. So, if this were an onion, the invisible text would be in the middle, and that's where we would start. The unescape call would be the outside of the onion, which we will look at last.

Part 1: The invisible part

A function, in a general sense not specific to programming, takes input and produces an output through some steps. Here we have the function (again, not strictly in the programming sense) that takes the invisible input between the backticks and outputs "Never gonna run around and desert you". The escape, replace, and unescape are really just helpful hints at the steps that are used. Presumably many different inputs could be placed in between the backticks.

If a function takes steps and produces an output, maybe the input could be reproduced by following the steps in reverse order. So I could start with the human-readable string and wind up with the obfuscated string, presumably.

It might help to know more about the invisible input than the fact that it is invisible. Let's make some observations:

As shown above in the hexdump -C output, the individual bytes are f3, a0, 85, and various other values like 8e. What are these values in decimal?

int('0xf3', base=16)

-> 243

int('0xa0', base=16)

-> 160

int('0x85', base=16)

-> 133

int('0x8e', base=16)

-> 142

bin(243)

-> '0b11110011'

So the first byte is 0xf3, which in decimal is 243. What encoding could this be using? One guess is UTF-16, since JavaScript uses UTF-16 for strings.

int('0xf3a0858ef3a085a5', base=16).to_bytes(8, 'big').decode('utf-16')

-> 'ꃳ躅ꃳꖅ'

So when decoding 8 of these raw bytes through Python's UTF-16 codec, we see two valid, visible characters, so the invisible characters cannot be encoded in UTF-16.

An aside: What is an encoding?

Fundamentally, an encoding is a way of representing one set of things in another set of things by translating each item in the first bucket into an item in the second bucket. The two buckets of things could be anything. In character encodings, one bucket contains little squiggles (e.g. the letter "A" or the character "ꃳꖅ") and their bastard friends (looking at you, line-feed and carriage return), and the other bucket contains numbers. One way of organizing these buckets is to make long, ordered lists of the items, or maybe even tables. So we'd order some items earlier in the list and others later. The benefit of organizing them at all is that we can then create equivalences between items in the two buckets, e.g. we can say "this squiggle is encoded as the number 65," and the mechanism of this equivalence would be sameness of position, i.e. the item in one list is equivalent to the item in the other list if they are in the same position relative to the start of their respective lists. One of the longest, and most computing friendly, lists we know of is the set of integer numbers, so we use that to organize the numeric bucket in character encodings, and we also use it to index the list of squiggles. Indexing means creating a way to refer to an item's position in the list (position 0, 1, 2, etc.). One nice property of integers is that they can be sorted - each one is clearly less or more than any other. The squiggle bucket is tougher to deal with than the number bucket because it consists of squiggles (and a select group of goofy non-printable fuck-offs), which each have no obvious quantity. It's not clear which of "A" and "ꃳꖅ" is larger (should occur later in the list), so if someone asked us, it would not be obvious which should be given the number 4087383461 and which should get the number 65. We humans have conventions for the comparison of letters, like the ABCs, which put things in a commonly understood "alphabetical" order.

The Unicode Standard is a way of ordering all the squiggles humans use in written communication. In short, it assigns a number to each of these symbols. UTF-8, UTF-16, and UTF-32 are encodings that map each of these Unicode numbers (called "Unicode code points" to binary numbers. Each of these 3 encoding schemes encodes Unicode code points differently. For an example of different ways a single character can be encoded, check out the different ways of encoding an egg emoji. Side note: in Python, you can use the hex code point identifier to specify a Unicode character using only ASCII characters, but if your codepoint is greater than 16 bits long (like 0x1595a), then you've got to use a 32-bit identified, left-padding with 0s, and an uppercase U for the escape, to let Python know a 32-bit code point identifier follows.

Another plain language explanation of Unicode and other text encodings can be found in Joel on Software's often-referenced piece "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

"\U0001f95a"

-> '🥚'

For more on how Python models strings internally, and how to specify different strings, check out Python's Unicode HOWTO. These docs include lots of Python examples and some good tips and references about Unicode in general.

egg_unicode = "\U0001f95a"
egg_unicode.encode('utf-8')

-> b'\xf0\x9f\xa5\x9a'

egg_unicode.encode('utf-16')

-> b'\xff\xfe>\xd8Z\xdd'

Side note: Python prints the ASCII char for a byte string representation [if the byte is in the range of printable ASCII chars](https://stackoverflow.com/questions/72392298/what-are-these-symbols-in-bytes-type]. So, for the above, the egg is really represented in 6 bytes in UTF-16.

Part 1: The invisible part, continued

We last learned that the invisible characters of the obfuscated code cannot be encoded as UTF-16, since the raw bytes of those invisible characters did not decode as UTF-16.

Let's think through this. The characters showing up in our browser are in fact not showing up. Why? Because they are non-printable. What does non-printable mean? Non-printable means a human cannot see them. Why can't a human see them? Why didn't our computer show us something for these characters? It means that, the web client (the browser) that got the web page's data from the server decided to tell whatever display software my computer was using NOT to print anything for those characters. Ultimately, my display driver and monitor did not display anything for those characters.

Obviously, if the web client had interpreted the data from the server as UTF-16 encoded data, it would have instructed the display software stack to display something instead of nothing, since we saw that a UTF-16 decoding of that same string produced visible characters. So in order to understand what encoding the invisible characters are in, we need to figure out how the browser decided to decode the web page data in a given encoding.

Spoiler: The invisible stuff is UTF-8 encoded

Let's banish the suspense. Many readers probably already know that the non-printable characters are non-printable because the bytes in the web page document were interpreted as UTF-8 encoded text, and those bytes refer to non-printable Unicode code points, so whatever display software is involved displayed nothing in their place.

int('0xf3a0858ef3a085a5', base=16).to_bytes(8, 'big').decode('utf-8')

-> '󠅎󠅥'

Part 1: The invisible part, continued again

But, it's interesting to go through how we know that the browser displays UTF-8. It's a good guess - UTF-8 is now everywhere, and it's a pretty good guess that HTML documents these days are usually encoded as UTF-8, and browsers decode them as UTF-8 by default. I think someone told me that as well one time.

But how do we know that a web page is in UTF-8, rather than some other encoding? Is it possible to send documents back to the browser that are not in UTF-8?

This kind of gets at what a marvel a web browser is; no, it doesn't just get at what a marvel a web browser is - it gets at what a marvel being a human, being alive in this world, and somehow functioning - that is, doing things through time - is. Most of us think of a web browser as something that displays images, text, maybe even experiences to us. That's how we get through life. If we constantly held a web browser as a thing that squirts binary numbers as electrical pulses across a bunch of copper wires and then receives electrical pulses and turns them back into binary numbers that ultimately result in a cathode ray tube shooting some electrons at some phosphorescent surface.... (no one uses CRT monitors anymore, but I don't really know how our modern monitors work - maybe some liquid crystals are activated or something)... If we always thought of a browser that way, we'd be under constant mental strain while using one. Some people think of browsers this way, or else we wouldn't have browsers at all, because that's how people have to think to build them. But for most of us, and all of us at least some of the time, it's easier to ignore all that detail, which is both how we live without screaming and also exactly the origin of security - we ignore detail, and thus potentially different, important ways of interpreting things, in order to get on with things.

So, if we look at a web browser through this more detailed, but no more accurate description, we see some software that interprets bytes from the wire. When receiving some stream of bytes from the computer's operating system, the browser must have some way to know what encoding the data is in, so it can turn the bytes into something meaningful to humans by telling the computer's display software what to display.

An aside: How does a browser determine the encoding of the bytes it receives?

It depends. It especially depends on what type of file the browser thinks it is receiving because some algorithm has to be used to detect a document's encoding, and that algorithm may differ depending on the type of file being received. Let's stick to HTML here for simplicity. In the case of HTML, the algorithm browsers should use to detect content type is pretty well defined. Also, newer HTML documents will likely use UTF-8 due to WHATWG's standard for Encodings. That doesn't mean browsers assume things are UTF-8, though. When dealing with older HTML documents, various encodings can be used, and anyone can still create documents with any encoding, though some encodings like UTF-7 are prohibited by modern browsers due to historical security issues. A fun thing is that UTF-16 is still acknowledged to present significant security issues, but UTF-16 is too big to fail.

In short, as best I can tell, a browser probably considers the following sources of encoding in priority order from highest to lowest:

  1. The Content-Encoding header in an HTTP response. This doesn't require looking at the content of the document, but it may differ from the encoding of the actual document. WHATWG has created a MIME sniffing algorithm to give a standard way of figuring out the MIME type.
  2. The Byte Order Mark (BOM), if present.
  3. Metadata present in a meta element within the document.

This short explanation is all based on standards. In reality, what matters is what browsers do. I'm sure there are some strange bugs out there involving older encodings.

Part 1: The invisible part, wrapped up

In the case of the page I originally found that bit of obfuscated code on, the charset is specified in the head element of the HTML, inside a meta tag: <meta charset="UTF-8">.

So, the trick must work by using non-printable code-points, encoded in UTF-8. When these are entered, by copying and pasting or whatever, as the contents of a JavaScript string, they will be converted to UTF-16 bytes, since that's how JavaScript stores strings. JavaScript's interpreter, or some component of the browser that interacts with the interpreter, probably decodes the UTF-8 bytes to Unicode values, then encodes the same values as UTF-16 in the interpreter's memory. So let's pretend our invisible byte string is now stored inside the interpeter, decoded to Unicode and then encoded to UTF-16.

Below is a preview of the hex bytes we should see if we are able to print the byte content of the JavaScript string:

[hex(b) for b in '󠅎󠅥󠅶󠅥󠅲󠄠󠅧󠅯󠅮󠅮'.encode('utf-16be')][:10]

-> ['0xdb',
    '0x40',
    '0xdd',
    '0x4e',
    '0xdb',
    '0x40',
    '0xdd',
    '0x65',
    '0xdb',
    '0x40']

Note that I used utf-16be as the encoding. UTF-16 cares about byte order (UTF-8 does not). If we use utf-16, it defaults to little endian (this might be due to my own CPU architecture, but I'm not sure). utf-16 or utf-16le produces different values, of course:

[hex(b) for b in '󠅎󠅥󠅶󠅥󠅲󠄠󠅧󠅯󠅮󠅮'.encode('utf-16le')][:10] # little endian (not actually used by JS, on my system at least)

-> ['0x40',
    '0xdb',
    '0x4e',
    '0xdd',
    '0x40',
    '0xdb',
    '0x65',
    '0xdd',
    '0x40',
    '0xdb']

Disregard the above. We only care about big-endian UTF-16, as we will see.

At this point, you may be wondering what kinds of non-printable characters are being used. To find out, we can look up the non-printing Unicode code points. To get those we can do this:

[hex(ord(l)) for l in '󠅎󠅥󠅶󠅥󠅲󠄠󠅧󠅯󠅮󠅮']

-> ['0xe014e',
    '0xe0165',
    '0xe0176',
    '0xe0165',
    '0xe0172',
    '0xe0120',
    '0xe0167',
    '0xe016f',
    '0xe016e',
    '0xe016e']

Actually, since strings in Python 3.x are Unicode, passing one of them to the unicodedata library works really well for looking up information about these characters:

import unicodedata
[unicodedata.name(l) for l in '󠅎󠅥󠅶󠅥󠅲󠄠󠅧󠅯󠅮󠅮']

-> ['VARIATION SELECTOR-95',
    'VARIATION SELECTOR-118',
    'VARIATION SELECTOR-135',
    'VARIATION SELECTOR-118',
    'VARIATION SELECTOR-131',
    'VARIATION SELECTOR-49',
    'VARIATION SELECTOR-120',
    'VARIATION SELECTOR-128',
    'VARIATION SELECTOR-127',
    'VARIATION SELECTOR-127']

Cool! So these are things called variation selectors. Variation selectors are meant to be used following another character, to modify how the character is displayed - for instance, to modify an emoji or choose to display it as text rather than an image. Here's a nice web page showing all three of the special numbers we've seen so far - the UTF-8, UTF-16, and Unicode code point values - in one place, tied together by the name "VARIATION SELECTOR-95".

Part 2: The funky escape part

If we keep following the call chain, the first thing done to this UTF-16 byte stream appears to involve a strange construction:

escape``

This is actually pretty simple - it's using a feature of the JavaScript language called "tagged templates." It's worth reading about JavaScript tagged templates on MDN, but for now we can treat it like the escape function receives a list containing a single string, the string made of UTF-16 bytes derived from our non-printable UTF-8 bytes.

We can put the tagged template part into the browser console, and we get the following result:

"%uDB40%uDD4E%uDB40%uDD65%uDB40%uDD76%uDB40%uDD65%uDB40%uDD72%uDB40%uDD20%uDB40%uDD67%uDB40%uDD6F%uDB40%uDD6E%uDB40%uDD6E%uDB40%uDD61%uDB40%uDD20%uDB40%uDD72%uDB40%uDD75%uDB40%uDD6E%uDB40%uDD20%uDB40%uDD61%uDB40%uDD72%uDB40%uDD6F%uDB40%uDD75%uDB40%uDD6E%uDB40%uDD64%uDB40%uDD20%uDB40%uDD61%uDB40%uDD6E%uDB40%uDD64%uDB40%uDD20%uDB40%uDD64%uDB40%uDD65%uDB40%uDD73%uDB40%uDD65%uDB40%uDD72%uDB40%uDD74%uDB40%uDD20%uDB40%uDD79%uDB40%uDD6F%uDB40%uDD75"

Look familiar? That is a big-endian UTF-16 representation of our original UTF-8 byte stream, except that it's percent-encoded in an interesting way. The escape function is deprecated according to MDN.

Part 3: The replace part

Things are pretty simple from here on.

Looking at the original,

unescape(escape`󠅎󠅥󠅶󠅥󠅲󠄠󠅧󠅯󠅮󠅮󠅡󠄠󠅲󠅵󠅮󠄠󠅡󠅲󠅯󠅵󠅮󠅤󠄠󠅡󠅮󠅤󠄠󠅤󠅥󠅳󠅥󠅲󠅴󠄠󠅹󠅯󠅵`.replace(/u.{8}/g,''))

replace(/u.{8}/g,'') is next. All this does is remove each instance of the letter "u" followed by 8 characters.

This results in the following:

"%4E%65%76%65%72%20%67%6F%6E%6E%61%20%72%75%6E%20%61%72%6F%75%6E%64%20%61%6E%64%20%64%65%73%65%72%74%20%79%6F%75"

And now, we have simple percent-encoded ASCII.

Part 4: The unescape

And finally, the joke. Unescaping the percent-encoded ASCII results in:

"Never gonna run around and desert you"

Reverse engineering this trick

So, how did someone figure this trick out? I'm unsure exactly, but my guess is that they simply looked for a block of non-printing code-points in Unicode, and then they found some way to convert these to ASCII without being super obvious. The path used here was to find a nice block of non-printing characters whose UTF-16 encodings ended in valid ASCII byte values. One really important thing is that these non-printing characters shouldn't be detected by many programs. It's not uncommon for programs to call out hidden text, like the way VSCode might alert you to a hidden function name. Variation selectors fill the role for that.

A short survey of encoding based threat vectors