Site icon Joshua Tzucker's Site

Google Sheets – How To Split a String Into Characters

I’m writing up this post because this is now the second time I have run into this issue, and the solution I finally ended up with is not as simple as one might guess.

For splitting text into columns in Google Sheets, or even with a known delimiter, the approach is not that difficult – “How-To Geek” has a good step-by-step guide for those situations.

What this post is about is splitting text in Google Sheets by character, with an unknown input length and text content. For example:

Input Expected Output
| hello | | h | e | l | l | o |

This is harder than it might seem for several reasons:

  1. The SPLIT() function in Google Sheets does not allow splitting by an empty delimiter (e.g. "")
    • This might be surprising to you if you are a JavaScript developer, since const charArr = 'hello'.split(''); is not only permissable in JavaScript, but it does exactly what we want
      • In fact, one solution to our issue could be to simply create a custom function with Google Apps Script, called JS_SPLIT(), and allow it to be called from Sheets
  2. Inserting a delimiter into the input string, between each input character, to workaround the above limitation works, but has its own set of nuances

Solution

First, I’ll provide the solution, and then break down how it works, and why it might seem more complicated than necessary.

Formula:

=SPLIT(REGEXREPLACE(REGEXREPLACE(A2&"","(?s)(.{1})","$1"&CHAR(127)),"'","''"),CHAR(127))

Easier to read:

=SPLIT(
    REGEXREPLACE(
        REGEXREPLACE(A2&"","(?s)(.{1})","$1"&CHAR(127))
    ,"'","''")
,CHAR(127))

Breaking this down section by section, from inside to out:

Emoji Issue

Despite this formula working with line breaks, symbols, special characters, and even quotes, there is one instance I have found where it might still produce unexpected results: emoji input. I’m reluctant to call this a problem or “bad” behavior, because you could also view it as the formula working correctly.

Here is the issue: certain “emoji” are displayed as one character, but actually comprised of multiple emoji joined together, by a “zero-width joiner” (ZWJ), into a ZWJ Sequence. These tend to be emoji that have a common base (such as a face), but are a variation on that base (maybe a different skin color, hair color, etc.), and especially for gendered emoji combinations. A list of some of these can be found on emojipedia.

Without getting too much into the details, here is an example and how it plays out with my formula.

This emoji – 👩‍🦰 – is Woman: Red Hair, and as long as you have an up-to-date operating system, you should see it as a single character on your screen. However, look at how the Google Sheets formula splits up that character when it appears in our input string:

What has happened is that the formula has treated 👩‍🦰 as three characters, not one, which is technically correct since it is a ZWJ Sequence comprised of 👩 (Woman, U+1F469) + ZWJ (U+200D) + 🦰 (Red Hair, U+1F9B0).

The 👋 (waving hand emoji) is split as a single character, since it is not a ZWJ sequence, but rather a regular single glyph / emoji.

Note: This same behavior can be seen in JavaScript – try copying and pasting this into your console for a fun experiment: console.log([...'👩‍🦰'])

ZWJ Sequences also completely break formula that rely on using length tricks, such as this alternative solution for splitting by character.