Google Sheets – How To Split a String Into Characters

  • infoFull Post Details
    info_outlineClick for Full Post Details
    Date Posted:
    Jan. 31, 2021
    Last Updated:
    Jan. 31, 2021
  • classTags
    classClick for Tags

I’m writing up this post because this is now the second time I have run into this issue, and the solution I finally ended up with is not as simple as one might guess.

For splitting text into columns in Google Sheets, or even with a known delimiter, the approach is not that difficult – “How-To Geek” has a good step-by-step guide for those situations.

What this post is about is splitting text in Google Sheets by character, with an unknown input length and text content. For example:

Input Expected Output
| hello | | h | e | l | l | o |

This is harder than it might seem for several reasons:

  1. The SPLIT() function in Google Sheets does not allow splitting by an empty delimiter (e.g. "")
    • This might be surprising to you if you are a JavaScript developer, since const charArr = 'hello'.split(''); is not only permissable in JavaScript, but it does exactly what we want
      • In fact, one solution to our issue could be to simply create a custom function with Google Apps Script, called JS_SPLIT(), and allow it to be called from Sheets
  2. Inserting a delimiter into the input string, between each input character, to workaround the above limitation works, but has its own set of nuances

Solution

First, I’ll provide the solution, and then break down how it works, and why it might seem more complicated than necessary.

Formula:

=SPLIT(REGEXREPLACE(REGEXREPLACE(A2&"","(?s)(.{1})","$1"&CHAR(127)),"'","''"),CHAR(127))

Easier to read:

=SPLIT(
    REGEXREPLACE(
        REGEXREPLACE(A2&"","(?s)(.{1})","$1"&CHAR(127))
    ,"'","''")
,CHAR(127))

Breaking this down section by section, from inside to out:

  • REGEXREPLACE(A2&"","(?s)(.{1})","$1"&CHAR(127))
    • What: This replaces every character in the input string (contained in cell A2) with itself, plus a special delimiter character
    • You’ll notice that I use CHAR(127) as my inserted delimiter, whereas many answers on StackOverflow use a plain text character, like ~. There are two reasons why I use CHAR(127) instead of ~.
      • CHAR(127) is the Delete Character, which is an ASCII control character (invisible in this case) that is highly unlikely to ever show up in arbitrary input text. The Tilde character (~), on the other hand, is much more likely to show up in text input, which would cause issues with a formula that uses it as a delimiter
      • The delete character is one of the very few control characters that Google Sheets does not strip out of input / output
    • Another really important part of this step that a lot of StackOverflow answers miss is the (?s) part. This is a RegEx flag (in the RE2 flavor) that allows . to match any character as well as new line (\n). If we were to leave off this flag, then our formula would fail to work properly with input cells that contained multiple lines of input
  • REGEXREPLACE(..., "'", "''")
    • Here I am escaping any single quotes in our input string, by replacing them with '' (two single quotes).
    • This is important because if we don’t do this, then any single quotes in the input string will turn into empty cells when we split by the delimiter in the next step. ' is a special leading escape character in Sheets.
  • =SPLIT(..., CHAR(127))
    • Finally, I am splitting the string with delimiters inserted, by the delimiter itself (CHAR(127)). The result will be an array, which gets expanded out, starting in the cell you place this formula in

Emoji Issue

Despite this formula working with line breaks, symbols, special characters, and even quotes, there is one instance I have found where it might still produce unexpected results: emoji input. I’m reluctant to call this a problem or “bad” behavior, because you could also view it as the formula working correctly.

Here is the issue: certain “emoji” are displayed as one character, but actually comprised of multiple emoji joined together, by a “zero-width joiner” (ZWJ), into a ZWJ Sequence. These tend to be emoji that have a common base (such as a face), but are a variation on that base (maybe a different skin color, hair color, etc.), and especially for gendered emoji combinations. A list of some of these can be found on emojipedia.

Without getting too much into the details, here is an example and how it plays out with my formula.

This emoji – 👩‍🦰 – is Woman: Red Hair, and as long as you have an up-to-date operating system, you should see it as a single character on your screen. However, look at how the Google Sheets formula splits up that character when it appears in our input string:

ZWJ Emoji Sequence in Google Sheets - Woman with Red Hair

What has happened is that the formula has treated 👩‍🦰 as three characters, not one, which is technically correct since it is a ZWJ Sequence comprised of 👩 (Woman, U+1F469) + ZWJ (U+200D) + 🦰 (Red Hair, U+1F9B0).

The 👋 (waving hand emoji) is split as a single character, since it is not a ZWJ sequence, but rather a regular single glyph / emoji.

Note: This same behavior can be seen in JavaScript – try copying and pasting this into your console for a fun experiment: console.log([...'👩‍🦰'])

ZWJ Sequences also completely break formula that rely on using length tricks, such as this alternative solution for splitting by character.

Leave a Reply

Your email address will not be published. Required fields are marked *