Getting The Word Count of Multiple Markdown Files

Issue

I use Markdown on pretty much a daily basis. I enjoy its portability and wide acceptance, as well as its non-proprietary nature. However, certain literary features tend to be lackluster in the Markdown ecosystem, primarily because Markdown is most commonly used by developers, not writers (not to say that there is not overlap!). One of those features that it took me a little bit of time to suss out is how to calculate word counts for markdown files.

CLI / Shell Solutions 👩‍💻

Most of the below code snippets assume you are using a Unix-based OS, or an OS that has common Unix commands available (such as Windows, with Git-Bash installed, or coreutils).

We can build up a custom string of commands to get the results we want. Let’s break it down:

Getting word count (from plain text)
- If we want to accomplish this from the command line, it turns out there is a handy command we can use, aptly named wc (word count). We can use wc -w.
Getting word count (from markdown file)
- We first need to convert the markdown file into plain text, before passing to wc -w, to avoid inflating our count with MD syntax stuff
- We can use one of my favorite markdown processors, pandoc, for converting MD to plain text.
  - pandoc --strip-comments -t plain {markdown_file_path.md}
- Putting the above together with wc for word count:
  - pandoc --strip-comments -t plain {markdown_file_path.md} | wc -w
(BONUS) - What about getting a total word count across multiple files?
- For a bulk word count, we can use another command that will produce the list of filepaths, pass that into our previous command, and let wc sum it all up
- Example: ls | xargs pandoc --strip-comments -t plain | wc -w
- Example: find . -iname "*.md" | xargs pandoc --strip-comments -t plain | wc -w

Solution Summary

So, to summarize some quick options:

Word count for single file:
- pandoc --strip-comments -t plain {markdown_file_path.md}
Word count for multiple files:
- find . -iname "*.md" | xargs pandoc --strip-comments -t plain | wc -w

Accuracy 🎯

What makes something a “word”? With the above commands, anything separated by spaces is a word, including code, URLs, and other snippets that many might argue should not be included in the word count calculation.

The best way to address this, if you care to, is to use a Pandoc filter (either a program, or Lua script). I won’t go into details, but here are two resources that cover how to use this feature:

Torrecillas: “Meaningful Word Count for Markdown”
StackExchange / Superuser

Editor / GUI Solutions 🖱

Many Markdown editors support native word count reporting:

Nano (use ALT + D)
Typora
Mark Text
Notepad++

Oddly enough, VSCode does not natively offer a word-count feature, but there are extensions you can install (such as this one) which will provide that feature.

⚠ Warning: Many of the above editors use the same word-count rules as wc, which counts code and other non-prose content. See my notes under “Accuracy”