Automate Prose Style Checking

I've taken a programmer’s approach to proofreading by automating some basic styling checks which mimic the static code analysis tools available for most programming languages.

Automate Prose Style Checking

I've taken a programmer’s approach to proofreading by automating some basic styling checks which mimic the static code analysis tools available for most programming languages.

While modern editors can carry out spelling and grammar checks, stopping us from embarrassing mistakes, they are not great at helping us to avoid common styling mistakes. This is usually left to the author to learn through practice; writing, reviewing and getting feedback from editors or colleagues, as well as referring to the multitude of advice on style.

While reading William Strunk’s The Elements of Style, it occurred to me that it would be possible to automate some of the checks, especially those that warn against using certain words and clichéd phrases.

It turns out, however, that someone else has already gone and created that already. Proselint does exactly that as well as combine other styling guides which are more modern than Strunk’s work. Proselint is written in Python.

Prose Lint

The application is available to install with the following command, assuming that Python is already installed. Note that I would recommend installing in a sandbox environment using virtualenv but that is outside the scope of this article.

pip install proselint

To run, save some text in a file called prose-to-check and run:

proselint prose-to-check

The output gives reference to the line and the rule that is broken with enough information to consider making a fix. The step of having to first export to a file, however, is not something that I want to have to do each time I run the check.

Scrape and Extract Text

It's easy enough to scrape the HTML from the web using curl, but proselint does not do well with parsing this, at least not at the time of this post. I decided to write a simple program to extract just the text from the main section of the page using Beautiful Soup.

Install Beautiful Soup (in the same virtualenv) with:

pip install beautifulsoup4

This script will parse a Ghost page (using my current theme at any rate):

def load(file):
    with open(file) as fp:
        soup = BeautifulSoup(fp, "html.parser")
    title = soup.find("h1", "post-full-title").getText()
    body = soup.find("section", "post-full-content").getText()
    return title + body

It searches for the title in a tag h1 with class post-full-title, and the body within a tag <section> with the class post-full-content. The text only part can be extracted with getText() and the two strings concatenated to return the result.

So given a HTML file of a post using the Casper theme from Ghost, I can extract all the text. Note that code listings also comes out of this - I may decide to revisit this to remove those elements from the tree as proselint is unlikely to be able to make much sense of those lines.

Finishing the Script

I placed the above procedure in a script called ghost-html-to-text.py so that it looks like this:

from bs4 import BeautifulSoup
import sys

# The above load procedure here

args = sys.argv
if len(args) != 2:
    print("Please specify a single argument")
else:
    filename = args[1]
    article = load(filename)
    print(article)

Now this can be run with

pyhton ghost-html-to-text.py my-file.html

I would still have to save the file to my computer and that is a step I don’t really want to do either, especially as it would be a step to carry out after each edit. Can I combine scraping the file, extracting the text and running through proselint in a single command?

This wouldn’t take much more time to write fully in python, but I will keep things simple and combine the three elements (scraping, parsing, linting), each part doing a specific function, and tie them together. I think this is more in the spirit of both Linux and Python.

Bringing it Together

Using the linux pipe and the standard in device, I can combine all the parts together as follows with the url for the page to check with proselint in the first line:

curl http://failedtofunction.com/ghost-problems/ \
| python ghost-html-to-text.py /dev/stdin \
| proselint /dev/stdin

This works nicely with Ghost as it provides a url allowing a post to be previewed before publishing. I simply copy it from the browser and paste in above to carry out the check.

Putting this into a shell script will make it easier to run. I've created a file called web-proselint, taking the url of the post in an argument denoted as $1:

#!/bin/sh
curl $1 | python ghost-html-to-text.py /dev/stdin | proselint /dev/stdin

As it stands this is not executable without changing the permission, but can be run with the following command, using the leading . to run as a script:

. web-proselint http://failedtofunction.com/ghost-problems/

What Else?

I need to add some additional code to the above, such as removing code blocks from the text returned; these can confuse proselint. For example, proselint raises a warning about the type of quotation marks that are used, recommending “curly” rather than "straight". Code listings will generally only have the latter so the warning is flagged unnecesarily.

It would also be nice to get this working for different themes or even to work with WordPress sites. Some sort of configuration could allow the tag and class of the elements we need to extract and some code to figure out which theme the site is using would be needed.

I could also add a command to activate the virtualenv that I created and to deactivate it afterwards. This makes the review process smoother by not having to worry about environment issues, but as with the use of virtualenv, this is outside the scope of the post. Packaging it into a standalone executable is another possibility.

An alternate approach would be to use Ghost's export facility - I believe this represents the post in JSON format which may be more convenient to deal with than HTML and might be worth investigating.

Proselint, spelling and grammar checkers and any other automated means should not replace having another colleague review your post before you publish it - humans still lead the way with spotting the major issues with writing.