by

Introducing the Selector Finder

Reading Time: 4 minutes

Ever wanted to delete a CSS ruleset, but didn’t know if it was being used on a site? Maybe you want to know how many empty h3s you’ve got. Or maybe you want to know how often a a .title and .subtitle are used together.

I built a thing that can answer that question: The Selector Finder.

Give it a sitemap and a CSS selector, and it’ll scour the site looking for any matches.

Talking Tech and Selecting Selectors

I debated about whether to write this in Python or JavaScript. I’ve done quite a bit of web scraping and automation in Python, so it was my first thought. Except, of course, Python would require me to use the BeautifulSoup library. And BeautifulSoup ain’t pretty to write.

So, Node seemed like a more comfortable choice. Right off the bat, I knew the perfect library: Cheerio. It’s a server-side version of jQuery, so blending that with a basic Ajax library (Axios) would make this pretty straightforward.

I did decide that it would be handy to screengrab the elements, and to at least try to account for SPAs, so I did get Puppeteer involved.

By the time I finished, I was using a grand total of six libraries for this thing. So it’s pretty lightweight by NPM standards.

Setup

This isn’t quite an NPM package, and it isn’t quite a pure CLI; it’s still something in-between. You’ll want to download the code from Github, crack open the command line, go to that directory, and then run npm install to get started.

Usage and Features

The one thing this requires is an XML sitemap. From the command line, you’ll give it the URL to your sitemap, along with a CSS selector

Finding that Selector

This here will find all the instances of the .sources element on my blog:

node index.js -u "https://blog.frankmtaylor.com/wp-sitemap-posts-post-1.xml" -s ".sources"

I’ll get something like this in the console, to tell me it’s started

The "starting" message in the console which shows the full url to the sitemap, the limit, and the CSS selector
It’s just a small summary of the arguments I’ve fed to it

And when it finishes, I get a summary of the results

The ending console message showing how long the process took, how many pages were scanned, how many pages have a match, and total results. It also shows the name of the file containing the results.
The summary is a nice thing to drop in a Jira ticket if you have to gauge level-of-effort

But, where are the results?

Oh, they’re in that pages.json file mentioned in the FileName. They’ll look something like this:

{
  "cssSelector": ".sources",
  "totalPagesSearched": 189,
  "totalMatches": 12,
  "pagesWithSelector": [
    {
      "url": "https://blog.frankmtaylor.com/2013/07/05/css-the-breakdown-part-one-the-selector-and-grammar/",
      "totalMatches": 1,
      "elements": [
        {
          "tag": "div",
          "attributes": {
            "class": "sources"
          }
        }
      ]
    }
]
}

Looking at Logs

This utility writes the logs not just to the console, but to a log file, too. If there’s an issue searching on any page, it’ll pop up in the log.txt file

a sample of a run where the exact same messages shown in the console appear in the log file. But errors would also display here.
Errors will end up between the start and finish messages

Additional Options

Limit the number of pages

If it’s a massive sitemap, you can set a limit to how many pages it scans. You can use the --limit or -l parameters to provide a number. If it’s zero, that means there’s no limit.

node index.js -u "https://site/sitemap.xml" -s ".sources" -l 30

Change the output file name (a little)

if you don’t like pages.json as your file name, you can change it to something.pages.json. Use the –outputFileName or -o parameters

node index.js -u "https://site/sitemap.xml" -s ".sources" -o "sourcesclass"

Flag it if it’s a single-page app

By default, this uses Cheerio. But Cheerio isn’t going to find stuff that’s rendered with Vue, Angular, etc on page load. In those cases, we need some rendering. use --isSpa or -d to indicate that this is a single page application. It will take longer to get your results!

node index.js -u "https://site/sitemap.xml" -s ".sources" -d 

Take a screenshot

Sometimes it’s not enough to have the tag name and some attributes. Maybe you’d like to see the element. If that’s the case, you can take a screenshot of the element on the page with --takeScreenshots or -c.

node index.js -u "https://site/sitemap.xml" -s ".sources" -c

Wrap-up

This is going to end up being a permanent member of my toolkit, because it does things my IDE can’t. Provided I have a static build, I can search for HTML elements and exact matches. It takes a certain level of RegEx skill not found in the general population to even try fuzzy matches just in a class name.

So the Selector Finder ends up being a really handy way to find out how often some bit of CSS is used, where it’s used, and even get an idea what it looked like before I broke it.

I am considering making it a proper NPM package + CLI utility going forward, but what’s holding me back is the use-case of wanting to scan a local site where you may not have a sitemap, but a directory instead. So stay tuned.