Personal-scale Web scraping for fun and profit
Off we go! (So what are we doing here?)
Ever hit walls on a project, because you need data but there's no API?
I've run into a this a lot lately -- and the common theme you'll find here, as I dart around several projects in an otherwise winding log of what I've been "working" on over the last few months, is that all of them involve some variation of same exercise -- crawling websites to grab publicly-available data, and outputting it in a structured format.
They're arranged loosely by order of complexity, and they mostly (but not exactly) mirror the progression I made from zero experience with Web scraping to building an automated crawler to pull down job search listings for me and output them to files.
Now, fair warning up front, it's long -- it goes into a fair amount of detail at each step, and not including code samples it clocks in at roughly 10,000 words in total. You might want to use the table of contents to jump around, but to give you a tl;dr up front, it'll detail how I:
- Grabbed a dataset on Futurama from a wiki for an eventual side project
- Pulled down the contents of the first reference material I used to learn to code, and converted them to Markdown
- Wiped my reddit history in protest of the whole mod purge thing
- Automated a browser to aid in my job search
This guide is written primiarily targeting vanilla JavaScript, but you will see polyfills as I'm not strictly talking about browsers. There'll be a couple of other exceptions that I'll explain as I go, but aside from a bit of discussion of snags you might run across in an example using React, it doesn't assume prior knowledge of anything other than a working grasp the language in general, and of browsers' DOM APIs in particular. It also shouldn't require a deep expertise in either, as it aims to build on itself in more or less the way my collection of experiments did.
Now that we've gotten that out of the way, let me start by establishing some of the problems I was trying to solve. So there are several nails I've noticed floating around my own workspace once I picked up this metaphorical hammer, but to give this at least a little focus I'm going to give you some background on the three that were all annoying me at once when this all started.
First: one of my backlogged portfolio projects involves running data analysis on scripts from Futurama, mostly just as an excuse to do something fun. Kaggle has a dataset for this, but it's outdated, and meanwhile there's a wiki. Moreover, Kaggle does allow you to post code with your datasets, but since I speak JavaScript better than Python, and wanted to roll this myself anyway, I opted to start in a browser with a trick I'd picked up while learning to run brute-force attacks on Web applications. (I'll get into this a bit later too...)
Second: for reasons that this post isn't about (extreme tl;dr, just to preempt further questions, is that it was a workplace thing), I recently found myself in need of legal advice. But as I was poring through directories -- which was initially my final straw of a thing I never wanted to manually do again -- many of the law offices I was checking out turned out to have either no website, or just ancient ones. Now, while I was already looking to engage their services myself I was obviously just going to append to my outreach messages that "just as a professional courtesy, let me tell you about HTTPS..." but even in a small practice area, I found enough of them to make me wonder if there's maybe a market there.
Now, how sustainable this realization is as a freelance business is still one of those backlogged research-phase projects, and accordingly I've been avoiding getting to building something like that out at a larger scale until I can get a clearer sense on where the boundaries are around terms of use, anti-abuse measures, etc for commercial lead-gen. (Especially if I was going to use it to hit up lawyers in particular.)
But that's not the only use case I have for this...
Finally: for a while I've slowly been building out tools here and there to automate fetching and organizing job listings -- my prototype involved querying Hacker News via an API and building out a skeletal imitation of Tinder for categorization purposes, but life things got in the way of building out anything more extensive (and HN's forum threads on this aren't the most consistent, which makes parsing them for relevant metadata more of a task). Using tools to speed up my own job search is still effectively a personal activity, even if finding somewhere I like and getting hired there would, of course, make me money. Plus, in most cases, listings don't have things like direct email contacts (which could raise issues around personally identifiable information), which mitigates some of the concerns about this I'd otherwise want to address up front.
And, the process of running through a set of publicly available listings is roughly the same -- given a paginated list, get top-level entries and links to the job listings themselves, get the link to the next page, fetch that page and parse its HTML, and repeat as long as there is a next-page button. (And given a list that's not paginated, you can automate scrolling using JavaScript for a similar effect.)
But there's an even less complex, lower-stakes environment I could use to test this.
IT'S DANGEROUS TO GO ALONE! (Disclaimers)
Before we continue, though, a few warnings.
In the US, where I'm writing this from, current case law (as of this writing) generally holds that scraping public-facing data is legal, and this is written with that knowledge in mind. (Here's a good explainer of a case in this area that LinkedIn pursued, from when SCOTUS ruled on it a little over a year ago.) That said: I don't know what jurisdiction you're reading this from (or particularly care to), I don't know your use cases, and even if I did, I'm not a lawyer.
There are a variety of topics that intersect with this, ranging from applicable terms of use, intellectual property related to whatever data you're gathering, data privacy laws, and probably more I'm not thinking of -- all of which may impact whatever you might want to do with its contents. None of what I'm telling you here is legal advice, and I would strongly recommend doing your own research, in a manner tailored specifically to your project goals -- especially if those goals involve commercial use.
Good news, everyone! (Getting started)
For the first piece, enter The Infosphere. This Futurama wiki contains the episode transcripts I need to gather data for my other project -- what are the 10 most common words actually used by a bending unit?
Making fetch
happen (...that's it. that's the reference.)
My initial approach (here, and eventually on a jobs board) was to just write and test my code using a browser. There are other tools that can be used for this, and I'll elaborate on that a bit later (along with some of the limitations attached), but in getting relevant page information to construct my queries, it was easier to just be able to inspect the contents anyway. For that matter, it's also trivial to copy any data structure from the browser console straight into my clipboard, to then manipulate as needed anywhere else. So I pulled up the Transcripts page directly, ran a query on the episode/movie list to get relevant links (and some metadata like titles, episode order, etc), and ran fetch
requests against them. While a lot of my usage of this in workplaces and otherwise has looked something like:
// var names are just placeholders here
const res = await fetch(endpoint, options)
const data = await res.json()
I've since learned that it's possible to get raw HTML responses using .text()
instead. And further, on tweaking this for these experiments, that:
res
above becomes unreadable on being parsed once- using
await
this becomes moot, because I can just get the final result out of a promise chain and leaveres
in an anonymous function.
As an aside, this piece actually came from a fourth project... for fun, I've been slowly been going through CTF puzzles hosted at a site called Over the Wire -- I won't get into detail about how, since they ask users not to post writeups, but to be vague about it some level solutions require brute-forcing your way to a solution (by design), by attempting possible solutions in a loop until you have the right one. Since the Natas series in particular (which is what I'm currently about halfway into) consists of Web application challenges, a natural path to take here is to implement this via looped fetch
es.
Funny side story: at a previous workplace, a week after getting scolded for not spending my off-hours technical dives on skills that related more directly to my job, I then found myself having to ssh pkill -9
a stuck process in order to stop my development box from hanging -- something that I also learned how to do from these puzzles.
Speedrun skips (Taking shortcuts)
Anyway, skipping over how this would look with JSON just for brevity, I've since ended up writing functions to wrap this in something a little less verbose:
// this isn't async because even if it returns a promise `await` is going to read that anyway
const fetchHTML = (endpoint, options) => fetch(endpoint, options)
.then(res => res.text())
From there, I was able to get a set of full response bodies with scripts for every episode. But then I had to also query those.
This is where DOMParser
comes in. The MDN article is as usual a great reference for more detail, but in essense you can create a DOMParser
instance to feed raw HTML (and other) strings and query them in the same manner you would the page you're actively viewing.
Mirroring the shorthand above, we can represent this in a function:
const parseHTML = page => new DOMParser().parseFromString(page, 'text/html')
Taking this a step further, you can chain these together further into a single operation:
// abbreviating these args since you just saw them above
const getDoc = (e, o) => fetchHTML(e, o).then(res => parseHTML(res))
Now normally we'd also be doing some kind of error handling, but realistically you would want to wrap these network calls in try/catch
blocks no matter how you're doing them, so it's not something I'm really explicitly thinking about here.
There's one more pattern you'll be using a lot:
// given q, to represent a query
// and d, to represent *a* document. more on this in a second.
[...d.querySelectorAll(q)]
The reason you want this is that a NodeList
is not an Array
, and doesn't have access to most of the methods the latter has available. But, by spreading the list into an array, you can then run any array method you need on the results.
One particularly handy function you'll probably use a lot is to run .map()
on the results, so that you can get specific properties like innerText
directly instead of having to run a for
loop on the bare NodeList
and push
into some other array. With .filter()
you can use additional conditions based on DOM properties in addition to those in your CSS query, or you can make more complex queries using .find()
to go beyond what you can do with querySelector
alone.
We're going to represent this below as a similar function to the above, but with a slight variation. It'll make sense in a second:
const getDOMQueryResults = (selector, doc = document) => [...doc.querySelectorAll(selector)]
So this sets a second argument, doc
, with document
as a default value. For the unfamiliar, this means that you can optionally call it with a different document if, say, you've got another tree you've created using a DOMParser
and want to query against instead.
That said, you'll want to remember what's happening under the hood here -- in the event that you need to access a property like children
, those are also NodeList
s and not arrays.
Tactical Espionage Action (Detection and mitigations)
Before we go any further, I'm also going to define a couple more quick utility functions here:
const sleep = (timeout) => new Promise(resolve => setTimeout(resolve, timeout))
// alternatively, if you want it to resolve a specific value:
// new Promise(resolve => setTimeout(() => resolve(your_value), timeout)
const getRandomTime = (target = 12, window = 3) => {
// we can guard this more thoroughly, but skipping arguments
// is the point of the default parameters anyway
if (window < 0) window = Math.abs(window)
// generate a random second count
const roll = () => Math.floor(Math.random() * (target + window))
let seconds = roll()
// if random selection isn't within the window specified,
// reassign with a new value until it is
// we don't need to check upper bounds, as that's already a hard limit
// in the randomization above
while (seconds < (target - window)) seconds = roll()
// generate ms separately. This way, we're only dealing in this level
// of precision once, no matter how many re-rolls happen above
const ms = Math.floor(Math.random() * 1000) / 1000
return seconds + ms
}
const getRandomMilliseconds = (target = 12, window = 3) => getRandomTime(target, window) * 1000
With sleep
, you can block the execution of an async function for the timeout of your choosing. This is important -- you don't want to hammer a website with a flood of requests all at once, because any platform with reasonable load handing is going to do something about that. (Unlike with, say, an application that's insecure for educational purposes, where just asking users "don't DDoS us" on the honor system is at least somewhat feasible.) Accordingly, we want a way to be able to slow our page calls down when we need to.
You can also use it between individual steps of your code in modern JavaScript environments, since the language has had support for top-level await
since ES2022.
Meanwhile, getRandomTime
is more or less what it sounds like. You optionally pass in (up to) two numbers: one is an anchoring point you want to weight your randomization around, and the other is a maximuum distance. Using the defaults as an example, this would give you a random number of whole seconds, between 7 and 13. We could really overengineer all of this (or just move to TypeScript) to check that the arguments passed into it aren't breaking, but the point of the default parameters in the first place is to just avoid having to use them at all. But instead I've just overengineered it a little, so that our random intervals are happening at the same level of precision as what we can pass into setTimeout
.
Polymerization (Putting it all together...)
So I'll start diving into how this all fits together with the one I built my prototype with: harvesting episode scripts from Futurama. This is all running in a browser console, at the Infosphere's episode transcript list page.
Some limitations to be aware of:
- We're not covering the Hulu revival, since it's still ongoing and the data will be incomplete anyway
- Moreover, about 20 later episodes contain incomplete transcripts, and won't be included here
- Similarly to how Star Trek handles its canon (I know, I know... we don't talk about that, because wars were declared), we're not counting anything that isn't a TV episode or a movie. Comics, video games (mobile or otherwise), and the like are all treated as out of scope
- And more just a snag we'll have to account for than a limitation in our data, but obviously we're only counting the dialogue from the movies once, so we'll be skipping over the broadcast versions
// this is the selector for table elements representing entries we'll want here.
// it's useful to have these as vars, because you can break out a lot of your query logic
// to recycle that in other places that you might want to use it
const tableCellSelector = '.oCentre'
const episodes = getDOMQueryResults(tableCellSelector)
.map((e, i) => {
// don't include the current revival; its data isn't stable enough to use
if(e.parentNode.innerText.includes('Hulu')) return null
const cells = [...e.children].map(e => e.innerText.trim())
// get the transcript page's title, because that's a constant
// and then skip to the end and get the release order
// release order is distinct for movies and restarts at 1
// (we can skip unused entries using commas, and we'll be using this again later)
const [pageTitle, , , ,releaseOrder] = cells
// (after offsetting for zero-indexing,)
// we can equality check this against the array index
// to get whether this is a broadcast episode or a movie
// since these have different data associated with them
// (also since the movies are duped as "season 5")
const isMovie = releaseOrder != i + 1
// use pageTitle to find the transcript link (instead of doing a whole DOM traversal)
// then strip the preceding text to get the actual title
// NOTE: this can also be expressed just as .href -- see below
const linkQuery = `a[title="${pageTitle}"]`
const transcriptLink = document.querySelector(linkQuery)
?.getAttribute('href')
const title = pageTitle.replace('Transcript:', '')
if (isMovie) {
const [ , dvd, bluRay, productionCode ] = cells
return {
title,
isMovie,
release: { dvd, bluRay },
productionCode,
releaseOrder,
transcriptLink
}
}
const [ , airdate, productionCode, broadcastOrder ] = cells
return {
title,
isMovie,
airdate,
broadcastOrder,
productionCode,
releaseOrder,
transcriptLink
}
}).filter(e => e)
// remember that "given *a* document" bit?
for (let e of episodes) {
// don't dupe repackaged versions of the movies
if (e.broadcastOrder?.startsWith('S06')) {
e.lines = 'see movie entry for original transcript'
return
}
// get and parse script pages
const doc = await getDoc(e.transcriptLink)
// detect if there's any mention of there being raw text
// that hasn't properly gone through edits
if (checkTextContents(doc.body, 'not meant to be read')) {
e.lines = 'incomplete transcript'
return
}
// remove text sections containing episode timestamps
// we don't need them for this, they mess w the separator
// that we're actually trying to grab here, and
// they're not consistent across pages anyway
getDOMQueryResults('span.timestamp', doc).forEach(e => e.remove())
const scriptContentsQuery = '.mw-parser-output>*'
// query each script page for the content we want, including running some filtering logic on it.
// what we're ultimately doing here is getting individual blocks of text
// representing lines or direction in the script,
// and then filtering out tables containing metadata.
// there's *definitely* more cleanup I didn't get to,
// because I got to fleshing out the scraping stuff first
e.lines = getDOMQueryResults(scriptContentsQuery, doc)
.map(({ innerText, tagName }) => {
if (tagName == 'TABLE') return null
// we con't care about...
const line = innerText
.replace(/\[.*\]/g, '') // actions, just words
.replace(/\<.*\>/g, '') // stray, broken tags
.replaceAll('\n','') // extraneous whitespace
// if it's *only* the above,
// the whole section should be filtered out
if (!line) return null
const [speaker] = line.split(':')
return { speaker, line: line.replace(`${speaker}: `, '') }
}).filter(e => Boolean(e))
}
So here's what I'm doing with this, in brief:
- I'm querying the tables of episodes and movies, and getting various metadata
- This includes the
href
for the transcript link on each page. - Using that link, I'm then making
fetch
requests (using the wrapper functions defined eariler) to get those pages' contents - I'm then taking that HTML, returned to me as response data (and
await
ed out of a promise chain), and running it into aDOMParser
-- at which point I can query the contents of those pages, without actually having to navigate through them.
As an additional note: you can read href
as an object property directly off of <a>
tags, and I'll be doing so through the rest of this piece. That said -- you can use getAttribute
to get whatever other property might be on an element, and we'll also be doing that later.
We can also express the latter half as a .map()
call instead, like this:
// rate limiting doesn't matter here, but for illustration
// I'm going to show you how to deal w that anyway
const results = await Promise.allSettled(episodes
.map(async (e, i) => {
// i is the array index, which we'll use to stagger requests
// the async operations don't block each other, so with this
// we can set them to go off at random times
// that are loosely tied to the array indexes
// to keep this from getting *too* all over the place,
// we can tighten the window getting multiplied by each index
const interval = i * getRandomMilliseconds(9, 1)
await sleep(interval)
try {
// fetch logic
} catch (err) {
console.error(err)
return e
}
// everywhere you're assigning `e.lines`
const lines = [] // your actual value here
return { ...e, lines }
}))
Promise.allSettled
will wait until all async calls in the array are finished, without necessarily caring about whether they resolve as intended. (Unlike .all
, which rejects if any Promise in the array does.)
Go Beyond (the browser)...
(PLUS ULTRA!)
Now, everything we've done up until now has been client-side -- that is, running in the browser. And in many cases, that'll be necessary as some part of this process. Pages could render content dynamically after being loaded, or even just block other user agents entirely.
But that's not always our only option.
Unlike a lot of larger-scale sites, that for various business reasons would actively try to do something about this kind of thing, this one is also serving the full page contents as raw HTML -- so with a few modifications, we can also do this directly within a server-side runtime like Deno. If you're newer to Web development, or skew toward frontend, you might wonder why I'm not suggesting Node for this. In brief, it's that Deno offers a lot of useful on-ramps. It uses standard Web APIs wherever it makes reasonable sense, so you can reach for a lot of the same functionality (fetch
, WebSockets
, and localStorage
are some immediate examples that come to mind) that you're already used to in a browser, and they all mostly work in the same way, aside for some mild differences that come with running them on a server. (More on that later.) Plus, it comes with a broad array of tooling out of the box, so you don't need to make as much use of third-party libraries to handle things like behavior-driven tests, common text formats like CSV and YAML, etc. (And in fairness, Node has been closing this gap somewhat in response, but the point is that we don't want to hunt for third-party tools when we don't have to.)
Kitson Kelly, former Deno core developer, once also opined that "Deno is a browser for code," but (aside from that post having since link-rotted anyway) his piece on this topic focuses more on more under-the-hood functionality around permissions, dependency management, etc. This description is more about ergonomics, and is otherwise shaped more by Node/Deno creator Ryan Dahl's description, from his talk on A New Way to JavaScript: a scripting environment akin to Ruby or Python, but for Web technologies.
To run a script in this environment, you'd just run deno run {options} {script}
, passing in your relevant options and targets. Some examples of options you might pass in include locations for things like config files, or permissions you're granting to your script. (It'll prompt you as needed for those if you don't use the flags.) Some common ones for the latter include allow-read
, allow-write
, and allow-net
, which are all more or less what they sound like. You can also scope those permissions to specific locations (such as allow-net=localhost
) or selectively deny access to a given scope in the same fashion (deny-net={https://google.com}
). If you want to give permission to everything you can use the -A
flag, but this isn't typically recommended.
I'm only going to lay out the parts that would actually be modified here, but to give you an idea:
import { DOMParser } from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts'
// or JSDOM or your DOM parsing lib of choice
// neither Deno nor Node comes with this natively, so we need a package for this
// that said, this is pretty much a straight polyfill
// generally you'd use an import map here for versioning and location
// but it's not especially important here
// we already know where we're going, so we can cheat here a little
const rootURL = 'https://theinfosphere.org'
const transcriptsPageURL = `${rootURL}/Episode_Transcript_Listing`
// first, we're going to grab the page contents for where we'd been pointing our browser
// `episodes` will be querying this var instead of `document`
const transcriptsDoc = await getDoc(transcriptsPageURL)
// inside the .map() call...
const transcriptHref = transcriptsDoc
.querySelector(`a[title="${pageTitle}"]`).href
// getting the individual episode links
const link = new URL(rootURL)
link.pathname = e.transcriptLink
const doc = await getDoc(link)
// we can also write the contents directly to JSON
await Deno.writeTextFile(JSON.stringify(yourData), yourFile)
// or, skip the await and use writeTextFileSync
// alternatively, you can use jsonfile, a package in Deno's registry
// these contain some old helper functions previously in the standard lib
// that have since been removed in favor of native functionality
// you can mostly render the read functions moot, by using import attributes,
// or copying the above pattern with readTextFile/JSON.parse, but writeJson
// is still a bit more ergonomic with how it handles formatting options
The URL
object -- created by calling the new URL()
constructor on a string containing... well, a URL -- will give you specific properties on a URL. It comes with various properties that will give you the origin
, host
, protocol
, (https:
, www.example.website
, and ${protocol}//${host}
, in that order) and more. You can convert it back to a plain URL using toString
, or just grab it via the href
property. Further, you can also use reassign properties like pathname
, and not only will others update as necessary, but you can also just pass the URL object itself directly into a fetch
request.
But in this environment, you're going to run into some hard limitations when pages bother mitigating this -- like pretty much anything where the page is grabbing information and dynamically modifying the page contents instead of just navigating to a new page -- so we're not going to focus too heavily on that here.
For now, we're going to talk about what we can do client-side -- we can automate a browser to do this later, but this is about the initial experimentation phase of grabbing data from [a couple of places I'm not going to name], where you're just doing the actions that you would later script. This more accurately reflects the experiment of poking around the page I was aiming to crawl in order to tinker with this, and there are some cases you would explicitly want the browser console anyway -- for instance, fetch
ing from the same domain can get you around having to think about CORS.
Note that another option to handle client-side operations you might want to repeate is to use a browser extension like Violentmonkey to save userscripts, and run them at matching URLs on page load.
BUT WAIT THERE'S MORE (examples)
Back to my main point... the DOMParser
piece opens up a lot of possibilities, because it also directly enables you to walk through paginated listings. If the "next" button is a link, it would look something like this:
const getPageListings = (doc) => {
// your page query logic
// you'll be using querySelector/getDOMQueryResults a *lot* here
// there are a few ways to get grouped results:
// traversal through an element's children/parents,
// descendant combinators,
// merging the results of several getDOMQueryResults calls...
// it depends on what you want and how the page is structured
// but I'll give you a really simple implementation:
const selector = 'li' // just for ease of use
// assuming all fields exist on every entry
const titles = getDOMQueryResults(`${selector}>h2`.map(e => e.innerText))
const links = getDOMQueryResults(`${selector}>a`.map(e => e.href))
const { length } = titles
// fill an empty array, then transform each entry to an object
// taking the value of the same index across all other arrays
const results = [...Array(length)].map(e, i) => ({
title: titles[i]
link: links[i]
})
return results
}
const getDataFromDirectory = async (page = '', listings = []) => {
const seconds = getRandomTime()
const timeout = seconds * 1000
const doc = listings.length ? parseHTML(page) : document
const newListings = getPageListings(doc)
const currentListings = [...listings, ...newListings]
const nextButtonQuery = '' // YOUR SELECTOR HERE
const nextPageURL = doc.querySelector(nextButtonQuery)?.href
console.log({ currentListings })
if (!nextPageURL) return currentListings
await sleep(timeout)
try {
const nextPage = await fetchHTML(nextPageURL)
return await getDataFromDirectory(nextPage, currentListings)
}
catch ({ message }){
console.error(message)
return currentListings
}
}
To briefly run though how this works:
- This starts with an empty array,
listings
, and gets the results listed on the first page.- Both parameters are optional; you don't need either of them when running this from the first page of a larger set of search results, and
- Your page query logic will vary, a lot, based on what you're looking for, and where. For now, as a placeholder, this is defined as
getPageListings
.
- From there, it's going to look for the "next" button
- If there isn't one, the function exits here, returning the current set of data
- Otherwise:
- It grabs the URL, and runs a fetch request against it after a defined timeout.
- It then recursively calls itself, feeding in the next page's HTML and the current data
- If any of this fails, it logs the error and returns the current data
For extra fun, getPageListings
itself also has a use case for a DOMParser. Say that you have a site that actively wants to obfuscate the use of scrapers, by shuffling around as many of its CSS classes as it can? It'd be a real shame if the entire solution to this problem were a meme. Obviously.
And obviously, I've got you covered. Have a sample implementation:
const selector = 'li' // just for ease of use
const cards = getDOMQueryResults(selector)
const results = cards.map(({ outerHTML }) => {
// WE HEARD YOU LIKE DOM QUERIES
// SO WE PUT A DOM QUERY IN YOUR DOM QUERY
// SO YOU CAN SCRAPE DATA WHILE YOU SCRAPE DATA
const doc = parseHTML(outerHTML)
// your DOM queries here
// such as...
const { innerText: title } = doc.querySelector('h2')
return { title }
})
But truthfully, you don't really don't need the second one, because element instances come with their own querySelector
.
So we could just express results
this way instead:
// since we're just accessing the element object directly
// we can also name it something meaningful, like the tag name
const results = cards.map((li) => {
// (it's still a meme though)
// WE HEARD YOU LIKE DOM QUERIES...
const { innerText: title } = li.querySelector('h2')
return { title }
})
For reference, where innerHTML
is a string representation of a DOM node's children, whereas outerHTML
includes itself. You'd want to use the latter here, since there's no guarantee that there's only one child. (You could just recursively dig through the rest of the tree until you find something with multiple children, but we don't really need that here.)
With this, you can section off individual query results, and then run queries against those so that you're only working with the specific pieces of data you want. Incidentally -- speaking from the various versions of this I pulled together while writing this infodump -- as an added bonus it probably made all of your query handling cleaner.
Now you might wonder: what if the page is actively being transformed client-side with new results? (Including, but not limited to, infinite scrolls.) Amusingly, in some implementations that's even easier (also client-side). Let's look at what an example using a button:
// you could just as easily stick this inside the function below
// I'm just declaring it ouside bc this isn't changing between scopes anyway
// given successive attempts write I'd probably even just decalre everything I want
// to query in such an operation as a top-level object
// that would also give it a consistent *structure* for these fn's to pull from
const buttonQuery = '' // YOUR SELECTOR HERE
// check for a button
// click the button if it exists
// wait for some time, then do this again
// keep going until there's no more button
const clickTheButton = async () => {
const button = document.querySelector(buttonQuery)
// you *should* be able to find a sufficient selector for one button
// if this *isn't* labeled in some way the site has accessibility problems
// and further, some sites will also do it to aid their own use
// of automated testing tools
// but again, failing that, you can often bruteforce that via:
// `[...document.querySelectorAll('button')].find(b => CONDITION)`
// `innerText` is often useful for this
if (!button) return
button.click()
await sleep(timeout) // not repeating myself, but pick a number
return await clickTheButton()
}
await clickTheButton()
Once the server has run out of results to append to the page, you can run the entire DOM query at once. You don't need to use a DOMParser at the top level, you don't need to run individual operations on separate documents every time you make requests... (Not that it isn't possible to stitch the HTML responses together, but exfiltrating the body contents also involves either the same DOMParser step or needless effort spent on manually parsing text.)
But what if there is no button? Well, we've all seen scrolljacking somewhere on a website, and whatever your opinion of it as a general practice, it's a perfect solution here. So let's get into an example of that below.
we did it reddit! (Rewriting history)
(assistant, play Rewrite by Asian Kung-Fu Generation)
Now, what if we didn't want to just read data?
Say that, for example, your favorite social media platform of >10yrs slapped its community in the face -- by suddenly cutting off access to third-party apps and tools, including the ones that keep the quality of the site from plummeting by aiding the site's entirely volunteer mod team, and openly hijacking communities whose mods opposed the changes. Say that you want to wipe your interactions with the platform in an act of spite, but you can't do it through the API for all of the reasons you would want to in the first place. Because it still exists, but then you'd have to give them the exact money that they're enshittifying the platform for in the first place. (If you're not familar with this term, or with Cory Doctorow's work as a whole, I also highly recommend a look.)
Now, technically you agree not to scrape the site as part of the user agreement (and part of the distinction in scraping cases is that if you're not proactively agreeing to the user agreement there isn't as much weight to the enforcement of it), but what are they gonna do, ban your account? (And in practice, individual subreddits will detect and remove comments you're overwriting, but nothing platform-level happened in my testing.)
First, let's set up one more variant of our infinite scroll handling and another utility function:
// scroll to end of page
// keep going until full history is loaded...
let isFinishedLoading = false
let currentScrollPosition = 0
let counter = 0
const maxRuns = 100
// ...or you hit a defined number of maximum pagination calls
// this is here as a performance optimization. If you have *lots* of content,
// you might want to refresh and do multiple runs instead of trying it all
// in a single browser process, which could eventually crash
// (this limit is also a guess; tweak this to your own needs)
while (!isFinishedLoading && counter !== maxRuns) {
window.scrollTo(0, document.body.scrollHeight)
await sleep(3000) // wait 3 seconds; adjust this if you have a slow connection
isFinishedLoading = currentScrollPosition === window.scrollY
currentScrollPosition = window.scrollY
counter++
}
const checkTextContents = (el, text) => el.innerText.toLowerCase().includes(text)
And from there, the implementation more or less looked like this:
// delete all posts or comments loaded in the current tab
// run this from a /posts or /comments section specifically
// it'll let you cover more of them in one run if you have a *lot*,
// but more importantly the initial menu button selector is different
// so this won't work if you're in on a general /u/username page
const waitInterval = 0.25 * 1000
const bulkDelete = async (dryRun = false) => {
const menus = getDOMQueryResults('[aria-label="more options"]')
for (let menuButton of menus) {
menuButton.click()
await sleep(waitInterval)
const deleteButton = getDOMQueryResults('button[role="menuitem"]')
.find(e => checkTextContents(e, 'delete'))
deleteButton.click()
const confirmButton = getDOMQueryResults('button')
.find(e => checkTextContents(e, 'delete'))
// this is for the sake of doing a dry run...
// which you're going to want to
const cancelButton = getDOMQueryResults('button')
.find(e => checkTextContents(e, 'cancel'))
await sleep(waitInterval).then(() => {
const targetButton = dryRun ? cancelButton : confirmButton
targetButton.click()
})
}
}
But this was the extent of what I got working on New Reddit. For overwriting comments, I ultimately had to resort to old reddit. There were additional limitations I'll get into later on this, that I could have potentially gotten around through the use of browser automation, but at the time of writing I was more interested in what I could get to run purely client-side, and spending my night on that, than I was in spending it researching the more elegant implementation.
As an additional note, the above implementation might not work anymore by the time you're reading this -- New Reddit has since dealt with, um, people like me by making use of custom elements and Shadow DOM. At the time of this writing it doesn't look to be implemented across the whole site (weirdly I only encountered this when I wasn't logged in), but I wouldn't reasonably expect it to stay like that long-term. At any rate -- interacting with those trees is still doable if my cursory search about this is right, but it's a whole other research spike that also just didn't end up being necessary for any of the things I'm talking about here.
And, critically, I can only test this so many times on my own account(s).
Thankfully, the fetch loop is run separately from the edit flow, and the delete logic is pretty similar overall, so as I show you how I ultimately solved for comment scrubbing (or at least my best reconstruction of it), I can lay out how and where you'd change this for bulk deletion.
// overwrite comments. run this from your comments page on old reddit
const overwriteComments = async (dryRun = false) => {
// customize this to create your own 'fuck you' message
// it would still be cool of you to credit my handle/keep this link,
// but I don't ultimately *care* very much, so follow the WTFPL and just...
// Do What the Fuck You Want To
const gistLink = 'https://gist.github.com/chaosharmonic/8cd5dd0a05602ecc0233d5e4b8fbb6b2'
const modPurgeLink = 'https://www.theverge.com/23779477/reddit-protest-blackouts-crushed'
const monetizationLink = 'https://www.theverge.com/2023/4/18/23688463/reddit-developer-api-terms-change-monetization-ai'
const RIPLink = 'https://www.rollingstone.com/culture/culture-news/the-brilliant-life-and-tragic-death-of-aaron-swartz-177191/'
const fediverseLink = 'https://fediverse.party'
const fuckYouSpez = `
This comment has been scrubbed, courtesy of a userscript created by /u/chaosharmonic, a >10yr Redditor making an exodus in the wake of [Reddit's latest fuckening](${modPurgeLink}) (and rolling his own exit path, because even though Shreddit is back up, you'd still ultimately have to pay Reddit for its API usage).
Since this is brazen cash grab to force users onto the first-party client (ads and all), [monetize all of our discussions](${monetizationLink}), here's an unfriendly reminder to the Reddit admins that open information access is a cause one of your founders [actually fucking died over](${RIPLink}).
Pissed about the API shutdown, but don't have an easy way to wipe your interaction with the site because of the API shutdown? [Give this a shot!](${gistLink})
Fuck you, /u/spez.
P.S. See you on the [Fediverse](${fediverseLink})
`.trim().replaceAll(' ', '') // remove extraneous whitespace
// if you'd rather run the delete logic on old reddit too, swap in
// the values in the comments, and comment everything above this line
// (note that I haven't tested this on posts)
const scrubComments = async () => {
getDOMQueryResults('a.edit-usertext').forEach(e => e.click())
// 'a[data-event-action="delete"]'
getDOMQueryResults('textarea').forEach(e => e.value = fuckYouSpez)
// this one you'd just comment out
const buttonType = dryRun ? 'cancel' : 'save' // 'no' : 'yes'
const interval = dryRun ? 2500 : 12500
// loop over each submit button, and stagger submissions
// by {interval} seconds each
// don't proceed until all target buttons have been clicked
// offset by 1 to avoid immediately submitting
// the first entry after a new page is fetched
await Promise.allSettled((
getDOMQueryResults(`button.${buttonType}`) // `a.${selection}`
.map(async (e, i) => {
await sleep(interval * i)
e.click()
})
))
}
await scrubComments()
const nextLink = document.querySelector('a[rel="nofollow next"]')
if (nextLink) {
// get next page
const { href } = nextLink
const nextDoc = await getDoc(nextLink)
// replace existing table contents with next page's table contents
const nextResults = nextDoc.querySelector('#siteTable')
document.querySelector('#siteTable').innerHTML = nextResults.innerHTML
// recursively call the whole function again until you're done
await overwriteComments(dryRun)
}
}
Now again, this is a personal, spiteful example. But having spent >5yrs as a solo help desk, if you work a lot in SaaS Web portals for your job, there's a lot you can automate using these same techniques. You can automate a lot of data entry by targeting form fields, reading their current states, and either altering their values directly, or synthetically generating various DOM events to trigger anything from clicks to submissions. Together with browser automation tools (more on this in a second), you can also use this to collect data from these sorts of UIs, and then write that data to a file, for import into whatever spreadsheet or custom dashboard your boss might want that data presented in. (Again, note that Deno has CSV handlers.)
A couple of details about the implementation to call out here...
The relevant technical ones first:
- Normally, we'd be calling out the use of
innerHTML
as unsafe. Used carelessly, this can be leveraged by cross-site scripting attacks, that insert a<script>
tag somewhere into the DOM -- at which point it then runs. Doing this off of the bare results of afetch
request would usually be a terrible idea. That said, while browsers offer mitigations for this now, that's not really within this scope, and it's not fully implemented across browsers yet. Where I'd maybe have concerns about this normally, this is my content, so as much as I can reasonably trust Reddit would have mitigations about this in the first place, I can even be more sure that I'm not running such an attack on myself. - The big text block that makes up
fuckYouSpez
is written in Markdown. (And so is the rest of what you're reading right now!) This is a whole other talk, but tl;dr: Markdown is a lightweight text format designed to convert to and from plain HTML. The idea, more or less, is that you can decorate plain text with punctuation, and get back formatted text in a browser (or other Web-based application). It's used across a lot of different applications in social media, chat, and other spaces that heavily rely on user-generated content. The list includes Discord, Slack, Matrix (also its own talk), GitHub, Trello, Notion, and of as you could guess, Reddit. (And that's not anywhere near an exhaustive list, just a small sampling off the top of my head.) old reddit in particular used Markdown by default, and New Reddit still has a toggle to let you use this instead of WYSIWIG. In fact, one of the primary contributors to the original Markdown spec. Aaron Swartz, was also one of Reddit's cofounders.
Next, a contextual aside -- that will somehow circle back to how this whole post started. Swartz was effectively bullied into suicide, by a zealous prosecutor threatening decades of prison time and outright refusing to consider a plea deal, over the Computer Fraud and Abuse Act -- a broad hacking law written in the 1980s by elderly lawmakers who were afraid of what you could do with computers in an 80s movie. (I'm not joking.) Essentially, under this law, "access[ing] a computer without authorization" or "exceeding authorized access" was classified as a felony. Now, being written by legislators instead of informed users, this ended up being broad enough in practice that it could include anything as minor as violating a site's terms of use. In fact, a different tragedy involving cyberbullying did become a CFAA case, specifically because of the ToS violations, and was thrown out specifically over the question of what "unauthorized access" even means.
(And I say "instead of informed users" in part because that was on purpose -- one specific federal office that the Reagan-era GOP held up as an example of government waste, before eventually slashing it in the 90s, was the Office of Technology Assessment, whose job was to research emerging technologies and educate lawmakers on how they might impact society.)
The point where this circles back is that the case I referenced in the disclaimer was also over the CFAA, which SCOTUS defanged over that same problem. The important distinction that precedent set that you should be aware of is that it matters if access to the resources requires you to agree to the terms of use. In practical terms, this means that you do need to pay attention to whether the details you're trying to scrape are gated behind a login.
Finally, just as a brag about what I'm capable of when properly motivated: aside from reconstructing the working code for overwriting comments later (the old reddit version originally being a workaround in my testing that I didn't bother to save the first time becuase I wanted a working version in New Reddit -- more on that in a bit), I wrote all of the above implementation over an all-nighter that I started drunk, because I started with a brief curiosity about how the implementation might work, and then got obsessed with solving a puzzle until suddenly realizing it was 6am.
Automate the Boring Stuff...
...with Python!
(No, I'm not suddenly pivoting languages on you.)
Is there a book you like, or some other useful reference text, that's freely available on the Web? Especially something posted by the author -- in this case, Al Sweigart -- under a Creative Commons license that freely allows you to share and remix it? (In this case, using its Attribution, Non-Commercial, and Share-Alike provisions. Basically what they sound like: credit him, don't use it in any commercial projects, and do apply the same license to any projects you do use it in.)
While I'm here, I'd be remiss if I didn't plug his work. Aside from generally being excellent, the book I'm using for this exercise -- or rather, the video course -- was the first thing that actually stuck for me in learning to code, and all of the written material is similarly available.
(Of course, you could just as easily do this with something similar, like Eloquent Javascript.)
Anyway. Time to channel your inner data hoarder. Crawl the site, pull down each chapter, and clean up the HTML before finally converting this to Markdown. Once you've got the content into plain text (or close enough), you can then render the contents in whatever interface you like. (For a cool example of this being done with other techincal content, see DashDash, which does this with Linux man
pages.)
Unlike most of this, you will need to actually extend beyond the platform for this one, and pull in a third-party library that isn't just a polyfill. For simpler examples, we can do this by operating on querySelectorAll
results using String.replace()
to swap out individual elements, until you're ultimately left with just the individual HTML you want. And we'll run through some examples of this anyway, just since there's a wealth of conversion tools, and different tools will come with different options for handling automated replacement rules. But we run into some problems with this if we try to hand-roll it at scale:
- First of all, Markdown is a distinct format, and just by using it as a storage medium, we're already outside the realm of vanilla JavaScript (or plain HTML), so we don't need to be purists about this. We're also just trying to store stuff as Markdown, not build an entire implementation.
- The original Markdown spec is actually fairly small, and so not only are there some elements it doesn't handle natively, there are a variety of dialects that do cover these things. Their conventions mostly overlap, but you're probably not going to want to somehow create your own.
- That said, for convenience's sake we're mostly going to refer to two commonly used variants: CommonMark, a specificiation (and reference implementation) maintained by a community of Markdown enthusiasts who work on various downstream tools; and GitHub Flavored Markdown, a dialect built on top of CommonMark that adds various functionality (such as tables and strikethrough) on top of it. We'll be using the latter here for the tables in particular.
- HTML itself has evolved over time, and older texts might be built using generic elements like
<div>
s and styled manually using CSS to create UI components that have since gained standard implementations (<code>
, for instance, or<dialog>
for a less immediately relevant example) or even just elements with semantic naming (like<nav>
,<aside>
,<section>
etc). - HTML is also valid Markdown, so you're also going to have to determine what HTML (if any) you might want to keep.
There are a few different options for this, and since ultimately this process is operating on text, many of them are available both client- and server-side.
One option you can use in both directions remark
(see website) -- a robust option, maintained as part of a collective called Unified whose purpose is to provide tools for creating structured data out of content. It works in conjunction with a companion plugin for HTML called rehype
, both being part of a broad plugin architecture for the unified
package itself. Unified is used by a wide array of tools within the JavaScript ecosystem, ranging from Gatsby to Prettier to Node itself. There's also a CLI you can install and run separately, should you feel like it. That said, while the CLI enables you to just define everything in a config file, using it in your code can get a little verbose, since you have to chain calls to .use(plugin)
for every step (which includes both parsing and stringifying your input/output formats), and every plugin has to be imported separately as an indidual package. (Additionally, Remark adds extra spacing to list items -- equivalent to wrapping their contents in <p>
tags -- apparently in a holdover from the original implementation, which had a concept of "loose" and "tight" lists. I had trouble finding a way around this as a default behavior, and it gave me issues in a different place where I had reason to convert and store text content like this.)
There are also various individual tools for converting specifically from one format to the other. Popular options for parsing Markdown and converting to HTML include marked
and markdown-it
. For server-side in particular, the Deno team also maintains deno-gfm
. For the other direction, we'll be using one called turndown
-- it's actively maintained, converting between formats only takes one line, and while it allows for replacement rules I was able to more or less get what I needed just using its defaults.
The stuff we'll be feeding into it will be sort of redundant, since many of these tools provide their handlers for supplying replacement rules. But for the reasons listed above, we'll also get into the ways you might want to clean this up in the event that you want to use other tools.
You can also do this one using Deno (see above), which incidentally lines up with a different project of mine that's further on my back burner -- remix that book myself into something similar using JavaScript. With DevTools and Deno, if that's not obvious by this point. Conveniently, said book also includes a chapter on this exact topic. While I didn't actually get this far before ultimately jumping tracks to JavaScript myself (somewhere around testing was as about where I jumped tracks), and this was developed in an unrelated manner with no connection to the source material, it does mean that I'm some 5% of the way there already...
The goal here is twofold: remove any extraneous elements or structure (like IDs and classes) that might not translate cleanly to a Markdown document, but first find and transform any that might have an equivalent in native HTML now. As stated above: depending on the age of your text of choice, semantic elements like <aside>
, ones designed for handling code blocks, etc. might not have existed at the time of publication. Some of those things may be manually formatted via CSS, while others might just not be in use at all, depending on the author's formatting preferences. So since we really only care about the raw content, it's good to take any of these that you can pick out and swap in native HTML where it exists. Not only is HTML itself valid Markdown, but you'll want elements like <code>
, that have Markdown equivalents, to exist in the document body first before ultimately running all of this into a parser.
Additional note: if you're doing this server-side, you also have more options here. In addition to native JavaScript packages, you can also use Deno.Command
or node:child_process
to invoke command-line tools such as pandoc
.
Starting from the site's homepage, here's a rough idea what this looks like. (We'll assume for anyone that's using Deno that you've already fetch
ed the homepage and parsed it as a document, as seen above.)
// Deno, <script type='module'>
// this is originally named TurndownService, but you don't *have* to keep it that verbose
import TDService from 'https://cdn.jsdelivr.net/npm/turndown/+esm'
import { gfm } from 'https://cdn.jsdelivr.net/npm/turndown-plugin-gfm/+esm'
// this is like 20 requests in total against a resource that's free to read on purpose,
// so it's neither the volume nor the place to need to worry about rate limiting
// hence, we can just map directly over this and feed the results into Promise.allSettled
// without concerning ourselves too much about delays
// if Deno:
// const rootURL = 'https://automatetheboringstuff.com'
// const rootDoc = await getDoc(rootURL)
// if console: just swap rootDoc for 'document'
const links = getDOMQueryResults('li a', rootDoc)
.map(e => ({
href: e.getAttribute('href'),
title: e.innerText.replace(' ', ' ')
}))
const bookContents = await Promise.allSettled(links
.map(async (result) => {
// there are some extra sections that I'm leaving out
// here, just to simplify this for now
const isBookContent = ['Chapter', 'Appendix', 'Introduction']
.some(word => result.title.startsWith(word))
if (!isBookContent) return null
try {
const page = await fetchHTML(`${rootURL}${result.href}`)
// console.log(page)
return { ...result, content: page }
} catch {
return result
}
}))
.then(values => values
.map(({ value }) => ({ ...value }))
.filter(e => e)
)
for (let c of bookContents) {
const { content } = c
if (!content) continue
// strip out extraneous IDs, normalize any duplicate classes
// you won't have *much* of the latter, but
// there are a few w trailing digits
// it's *mostly* >100 of the same `calibre-link` IDs
// for *most* things we'll be using `replaceAll()`,
// but since these are regex replaces anyway,
// we can get the same effect w the 'g' flag
const page = content
.replace(/id="calibre_link-\d+"/g, '')
.replace(/class="programs\d+"/g, 'class="programs"')
// `replace()` or `replaceAll()` more as necessary
const doc = parseHTML(page)
// you can manipulate elements using forEach by spreading into an array
getDOMQueryResults('span', doc)
.filter(e => e.outerHTML.includes('pagebreak'))
.forEach(e => e.remove())
// you can also operate on multiple selections at once
getDOMQueryResults('header, footer, center', doc)
.forEach(e => e.remove())
// normalize class names (remove trailing numbers -- these aren't shuffled *much*)
// these elements were styled as code via CSS instead of using native elements
// my *guess* is the site was written before some of these were put into place?
// some of these below are just semantic choices, but the point here is that
// you can replace a parent element by assigning `outerHTML`, and stitching
// the parent's `innerHTML` into a structure of your choice
getDOMQueryResults('programs', doc).forEach(e => {
e.outerHTML = `<pre><code>${e.innerHTML}</code></pre>`
})
getDOMQueryResults('.codestrong, .codeitalic, .literal', doc)
.forEach(e => {
e.outerHTML = `<code>${e.innerHTML}</code>`
})
getDOMQueryResults('.note, .sidebar', doc).forEach(e => {
e.outerHTML = `<aside>${e.innerHTML}</aside>`
})
// if you want to remove a nested child, you can also
// assign its innerHTML directly to that of its parent
// using e.parentNode (if an only child)
// this is still an option if not, it's just kind of verbose...
// it looks something like:
// e.parentNode.innerHTML = e.parentNode.innerHTML
// .replace(e.outerHTML, e.innerHTML)
const outerInnerSwaps = [
getDOMQueryResults('li > p', doc),
getDOMQueryResults('a', doc)
.filter(a => a.getAttribute('href')?.startsWith('#'))
].flat()
outerInnerSwaps.forEach(e => {
e.outerHTML = e.innerHTML
})
// strip out any classes you haven't mined for specific structure
// some of this isn't strictly necessary, but we're being thorough here.
// just since depending on the content and your tool of choice
// you may need more or less manual cleanup than this
getDOMQueryResults('*', doc)
.filter(({ classList }) => classList.length)
.forEach(e => e.classList.value = '')
// convert final document contents to Markdown
// (that last replace is mostly just if you care
// to read the intermediary HTML)
const cleanedHTML = doc.body.innerHTML
.replaceAll('<pre><code>', '<pre><code class="language-python">')
.replaceAll(' class=""', '')
// one-line version:
// c.content = new TDService().turndown(cleanedHTML)
// TurndownService can also be declared on its own,
// and then modified with replacement rules
// to keep details like semantic structure
// or plugins, to extend its functionality
const td = new TDService({ codeBlockStyle: 'fenced' })
td.use(gfm)
td.keep('aside')
c.content = td.turndown(cleanedHTML)
}
If you're using Deno, here's how you'd finally store all of this:
import { ensureDirSync } from "https://deno.land/std@latest/fs/ensure_dir.ts"
ensureDirSync('./chapters')
await Promise.allSettled(bookContents.filter(c => c.content)
.map(async ({ title, content }) =>
await Deno.writeTextFile(`./chapters/${title}.md`, content))
)
Now, the above still isn't perfect as I run it myself: there are a couple of points where various elements representing code formatting are next to each other, the table formatting isn't great (because the tables contain paragraphs), I'm not fetching any of the images here... I could go on. But you're going to have to adapt parts of this to your target anyway,
There are a couple of differences in this implementation depending on where you're running it. First, server-side:
- The version of
turndown
listed above gave me some kind of issue involvingdocument
not being defined. I got around this by definingdocument
from parsingcleanedHTML
(Turndown can also take documents and not just bare HTML strings, incidentally), but the easier route was just to importnpm:turndown
instead. - I chose
deno-dom
overJSDOM
because it offers a bareDOMParser
instead of a separate abstraction for declaring the document, but it's missing some API functionality, where the latter is not only more mature, but focused on offering a thorough implementation -- extending beyond just the DOM itself into adjacent functionality like custom elements and cookie handling. I had to usegetAttribute
instead of being able to accesshref
directly, for instance.
Now, you might just want to run the whole thing client-side. But there, though, we run into some other limitations:
import
statements are only supported in<script type="module">
tags, so while generally I'm using ES modules above, you'd still need another way to load the scripts if you were doing this in a console. It's not hard, necessarily -- use the DOM API to create ascript
element, assign itssrc
to your favorite CDN, and append it to thehead
orbody
. And, optionally, reassign the name if you feel like using something shorter... but at this point we're already ceasing to take the succinct route just by doing this.- And even if you do... remember how I mentioned XSS attacks earlier? Yeah, for that exact reason a lot of places where you might want to do this are going to implement a Content Security Policy to limit the sources you can use to request external resources, like packages.
But... what if we didn't have to choose?
Even Further Beyond (the browser)
(...or: putting it all together, server-side)
At some point you'll probably want to combine these approaches.
You'll notice I've mentioned a couple times by now that fetch logic will vary when you run it client-side vs server-side. There's no cookie storage, you can't short-circuit your way around CORS because there's no origin to fetch
from, and a big limitation here is that if your target is using any kind of UI library you can't actually execute any JavaScript you receive to render (let alone store) the real page contents.
But what if you could use server-side logic to control your browser, and then make the browser run your fetch scripts? What if you needed a real, rendered page to interact with, but didn't want to set up any of the interacting yourself? What if you wanted to automate organizing and storing your results?
This is where you'd want to reach for a browser automation tool, like puppeteer
or playwright
.
I reference these two in particular largely because, from the browser side, they're effectively first-party tools:
- Puppeteer is maintained by Google, with its underlying protocol originating in Chrome.
- Longer-term, the W3C is aiming to extend the WebDriver protocol to communicate bidirectionally and cover similar use cases. Incidentally, both the DevTools protocol and the WebDriver BiDi proposal communicate over WebSockets.
- The protocol itself is also implemented to varying degrees in other, non-Chromium-based browsers.
- Meanwhile, Playwright is maintained by Microsoft, which ships a distinct version of the DevTools protocol within Edge.
Since we're using Deno: puppeteer
has a port, but it's a few versions behind. Deno supports importing the version for Node, but aside from that approach having its own snags, we can also use astral
, a similar library which was built specifically for Deno. Astral is, at the time of this writing, fairly early into development, and is not intended as a drop-in replacement for Puppeteer -- it aims to simplify parts of its selector and event APIs -- but does ultimately communicate with it in a similar fashion.
I'll broadly be referencing Puppeteer's conventions as I go, but noting where there are meaningful deviations.
One key difference relative to the entire set of Puppeteer docs is that, since we're not using Node, instead of having to wrap everything in an async
function, we can assume there's access to top-level await
throughout.
For a potential example, see an alternative implementation of the comment overwrite.
const overwriteComments = async (dryRun = false) => {
// given the same strings as above...
const commentMenus = getDOMQueryResults('[aria-label="more options"]')
// edit all comments with a 'fuck you /u/spez' message
for (let menuButton of commentMenus) {
menuButton.click()
const editButton = getDOMQueryResults('button[role="menuitem"]')
.find(({innerText: t}) => t.toLowerCase() === 'edit')
editButton.click()
const markdownButton = document.querySelector('button[aria-label="Switch to markdown"')
// switch to Markdown
// fuckYouSpez is Markdown content, and submitting it this way
// will preserve your formatting without having to bother with
// the rest of the WYSIWIG menu
if (markdownButton) markdownButton.click()
document.querySelector("textarea").value = fuckYouSpez
await sleep(250).then(async () => {
const cancelButton = document.querySelector('button[type="reset"')
const submitButton = document.querySelector('button[type="submit"')
if (dryRun) {
cancelButton.click()
await sleep(250).then(() => {
const discardButton = getDOMQueryResults('button[role="menuitem"]')
.find(({ innerText: t }) => t.toLowerCase() === 'discard')
discardButton.click()
})
} else {
submitButton.click()
await sleep(750)
}
// all runs
const closeButton = document.querySelector('button[aria-label="Close"')
closeButton.click()
})
}
}
This is the version I wanted to keep, but couldn't get working. Why? A couple of reasons:
- conflicts between direct DOM manipulation and React state. Altering the value of the element itself won't actually change anything about what's ultimately sent on submit. This only changes on typing in the text field, and I can't seem to find an effective way to synthetically generate a
KeyboardEvent
the way I can aclick
. I could potentially get around this by making a network request directly from the page, but that wouldn't solve the other problem... - ...client-side routing -- that is, manipulating the page URL and browser history when pulling up each comment to edit, over a single application that runs on one seamless "page." But, as the page's
window.location
changes, the browser starts a new console session, ending the rest of the operation.
Browser automation libraries like the ones I listed above leverage DevTools to control the actions a browser is taking -- in addition to the set of interactive tools you can use within your session, such as the console or the element inspector, there's also a debugging protocol that outside tools (in this case Deno, via Astral) can use to communicate with it. And since the script that's actually controlling is actually communicating with the browser from the outside, which means we're not bound to a console session, and can more reliably generate synthetic inputs. Importantly, this means that our operations aren't bound to a single page (client-side routed or otherwise) anymore. We can start a new session with:
import { launch } from "https://deno.land/x/astral/mod.ts"
const browser = await launch()
// if you want/need to see this operation, or to interact with it manually,
// pass in { headless: false }
const targetURL = 'about:blank' // whatever your actual target is
// you can also do this in one line
// by passing targetURL directly into newPage
// but if you need to navigate around
// this is useful to know
const page = await browser.newPage()
await page.goto(targetURL)
// your logic here
await browser.close()
// cleanup
From there, page
comes with handlers like waitForNavigation
and waitForSelector
that can pause the execution of the script in order for the page to load relevant contents. The selectors are somewhat different across implementations -- while both let you await
a selector on its existence and assign it to a var -- which is then supplied with methods like .type
and .click
, Puppeteer also supplies these on the page
object (using the selectors as arguments). Additionally, both offer a JQuery-like $
selector. (Astral's examples still default to this, where Playwright specifically notes that it's deprecated.) Astral's does have more limitations, however, as Puppeteer also offers a few other shortcuts -- more on this in a second.
Regardless of which you opt for, either will let us pause the execution until we get the existence of a selector. The first benefit to this is that you can clean out all the sleep
calls that are manually being used to wait for these elements to exist. (And further, we're going to want to limit how much we're doing directly within the browser console, for reasons I'll get into in a second.) But more importantly, this also will let you pause the execution of the rest of the script until there's a DOM initialized on the page.
Going back to the New Reddit example for a second: Puppeteer does offer additional selectors that provide shortcuts for traversing shadow DOMs, but Astral doesn't appear to have implemented this yet. Between that and the limitations on how thoroughly I can test this one (and the aim that spruing out of that of doing it all client-side anyway) we can just move on from that specific example, save for one additional detail: to get the above working this way, you'd also need to implement a login flow. I won't detail it here, but it's fairly straightforward: find the selectors for username/pw, run .type
handlers on them, click "submit," and if you do have to solve a CAPTCHA or equivalent you can still interact with a full instance of the browser and just solve those manually -- waitForNavigation
will just... wait for navigation and pause the rest of the script's execution while you do that.
We can use page.evaluate
to run scripts within the browser context, with full access to any relevant APIs -- here, we can make DOM queries, fetch
requests from a browser window in a specific origin, etc. Note that, per the Puppeteer docs, code run through this is serialized and sent over JSON as strings, so you have to supply additional arguments to access any data outside the scope of the callback you're passing in.
One potential way around this in Puppeteer (a poke around Astral's source didn't show this to be implemented yet either) might be to use page.addScriptTag
to load a script into the window and declare some global variables -- which page.evaluate
should then be able to access as resources within the window. Without that, a more blunt approach -- but a perfectly usable one here -- would be to just declare values within the callback itself for anything you don't strictly need to pass in from outside.
So, for example, if you were running getDataFromDirectory
as listed above, you could then combine this with the previous server-side parts to run your scraping logic from within the browser (including any fetch
calls from within the same origin), pass that to Deno once the whole operation is done, and finally write the whole thing to a file. We're going to have to move some things around in the process -- page.evaluate
won't run indefinitely, and appears from my testing to cut out after ~a minute, so we'll need to run individual calls against this instead of invoking one singular, large, recursive function call.
Luckily, we can just run this part inside the recursive function call -- calling the page handling logic once each time it runs, away from any of the sleep
calls:
// above details we're not repeating here:
// importing and setting up browser context
// importing jsonfile
// don't do anything until there's a DOM we can operate against
// for a less blunt approach, choose a selector that's actually on your target page
// so that this doesn't continue running if a CAPTCHA page loads first instead
// but again, depending on your target that might not matter
await page.$('body')
const getDataFromDirectory = async (listings = [], targetURL = '') => {
// page.evaluate times out after a minute,
// so we need this to contain a single operation
const listings = await page.evaluate(async (targetPage) => {
// more details we're not repeating: any utility functions
// but for scoping reasons, you'd want to decalre those here
// if there's a URL being passed in, fetch your new document
const doc = targetPage ? await getDoc(targetPage) : document
const listSelector = '' // YOUR SELECTOR HERE
// since we're already running this whole thing in a callback,
// there's no real point in declaring a separate fn now
const nextListings = getDOMQueryResults(listSelector, doc)
.map(e => {
// your parsing logic here
})
const nextButtonQuery = '' // YOUR SELECTOR HERE
const nextPageURL = doc.querySelector(nextButtonQuery)?.href
if (!nextPageURL) return { nextListings }
return { nextListings, nextPageURL }
}, { args: [targetURL] })
// targetURL from the initial function args goes here
// it'll then be passed into the callback above
const entries = [...currentListings, ...nextListings]
if (!nextPageURL) return entries
// you're going to want to declare this one outside the loop
// this way, it doesn't impact the evaluate logic
await sleep(timeout)
try {
return await getDataFromDirectory(entries, nextPageURL)
}
catch ({ message }){
console.error(message)
return entries
}
}
await writeJsonSync(file, listings)
And you don't want to apply through these sites, right? You could get stuck in some easy-apply hole, your records of what you even applied to could be scattered across various platforms (whereas by applying directly everything lives in your email), and worst of all you could just get bounced to some ATS that makes you set up a separate login.
So another example of why you might want to persist an operation like this across pages would be if, say, a site containing job listings also gave you redirect links:
// this data structure is purely for example purposes
for (let listing of listings) {
await sleep(getRandomMilliseconds())
const { redirectURL } = listing
const { host: redirectHost } = new URL(redirectURL)
await browser.newPage(listing.redirectURL)
const originalURL = await page.evaluate(() => {
const { host: destinationHost } = location
return (redirectHost != destinationHost) && destination
})
if (originalURL) listing.originalURL = originalURL
}
If you do have to follow redirects, you might want to limit that to cases where you're actively looking to pull down additional detail on a listing... but if you get lucky, some sites may stick the original URL into a query string -- in which case you can just extract it directly from the details page, without having to make an additional request to follow where the link ultimately leads:
const { search } = new URL(redirectURL)
const originalURL = new URLSearchParams(search).get('url')
So to revisit the URL
object for a minute... Here, we're checking that either the host
has changed (indicating that you've successfully redirected), or grabbing its query string (search
) and constructing a similar helper object, URLSearchParams
. Similar to what we did with pathname
above, we can also use this to construct searches and navigate to them.
URLSearchParams
is structured differently from plain JavaScript objects, and has to be accessed through get()
and set()
methods passing in key names (and values in the latter), but it's useful for converting data back and forth across these formats. The constructor will also take objects, and like URL
we can also convert the whole thing back to a query using toString()
. And you don't even have to do that, because we can also just assign it directly to URL.search
. URL
also has a searchParams
property that you can call directly, but it's also useful to know that you can construct one.
There's actually a largely overlapping set of properties on window.location
, but that's a representation of the current page rather than an API you can call on any arbitrary URL. And there's document.URL
-- not to be confused with any of the above, since it's actually just a string. But it's the same value as location.href
, which makes it easy to just use that instead for clarity's sake. Or, better yet, just pull a property like host
directly.
This isn't even my final form (Some thoughts on optimization)
Like I said, the aim here is to do this at a small enough scale that you're not abusing target platforms. If you're doing this right, automating your data gathering may even involve hitting those targets less than you would if you were visting them manually as a user. (Especially if you're obsessive and, Idk, refresh your apartment search results once an hour when you're doing it yourself.)
Here are some further ways to aid you in doing that:
- If you're running searches, you can restrict results to a given time frame (like once per day) to minimize calls -- if you do it once every few hours and filter out anything that overlaps, you can keep the operations themselves small, and avoid making extensive calls to a site at any one time
- Other implementations of infinite scrolling involve the use of client-side occlusion, a performance optimization that removes stale results that you're no longer looking at -- like "pages" of them at the top when you're all the way at the bottom looking at "page" 23
- The logic around this is something of a mix between the above approaches -- you would still run
clickTheButton
(or whatever auto-scrolling equivalent) and let it run recursively, but you'd want to modify this so that you're running DOM queries at least often enough to capture all of the page data. (You may also want to optimize your queries here to filter out duplicates and avoid doing this too aggressively, especially since the entire nature of this technique is to prevent a page that's taking up an ever-expanding amount of RAM from choking out the rest of your browser.) - Shout out to Stephen Mapes, a college senpai, for this one. (I have run across this concept once or twice on my own before, but never by name.)
- The logic around this is something of a mix between the above approaches -- you would still run
- You can save top-level items and then filter them out to remove irrelevant results before pulling individual entries
- You can batch calls to separate websites to run together -- say, a business directory listing followed by a search on a general-purpose engine to find that business' website -- so that you're not running all of your queries against the same place and can buffer them all out a bit further.
- This can also be a good way to filter duplicate results from target sources if you're searching more than one place. For instance, given something like a collection of job boards, you're going to see broadly similar listings from one site to the next -- because regardless of any formatting differences, either the sites are scraping from the same listings or some HR person is actively crossposting them -- so you can use the metadata from downstream sources as a way to search primary ones.
But it's important to note that none of this is foolproof. While these techniques are going to be substantially more accessible than, say, curl
would in the face of Web Application Firewalls like Cloudflare, this is still what Doctorow calls "adversarial interoperability." Just because it might be legal to collect data from a platform, that doesn't mean they necessarily want you to, and many of them are going to take extensive steps to limit this. Data is a competitive advantage, and restricting access to content is critical to effectively maintaining walled gardens. Aside from that boss being an ongoing rabbit hole, ultimately it's also an arms race, so we're not going to get into extensive detail about that here. (This is the other reason you're not seeing much site-specific code: the target platforms would inevitably update anyway, so I'd rather keep it general enough that I don't have to worry about the contents breaking.) But what's critical to know is that you shouldn't expect to be able to beat the same target, the same way, indefinitely. You're going to want to research the site you're collecting data from, get a good understanding of what data is available from it and what its limitations are, and ideally have some alternatives in line -- so that if and when your initial method fails, you can scale out and not just up.
Endgame
All of this being said, small-scale data scraping gives you ways to automate out a lot of your life -- you can use this to store anything from long-form content, to interactions with Web applications (would I even be the meme trash I claim to be if I didn't make at least one reference to large language models?) to the results of basically any search -- to levels of depth you can customize to your needs. And I'd highly encourage exercising some creativity with it; at some point, I also (hopefully) have less than 9500 words to say on how you should build the things you want to use. Personally, most of my career is built on this, in that it's fueled a lot of the ways my own knowledge set has expanded over time. And depending on what you're trying to avoid just buying, it could also save you money. Especially because it stacks over time if we extend "build" to include "repair," and "buying" to include "replacing."
But just as importantly, you save intangibles: time and effort. (To a point. If you execute well.) You know what my job "search" process looks like right now? It doesn't look like poring through [TARGET SITE]. It looks like telling a bot to do that, while I go play a video game (Haiku the Robot, as of this writing) until it's done, and then go through all of its results once I've got them in a single text file with enough metadata to reasonably filter through. It's less personally draining, it's not burying me in a mountain of browser tabs, and I'm getting outreach done at more volume because I'm not spending an extensive amount of time finding better-quality matches. And this is with just the crawler -- everything else I'm building around it (like cover letter generation) is still skeletal, and these gains in productivity are purely from doing a ctrl+F through some JSON. Never mind the organization that comes from skipping the easy-apply holes, using direct links, and then having every piece of outreach live somewhere in my email. And while it's not currently set up for this, it also gives me enough detail on each result to filter out duplicates, by checking new results against existing metadata. (Again, not naming names, but the kinds of classified sites where I've found everything from most of my apartments to my favorite bass would be examples of where this might matter.) I can even link back to this writeup (and have been) to show off to those same prospective employers how I built the tools I used to find them.
Speaking from my prior tech support life, when I was working the hardest it was often because I didn't have proper systems set up. That they were arcane, or brittle, or inconsistent across sites, or even just excessively manual -- regardless of whether or not that was within my control. (And eventually it was, specifically because I got familiar enough with the workings of these systems to be able to reasonably pitch things like hardware purchases.) And most of what enabled me to scale my work up in that job was finding and removing the causes of those problems, whether that was unifying hardware configs across locations or writing repair manuals once there was a consistent enough foundation to do that. And it made my life easier, because that's what the effective use of technology is ultimately supposed to do.
It's supposed to enable you to work less hard.