Osquery: a Playground for SQL noobs

(...like me!)

Hello there!

SQL has been a longtime hole in my knowledge set -- data in my prior tech support life mostly used Excel, my first dev job (and the side project I built thereafter) used MongoDB, my next one didn't make me directly interact with a data layer... And while I've been itching for an excuse to pick this up, my current side project ended up opting for Deno KV. So for a while I've been poking around for some practical personal uses to tinker on, and stumbled across a tool called osquery.

osquery takes a variety of different kinds of logging, that may be formatted in a variety of different ways, and exposes them virtually as a SQLite database. So you'll have different collections ("tables" in SQL parlance) of data like os_version, users, uptime, dns_resolvers, and numerous others. To rattle off a few more, you can query for packages installed across a variety of system- and language-level package managers, browser extensions, network interfaces, users, processes... and so on. Plus, because it's all just a wrapper over your existing logging data, it's all read-only, so we can't break anything in the course of experimenting with this. (And, really, don't even have to care about setting anything up, aside from osquery itself.)

Roger, Roger (notes about bots)

To be up front, I also taught myself this using chatbots -- using a variety of models both locally and on Chatbot Arena. It's largely been Mistral derivatives of varying finetunes and sizes, but also periodically included Llama 3 and Phi-3. This particular experiment started by asking one if it could generate me some basic starting points specific to this tool; the ones it gave me weren't quite correct, but after a few changes they were enough to at least get me up and running. Overall, though, I had better success asking them about SQL as a whole, ranging from general learning paths to questions about how specific parts of the language work. I then validated these outputs by digging through the osquery shell to find examples I could demonstrate them with, in order to get a handle on real usage.

That said, I didn't use them to generate any of the content here; everything you're currently reading was written by a human.

Is it possible to learn this power?

SQL, or Structured Query Language, is a way to query relational databases. These are databases where you store tables of of data -- sort of like a spreadsheet but with more definition. (And actually, SQLite can import CSVs directly and query them in the same language.) Each table holds columns, which contain property names -- fields that, to grab an example like time, would include values like hour, minute, and so on. The columns also have data types: the ones I just noted are all numbers, but a full system timestamp would be stored as text. The table then contains rows, which represent entries -- while time has a single instance, representing the current system time, you'd have more in something like, say, shell_history.

The table definition is called a schema, and is generally defined at its creation. It can be modified, but since again osquery is just an interface over something other logging data and not an actual data store itself, we can assume that isn't happening here. We won't even be creating any of the tables ourselves, as osquery handles this for us. (While I'm on the subject, note that I'm mostly going to be talking about read operations throughout this writeup. Sadly, we'll have to save our XKCD references for another discussion.)

An elegant weapon, for a more civilized age (getting started)

osquery it's available on every major operating system and across a variety of package managers, so I can't describe every install method, and will assume you've followed the guide in their documentation.

Once it's installed, we can run queries using the osquery shell, running the command osqueryi in a terminal.

It'll start by telling you to type .help if you need more information, which will then give you a list of other commands like it that will help you navigate the database. For instance, you can enter .tables to get a list of tables, or .schema {table} to get the table's definition. (This functionality, generally, is inherited from sqlite, but the command list itself is different.)

Running .schema will output CREATE TABLE statement used to define its columns and their data types. For clarity, I've also spaced it out into separate lines. And it'll vary on your system -- I'm on Linux, and not every table or column is relevant on every OS (as you'll see with some of the hidden fields as you poke around) -- so let's go for something simpler...

For an easy example, that won't break how this displays on mobile (I hope), we can get the the hostnames stored in a system's hosts file.

A quick .schema etc_hosts, and...

CREATE TABLE etc_hosts(
  `address` TEXT,
  `hostnames` TEXT,
  `pid_with_namespace` INTEGER HIDDEN,
  PRIMARY KEY (
    `address`,
    `hostnames`,
    `pid_with_namespace`
  )
)
WITHOUT ROWID;

I'm not going to cover everything in the above, but will note that the capitalization is a convention for SQL keywords. The shell isn't actually case-sensitive, but does use it in outputs, which I've kept here and will be sticking with for clarity.

To get the full contents, we can use a simple SELECT statement:

-- NOTE: this can also be expressed here as `.all hosts`
SELECT * FROM etc_hosts;

Which, on a machine where this hasn't been modified, will give you:

address hostnames
127.0.0.1 localhost

A couple of asides, before I get too deep:

For another example, let's look at uptime:

CREATE TABLE uptime(
  `days` INTEGER,
  `hours` INTEGER,
  `minutes` INTEGER,
  `seconds` INTEGER,
  `total_seconds` BIGINT
);

We might not want the full table. In that case we don't have to use the wildcard *; instead, we can specify explicit fields by selecting columns, like:

SELECT days, hours FROM uptime;

This is where the fun begins (operating on results)

This gives us:

days hours
4 4

(And to keep these tables from overflowing on smaller devices, I'll be doing this throughout.)

What if we wanted total hours?

That's not in the schema above, but we can do math operations here and store their results in a column:

SELECT ((days * 24) + hours) AS total_hours FROM uptime;

And we get:

total_hours
100

We can use AS to rename existing fields in our queries, too:

SELECT days AS total_days FROM uptime;
total_days
4

We can limit the output results:

-- get the first 10 commands in your shell history
SELECT uid, command FROM shell_history
  LIMIT 10;
uid command
1000 tldr grep
1000 tldr
1000 osqueryi ".tables"
1000 osqueryi ".all time"
1000 deno
1000 exit
1000 nu
1000 osqueryi

Sort them:

-- get commands from shell history in reverse alphabetical order
-- `DESC` indicates reverse order
SELECT command FROM shell_history
  ORDER BY command DESC
  LIMIT 3;
command
zsh
zsh
zsh

Filter them:

-- get all running instances of osquery shell
SELECT name, pid FROM processes
  WHERE name="osqueryi";
name pid
osqueryi 138464
-- get `curl` calls invoked from an interactive shell

-- % is a wildcard flag here -- it can be anything (or nothing) at all
-- ergo:
-- - "{string}%" is equivalent to "starts with"
-- - "%{string}" means "ends with"
-- - and "%{string}%" is "includes"

-- so if, say, you got a URL from somewhere else,
--  and piped it into `curl`
--  it would still show up here

-- (note that this wouldn't show calls made from within scripts)
SELECT command FROM shell_history
  WHERE command LIKE "%curl%"
command
curl cheat.sh/sqlite3
curl cheat.sh/ffmpeg

And add conditional logic:

-- get processes using zsh or nushell
SELECT name, pid FROM processes
  WHERE name="zsh" OR name="nu"
  LIMIT 3;
name pid
nu 137609
zsh 137211
zsh 137416

Not just the men, but the women, and the children, too (grouping)

We can also aggregate results using functions like COUNT() to get total rows, as well as operate on numeric columns with ones like MIN() MAX(), SUM(), and AVG().

For a naive example, let's get some device data and operate on that:

SELECT models FROM usb_devices ORDER BY model;
model
2.0 root hub
2.0 root hub
2.0 root hub
2.0 root hub
3.0 root hub
3.0 root hub
3.0 root hub
3.0 root hub
Goodix_USB2.0_MISC
HDMI_Expansion_Card
Laptop_Camera
Wireless_Device

We can then roll that all into a single result:

-- get the number of USB devices currently attached
SELECT COUNT(*) FROM usb_devices;
COUNT(*)
12

Without an AS, the function call itself will also be the column label. We can also get counts for each unique value

SELECT COUNT(*) as total, COUNT(DISTINCT model) as models FROM usb_devices;
total models
12 6

But we might also want to get totals for a given value. Say, total running processes per application:

SELECT name, COUNT() as total from processes
  GROUP BY name ORDER BY total DESC
  LIMIT 5
name total
codium 40
firefox 23
ferdium 19
chrome 14
zsh 10

Human/Cyborg relations (combining tables)

But what if you wanted to use data from more than one table? What if you'd rather, say, associate processes with usernames instead of uids?

Here's where the "relational" part comes in. You can use a join to get a single, merged result that pulls in data based on shared details like columns. For instance, users also has a uid field, enabling us to...

-- get running processes by name, id, and user,
--  substituting username for user id
--  starting 25 results in
SELECT username, name, pid FROM processes
  JOIN users ON processes.uid = users.uid
  OFFSET 25
  LIMIT 2;
username name pid
ryuzaki ollama 13523
ryuzaki koboldcpp 13525

JOINs will output a single set of results containing columns from both tables. By default these are INNER JOINs, which only return rows where there are matches on both sides. You can include other results from either table by using an OUTER JOIN -- a LEFT OUTER JOIN will return all rows from the first table regardless of matches, RIGHT will return all rows from the second, and FULL won't filter from either table.

You can also take results from multiple queries, across multiple tables, and collect them into one output. Say you wanted to grab all of the packages you've installed. You could make individual queries: select name FROM python_packages, select name FROM deb_packages, select tags FROM docker_images... but you can also combine those into one set of results, with union.

Let's go for a simpler example, though. It's a lot of tables, and the default install on Linux is missing the Arch repos anyway.

We can also take and nest it into a subquery, to then operate on the entire collected output:

-- output 2 results from the inner query
-- parentheses (you know... `()`) will nest a query,
--  and then you can `select` from its results
--  the same way you can select from tables
SELECT * FROM (
  -- get all extensions installed on firefox
  --  or chromium-based browsers
  -- that can actually be a *lot* of things:
  --  brave, opera, *edge*...
  -- note: there's also a safari_extensions
  --  but I'm not on a mac
  SELECT name, browser_type AS source FROM chrome_extensions
    UNION
  SELECT name, 'firefox' AS source FROM firefox_addons
) LIMIT 2;
name source
Plasma Integration chromium
Bitwarden firefox

The UNION operation takes similar sets of lists, from queries across separate tables, and combines them into one set of results. The key is that you're returning similar overall shapes: you'll notice above that I'm getting the same properties from each query, but AS you can see, I can rename other fields or assign manual values to shape the data however I might need.

You can even use subqueries to do it: for instance, while I could easily populate the name field for Docker images with SELECT tags AS name, I could also do it using SELECT (SELECT tags FROM docker_images) AS name, or then insert that value into any other field of my choosing using AS to assign it to a different column.

Perhaps the archives are incomplete (branching out)

Over six million different forms of communication (other output formats)

You can also get your results in CSV or JSON by adding --csv or --json flags when you run osqueryi. This can be useful for output to other applications or languages.

So then the etc_hosts example from earlier would show as:

address|hostnames
127.0.0.1|localhost

...while the pid example above in JSON would look like:

[
  {"name":"zsh","pid":"137211"},
  {"name":"zsh","pid":"137416"},
  {"name":"nu","pid":"137609"}
]

When invoking osqueryi directly on a single query, we can then use output redirection to save this to a file: osqueryi --csv '.all uptime' > uptime.csv

SQLite itself can import CSVs to query them directly, and other SQL databases like DuckDB can also do this with JSON. Speaking of which, like Markdown, SQL has a variety of dialects, and not everything above will work in every kind of database.

Let the past die. Kill it if you have to. (PRQL)

If you wanted to write complex queries, or work across multiple databases, you might also try PRQL. It's a language that simplifies a variety of query operations, and compiles them to raw SQL in the dialect of your choosing.

PRQL uses pipelines, represented either by the pipe character (|) or line breaks, to transform data at each step of an operation. As an example:

from shell_history
select { uid, command }
filter command != 'chrome'
take 10

But I'm not going to get too deep into this. As a fun bit of trivia, while SQL's pronunciation was never explicitly defined when the language was created, with ones spelling it out and ones calling it "sequel" both seeing common usage. With PRQL, though, calling it "prequel" is canonical.

Abilities some consider to be unnatural (some other playgrounds for SQL)

Some other applications you might want to try include:

There's also a plethora of sample datasets available for public consumption, from sources like Kaggle or data.gov. While I've personally soured on it as a platform, there's also a subreddit for that. (Annoyingly, I haven't found a particularly active Lemmy.)

With options that include JSON you could even just make requests from public APIs to collect a dataset on something fun, like maybe data about Star Wars.

...and if you're feeling really unhinged enterprising, maybe you just end up writing 10000 words on Web scraping, as you fall down some rabbit hole just because you wanted a dataset of lines from Futurama and your own curated jobs feed.