Osquery: a Playground for SQL Noobs

April 29th, 2024

(...like me!)

Hello there!

SQL has been a longtime hole in my knowledge set -- data in my prior tech support life mostly used Excel, my first dev job (and the side project I built thereafter) used MongoDB, my next one didn't make me directly interact with a data layer... And while I've been itching for an excuse to pick this up, my current side project ended up opting for Deno KV. So for a while I've been poking around for some practical personal uses to tinker on, and stumbled across a tool called osquery.

osquery takes a variety of different kinds of logging, that may be formatted in a variety of different ways, and exposes them virtually as a SQLite database. So you'll have different collections ("tables" in SQL parlance) of data like os_version, users, uptime, dns_resolvers, and numerous others. To rattle off a few more, you can query for packages installed across a variety of system- and language-level package managers, browser extensions, network interfaces, users, processes... and so on. Plus, because it's all just a wrapper over your existing logging data, it's all read-only, so we can't break anything in the course of experimenting with this. (And, really, don't even have to care about setting anything up, aside from osquery itself.)

Roger, Roger (notes about bots)

To be up front, I also taught myself this using various chatbots, both with hosted options on Chatbot Arena and running on my device using tools like Llamafile. This process largely involved Mistral derivatives of varying finetunes and sizes, but also periodically included Llama 3 and Phi-3. This experiment started by asking one if it could generate me some basic starting points specific to this tool; the ones it gave me weren't quite correct, but after a few changes they were enough to at least get me up and running. Overall, though, I had better success asking them about SQL as a whole, ranging from general learning paths to questions about how specific parts of the language work. I then validated these outputs by digging through the osquery shell to find examples I could demonstrate them with, in order to get a handle on real usage.

I've previously had good results doing this to solve CTF puzzles -- asking them generalities about injections became one of my earlier exposures to SQL, and I've talked elsewhere about how this has fueled other projects. (This is also a good way to test these tools for guardrails) I'll probably talk about this more another time, but for now let's move on.

One last important clarification though -- I didn't use them to generate any of the content here; everything you're currently reading was written by a human.

Is it possible to learn this power?

SQL, or Structured Query Language, is a way to query relational databases. These are databases where you store tables of of data -- sort of like a spreadsheet but with more definition. (And actually, SQLite can import CSVs directly and query them in the same language.) Each table holds columns, which contain property names -- fields that, to grab an example like time, would include values like hour, minute, and so on. The columns also have data types: the ones I just noted are all numbers, but a full system timestamp would be stored as text. The table then contains rows, which represent entries -- while time has a single instance, representing the current system time, you'd have more in something like, say, shell_history.

The table definition is called a schema, and is generally defined at its creation. It can be modified, but since again osquery is just an interface over something other logging data and not an actual data store itself, we can assume that isn't happening here. We won't even be creating any of the tables ourselves, as osquery handles this for us. (While I'm on the subject, note that I'm mostly going to be talking about read operations throughout this writeup. Sadly, we'll have to save our XKCD references for another discussion.)

An elegant weapon, for a more civilized age (getting started)

osquery it's available on every major operating system and across a variety of package managers, so I can't describe every install method, and will assume you've followed the guide in their documentation.

Once it's installed, we can run queries using the osquery shell, running the command osqueryi in a terminal.

It'll start by telling you to type .help if you need more information, which will then give you a list of other commands like it that will help you navigate the database. For instance, you can enter .tables to get a list of tables, or .schema {table} to get the table's definition. (This functionality, generally, is inherited from sqlite, but the command list itself is different.)

Running .schema will output CREATE TABLE statement used to define its columns and their data types. For clarity, I've also spaced it out into separate lines. And it'll vary on your system -- I'm on Linux, and not every table or column is relevant on every OS (as you'll see with some of the hidden fields as you poke around) -- so let's go for something simpler...

For an easy example, that won't break how this displays on mobile (I hope), we can get the the hostnames stored in a system's hosts file.

A quick .schema etc_hosts, and...

CREATE TABLE etc_hosts(
  `address` TEXT,
  `hostnames` TEXT,
  `pid_with_namespace` INTEGER HIDDEN,
  PRIMARY KEY (
    `address`,
    `hostnames`,
    `pid_with_namespace`
  )
)
WITHOUT ROWID;

I'm not going to cover everything in the above, but will note that the capitalization is a convention for SQL keywords. The shell isn't actually case-sensitive, but does use it in outputs, which I've kept here and will be sticking with for clarity.

To get the full contents, we can use a simple SELECT statement:

-- NOTE: this can also be expressed here as `.all hosts`
SELECT * FROM etc_hosts;

Which, on a machine where this hasn't been modified, will give you:

address	hostnames
127.0.0.1	localhost

A couple of asides, before I get too deep:

Some of these examples are real -- it doesn't exactly matter if you know about the system uptime as I'm writing drafts of this -- but details like usernames are masked, and the command line tools are mostly just things you might find useful. I'd absolutly encourage you to go on Internet dives about them.
You also don't even have to type these entire names, as the shell itself comes with tab completion.
If you're running a single query, you can pass it directly into the command (osqueryi "SELECT * FROM users;") and get its output in your terminal.

For another example, let's look at uptime:

CREATE TABLE uptime(
  `days` INTEGER,
  `hours` INTEGER,
  `minutes` INTEGER,
  `seconds` INTEGER,
  `total_seconds` BIGINT
);

We might not want the full table. In that case we don't have to use the wildcard *; instead, we can specify explicit fields by selecting columns, like:

SELECT days, hours FROM uptime;

This is where the fun begins (operating on results)

This gives us:

days	hours
4	4

(And to keep these tables from overflowing on smaller devices, I'll be doing this throughout.)

What if we wanted total hours?

That's not in the schema above, but we can do math operations here and store their results in a column:

SELECT ((days * 24) + hours) AS total_hours FROM uptime;

And we get:

total_hours
100

We can use AS to rename existing fields in our queries, too:

SELECT days AS total_days FROM uptime;

total_days
4

We can limit the output results:

-- get the first 10 commands in your shell history
SELECT uid, command FROM shell_history
  LIMIT 10;

uid	command
1000	tldr grep
1000	tldr
1000	osqueryi ".tables"
1000	osqueryi ".all time"
1000	deno
1000	exit
1000	nu
1000	osqueryi

Sort them:

-- get commands from shell history in reverse alphabetical order
-- `DESC` indicates reverse order
SELECT command FROM shell_history
  ORDER BY command DESC
  LIMIT 3;

command
zsh
zsh
zsh

Filter them:

-- get all running instances of osquery shell
SELECT name, pid FROM processes
  WHERE name="osqueryi";

name	pid
osqueryi	138464

-- get `curl` calls invoked from an interactive shell

-- % is a wildcard flag here -- it can be anything (or nothing) at all
-- ergo:
-- - "{string}%" is equivalent to "starts with"
-- - "%{string}" means "ends with"
-- - and "%{string}%" is "includes"

-- so if, say, you got a URL from somewhere else,
--  and piped it into `curl`
--  it would still show up here

-- (note that this wouldn't show calls made from within scripts)
SELECT command FROM shell_history
  WHERE command LIKE "%curl%"

command
curl cheat.sh/sqlite3
curl cheat.sh/ffmpeg

And add conditional logic:

-- get processes using zsh or nushell
SELECT name, pid FROM processes
  WHERE name="zsh" OR name="nu"
  LIMIT 3;

name	pid
nu	137609
zsh	137211
zsh	137416

Not just the men, but the women, and the children, too (grouping)

We can also aggregate results using functions like COUNT() to get total rows, as well as operate on numeric columns with ones like MIN() MAX(), SUM(), and AVG().

For a naive example, let's get some device data and operate on that:

SELECT models FROM usb_devices ORDER BY model;

model
2.0 root hub
2.0 root hub
2.0 root hub
2.0 root hub
3.0 root hub
3.0 root hub
3.0 root hub
3.0 root hub
Goodix_USB2.0_MISC
HDMI_Expansion_Card
Laptop_Camera
Wireless_Device

We can then roll that all into a single result:

-- get the number of USB devices currently attached
SELECT COUNT(*) FROM usb_devices;

COUNT(*)
12

Without an AS, the function call itself will also be the column label. We can also get counts for each unique value

SELECT COUNT(*) as total, COUNT(DISTINCT model) as models FROM usb_devices;

total	models
12	6

But we might also want to get totals for a given value. Say, total running processes per application:

SELECT name, COUNT() as total from processes
  GROUP BY name ORDER BY total DESC
  LIMIT 5

name	total
codium	40
firefox	23
ferdium	19
chrome	14
zsh	10

Human/Cyborg relations (combining tables)

But what if you wanted to use data from more than one table? What if you'd rather, say, associate processes with usernames instead of uids?

Here's where the "relational" part comes in. You can use a join to get a single, merged result that pulls in data based on shared details like columns. For instance, users also has a uid field, enabling us to...

-- get running processes by name, id, and user,
--  substituting username for user id
--  starting 25 results in
SELECT username, name, pid FROM processes
  JOIN users ON processes.uid = users.uid
  OFFSET 25
  LIMIT 2;

username	name	pid
ryuzaki	ollama	13523
ryuzaki	koboldcpp	13525

JOINs will output a single set of results containing columns from both tables. By default these are INNER JOINs, which only return rows where there are matches on both sides. You can include other results from either table by using an OUTER JOIN -- a LEFT OUTER JOIN will return all rows from the first table regardless of matches, RIGHT will return all rows from the second, and FULL won't filter from either table.

You can also take results from multiple queries, across multiple tables, and collect them into one output. Say you wanted to grab all of the packages you've installed. You could make individual queries: select name FROM python_packages, select name FROM deb_packages, select tags FROM docker_images... but you can also combine those into one set of results, with union.

Let's go for a simpler example, though. It's a lot of tables, and the default install on Linux is missing the Arch repos anyway.

We can also take and nest it into a subquery, to then operate on the entire collected output:

-- output 2 results from the inner query
-- parentheses (you know... `()`) will nest a query,
--  and then you can `select` from its results
--  the same way you can select from tables
SELECT * FROM (
  -- get all extensions installed on firefox
  --  or chromium-based browsers
  -- that can actually be a *lot* of things:
  --  brave, opera, *edge*...
  -- note: there's also a safari_extensions
  --  but I'm not on a mac
  SELECT name, browser_type AS source FROM chrome_extensions
    UNION
  SELECT name, 'firefox' AS source FROM firefox_addons
) LIMIT 2;

name	source
Plasma Integration	chromium
Bitwarden	firefox

The UNION operation takes similar sets of lists, from queries across separate tables, and combines them into one set of results. The key is that you're returning similar overall shapes: you'll notice above that I'm getting the same properties from each query, but AS you can see, I can rename other fields or assign manual values to shape the data however I might need.

You can even use subqueries to do it: for instance, while I could easily populate the name field for Docker images with SELECT tags AS name, I could also do it using SELECT (SELECT tags FROM docker_images) AS name, or then insert that value into any other field of my choosing using AS to assign it to a different column.

Perhaps the archives are incomplete (branching out)

Over six million different forms of communication (other output formats)

You can also get your results in CSV or JSON by adding --csv or --json flags when you run osqueryi. This can be useful for output to other applications or languages.

So then the etc_hosts example from earlier would show as:

address|hostnames
127.0.0.1|localhost

...while the pid example above in JSON would look like:

[
  {"name":"zsh","pid":"137211"},
  {"name":"zsh","pid":"137416"},
  {"name":"nu","pid":"137609"}
]

When invoking osqueryi directly on a single query, we can then use output redirection to save this to a file: osqueryi --csv '.all uptime' > uptime.csv

SQLite itself can import CSVs to query them directly, and other SQL databases like DuckDB can also do this with JSON. Speaking of which, like Markdown, SQL has a variety of dialects, and not everything above will work in every kind of database.

Let the past die. Kill it if you have to. (PRQL)

If you wanted to write complex queries, or work across multiple databases, you might also try PRQL. It's a language that simplifies a variety of query operations, and compiles them to raw SQL in the dialect of your choosing.

PRQL uses pipelines, represented either by the pipe character (|) or line breaks, to transform data at each step of an operation. As an example:

from shell_history
select { uid, command }
filter command != 'chrome'
take 10

But I'm not going to get too deep into this. As a fun bit of trivia, while SQL's pronunciation was never explicitly defined when the language was created, with ones spelling it out and ones calling it "sequel" both seeing common usage. With PRQL, though, calling it "prequel" is canonical.

Abilities some consider to be unnatural (some other playgrounds for SQL)

Some other applications you might want to try include:

Pretzel, a playground for PRQL and DuckDB that you can use to query local data. It runs in the browser using WebAssembly, and can can be installed locally to your system as a Progressive Web App. (Fair warning: to my knowledge, it has no dark mode)
Using SteamPipe to query various web services. It has a plugin-based architecture enabling you to make SQL queries against API calls from a variety of diffrent kinds of publicly-accessible platforms, ranging from cloud infrastructure services to some social media platfoms to open data and standards like finance quotes and RSS feeds. It also offers a similar shell that you can call with steampipe query.

There's also a plethora of sample datasets available for public consumption, from sources like Kaggle or data.gov. While I've personally soured on it as a platform, there's also a subreddit for that. (Annoyingly, I haven't found a particularly active Lemmy.)

With options that include JSON you could even just make requests from public APIs to collect a dataset on something fun, like maybe data about Star Wars.

...and if you're feeling really ~~unhinged~~ enterprising, maybe you just end up writing 10000 words on Web scraping, as you fall down some rabbit hole just because you wanted a dataset of lines from Futurama and your own curated jobs feed.