Osquery: a Playground for SQL Noobs
(...like me!)
Hello there!
SQL has been a longtime hole in my knowledge set -- data in my prior tech support life mostly used Excel, my first dev job (and the side project I built thereafter) used MongoDB, my next one didn't make me directly interact with a data layer... And while I've been itching for an excuse to pick this up, my current side project ended up opting for Deno KV. So for a while I've been poking around for some practical personal uses to tinker on, and stumbled across a tool called osquery
.
osquery
takes a variety of different kinds of logging, that may be formatted in a variety of different ways, and exposes them virtually as a SQLite database. So you'll have different collections ("tables" in SQL parlance) of data like os_version
, users
, uptime
, dns_resolvers
, and numerous others. To rattle off a few more, you can query for packages installed across a variety of system- and language-level package managers, browser extensions, network interfaces, users, processes... and so on. Plus, because it's all just a wrapper over your existing logging data, it's all read-only, so we can't break anything in the course of experimenting with this. (And, really, don't even have to care about setting anything up, aside from osquery
itself.)
Roger, Roger (notes about bots)
To be up front, I also taught myself this using various chatbots, both with hosted options on Chatbot Arena and running on my device using tools like Llamafile. This process largely involved Mistral derivatives of varying finetunes and sizes, but also periodically included Llama 3 and Phi-3. This experiment started by asking one if it could generate me some basic starting points specific to this tool; the ones it gave me weren't quite correct, but after a few changes they were enough to at least get me up and running. Overall, though, I had better success asking them about SQL as a whole, ranging from general learning paths to questions about how specific parts of the language work. I then validated these outputs by digging through the osquery
shell to find examples I could demonstrate them with, in order to get a handle on real usage.
I've previously had good results doing this to solve CTF puzzles -- asking them generalities about injections became one of my earlier exposures to SQL, and I've talked elsewhere about how this has fueled other projects. (This is also a good way to test these tools for guardrails) I'll probably talk about this more another time, but for now let's move on.
One last important clarification though -- I didn't use them to generate any of the content here; everything you're currently reading was written by a human.
Is it possible to learn this power?
SQL, or Structured Query Language, is a way to query relational databases. These are databases where you store tables of of data -- sort of like a spreadsheet but with more definition. (And actually, SQLite can import CSVs directly and query them in the same language.) Each table holds columns, which contain property names -- fields that, to grab an example like time
, would include values like hour
, minute
, and so on. The columns also have data types: the ones I just noted are all numbers, but a full system timestamp would be stored as text. The table then contains rows, which represent entries -- while time
has a single instance, representing the current system time, you'd have more in something like, say, shell_history
.
The table definition is called a schema, and is generally defined at its creation. It can be modified, but since again osquery
is just an interface over something other logging data and not an actual data store itself, we can assume that isn't happening here. We won't even be creating any of the tables ourselves, as osquery
handles this for us. (While I'm on the subject, note that I'm mostly going to be talking about read operations throughout this writeup. Sadly, we'll have to save our XKCD references for another discussion.)
An elegant weapon, for a more civilized age (getting started)
osquery
it's available on every major operating system and across a variety of package managers, so I can't describe every install method, and will assume you've followed the guide in their documentation.
Once it's installed, we can run queries using the osquery shell, running the command osqueryi
in a terminal.
It'll start by telling you to type .help
if you need more information, which will then give you a list of other commands like it that will help you navigate the database. For instance, you can enter .tables
to get a list of tables, or .schema {table}
to get the table's definition. (This functionality, generally, is inherited from sqlite
, but the command list itself is different.)
Running .schema
will output CREATE TABLE
statement used to define its columns and their data types. For clarity, I've also spaced it out into separate lines. And it'll vary on your system -- I'm on Linux, and not every table or column is relevant on every OS (as you'll see with some of the hidden
fields as you poke around) -- so let's go for something simpler...
For an easy example, that won't break how this displays on mobile (I hope), we can get the the hostnames stored in a system's hosts
file.
A quick .schema etc_hosts
, and...
CREATE TABLE etc_hosts(
`address` TEXT,
`hostnames` TEXT,
`pid_with_namespace` INTEGER HIDDEN,
PRIMARY KEY (
`address`,
`hostnames`,
`pid_with_namespace`
)
)
WITHOUT ROWID;
I'm not going to cover everything in the above, but will note that the capitalization is a convention for SQL keywords. The shell isn't actually case-sensitive, but does use it in outputs, which I've kept here and will be sticking with for clarity.
To get the full contents, we can use a simple SELECT
statement:
-- NOTE: this can also be expressed here as `.all hosts`
SELECT * FROM etc_hosts;
Which, on a machine where this hasn't been modified, will give you:
address | hostnames |
---|---|
127.0.0.1 | localhost |
A couple of asides, before I get too deep:
- Some of these examples are real -- it doesn't exactly matter if you know about the system uptime as I'm writing drafts of this -- but details like usernames are masked, and the command line tools are mostly just things you might find useful. I'd absolutly encourage you to go on Internet dives about them.
- You also don't even have to type these entire names, as the shell itself comes with tab completion.
- If you're running a single query, you can pass it directly into the command (
osqueryi "SELECT * FROM users;"
) and get its output in your terminal.
For another example, let's look at uptime
:
CREATE TABLE uptime(
`days` INTEGER,
`hours` INTEGER,
`minutes` INTEGER,
`seconds` INTEGER,
`total_seconds` BIGINT
);
We might not want the full table. In that case we don't have to use the wildcard *
; instead, we can specify explicit fields by selecting columns, like:
SELECT days, hours FROM uptime;
This is where the fun begins (operating on results)
This gives us:
days | hours |
---|---|
4 | 4 |
(And to keep these tables from overflowing on smaller devices, I'll be doing this throughout.)
What if we wanted total hours?
That's not in the schema above, but we can do math operations here and store their results in a column:
SELECT ((days * 24) + hours) AS total_hours FROM uptime;
And we get:
total_hours |
---|
100 |
We can use AS
to rename existing fields in our queries, too:
SELECT days AS total_days FROM uptime;
total_days |
---|
4 |
We can limit the output results:
-- get the first 10 commands in your shell history
SELECT uid, command FROM shell_history
LIMIT 10;
uid | command |
---|---|
1000 | tldr grep |
1000 | tldr |
1000 | osqueryi ".tables" |
1000 | osqueryi ".all time" |
1000 | deno |
1000 | exit |
1000 | nu |
1000 | osqueryi |
Sort them:
-- get commands from shell history in reverse alphabetical order
-- `DESC` indicates reverse order
SELECT command FROM shell_history
ORDER BY command DESC
LIMIT 3;
command |
---|
zsh |
zsh |
zsh |
Filter them:
-- get all running instances of osquery shell
SELECT name, pid FROM processes
WHERE name="osqueryi";
name | pid |
---|---|
osqueryi | 138464 |
-- get `curl` calls invoked from an interactive shell
-- % is a wildcard flag here -- it can be anything (or nothing) at all
-- ergo:
-- - "{string}%" is equivalent to "starts with"
-- - "%{string}" means "ends with"
-- - and "%{string}%" is "includes"
-- so if, say, you got a URL from somewhere else,
-- and piped it into `curl`
-- it would still show up here
-- (note that this wouldn't show calls made from within scripts)
SELECT command FROM shell_history
WHERE command LIKE "%curl%"
command |
---|
curl cheat.sh/sqlite3 |
curl cheat.sh/ffmpeg |
And add conditional logic:
-- get processes using zsh or nushell
SELECT name, pid FROM processes
WHERE name="zsh" OR name="nu"
LIMIT 3;
name | pid |
---|---|
nu | 137609 |
zsh | 137211 |
zsh | 137416 |
Not just the men, but the women, and the children, too (grouping)
We can also aggregate results using functions like COUNT()
to get total rows, as well as operate on numeric columns with ones like MIN()
MAX()
, SUM()
, and AVG()
.
For a naive example, let's get some device data and operate on that:
SELECT models FROM usb_devices ORDER BY model;
model |
---|
2.0 root hub |
2.0 root hub |
2.0 root hub |
2.0 root hub |
3.0 root hub |
3.0 root hub |
3.0 root hub |
3.0 root hub |
Goodix_USB2.0_MISC |
HDMI_Expansion_Card |
Laptop_Camera |
Wireless_Device |
We can then roll that all into a single result:
-- get the number of USB devices currently attached
SELECT COUNT(*) FROM usb_devices;
COUNT(*) |
---|
12 |
Without an AS
, the function call itself will also be the column label. We can also get counts for each unique value
SELECT COUNT(*) as total, COUNT(DISTINCT model) as models FROM usb_devices;
total | models |
---|---|
12 | 6 |
But we might also want to get totals for a given value. Say, total running processes per application:
SELECT name, COUNT() as total from processes
GROUP BY name ORDER BY total DESC
LIMIT 5
name | total |
---|---|
codium | 40 |
firefox | 23 |
ferdium | 19 |
chrome | 14 |
zsh | 10 |
Human/Cyborg relations (combining tables)
But what if you wanted to use data from more than one table? What if you'd rather, say, associate processes with usernames instead of uid
s?
Here's where the "relational" part comes in. You can use a join
to get a single, merged result that pulls in data based on shared details like columns. For instance, users
also has a uid
field, enabling us to...
-- get running processes by name, id, and user,
-- substituting username for user id
-- starting 25 results in
SELECT username, name, pid FROM processes
JOIN users ON processes.uid = users.uid
OFFSET 25
LIMIT 2;
username | name | pid |
---|---|---|
ryuzaki | ollama | 13523 |
ryuzaki | koboldcpp | 13525 |
JOINs will output a single set of results containing columns from both tables. By default these are INNER JOIN
s, which only return rows where there are matches on both sides. You can include other results from either table by using an OUTER JOIN
-- a LEFT OUTER JOIN
will return all rows from the first table regardless of matches, RIGHT
will return all rows from the second, and FULL
won't filter from either table.
You can also take results from multiple queries, across multiple tables, and collect them into one output. Say you wanted to grab all of the packages you've installed. You could make individual queries: select name FROM python_packages
, select name FROM deb_packages
, select tags FROM docker_images
... but you can also combine those into one set of results, with union
.
Let's go for a simpler example, though. It's a lot of tables, and the default install on Linux is missing the Arch repos anyway.
We can also take and nest it into a subquery, to then operate on the entire collected output:
-- output 2 results from the inner query
-- parentheses (you know... `()`) will nest a query,
-- and then you can `select` from its results
-- the same way you can select from tables
SELECT * FROM (
-- get all extensions installed on firefox
-- or chromium-based browsers
-- that can actually be a *lot* of things:
-- brave, opera, *edge*...
-- note: there's also a safari_extensions
-- but I'm not on a mac
SELECT name, browser_type AS source FROM chrome_extensions
UNION
SELECT name, 'firefox' AS source FROM firefox_addons
) LIMIT 2;
name | source |
---|---|
Plasma Integration | chromium |
Bitwarden | firefox |
The UNION
operation takes similar sets of lists, from queries across separate tables, and combines them into one set of results. The key is that you're returning similar overall shapes: you'll notice above that I'm getting the same properties from each query, but AS
you can see, I can rename other fields or assign manual values to shape the data however I might need.
You can even use subqueries to do it: for instance, while I could easily populate the name
field for Docker images with SELECT tags AS name
, I could also do it using SELECT (SELECT tags FROM docker_images) AS name
, or then insert that value into any other field of my choosing using AS
to assign it to a different column.
Perhaps the archives are incomplete (branching out)
Over six million different forms of communication (other output formats)
You can also get your results in CSV or JSON by adding --csv
or --json
flags when you run osqueryi
. This can be useful for output to other applications or languages.
So then the etc_hosts
example from earlier would show as:
address|hostnames
127.0.0.1|localhost
...while the pid
example above in JSON would look like:
[
{"name":"zsh","pid":"137211"},
{"name":"zsh","pid":"137416"},
{"name":"nu","pid":"137609"}
]
When invoking osqueryi
directly on a single query, we can then use output redirection to save this to a file: osqueryi --csv '.all uptime' > uptime.csv
SQLite itself can import CSVs to query them directly, and other SQL databases like DuckDB can also do this with JSON. Speaking of which, like Markdown, SQL has a variety of dialects, and not everything above will work in every kind of database.
Let the past die. Kill it if you have to. (PRQL)
If you wanted to write complex queries, or work across multiple databases, you might also try PRQL. It's a language that simplifies a variety of query operations, and compiles them to raw SQL in the dialect of your choosing.
PRQL uses pipelines, represented either by the pipe character (|
) or line breaks, to transform data at each step of an operation. As an example:
from shell_history
select { uid, command }
filter command != 'chrome'
take 10
But I'm not going to get too deep into this. As a fun bit of trivia, while SQL's pronunciation was never explicitly defined when the language was created, with ones spelling it out and ones calling it "sequel" both seeing common usage. With PRQL, though, calling it "prequel" is canonical.
Abilities some consider to be unnatural (some other playgrounds for SQL)
Some other applications you might want to try include:
- Pretzel, a playground for PRQL and DuckDB that you can use to query local data. It runs in the browser using WebAssembly, and can can be installed locally to your system as a Progressive Web App. (Fair warning: to my knowledge, it has no dark mode)
- Using SteamPipe to query various web services. It has a plugin-based architecture enabling you to make SQL queries against API calls from a variety of diffrent kinds of publicly-accessible platforms, ranging from cloud infrastructure services to some social media platfoms to open data and standards like finance quotes and RSS feeds. It also offers a similar shell that you can call with
steampipe query
.
There's also a plethora of sample datasets available for public consumption, from sources like Kaggle or data.gov. While I've personally soured on it as a platform, there's also a subreddit for that. (Annoyingly, I haven't found a particularly active Lemmy.)
With options that include JSON you could even just make requests from public APIs to collect a dataset on something fun, like maybe data about Star Wars.
...and if you're feeling really unhinged enterprising, maybe you just end up writing 10000 words on Web scraping, as you fall down some rabbit hole just because you wanted a dataset of lines from Futurama and your own curated jobs feed.