Query Execution Activity
In this activity, we’ll be conducting experiments on an SQLite database. In particular, our research questions are:
- By how much does the query response time decrease in the presence of an index for a one-to-many inner join?
- By how much does the disk space increase for the presence of such index?
Tool Setup
Download the latest command-line tools from the SQLite
website. Specifically, you’ll need the sqlite3
executable (e.g. sqlite3.exe for Windows, etc.). You won’t need the
library.
Open up the command line and move your sqlite3
executable
to a working directory for this activity. Your sqlite3 executable will
reside in the same directory as your code.
Let’s create a file-backed database called blog.db
. To do
that you run the executable, e.g. if you are in Windows Powershell:
> .\sqlite3.exe blog.db
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite>
Workload Setup
We’ll be using the Blog example from the lecture. A blog has Posts, where each post has potentially many Comments. Each Comment belongs to a post. Both tables have dates, authors, content, and a number of likes. Thus, in sqlite, our schema will look like:
CREATE TABLE posts(
id INTEGER PRIMARY KEY,
author VARCHAR(20),
likes INTEGER,
content VARCHAR(200),
posted TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE comments(
id INTEGER PRIMARY KEY,
author VARCHAR(20),
likes INTEGER,
content VARCHAR(200),
commented TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
post_id INTEGER
);
Run the above code. If you ever need to see the database tables, write
.schema
Now, let’s create some data. We can use SQLite itself to generate some
random strings into each of the tables and columns. This data won’t be
realistic, but at least it’ll be unique. We’ll make use of sqlite’s
randomblob(n)
to generate random strings and convert them
to hexadecimal to be human-friendly. And “likes” will be a number
between zero and a 1000. Like this:
INSERT INTO posts(author, likes, content)
WITH RECURSIVE
randomdata(x,y,z) AS (
SELECT lower(hex(randomblob(20))), abs(random()) % 1000000, lower(hex(randomblob(200)))
UNION ALL
SELECT lower(hex(randomblob(20))), abs(random()) % 1000000, lower(hex(randomblob(200)))
FROM randomdata
LIMIT 10000
) SELECT * FROM randomdata;
Don’t get caught up on magic of these statements. Feel free to
copy-and-paste. The LIMIT
clause tells us how many rows in
our database, and we’ll be tweaking that later. Let’s start at 10,000
posts. To verify that this worked, you can count the table:
sqlite> SELECT COUNT(*) from posts;
10000
Likewise, we want to create 10,000 comments that all link to a random post. Here’s the magic statement:
INSERT INTO comments(author, content, likes, post_id)
WITH RECURSIVE
randomdata(x,y,z) AS (
SELECT lower(hex(randomblob(20))) AS author,
lower(hex(randomblob(200))) AS content,
abs(random()) % 1000
UNION ALL
SELECT lower(hex(randomblob(20))) AS author,
lower(hex(randomblob(200))) AS content,
abs(random()) % 1000
FROM randomdata
LIMIT 20
)
SELECT *
FROM randomdata,
(SELECT id FROM posts ORDER BY RANDOM()) AS post_ids ;
The above query might take a while. You may notice that our database file is now rather large. To verify row counts, let’s check:
sqlite> SELECT count(*) from comments;
200000
Note that our data is a tad unrealistic: it has EXACTLY 10 comments on every post. Real blogs have a much wider distribution. If we were writing a lab report, we would log that as a limitation.
Experiments
Ok, now let’s set up a timer for our SQL commands. In the sqlite client, you can turn on the timer like this:
sqlite> .timer on
When you run a query, it’ll show the output like this:
sqlite> SELECT comments.* FROM comments LIMIT 50;
-- showing tons of data --
Run Time: real 0.311 user 0.000000 sys 0.062500
In the above query, we decided to display tons of data to the screen. That’s a huge I/O slowdown compared to the actual lookup time. In the timer results we had:
- real time which represents the wall-clock time. This is as if we put our own timer on the query. It includes the I/O operations of displaying all of the data on the screen.
- user time is based on counting the number of CPU cycles that sqlite spent. Databases have a lot of I/O operations, so this time might not be as useful for us today.
- sys time is the time spent by the system doing tasks it was asked to do (like flushing I/O caches). Or it could be the system switching to another process. System time can be ignored for CPU-bound problems, but for I/O bound, you have to consider it.
Note that user+sys almost never equals real. In this situation, the other time was put into the command prompt showing our data.
So for this experiment, let’s focus on real time.
But, let’s also try to make the user time most of our
real time by minimizing our I/O, and watch for high
sys times. Let’s do a COUNT(*)
to count
the rows so we don’t do a lot of disk writing.
Also, let’s do a query that only selects a subset of our
data. Let’s try one that checks our random data and doesn’t deal with
our date (which will basically be all the same). Instead, let’s look
for a smaller substring using the LIKE
operator. This
operator involves regular expressions and can be more computationally
intensive.
Let’s try this:
SELECT COUNT(*)
FROM comments INNER JOIN posts ON comments.post_id=posts.id
WHERE
posts.content LIKE '%abcd%';
(Fun game: before you run this query, what do you think the answer will be? What if we did “abc” instead? What about “abcde”? Does that change our runtime?)
The above query took some time to complete:
Run Time: real 53.302 user 1.531250 sys 7.250000
How did sqlite choose to run this? Add an EXPLAIN QUERY PLAN to the beginning of the above query. Here’s my output:
QUERY PLAN
|--SCAN TABLE comments
`--SEARCH TABLE posts USING INTEGER PRIMARY KEY (rowid=?)
Just to make things easier, let’s turn on the query plan explainer on the command line for every statement.
Notice here that it’s using “rowid”, which is a system-defined identifier that automatically has an index. So in this case, even though we never said “CREATE INDEX”, we are actually using indexes.
.eqp on
Now, let’s try a more narrow query with WHERE clauses on BOTH tables.
SELECT COUNT(*)
FROM comments INNER JOIN posts ON comments.post_id=posts.id
WHERE
posts.content LIKE '%abcd%'
AND comments.content LIKE '%abcd%';
My results were:
sqlite> SELECT COUNT(*)
...> FROM comments INNER JOIN posts ON comments.post_id=posts.id
...> WHERE
...> posts.content LIKE '%abcd%'
...> AND comments.content LIKE '%abcd%';
QUERY PLAN
|--SCAN TABLE comments
`--SEARCH TABLE posts USING INTEGER PRIMARY KEY (rowid=?)
0
Run Time: real 0.886 user 0.000000 sys 0.062500
Much faster! Interesting. Since we have fewer results to scan, sqlite made some good choices on which table to scan first and reduce its reads.
Quick Diversion. How does the number of results affect the query time? Guess out loud with a friend before you try. Try various combinations of these:
posts.content LIKE '%ab%'
posts.content LIKE '%abc%'
posts.content LIKE '%abcdef%'
comments.content LIKE '%ab%'
comments.content LIKE '%abc%'
comments.content LIKE '%abcdef%'
Which combination is the slowest? By a significant margin? How could you determine this from the query plan?
But! These are all really fast because we are using an index. PRIMARY
KEY implies that there’s a unique index on posts.id
. What
if we queried using a non-indexed integer field? The likes
field has no index. Let’s say we wanted to find comments that had more
the same number of likes in any post (which is a weird query given the
domain of blogging). The query looks like this, where just took out our
ON clause.
SELECT COUNT(*)
FROM comments, posts
WHERE comments.likes = posts.likes
AND posts.content LIKE '%abcd%'
AND comments.content LIKE '%abcd%';
Quick diversion. Would the inequality work faster or
slower than equality? Make a guess out loud before you try out comments.likes
> posts.likes
.
Here’s my results:
sqlite> SELECT COUNT(*)
...> FROM comments, posts
...> WHERE
...> comments.likes = posts.likes
...> AND posts.content LIKE '%abcd%'
...> AND comments.content LIKE '%abcd%';
QUERY PLAN
|--SCAN TABLE comments
`--SEARCH TABLE posts USING AUTOMATIC COVERING INDEX (likes=?)
0
Run Time: real 10.208 user 0.687500 sys 1.359375
Note that we are still seeing in our query plan that SQLite is using an index. (Smart!) SQLite did some estimation and decided that creating a temporary index was faster than searching without one. So, let’s turn off that feature, just for the sake of our learning.
PRAGMA automatic_index = false;
I didn’t even let the results finish. (Ctrl+C kills the query). They were ridiculously long. Why? We had to do nested loops and check every combination of every row in the table! The dead giveaway was this:
QUERY PLAN
|--SCAN TABLE comments
`--SCAN TABLE posts
When it found one row that matched its criterion, it had to do an entire search of the table. Instead, what if we wanted to speed up searching the table? Let’s use an index. Which table should we index? Both would be fine, but comments is a much bigger table, so being able to look stuff up in it would be ideal.
CREATE INDEX idx_comment_likes ON comments(likes);
To check if our index is there, we can use the .index
command. Here’s what creating the index looked like:
sqlite> CREATE INDEX idx_comment_likes ON comments(likes);
Run Time: real 11.819 user 0.359375 sys 1.140625
sqlite> .indexes comments
idx_comment_likes
Note how long it took for us to build that index. That’s a long time. But will it speed up our earlier query? Let’s try it out:
sqlite> SELECT COUNT(*)
...> FROM comments, posts
...> WHERE
...> comments.likes = posts.likes
...> AND posts.content LIKE '%abcd%'
...> AND comments.content LIKE '%abcd%';
QUERY PLAN
|--SCAN TABLE posts
`--SEARCH TABLE comments USING INDEX idx_comment_likes (likes=?)
0
Run Time: real 0.697 user 0.031250 sys 0.125000
That’s fast.
What if we indexed both tables? Will SQLite know which index to choose?
sqlite> CREATE INDEX idx_posts_likes ON posts(likes);
Run Time: real 0.625 user 0.031250 sys 0.046875
Creating this index was nice and quick. Now back to our query:
sqlite> CREATE INDEX idx_posts_likes ON posts(likes);
Run Time: real 0.625 user 0.031250 sys 0.046875
sqlite> SELECT COUNT(*)
...> FROM comments, posts
...> WHERE
...> comments.likes = posts.likes
...> AND posts.content LIKE '%abcd%'
...> AND comments.content LIKE '%abcd%';
QUERY PLAN
|--SCAN TABLE comments
`--SEARCH TABLE posts USING INDEX idx_posts_likes (likes=?)
0
Run Time: real 9.869 user 0.796875 sys 0.859375
Interesting!! In this case the presence of an extra index actually confused our query execution planning. SQLite (at least for me) chose to scan the larger table and use the other index.
So the lesson is: don’t just index everything. Index what will be used so as not to confuse the query planner.
Inequality Diversion Again. Now that we have an explicit index, can we replace the equals sign with a greater-than and get the same performance? (The answer has to do with b-trees - which we’ll get to later).