Query Execution Activity

In this activity, we’ll be conducting experiments on an SQLite database. In particular, our research questions are:

Tool Setup

Download the latest command-line tools from the SQLite website. Specifically, you’ll need the sqlite3 executable (e.g. sqlite3.exe for Windows, etc.). You won’t need the library.

Open up the command line and move your sqlite3 executable to a working directory for this activity. Your sqlite3 executable will reside in the same directory as your code.

Let’s create a file-backed database called blog.db. To do that you run the executable, e.g. if you are in Windows Powershell:

> .\sqlite3.exe blog.db
    SQLite version 3.27.2 2019-02-25 16:06:06
    Enter ".help" for usage hints.
    sqlite>
    

Workload Setup

We’ll be using the Blog example from the lecture. A blog has Posts, where each post has potentially many Comments. Each Comment belongs to a post. Both tables have dates, authors, content, and a number of likes. Thus, in sqlite, our schema will look like:

CREATE TABLE posts(
       id INTEGER PRIMARY KEY,
       author VARCHAR(20),
       likes INTEGER,
       content VARCHAR(200),
       posted TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );

    CREATE TABLE comments(
       id INTEGER PRIMARY KEY,
       author VARCHAR(20),
       likes INTEGER,
       content VARCHAR(200),
       commented TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
       post_id INTEGER
    );

    

Run the above code. If you ever need to see the database tables, write .schema

Now, let’s create some data. We can use SQLite itself to generate some random strings into each of the tables and columns. This data won’t be realistic, but at least it’ll be unique. We’ll make use of sqlite’s randomblob(n) to generate random strings and convert them to hexadecimal to be human-friendly. And “likes” will be a number between zero and a 1000. Like this:

INSERT INTO posts(author, likes, content)
    WITH RECURSIVE
    randomdata(x,y,z) AS (
      SELECT lower(hex(randomblob(20))), abs(random()) % 1000000, lower(hex(randomblob(200)))
      UNION ALL
      SELECT lower(hex(randomblob(20))), abs(random()) % 1000000, lower(hex(randomblob(200)))
      FROM randomdata
      LIMIT 10000
    ) SELECT * FROM randomdata;
    

Don’t get caught up on magic of these statements. Feel free to copy-and-paste. The LIMIT clause tells us how many rows in our database, and we’ll be tweaking that later. Let’s start at 10,000 posts. To verify that this worked, you can count the table:

sqlite> SELECT COUNT(*) from posts;
    10000
    

Likewise, we want to create 10,000 comments that all link to a random post. Here’s the magic statement:

INSERT INTO comments(author, content, likes, post_id)
    WITH RECURSIVE
    randomdata(x,y,z) AS (
      SELECT lower(hex(randomblob(20))) AS author,
             lower(hex(randomblob(200))) AS content,
             abs(random()) % 1000
      UNION ALL
      SELECT lower(hex(randomblob(20))) AS author,
             lower(hex(randomblob(200))) AS content,
             abs(random()) % 1000
      FROM randomdata
      LIMIT 20
    )
    SELECT *
    FROM randomdata,
         (SELECT id FROM posts ORDER BY RANDOM()) AS post_ids ;
    

The above query might take a while. You may notice that our database file is now rather large. To verify row counts, let’s check:

sqlite> SELECT count(*) from comments;
    200000
    

Note that our data is a tad unrealistic: it has EXACTLY 10 comments on every post. Real blogs have a much wider distribution. If we were writing a lab report, we would log that as a limitation.

Experiments

Ok, now let’s set up a timer for our SQL commands. In the sqlite client, you can turn on the timer like this:

sqlite> .timer on
    

When you run a query, it’ll show the output like this:

sqlite> SELECT comments.* FROM comments LIMIT 50;
    -- showing tons of data --
    Run Time: real 0.311 user 0.000000 sys 0.062500
    

In the above query, we decided to display tons of data to the screen. That’s a huge I/O slowdown compared to the actual lookup time. In the timer results we had:

Note that user+sys almost never equals real. In this situation, the other time was put into the command prompt showing our data.

So for this experiment, let’s focus on real time. But, let’s also try to make the user time most of our real time by minimizing our I/O, and watch for high sys times. Let’s do a COUNT(*) to count the rows so we don’t do a lot of disk writing.

Also, let’s do a query that only selects a subset of our data. Let’s try one that checks our random data and doesn’t deal with our date (which will basically be all the same). Instead, let’s look for a smaller substring using the LIKE operator. This operator involves regular expressions and can be more computationally intensive.

Let’s try this:

SELECT COUNT(*)
    FROM comments INNER JOIN posts ON comments.post_id=posts.id
    WHERE
    posts.content LIKE '%abcd%';
    

(Fun game: before you run this query, what do you think the answer will be? What if we did “abc” instead? What about “abcde”? Does that change our runtime?)

The above query took some time to complete:

Run Time: real 53.302 user 1.531250 sys 7.250000
    

How did sqlite choose to run this? Add an EXPLAIN QUERY PLAN to the beginning of the above query. Here’s my output:

QUERY PLAN
    |--SCAN TABLE comments
    `--SEARCH TABLE posts USING INTEGER PRIMARY KEY (rowid=?)
    

Just to make things easier, let’s turn on the query plan explainer on the command line for every statement.

Notice here that it’s using “rowid”, which is a system-defined identifier that automatically has an index. So in this case, even though we never said “CREATE INDEX”, we are actually using indexes.

.eqp on
    

Now, let’s try a more narrow query with WHERE clauses on BOTH tables.

SELECT COUNT(*)
    FROM comments INNER JOIN posts ON comments.post_id=posts.id
    WHERE
    posts.content LIKE '%abcd%'
    AND comments.content LIKE '%abcd%';
    

My results were:

sqlite> SELECT COUNT(*)
       ...> FROM comments INNER JOIN posts ON comments.post_id=posts.id
       ...> WHERE
       ...> posts.content LIKE '%abcd%'
       ...> AND comments.content LIKE '%abcd%';
    QUERY PLAN
    |--SCAN TABLE comments
    `--SEARCH TABLE posts USING INTEGER PRIMARY KEY (rowid=?)
    0
    Run Time: real 0.886 user 0.000000 sys 0.062500
    

Much faster! Interesting. Since we have fewer results to scan, sqlite made some good choices on which table to scan first and reduce its reads.

Quick Diversion. How does the number of results affect the query time? Guess out loud with a friend before you try. Try various combinations of these:

Which combination is the slowest? By a significant margin? How could you determine this from the query plan?

But! These are all really fast because we are using an index. PRIMARY KEY implies that there’s a unique index on posts.id. What if we queried using a non-indexed integer field? The likes field has no index. Let’s say we wanted to find comments that had more the same number of likes in any post (which is a weird query given the domain of blogging). The query looks like this, where just took out our ON clause.


SELECT COUNT(*)
FROM comments, posts
WHERE comments.likes = posts.likes
      AND posts.content LIKE '%abcd%'
      AND comments.content LIKE '%abcd%';
      

Quick diversion. Would the inequality work faster or slower than equality? Make a guess out loud before you try out comments.likes > posts.likes.

Here’s my results:

sqlite> SELECT COUNT(*)
       ...> FROM comments, posts
       ...> WHERE
       ...> comments.likes = posts.likes
       ...> AND posts.content LIKE '%abcd%'
       ...> AND comments.content LIKE '%abcd%';
    QUERY PLAN
    |--SCAN TABLE comments
    `--SEARCH TABLE posts USING AUTOMATIC COVERING INDEX (likes=?)
    0
    Run Time: real 10.208 user 0.687500 sys 1.359375
    

Note that we are still seeing in our query plan that SQLite is using an index. (Smart!) SQLite did some estimation and decided that creating a temporary index was faster than searching without one. So, let’s turn off that feature, just for the sake of our learning.

PRAGMA automatic_index = false;
    

I didn’t even let the results finish. (Ctrl+C kills the query). They were ridiculously long. Why? We had to do nested loops and check every combination of every row in the table! The dead giveaway was this:

QUERY PLAN
    |--SCAN TABLE comments
    `--SCAN TABLE posts
    

When it found one row that matched its criterion, it had to do an entire search of the table. Instead, what if we wanted to speed up searching the table? Let’s use an index. Which table should we index? Both would be fine, but comments is a much bigger table, so being able to look stuff up in it would be ideal.

CREATE INDEX idx_comment_likes ON comments(likes);
    

To check if our index is there, we can use the .index command. Here’s what creating the index looked like:

        
sqlite> CREATE INDEX idx_comment_likes ON comments(likes);
Run Time: real 11.819 user 0.359375 sys 1.140625
sqlite> .indexes comments
idx_comment_likes
    
    

Note how long it took for us to build that index. That’s a long time. But will it speed up our earlier query? Let’s try it out:


sqlite> SELECT COUNT(*)
   ...> FROM comments, posts
   ...> WHERE
   ...> comments.likes = posts.likes
   ...> AND posts.content LIKE '%abcd%'
   ...> AND comments.content LIKE '%abcd%';
QUERY PLAN
|--SCAN TABLE posts
`--SEARCH TABLE comments USING INDEX idx_comment_likes (likes=?)
0
Run Time: real 0.697 user 0.031250 sys 0.125000
    

That’s fast.

What if we indexed both tables? Will SQLite know which index to choose?


sqlite> CREATE INDEX idx_posts_likes ON posts(likes);
    Run Time: real 0.625 user 0.031250 sys 0.046875
    

Creating this index was nice and quick. Now back to our query:


sqlite> CREATE INDEX idx_posts_likes ON posts(likes);
    Run Time: real 0.625 user 0.031250 sys 0.046875
sqlite> SELECT COUNT(*)
   ...> FROM comments, posts
   ...> WHERE
   ...> comments.likes = posts.likes
   ...> AND posts.content LIKE '%abcd%'
   ...> AND comments.content LIKE '%abcd%';
QUERY PLAN
|--SCAN TABLE comments
`--SEARCH TABLE posts USING INDEX idx_posts_likes (likes=?)
0
Run Time: real 9.869 user 0.796875 sys 0.859375
    

Interesting!! In this case the presence of an extra index actually confused our query execution planning. SQLite (at least for me) chose to scan the larger table and use the other index.

So the lesson is: don’t just index everything. Index what will be used so as not to confuse the query planner.

Inequality Diversion Again. Now that we have an explicit index, can we replace the equals sign with a greater-than and get the same performance? (The answer has to do with b-trees - which we’ll get to later).

Exercises

Once you get a handle on the above experiments, do the following experiments.

Disk Space Usage

Add Tagging

  1. Posts should be able to have a "tag". A tag can have multiple posts, and a post can have multiple tags. You'll need two tables for this, call them "tags" and "tagged"
  2. Generate 1000 random tags. Use this code:
    
    INSERT INTO tags(name)
     WITH RECURSIVE
     randomdata(x) AS (
       SELECT lower(hex(randomblob(20)))
       UNION ALL
       SELECT lower(hex(randomblob(20)))
       FROM randomdata
       LIMIT 1000
     ) SELECT * FROM randomdata;
              
  3. Generate 10000 connections between tags and posts. Use this code, running it twice.:
    
    INSERT INTO tagged(tag_id, post_id)
      SELECT tag_ids.id, post_ids.id
      FROM (SELECT id FROM tags ORDER BY RANDOM() LIMIT 100) AS tag_ids ,
           (SELECT id FROM posts ORDER BY RANDOM() LIMIT 100) AS post_ids ;
              
            
  4. Write the following queries and record their timings. These will be your benchmark. You may add a LIMIT modifier to your queries to make the output easier to navigate, although your timing differences may not be as noticeable.
    • List all tags with the post count (i.e. use GROUP BY)
    • List all tags with the post count that had at least 10000 likes and had at least 10 posts, sorted by post count descending
    • List all posts that has a specific tag
    • List all posts that had at least 10000 that has a specific tag
    • List all posts, ordered by the number of tags they have in descending order.
    • List all posts that have 2 or more tags that happen to have "abcd" in their name.
    • List all posts that have 20 or more tags and 10 or more comments, where the post author has "abc" as a substring and the comment content has "abc" as a substring
  5. Add an index for the tag_id, and a separate index for post_id on the joining table.
  6. Rerun your timings, and record the speed difference. Record the EXPLAINs as well. Which index gets used more often?
  7. Drop the above indexes and instead add a single multi-column index. Does it get used? Is there a performance difference? Is there a space difference?
  8. Drop the above indexes and index for the tag_id as above, and a separate PARTIAL and ORDERED index for post_id on the joining table, for only rows that have at least 10,000 likes. Is there a noticeable difference in speed, index choice, and/or space? Does the query plan use ever use this index?
  9. Add one more reasonable query to the benchmark, and rerun.
  10. Are there any queries that have seen little to no improvement in all of these query arrangements? Is there another index you can make that helps? (there's at least one that I know of!)
Meneely's solutions for the above queries