Introduction to SQL Queries: A Comprehensive Guide

Posted on Dec 11, 2024 @ 04:29 PM under Databases SQL Data Analytics

SQL (Structured Query Language) is the standard language used to interact with relational databases. Whether you're retrieving data, modifying records, or analyzing large datasets, SQL provides a powerful and efficient way to manage and manipulate information stored in relational databases.

In this tutorial, we will explore the core concepts of SQL, ranging from simple data retrieval to complex queries involving joins, aggregations, and subqueries. You’ll learn how to:

  • Write basic queries to select, insert, update, and delete data
  • Use filtering, grouping, and sorting to retrieve the exact data you need
  • Combine multiple tables with JOINs to build more powerful queries
  • Implement advanced SQL techniques such as subqueries, CTEs, window functions, and aggregate functions

By the end of this guide, you'll have a solid understanding of SQL and be equipped to write your own queries for a variety of real-world applications.

Let's start by installing these two libraries. We will use sqlalchemy only to be able to write raw SQL make updates to our tables. The reason for this, is because this tutorial is written in a Jupyter Notebook and we want the ability to also modify our database.

-- NB: Using Jupyter Notebook to modify the original source of the data is usually not a good idea. For this tutorial, the important thing to focus on is the actual SQL Queries. The database server used was MariaDB.

-- Why use Jupyter Notebook? Jupyter Notebook is an effective interactive environment for learning and exploration.

-- I have included comments for further clarification.

-- In this GitHub repository, there will also be a Python script that can be used to populate the database.

# Install libraries
!pip install pymysql
!pip install sqlalchemy
Requirement already satisfied: pymysql in /home/kevin/anaconda3/lib/python3.11/site-packages (1.1.1)
Requirement already satisfied: sqlalchemy in /home/kevin/anaconda3/lib/python3.11/site-packages (2.0.30)
Requirement already satisfied: typing-extensions>=4.6.0 in /home/kevin/anaconda3/lib/python3.11/site-packages (from sqlalchemy) (4.11.0)
Requirement already satisfied: greenlet!=0.4.17 in /home/kevin/anaconda3/lib/python3.11/site-packages (from sqlalchemy) (3.0.1)
# Import libraries
import pymysql # Connecting to Mariadb using mysql client
import pandas as pd
from sqlalchemy import create_engine, text # Will be used to create sql engine. We also use text to specify SQL string
# Set the maximum width for columns to 200 characters
pd.set_option('display.max_colwidth', 200)
# The purpose of this function is to hide the index when displaying dataframes for SQL results.
def hide_index(df):
    empty_index=[''] * len(df)
    df.index=empty_index
    return df

0. Schema => Design of Tables

-- Table Users => Stores information about the users
CREATE TABLE Users (
    user_id INT AUTO_INCREMENT PRIMARY KEY,
    username VARCHAR(100) NOT NULL,
    email VARCHAR(100) NOT NULL UNIQUE,
    password_hash VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Table Quizzes => Stores information about quizzes created by users
CREATE TABLE Quizzes (
    quiz_id INT AUTO_INCREMENT PRIMARY KEY,
    user_id INT,
    title VARCHAR(255) NOT NULL,
    description TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES Users(user_id) ON DELETE CASCADE
);

-- Table Questions => Stores questions that are associated with a quiz
CREATE TABLE Questions (
    question_id INT AUTO_INCREMENT PRIMARY KEY,
    quiz_id INT,
    question_text TEXT NOT NULL,
    correct_answer TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (quiz_id) REFERENCES Quizzes(quiz_id) ON DELETE CASCADE
);

-- Table UserQuizQuestions => Tracks each user's answers and score for the questions in a quiz
CREATE TABLE UserQuizQuestions (
    user_quiz_question_id INT AUTO_INCREMENT PRIMARY KEY,
    user_id INT,
    quiz_id INT,
    question_id INT,
    user_answer TEXT NOT NULL,
    score INT DEFAULT 0,
    answered_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES Users(user_id) ON DELETE CASCADE,
    FOREIGN KEY (quiz_id) REFERENCES Quizzes(quiz_id) ON DELETE CASCADE,
    FOREIGN KEY (question_id) REFERENCES Questions(question_id) ON DELETE CASCADE,
    UNIQUE (user_id, quiz_id, question_id) -- Ensures a user can only answer each question once in a quiz.
                                           -- We could have also used a composite primary key for this. 
);

-- idx_user_quiz this will speed up queries that need to find all answers a specific user gave for a quiz
CREATE INDEX idx_user_quiz ON UserQuizQuestions(user_id, quiz_id);
CREATE INDEX idx_quiz_question ON UserQuizQuestions(quiz_id, question_id);
# Create the SQLAlchemy engine. 
# NB: REPLACE VARIABLES BELOW
host='localhost'
user = 'root'
passwd = "pp08763547*T"
db = 'my_quiz_db'

engine = create_engine(f"mysql+pymysql://{user}:{passwd}@{host}/{db}")

1. Simple => Get a list of all records in a Table

-- These are basic queries to list all the data in a particular table.

  • All Users
df = pd.read_sql_query("SELECT * FROM Users", engine)
hide_index(df) # Call hide_index to hide the index from the result
user_id username email password_hash created_at
1 randyfowler allentimothy@example.org 86d51258b34327253b445d785b254555c7ca4e6c27c119461ac1e2b2a31e54c6 2022-01-16
2 kaitlinmckenzie camposbrooke@example.org 717d0b1b5f193c3737d53f0551370d644275c88753bbec4d3f74bc398c13281c 2024-09-20
3 vwhite chart@example.net 169e68a3c396a77d34df51305592d901920baebca869154e68db3d5b6ad1448d 2024-06-21
4 justin07 dennis80@example.com 3d2654dc6a42f69c940688a88ad6ec328a4b2773f10a35cb1501703fe064689a 2021-01-22
5 davidtodd my_new_email@example.com 43b34ed80cbecb26c2e5130de7dd65f068a6382ed124165744188206c3b3541e 2022-04-08
6 karenspears nelsonjennifer@example.org 8ccd0b4107471301a1d0d0c000246dba05ef1f472e3aa261b2593cb107c08e67 2024-08-30
7 ivargas rogersjaclyn@example.net f9bc792e9482b94db26096a7723961fba2d8d3ad5cab40348f806e9044c2b53e 2021-12-31
8 yrios laura90@example.net 3ffd6be1bf815e5cd18677a8314ea940565a67cf4ac90f4e284870c9679ede17 2022-03-30
9 muellerpatricia craig45@example.org ef69b5b047da7461ae48d93faa0a111544939d14a364d75b68c9878e7750def1 2022-01-14
  • All Quizzes
df = pd.read_sql_query("SELECT * FROM Users", engine)
hide_index(df)
user_id username email password_hash created_at
1 randyfowler allentimothy@example.org 86d51258b34327253b445d785b254555c7ca4e6c27c119461ac1e2b2a31e54c6 2022-01-16
2 kaitlinmckenzie camposbrooke@example.org 717d0b1b5f193c3737d53f0551370d644275c88753bbec4d3f74bc398c13281c 2024-09-20
3 vwhite chart@example.net 169e68a3c396a77d34df51305592d901920baebca869154e68db3d5b6ad1448d 2024-06-21
4 justin07 dennis80@example.com 3d2654dc6a42f69c940688a88ad6ec328a4b2773f10a35cb1501703fe064689a 2021-01-22
5 davidtodd my_new_email@example.com 43b34ed80cbecb26c2e5130de7dd65f068a6382ed124165744188206c3b3541e 2022-04-08
6 karenspears nelsonjennifer@example.org 8ccd0b4107471301a1d0d0c000246dba05ef1f472e3aa261b2593cb107c08e67 2024-08-30
7 ivargas rogersjaclyn@example.net f9bc792e9482b94db26096a7723961fba2d8d3ad5cab40348f806e9044c2b53e 2021-12-31
8 yrios laura90@example.net 3ffd6be1bf815e5cd18677a8314ea940565a67cf4ac90f4e284870c9679ede17 2022-03-30
9 muellerpatricia craig45@example.org ef69b5b047da7461ae48d93faa0a111544939d14a364d75b68c9878e7750def1 2022-01-14
  • All Questions
df = pd.read_sql_query("SELECT * FROM Questions", engine)
hide_index(df)
question_id quiz_id question_text correct_answer created_at
1 5 Call win two knowledge machine physical experience use anything. list 2021-09-26
3 4 One bring can within daughter choose win huge glass avoid position over possible. shake 2024-03-17
5 1 Life authority school me plant really simply focus then first else carry. sort 2021-03-16
6 5 Best budget thus finally shake money produce physical. upon 2020-10-04
7 5 Just research degree certain rise challenge resource technology nothing. far 2021-03-11
9 4 Hotel road live remain magazine four. off 2024-11-25
10 5 Memory much class age fast choose happen fear resource. financial 2021-05-12
11 4 Now establish across explain level maybe born. rest 2023-10-19
12 4 Free television benefit seek reach despite north. road 2023-02-22
13 1 Receive guess close think thus certainly difficult seem production evidence. pay 2020-01-31
14 1 Indeed win rate give ten capital reason bill over mind person. know 2023-05-26
16 3 Fish author few window growth consumer they party. can 2023-02-13
17 4 Cause agency make he amount although she money history piece score. like 2022-10-15
18 3 Improve keep seat season parent indeed. budget 2020-12-21
19 4 Writer agent few get plant case upon wait country. film 2021-04-03
20 3 Him player fact which both Mrs civil about course develop. take 2020-11-09
  • All UserQuizQuestions
df = pd.read_sql_query("SELECT * FROM Questions", engine)
hide_index(df)
question_id quiz_id question_text correct_answer created_at
1 5 Call win two knowledge machine physical experience use anything. list 2021-09-26
3 4 One bring can within daughter choose win huge glass avoid position over possible. shake 2024-03-17
5 1 Life authority school me plant really simply focus then first else carry. sort 2021-03-16
6 5 Best budget thus finally shake money produce physical. upon 2020-10-04
7 5 Just research degree certain rise challenge resource technology nothing. far 2021-03-11
9 4 Hotel road live remain magazine four. off 2024-11-25
10 5 Memory much class age fast choose happen fear resource. financial 2021-05-12
11 4 Now establish across explain level maybe born. rest 2023-10-19
12 4 Free television benefit seek reach despite north. road 2023-02-22
13 1 Receive guess close think thus certainly difficult seem production evidence. pay 2020-01-31
14 1 Indeed win rate give ten capital reason bill over mind person. know 2023-05-26
16 3 Fish author few window growth consumer they party. can 2023-02-13
17 4 Cause agency make he amount although she money history piece score. like 2022-10-15
18 3 Improve keep seat season parent indeed. budget 2020-12-21
19 4 Writer agent few get plant case upon wait country. film 2021-04-03
20 3 Him player fact which both Mrs civil about course develop. take 2020-11-09
# I did not want to continue writing `pd.read_sql_query` every time, so I created a simple function to run and return results.
def runSQL(sql_string, engine):
    df = pd.read_sql_query(sql_string, engine)
    return df

2. Select Columns => Get a list of all users with there username and email

-- This query will select all records from the Users table, showing username and email

df = pd.read_sql_query("SELECT username, email FROM Users", engine)
hide_index(df)
username email
randyfowler allentimothy@example.org
kaitlinmckenzie camposbrooke@example.org
vwhite chart@example.net
justin07 dennis80@example.com
davidtodd my_new_email@example.com
karenspears nelsonjennifer@example.org
ivargas rogersjaclyn@example.net
yrios laura90@example.net
muellerpatricia craig45@example.org

3. Filtering Data => Get all quizzes created by user with user_id=1

-- This query filters quizzes by creator's user_id, returning the title and description of quizzes create by user 1.

df = runSQL("SELECT title, description FROM Quizzes WHERE user_id=1",engine)
hide_index(df)
title description
Perform east realize. Door next answer available pull collection. Yet industry energy. Art no likely exist need soldier stage.

4. Aggregate Function => Find the number of questions in each Quiz

-- The COUNT(*) function counts the number of questions for each quiz, grouping by quiz_id.

query="""
SELECT quiz_id, COUNT(*) AS question_count
FROM Questions
GROUP BY quiz_id;
"""
df = runSQL(query, engine)
hide_index(df)
quiz_id question_count
1 3
3 3
4 6
5 4

5. JOIN => Retrieve username and title of quizzes for each user

-- This is an inner join between Users and Quizzes tables, that will match/link user_id in both tables to get the username and title for each quiz.

query="""
SELECT u.username, q.title
FROM Users u
JOIN Quizzes q ON u.user_id = q.user_id;
"""
df = runSQL(query, engine)
hide_index(df)
username title
yrios So herself light special.
davidtodd Not during.
karenspears Area many third.
randyfowler Perform east realize.

6. Multiple JOINS => Get a list of all users' answers for quiz quiz_id=2 along with the question text and score

-- This query joins UserQuizQuestions, Users, and Questions to retrieve the user's answers (user_answer), the question (question_text), and the score for quiz quiz_id = 2

query="""
SELECT u.username, q.question_text, uq.user_answer, uq.score
FROM UserQuizQuestions uq
JOIN Users u ON uq.user_id = u.user_id
JOIN Questions q ON uq.question_id = q.question_id
WHERE uq.quiz_id = 2;
"""
df = runSQL(query, engine)
hide_index(df)
# This will result in an empty set as shown below
username question_text user_answer score

7. Subquery => Get all users who have attempted a quiz with quiz_id=1

-- The subquery in the IN clause finds all users who have attempted the quiz with quiz_id = 1 by checking if user_id exists in the UserQuizQuestions table.

query="""
SELECT username
FROM Users
WHERE user_id IN (SELECT user_id FROM UserQuizQuestions WHERE quiz_id = 1);
"""
df = runSQL(query, engine)
hide_index(df)
username
ivargas
yrios
muellerpatricia

8. Common Table Expression (CTE) => Find the user who scored the highest in quiz quiz_id=3

-- CTEs are temporary named results that we can reference in later the main sql query.

-- The CTE UserScores calculates the total score for each user in quiz quiz_id = 3. Then, we JOIN this CTE with Users to get the username of the top scorer and order by total_score to select the highest one.

-- NB: Two students could have gotten the same high score. One way to deal with this is to rank them as shown in 9. below.

query="""
WITH UserScores AS (
  SELECT user_id, SUM(score) AS total_score
  FROM UserQuizQuestions
  WHERE quiz_id = 3
  GROUP BY user_id
)
SELECT u.username, us.total_score
FROM UserScores us
JOIN Users u ON us.user_id = u.user_id
ORDER BY us.total_score DESC
LIMIT 1;
"""
df = runSQL(query, engine)
hide_index(df)
username total_score
ivargas 5.0

9. Window Function => Rank users by their total score in quiz where quiz_id=3

-- The CTE calculates the total score for each user in quiz quiz_id = 3. The RANK() window function is used to rank users based on their scores in descending order. There is also a DENSE_RANK() that will not skip ranks after ties as was done when we used the RANK() function.

query="""
WITH UserScores AS (
  SELECT user_id, SUM(score) AS total_score
  FROM UserQuizQuestions
  WHERE quiz_id = 3
  GROUP BY user_id
)
SELECT u.username, us.total_score,
       RANK() OVER (ORDER BY us.total_score DESC) AS rank
FROM UserScores us
JOIN Users u ON us.user_id = u.user_id;
"""
df = runSQL(query, engine)
hide_index(df)
username total_score rank
ivargas 5.0 1
kaitlinmckenzie 3.0 2
karenspears 3.0 2
muellerpatricia 1.0 4

10. Group Cancatenation => Get a list of all questions for each quiz, concatenated into a single string, for quiz_id=2

-- The GROUP_CONCAT function is used to concatenate all question texts for the quiz quiz_id = 2 into a single string, separated by |.

query="""
SELECT quiz_id, GROUP_CONCAT(question_text SEPARATOR '| ') AS questions
FROM Questions
WHERE quiz_id = 2
GROUP BY quiz_id;
"""
df = runSQL(query, engine)
hide_index(df)
quiz_id questions

11. Self Join => Find pairs of users who have taken the same quiz (both users must have attempted the quiz)

-- A self join is performed on the UserQuizQuestions table to find pairs of users who have attempted the same quiz (quiz_id). We ensure that the users are different by checking u1.user_id <> u2.user_id. The result will have repetitions. It will give all permuations of the pairs irrespective of the order.

query="""
SELECT u1.username AS user_1, u2.username AS user_2, uq.quiz_id
FROM UserQuizQuestions uq
JOIN Users u1 ON uq.user_id = u1.user_id
JOIN UserQuizQuestions uq2 ON uq.quiz_id = uq2.quiz_id
JOIN Users u2 ON uq2.user_id = u2.user_id
WHERE u1.user_id <> u2.user_id;
"""
df = runSQL(query, engine)
hide_index(df)
user_1 user_2 quiz_id
randyfowler karenspears 4
randyfowler ivargas 4
randyfowler muellerpatricia 4
kaitlinmckenzie ivargas 3
kaitlinmckenzie karenspears 3
kaitlinmckenzie muellerpatricia 3
karenspears kaitlinmckenzie 3
karenspears ivargas 3
karenspears muellerpatricia 3
karenspears randyfowler 4
karenspears ivargas 4
karenspears muellerpatricia 4
ivargas yrios 1
ivargas muellerpatricia 1
ivargas kaitlinmckenzie 3
ivargas karenspears 3
ivargas muellerpatricia 3
ivargas karenspears 4
ivargas randyfowler 4
ivargas muellerpatricia 4
yrios muellerpatricia 1
yrios ivargas 1
muellerpatricia yrios 1
muellerpatricia ivargas 1
muellerpatricia kaitlinmckenzie 3
muellerpatricia ivargas 3
muellerpatricia karenspears 3
muellerpatricia karenspears 4
muellerpatricia randyfowler 4
muellerpatricia ivargas 4

12. Nested CTEs => Compute the average total score per user for each quiz

-- The first CTE QuizScores calculates the total score for each user in each quiz. The second CTE AverageScores computes the average score per quiz by averaging the total scores.

-- NB: We are averaging the sum of scores for each user with a quiz, not just the individual scores. This approach accounts for how well each user did on the quiz (as opposed to averaging the individual scores for all questions directly).

query="""
WITH QuizScores AS (
  SELECT quiz_id, user_id, SUM(score) AS total_score
  FROM UserQuizQuestions
  GROUP BY quiz_id, user_id
),
AverageScores AS (
  SELECT quiz_id, AVG(total_score) AS avg_score
  FROM QuizScores
  GROUP BY quiz_id
)
SELECT q.quiz_id, avg_score
FROM AverageScores q;
"""
df = runSQL(query, engine)
hide_index(df)
quiz_id avg_score
1 1.6667
3 3.0000
4 2.2500
5 1.0000
  • The first CTE will give this result
query="""
  SELECT quiz_id, user_id, SUM(score) AS total_score
  FROM UserQuizQuestions
  GROUP BY quiz_id, user_id;
"""
df = runSQL(query, engine)
hide_index(df)
quiz_id user_id total_score
1 7 1.0
1 8 1.0
1 9 3.0
3 2 3.0
3 6 3.0
3 7 5.0
3 9 1.0
4 1 0.0
4 6 5.0
4 7 1.0
4 9 3.0
5 6 1.0
  • If we averaged the individual scores (without accomodating each student) for all questions. We would get different results as shown below.
query="""
 SELECT quiz_id, AVG(score) AS total_score
  FROM UserQuizQuestions
  GROUP BY quiz_id
"""
df = runSQL(query, engine)
hide_index(df)
quiz_id total_score
1 1.6667
3 3.0000
4 2.2500
5 1.0000

13. Update => Update the email address of user with user_id=5 to a new email

-- This simple UPDATE query modifies the email field for the user with user_id = 5.

-- NB: We cannot use our function here, as we are not selecting. We are updating. We also have to explicitly commit changes for changes to persist. Under normal circumstances, you wouldn't be updating a database like this from a jupyter notebook. This is to demonstrate SQL, so the focus is on the actual SQL statement.

query="UPDATE Users SET email = 'my_new_email@example.com' WHERE user_id = 5;"
# Execute the update query
with engine.connect() as connection:
    connection.execute(text(query))
    connection.commit()  # Explicit commit to persist changes

14. Delete => Delete User from the database

-- Deleting user 10 will delete all entries associated with the user. This is possible because we had ensured that we cascaded on delete. If that were not the case, we would have to delete the children table first (UserQuizQuestions, Quizzes) and then delete the parent table Users

query="DELETE from Users WHERE user_id = 10;"
# Execute the update query
with engine.connect() as connection:
    connection.execute(text(query))
    connection.commit()  # Again we explicitly commit to persist changes

15. Subquery with EXISTS => Get Quizzes that do not have any questions associated with them.

-- The EXISTS subquery checks if there are any questions associated with each quiz. If no questions exist for a quiz, it is included in the result.

query="""
 SELECT q.quiz_id, q.title
 FROM Quizzes q
 WHERE NOT EXISTS (
 SELECT 1 FROM Questions WHERE quiz_id = q.quiz_id
);
"""
df = runSQL(query, engine)
hide_index(df)
# The set below will be empty
quiz_id title

16. CROSS JOIN => Generate all possible combinations of users and quizzes.

-- The CROSS JOIN generates the Cartesian product of Users and Quizzes, giving all possible combinations of users and quizzes.

query="""
 SELECT u.username, q.title
 FROM Users u
 CROSS JOIN Quizzes q;
"""
df = runSQL(query, engine)
hide_index(df)
username title
randyfowler So herself light special.
randyfowler Not during.
randyfowler Area many third.
randyfowler Perform east realize.
kaitlinmckenzie So herself light special.
kaitlinmckenzie Not during.
kaitlinmckenzie Area many third.
kaitlinmckenzie Perform east realize.
vwhite So herself light special.
vwhite Not during.
vwhite Area many third.
vwhite Perform east realize.
justin07 So herself light special.
justin07 Not during.
justin07 Area many third.
justin07 Perform east realize.
davidtodd So herself light special.
davidtodd Not during.
davidtodd Area many third.
davidtodd Perform east realize.
karenspears So herself light special.
karenspears Not during.
karenspears Area many third.
karenspears Perform east realize.
ivargas So herself light special.
ivargas Not during.
ivargas Area many third.
ivargas Perform east realize.
yrios So herself light special.
yrios Not during.
yrios Area many third.
yrios Perform east realize.
muellerpatricia So herself light special.
muellerpatricia Not during.
muellerpatricia Area many third.
muellerpatricia Perform east realize.

Conclusion

In conclusion, SQL is a powerful language for managing and manipulating data in relational databases. By mastering key concepts such as data retrieval, filtering, joins, and advanced techniques like subqueries and window functions, you can efficiently work with databases and extract valuable insights. With the tools and techniques covered in this tutorial, you are now equipped to write complex SQL queries and handle real-world data management tasks with confidence.