Introduction to SQL Queries: A Comprehensive Guide

Posted on Dec 11, 2024 @ 04:29 PM under Databases SQL Data Analytics

SQL (Structured Query Language) is the standard language used to interact with relational databases. Whether you're retrieving data, modifying records, or analyzing large datasets, SQL provides a powerful and efficient way to manage and manipulate information stored in relational databases.

In this tutorial, we will explore the core concepts of SQL, ranging from simple data retrieval to complex queries involving joins, aggregations, and subqueries. You’ll learn how to:

Write basic queries to select, insert, update, and delete data
Use filtering, grouping, and sorting to retrieve the exact data you need
Combine multiple tables with JOINs to build more powerful queries
Implement advanced SQL techniques such as subqueries, CTEs, window functions, and aggregate functions

By the end of this guide, you'll have a solid understanding of SQL and be equipped to write your own queries for a variety of real-world applications.

Let's start by installing these two libraries. We will use sqlalchemy only to be able to write raw SQL make updates to our tables. The reason for this, is because this tutorial is written in a Jupyter Notebook and we want the ability to also modify our database.

-- NB: Using Jupyter Notebook to modify the original source of the data is usually not a good idea. For this tutorial, the important thing to focus on is the actual SQL Queries. The database server used was MariaDB.

-- Why use Jupyter Notebook? Jupyter Notebook is an effective interactive environment for learning and exploration.

-- I have included comments for further clarification.

-- In this GitHub repository, there will also be a Python script that can be used to populate the database.

# Install libraries
!pip install pymysql
!pip install sqlalchemy

Requirement already satisfied: pymysql in /home/kevin/anaconda3/lib/python3.11/site-packages (1.1.1)
Requirement already satisfied: sqlalchemy in /home/kevin/anaconda3/lib/python3.11/site-packages (2.0.30)
Requirement already satisfied: typing-extensions>=4.6.0 in /home/kevin/anaconda3/lib/python3.11/site-packages (from sqlalchemy) (4.11.0)
Requirement already satisfied: greenlet!=0.4.17 in /home/kevin/anaconda3/lib/python3.11/site-packages (from sqlalchemy) (3.0.1)

# Import libraries
import pymysql # Connecting to Mariadb using mysql client
import pandas as pd
from sqlalchemy import create_engine, text # Will be used to create sql engine. We also use text to specify SQL string
# Set the maximum width for columns to 200 characters
pd.set_option('display.max_colwidth', 200)

# The purpose of this function is to hide the index when displaying dataframes for SQL results.
def hide_index(df):
    empty_index=[''] * len(df)
    df.index=empty_index
    return df

0. Schema => Design of Tables

-- Table Users => Stores information about the users
CREATE TABLE Users (
    user_id INT AUTO_INCREMENT PRIMARY KEY,
    username VARCHAR(100) NOT NULL,
    email VARCHAR(100) NOT NULL UNIQUE,
    password_hash VARCHAR(255) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Table Quizzes => Stores information about quizzes created by users
CREATE TABLE Quizzes (
    quiz_id INT AUTO_INCREMENT PRIMARY KEY,
    user_id INT,
    title VARCHAR(255) NOT NULL,
    description TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES Users(user_id) ON DELETE CASCADE
);

-- Table Questions => Stores questions that are associated with a quiz
CREATE TABLE Questions (
    question_id INT AUTO_INCREMENT PRIMARY KEY,
    quiz_id INT,
    question_text TEXT NOT NULL,
    correct_answer TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (quiz_id) REFERENCES Quizzes(quiz_id) ON DELETE CASCADE
);

-- Table UserQuizQuestions => Tracks each user's answers and score for the questions in a quiz
CREATE TABLE UserQuizQuestions (
    user_quiz_question_id INT AUTO_INCREMENT PRIMARY KEY,
    user_id INT,
    quiz_id INT,
    question_id INT,
    user_answer TEXT NOT NULL,
    score INT DEFAULT 0,
    answered_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES Users(user_id) ON DELETE CASCADE,
    FOREIGN KEY (quiz_id) REFERENCES Quizzes(quiz_id) ON DELETE CASCADE,
    FOREIGN KEY (question_id) REFERENCES Questions(question_id) ON DELETE CASCADE,
    UNIQUE (user_id, quiz_id, question_id) -- Ensures a user can only answer each question once in a quiz.
                                           -- We could have also used a composite primary key for this. 
);

-- idx_user_quiz this will speed up queries that need to find all answers a specific user gave for a quiz
CREATE INDEX idx_user_quiz ON UserQuizQuestions(user_id, quiz_id);
CREATE INDEX idx_quiz_question ON UserQuizQuestions(quiz_id, question_id);

# Create the SQLAlchemy engine. 
# NB: REPLACE VARIABLES BELOW
host='localhost'
user = 'root'
passwd = "pp08763547*T"
db = 'my_quiz_db'

engine = create_engine(f"mysql+pymysql://{user}:{passwd}@{host}/{db}")

1. Simple => Get a list of all records in a Table

-- These are basic queries to list all the data in a particular table.

All Users

df = pd.read_sql_query("SELECT * FROM Users", engine)
hide_index(df) # Call hide_index to hide the index from the result

user_id	username	email	password_hash	created_at
1	randyfowler	allentimothy@example.org	86d51258b34327253b445d785b254555c7ca4e6c27c119461ac1e2b2a31e54c6	2022-01-16
2	kaitlinmckenzie	camposbrooke@example.org	717d0b1b5f193c3737d53f0551370d644275c88753bbec4d3f74bc398c13281c	2024-09-20
3	vwhite	chart@example.net	169e68a3c396a77d34df51305592d901920baebca869154e68db3d5b6ad1448d	2024-06-21
4	justin07	dennis80@example.com	3d2654dc6a42f69c940688a88ad6ec328a4b2773f10a35cb1501703fe064689a	2021-01-22
5	davidtodd	my_new_email@example.com	43b34ed80cbecb26c2e5130de7dd65f068a6382ed124165744188206c3b3541e	2022-04-08
6	karenspears	nelsonjennifer@example.org	8ccd0b4107471301a1d0d0c000246dba05ef1f472e3aa261b2593cb107c08e67	2024-08-30
7	ivargas	rogersjaclyn@example.net	f9bc792e9482b94db26096a7723961fba2d8d3ad5cab40348f806e9044c2b53e	2021-12-31
8	yrios	laura90@example.net	3ffd6be1bf815e5cd18677a8314ea940565a67cf4ac90f4e284870c9679ede17	2022-03-30
9	muellerpatricia	craig45@example.org	ef69b5b047da7461ae48d93faa0a111544939d14a364d75b68c9878e7750def1	2022-01-14

All Quizzes

df = pd.read_sql_query("SELECT * FROM Users", engine)
hide_index(df)

user_id	username	email	password_hash	created_at
1	randyfowler	allentimothy@example.org	86d51258b34327253b445d785b254555c7ca4e6c27c119461ac1e2b2a31e54c6	2022-01-16
2	kaitlinmckenzie	camposbrooke@example.org	717d0b1b5f193c3737d53f0551370d644275c88753bbec4d3f74bc398c13281c	2024-09-20
3	vwhite	chart@example.net	169e68a3c396a77d34df51305592d901920baebca869154e68db3d5b6ad1448d	2024-06-21
4	justin07	dennis80@example.com	3d2654dc6a42f69c940688a88ad6ec328a4b2773f10a35cb1501703fe064689a	2021-01-22
5	davidtodd	my_new_email@example.com	43b34ed80cbecb26c2e5130de7dd65f068a6382ed124165744188206c3b3541e	2022-04-08
6	karenspears	nelsonjennifer@example.org	8ccd0b4107471301a1d0d0c000246dba05ef1f472e3aa261b2593cb107c08e67	2024-08-30
7	ivargas	rogersjaclyn@example.net	f9bc792e9482b94db26096a7723961fba2d8d3ad5cab40348f806e9044c2b53e	2021-12-31
8	yrios	laura90@example.net	3ffd6be1bf815e5cd18677a8314ea940565a67cf4ac90f4e284870c9679ede17	2022-03-30
9	muellerpatricia	craig45@example.org	ef69b5b047da7461ae48d93faa0a111544939d14a364d75b68c9878e7750def1	2022-01-14

All Questions

df = pd.read_sql_query("SELECT * FROM Questions", engine)
hide_index(df)

question_id	quiz_id	question_text	correct_answer	created_at
1	5	Call win two knowledge machine physical experience use anything.	list	2021-09-26
3	4	One bring can within daughter choose win huge glass avoid position over possible.	shake	2024-03-17
5	1	Life authority school me plant really simply focus then first else carry.	sort	2021-03-16
6	5	Best budget thus finally shake money produce physical.	upon	2020-10-04
7	5	Just research degree certain rise challenge resource technology nothing.	far	2021-03-11
9	4	Hotel road live remain magazine four.	off	2024-11-25
10	5	Memory much class age fast choose happen fear resource.	financial	2021-05-12
11	4	Now establish across explain level maybe born.	rest	2023-10-19
12	4	Free television benefit seek reach despite north.	road	2023-02-22
13	1	Receive guess close think thus certainly difficult seem production evidence.	pay	2020-01-31
14	1	Indeed win rate give ten capital reason bill over mind person.	know	2023-05-26
16	3	Fish author few window growth consumer they party.	can	2023-02-13
17	4	Cause agency make he amount although she money history piece score.	like	2022-10-15
18	3	Improve keep seat season parent indeed.	budget	2020-12-21
19	4	Writer agent few get plant case upon wait country.	film	2021-04-03
20	3	Him player fact which both Mrs civil about course develop.	take	2020-11-09

All UserQuizQuestions

df = pd.read_sql_query("SELECT * FROM Questions", engine)
hide_index(df)

question_id	quiz_id	question_text	correct_answer	created_at
1	5	Call win two knowledge machine physical experience use anything.	list	2021-09-26
3	4	One bring can within daughter choose win huge glass avoid position over possible.	shake	2024-03-17
5	1	Life authority school me plant really simply focus then first else carry.	sort	2021-03-16
6	5	Best budget thus finally shake money produce physical.	upon	2020-10-04
7	5	Just research degree certain rise challenge resource technology nothing.	far	2021-03-11
9	4	Hotel road live remain magazine four.	off	2024-11-25
10	5	Memory much class age fast choose happen fear resource.	financial	2021-05-12
11	4	Now establish across explain level maybe born.	rest	2023-10-19
12	4	Free television benefit seek reach despite north.	road	2023-02-22
13	1	Receive guess close think thus certainly difficult seem production evidence.	pay	2020-01-31
14	1	Indeed win rate give ten capital reason bill over mind person.	know	2023-05-26
16	3	Fish author few window growth consumer they party.	can	2023-02-13
17	4	Cause agency make he amount although she money history piece score.	like	2022-10-15
18	3	Improve keep seat season parent indeed.	budget	2020-12-21
19	4	Writer agent few get plant case upon wait country.	film	2021-04-03
20	3	Him player fact which both Mrs civil about course develop.	take	2020-11-09

# I did not want to continue writing `pd.read_sql_query` every time, so I created a simple function to run and return results.
def runSQL(sql_string, engine):
    df = pd.read_sql_query(sql_string, engine)
    return df

2. Select Columns => Get a list of all users with there username and email

-- This query will select all records from the Users table, showing username and email

df = pd.read_sql_query("SELECT username, email FROM Users", engine)
hide_index(df)

	username	email
	randyfowler	allentimothy@example.org
	kaitlinmckenzie	camposbrooke@example.org
	vwhite	chart@example.net
	justin07	dennis80@example.com
	davidtodd	my_new_email@example.com
	karenspears	nelsonjennifer@example.org
	ivargas	rogersjaclyn@example.net
	yrios	laura90@example.net
	muellerpatricia	craig45@example.org

3. Filtering Data => Get all quizzes created by user with user_id=1

-- This query filters quizzes by creator's user_id, returning the title and description of quizzes create by user 1.

df = runSQL("SELECT title, description FROM Quizzes WHERE user_id=1",engine)
hide_index(df)

	title	description
	Perform east realize.	Door next answer available pull collection. Yet industry energy. Art no likely exist need soldier stage.

4. Aggregate Function => Find the number of questions in each Quiz

-- The COUNT(*) function counts the number of questions for each quiz, grouping by quiz_id.

query="""
SELECT quiz_id, COUNT(*) AS question_count
FROM Questions
GROUP BY quiz_id;
"""
df = runSQL(query, engine)
hide_index(df)

	quiz_id	question_count
	1	3
	3	3
	4	6
	5	4

5. JOIN => Retrieve username and title of quizzes for each user

-- This is an inner join between Users and Quizzes tables, that will match/link user_id in both tables to get the username and title for each quiz.

query="""
SELECT u.username, q.title
FROM Users u
JOIN Quizzes q ON u.user_id = q.user_id;
"""
df = runSQL(query, engine)
hide_index(df)

	username	title
	yrios	So herself light special.
	davidtodd	Not during.
	karenspears	Area many third.
	randyfowler	Perform east realize.

6. Multiple JOINS => Get a list of all users' answers for quiz quiz_id=2 along with the question text and score

-- This query joins UserQuizQuestions, Users, and Questions to retrieve the user's answers (user_answer), the question (question_text), and the score for quiz quiz_id = 2

query="""
SELECT u.username, q.question_text, uq.user_answer, uq.score
FROM UserQuizQuestions uq
JOIN Users u ON uq.user_id = u.user_id
JOIN Questions q ON uq.question_id = q.question_id
WHERE uq.quiz_id = 2;
"""
df = runSQL(query, engine)
hide_index(df)
# This will result in an empty set as shown below

	username	question_text	user_answer	score

7. Subquery => Get all users who have attempted a quiz with quiz_id=1

-- The subquery in the IN clause finds all users who have attempted the quiz with quiz_id = 1 by checking if user_id exists in the UserQuizQuestions table.

query="""
SELECT username
FROM Users
WHERE user_id IN (SELECT user_id FROM UserQuizQuestions WHERE quiz_id = 1);
"""
df = runSQL(query, engine)
hide_index(df)

	username
	ivargas
	yrios
	muellerpatricia

8. Common Table Expression (CTE) => Find the user who scored the highest in quiz quiz_id=3

-- CTEs are temporary named results that we can reference in later the main sql query.

-- The CTE UserScores calculates the total score for each user in quiz quiz_id = 3. Then, we JOIN this CTE with Users to get the username of the top scorer and order by total_score to select the highest one.

-- NB: Two students could have gotten the same high score. One way to deal with this is to rank them as shown in 9. below.

query="""
WITH UserScores AS (
  SELECT user_id, SUM(score) AS total_score
  FROM UserQuizQuestions
  WHERE quiz_id = 3
  GROUP BY user_id
)
SELECT u.username, us.total_score
FROM UserScores us
JOIN Users u ON us.user_id = u.user_id
ORDER BY us.total_score DESC
LIMIT 1;
"""
df = runSQL(query, engine)
hide_index(df)

	username	total_score
	ivargas	5.0

9. Window Function => Rank users by their total score in quiz where quiz_id=3

-- The CTE calculates the total score for each user in quiz quiz_id = 3. The RANK() window function is used to rank users based on their scores in descending order. There is also a DENSE_RANK() that will not skip ranks after ties as was done when we used the RANK() function.

query="""
WITH UserScores AS (
  SELECT user_id, SUM(score) AS total_score
  FROM UserQuizQuestions
  WHERE quiz_id = 3
  GROUP BY user_id
)
SELECT u.username, us.total_score,
       RANK() OVER (ORDER BY us.total_score DESC) AS rank
FROM UserScores us
JOIN Users u ON us.user_id = u.user_id;
"""
df = runSQL(query, engine)
hide_index(df)

username	total_score	rank
ivargas	5.0	1
kaitlinmckenzie	3.0	2
karenspears	3.0	2
muellerpatricia	1.0	4

10. Group Cancatenation => Get a list of all questions for each quiz, concatenated into a single string, for quiz_id=2

-- The GROUP_CONCAT function is used to concatenate all question texts for the quiz quiz_id = 2 into a single string, separated by |.

query="""
SELECT quiz_id, GROUP_CONCAT(question_text SEPARATOR '| ') AS questions
FROM Questions
WHERE quiz_id = 2
GROUP BY quiz_id;
"""
df = runSQL(query, engine)
hide_index(df)

	quiz_id	questions

11. Self Join => Find pairs of users who have taken the same quiz (both users must have attempted the quiz)

-- A self join is performed on the UserQuizQuestions table to find pairs of users who have attempted the same quiz (quiz_id). We ensure that the users are different by checking u1.user_id <> u2.user_id. The result will have repetitions. It will give all permuations of the pairs irrespective of the order.

query="""
SELECT u1.username AS user_1, u2.username AS user_2, uq.quiz_id
FROM UserQuizQuestions uq
JOIN Users u1 ON uq.user_id = u1.user_id
JOIN UserQuizQuestions uq2 ON uq.quiz_id = uq2.quiz_id
JOIN Users u2 ON uq2.user_id = u2.user_id
WHERE u1.user_id <> u2.user_id;
"""
df = runSQL(query, engine)
hide_index(df)

user_1	user_2	quiz_id
randyfowler	karenspears	4
randyfowler	ivargas	4
randyfowler	muellerpatricia	4
kaitlinmckenzie	ivargas	3
kaitlinmckenzie	karenspears	3
kaitlinmckenzie	muellerpatricia	3
karenspears	kaitlinmckenzie	3
karenspears	ivargas	3
karenspears	muellerpatricia	3
karenspears	randyfowler	4
karenspears	ivargas	4
karenspears	muellerpatricia	4
ivargas	yrios	1
ivargas	muellerpatricia	1
ivargas	kaitlinmckenzie	3
ivargas	karenspears	3
ivargas	muellerpatricia	3
ivargas	karenspears	4
ivargas	randyfowler	4
ivargas	muellerpatricia	4
yrios	muellerpatricia	1
yrios	ivargas	1
muellerpatricia	yrios	1
muellerpatricia	ivargas	1
muellerpatricia	kaitlinmckenzie	3
muellerpatricia	ivargas	3
muellerpatricia	karenspears	3
muellerpatricia	karenspears	4
muellerpatricia	randyfowler	4
muellerpatricia	ivargas	4

12. Nested CTEs => Compute the average total score per user for each quiz

-- The first CTE QuizScores calculates the total score for each user in each quiz. The second CTE AverageScores computes the average score per quiz by averaging the total scores.

-- NB: We are averaging the sum of scores for each user with a quiz, not just the individual scores. This approach accounts for how well each user did on the quiz (as opposed to averaging the individual scores for all questions directly).

query="""
WITH QuizScores AS (
  SELECT quiz_id, user_id, SUM(score) AS total_score
  FROM UserQuizQuestions
  GROUP BY quiz_id, user_id
),
AverageScores AS (
  SELECT quiz_id, AVG(total_score) AS avg_score
  FROM QuizScores
  GROUP BY quiz_id
)
SELECT q.quiz_id, avg_score
FROM AverageScores q;
"""
df = runSQL(query, engine)
hide_index(df)

	quiz_id	avg_score
	1	1.6667
	3	3.0000
	4	2.2500
	5	1.0000

The first CTE will give this result

query="""
  SELECT quiz_id, user_id, SUM(score) AS total_score
  FROM UserQuizQuestions
  GROUP BY quiz_id, user_id;
"""
df = runSQL(query, engine)
hide_index(df)

quiz_id	user_id	total_score
1	7	1.0
1	8	1.0
1	9	3.0
3	2	3.0
3	6	3.0
3	7	5.0
3	9	1.0
4	1	0.0
4	6	5.0
4	7	1.0
4	9	3.0
5	6	1.0

If we averaged the individual scores (without accomodating each student) for all questions. We would get different results as shown below.

query="""
 SELECT quiz_id, AVG(score) AS total_score
  FROM UserQuizQuestions
  GROUP BY quiz_id
"""
df = runSQL(query, engine)
hide_index(df)

	quiz_id	total_score
	1	1.6667
	3	3.0000
	4	2.2500
	5	1.0000

13. Update => Update the email address of user with user_id=5 to a new email

-- This simple UPDATE query modifies the email field for the user with user_id = 5.

-- NB: We cannot use our function here, as we are not selecting. We are updating. We also have to explicitly commit changes for changes to persist. Under normal circumstances, you wouldn't be updating a database like this from a jupyter notebook. This is to demonstrate SQL, so the focus is on the actual SQL statement.

query="UPDATE Users SET email = 'my_new_email@example.com' WHERE user_id = 5;"
# Execute the update query
with engine.connect() as connection:
    connection.execute(text(query))
    connection.commit()  # Explicit commit to persist changes

14. Delete => Delete User from the database

-- Deleting user 10 will delete all entries associated with the user. This is possible because we had ensured that we cascaded on delete. If that were not the case, we would have to delete the children table first (UserQuizQuestions, Quizzes) and then delete the parent table Users

query="DELETE from Users WHERE user_id = 10;"
# Execute the update query
with engine.connect() as connection:
    connection.execute(text(query))
    connection.commit()  # Again we explicitly commit to persist changes

15. Subquery with EXISTS => Get Quizzes that do not have any questions associated with them.

-- The EXISTS subquery checks if there are any questions associated with each quiz. If no questions exist for a quiz, it is included in the result.

query="""
 SELECT q.quiz_id, q.title
 FROM Quizzes q
 WHERE NOT EXISTS (
 SELECT 1 FROM Questions WHERE quiz_id = q.quiz_id
);
"""
df = runSQL(query, engine)
hide_index(df)
# The set below will be empty

	quiz_id	title

16. CROSS JOIN => Generate all possible combinations of users and quizzes.

-- The CROSS JOIN generates the Cartesian product of Users and Quizzes, giving all possible combinations of users and quizzes.

query="""
 SELECT u.username, q.title
 FROM Users u
 CROSS JOIN Quizzes q;
"""
df = runSQL(query, engine)
hide_index(df)

	username	title
	randyfowler	So herself light special.
	randyfowler	Not during.
	randyfowler	Area many third.
	randyfowler	Perform east realize.
	kaitlinmckenzie	So herself light special.
	kaitlinmckenzie	Not during.
	kaitlinmckenzie	Area many third.
	kaitlinmckenzie	Perform east realize.
	vwhite	So herself light special.
	vwhite	Not during.
	vwhite	Area many third.
	vwhite	Perform east realize.
	justin07	So herself light special.
	justin07	Not during.
	justin07	Area many third.
	justin07	Perform east realize.
	davidtodd	So herself light special.
	davidtodd	Not during.
	davidtodd	Area many third.
	davidtodd	Perform east realize.
	karenspears	So herself light special.
	karenspears	Not during.
	karenspears	Area many third.
	karenspears	Perform east realize.
	ivargas	So herself light special.
	ivargas	Not during.
	ivargas	Area many third.
	ivargas	Perform east realize.
	yrios	So herself light special.
	yrios	Not during.
	yrios	Area many third.
	yrios	Perform east realize.
	muellerpatricia	So herself light special.
	muellerpatricia	Not during.
	muellerpatricia	Area many third.
	muellerpatricia	Perform east realize.

Conclusion

In conclusion, SQL is a powerful language for managing and manipulating data in relational databases. By mastering key concepts such as data retrieval, filtering, joins, and advanced techniques like subqueries and window functions, you can efficiently work with databases and extract valuable insights. With the tools and techniques covered in this tutorial, you are now equipped to write complex SQL queries and handle real-world data management tasks with confidence.

quiz_id	user_id	total_score
1	7	1.0
1	8	1.0
1	9	3.0
3	2	3.0
3	6	3.0
3	7	5.0
3	9	1.0
4	1	0.0
4	6	5.0
4	7	1.0
4	9	3.0
5	6	1.0

quiz_id	user_id	total_score
1	7	1.0
1	8	1.0
1	9	3.0
3	2	3.0
3	6	3.0
3	7	5.0
3	9	1.0
4	1	0.0
4	6	5.0
4	7	1.0
4	9	3.0
5	6	1.0