SQL vs Pandas: Data Insights Tools

Exploring data insights with SQL and pandas

Quick summary

In today’s world, where a lot of data is generated every millisecond, extracting valuable insights from it becomes crucial. Data analysis is the process of gathering, cleaning, transforming, and modeling data to uproot helpful information and support decision-making that satisfies business requirements.

Introduction

In today’s world, where a lot of data is generated every millisecond, extracting valuable insights from it becomes a very crucial skill. Data analysis is the process of gathering, cleaning, transforming and modelling data to uproot useful information and support decision-making which satisfies business requirements.

Overview of SQL and Pandas

SQL (Structured Query Language) is considered as a domain-specific language with enhanced expressive power that widely recognized for its management and manipulation of relational databases. It introduces a way to standardize the method of performing the operations so as to retrieve data from many databases e.g. MySQL, PostgreSQL, Sqlite and a list of others. With the performance of huge datasets being SQL’s trademark and the possibility of complex data manipulation and basic tasks being its simplicity, SQL is generally quite recognizable.

Pandas, though, is a Python module designed especially for data manipulation and analysis. It provides a set of higher-level data structures and functions which are designed in such a way that one can work with structured data quickly, easily, and in a more productive way. Pandas stands out among people who analyse data or do science because of its flexibility, powerful functionality, and ease of integration with other Python libraries like NumPy, Matplotlib and Scikit-learn.

— Similarities and differences between SQL and Pandas —

Let us now understand some of the function which can we can perform in both SQL and Pandas.

The goal of this analysis is to demonstrate how to perform basic data exploration, filtering, aggregation, grouping, and joining operations on a dataset containing information about customers and their purchases.

Data Exploration:

Goal: Understand the structure and contents of the dataset.
SQL


1 SELECT *
2 FROM shopping_behavior
3 LIMIT 10;

SELECT *: This selects all columns from the specified table.
LIMIT 10: Limits the result to the first 10 rows returned by the query.

Pandas


1 import pandas as pd
2
3 # Read the dataset into a DataFrame
4 df = pd.read_csv('shopping_behavior.csv')
5
6 # Display the first few rows of the DataFrame
7  print(df.head(10))

pd.read_csv(‘shopping_behavior.csv’): Reads the dataset into a data frame using Pandas.
df.head(10): Returns the first 10 rows of the DataFrame.

Goal: Check for Missing Values:
SQL


1 SELECT 
2    SUM(CASE WHEN Age IS NULL THEN 1 ELSE 0 END) AS 
  null_age_count,
3    SUM(CASE WHEN Gender IS NULL THEN 1 ELSE 0 END) AS 
  null_gender_count,
4    SUM(CASE WHEN `Purchase Amount (USD)` IS NULL THEN 1 ELSE 0 END) AS        
  null_purchase_amount_count
5    FROM shopping_behavior;

‘SUM(CASE …)’: Counts the occurrences of NULL values in specific columns.

Pandas


1 # Check for missing values in the dataset
2 missing_values_count = df.isnull().sum()
3 print("Missing Values:\n", missing_values_count)

isnull(): Returns a DataFrame of the same shape as df where each value is either True or False, indicating missing values.
.sum(): Sums up the number of True values (missing values) for each column.

Goal: Summary Statistics
SQL


1 SELECT 
2   AVG(Age) AS avg_age,
3   AVG(`Purchase Amount (USD)`) AS avg_purchase_amount,
4   COUNT(DISTINCT Category) AS num_categories,
5   COUNT(DISTINCT Payment Method) AS num_payment_methods,
6   COUNT(DISTINCT Location) AS num_locations
7 FROM shopping_behavior;

AVG: Calculates the average value.
COUNT(DISTINCT …): Counts the number of distinct values in specified columns.

Pandas


1 # Display summary statistics
2 summary_statistics = df.describe()
3 print("Summary Statistics:\n", summary_statistics)

.describe(): Generates descriptive statistics that summarize the central tendency, dispersion, and shape of the distribution of the DataFrame’s columns.

Data Filtering

Goal: Filter by Age and Gender
SQL:


1 SELECT *
2 FROM shopping_behavior
3 WHERE Age > 30
4 AND Gender = 'Male';

WHERE: Filters the rows based on specified conditions.

Pandas


1 # Filter data based on age and gender
2 filtered_data = df[(df['Age'] > 30) & (df['Gender'] == 'Male')]
3 print(filtered_data)

df[‘Age’] > 30: Creates a boolean mask based on the condition ‘Age > 30’.
&: Combines multiple conditions.
df[…]: Filters the DataFrame based on the boolean mask.

Goal:Filter by Category
SQL:


1 SELECT *
2 FROM shopping_behavior
3 WHERE Category = 'Electronics';

WHERE: Filters the rows where the category is ‘Electronics’.

Pandas:


1 # Filter data based on category
2 electronics_data = df[df['Category'] == 'Electronics']
3 print(electronics_data)

df[‘Category’] == ‘Electronics’: Creates a boolean mask based on the condition ‘Category == Electronics’.
df[…]: Filters the DataFrame based on the boolean mask.

Grouping and Aggregation

Goal: Grouping by Gender and Calculating Average Purchase Amount
SQL


1 SELECT Gender, AVG(`Purchase Amount (USD)`) AS 
  avg_purchase_amount
2 FROM shopping_behavior
3 GROUP BY Gender;

AVG(“Purchase Amount (USD)“): Calculates the average purchase amount.
GROUP BY Gender: Groups the result set by gender.

Pandas


1 avg_purchase_by_gender = df.groupby('Gender')['Purchase 
  Amount (USD)'].mean()
2 print(avg_purchase_by_gender)

groupby(‘Gender’)[‘Purchase Amount (USD)’]: Groups the DataFrame by gender and selects the ‘Purchase Amount (USD)’ column.
.mean(): Calculates the mean of the selected column for each group.

Goal:Grouping by Location and Counting Number of Customers
SQL



1 SELECT Location, COUNT(DISTINCT CustomerID) AS num_customers
2 FROM shopping_behavior
3 GROUP BY Location;

COUNT(DISTINCT CustomerID): Counts the number of distinct customers.
GROUP BY Location: Groups the result set by location.

Pandas


1 customer_count_by_location = 
  df.groupby('Location')['CustomerID'].nunique()
2 print(customer_count_by_location)

groupby(‘Location’)[‘CustomerID’]: Groups the DataFrame by location and selects the ‘CustomerID’ column.
.nunique(): Counts the number of unique values in the selected column for each group.

Goal:Aggregating Multiple Statistics
SQL


1 SELECT Gender,
2 AVG(Age) AS avg_age,
3 AVG(`Purchase Amount (USD)`) AS avg_purchase_amount,
4 COUNT(*) AS num_transactions
5 FROM shopping_behavior
6 GROUP BY Gender;

AVG(Age): Calculates the average age.
AVG(`Purchase Amount (USD)`): Calculates the average purchase amount.
COUNT(*): Counts the total number of transactions.
GROUP BY Gender: Groups the result set by gender.

Pandas


1 gender_statistics = df.groupby('Gender').agg({'Age': 'mean', 
2 'Purchase Amount (USD)': 'mean', 'CustomerID': 'count'})
3 gender_statistics.rename(columns={'Age': 'avg_age', 'Purchase 
4 Amount (USD)': 'avg_purchase_amount', 'CustomerID':
5 'num_transactions'}, inplace=True)
6 print(gender_statistics)

groupby(‘Gender’).agg(…): Groups the DataFrame by gender and specifies multiple aggregation functions for different columns.
.rename(columns={…}): Renames the columns for clarity in the output.

Joining tables

Goal:Combine data from multiple tables.
SQL


1 SELECT t1.CustomerID, t1.Age, t2.Category
2 FROM table1 t1
3 JOIN table2 t2 ON t1.CustomerID = t2.CustomerID;

JOIN: Combines rows from two or more tables based on a related column between them.

Pandas


1 # Assuming you have another DataFrame df2 for the second 
  table
2 merged_data = pd.merge(df, df2, on='CustomerID', how='inner')
3 print(merged_data[['CustomerID', 'Age', 'Category']])

pd.merge(): Merges two DataFrames based on one or more common columns.
how=’inner’: Specifies the type of join (inner in this case) to use.

You can utilize the attached CSV file to execute all the SQL and Pandas code snippets provided in this article, enabling you to observe first-hand how each function processes data.

shopping_behavior.csv

Now, let’s have a look at some of the differences between SQL and Python

Feature	SQL	Pandas
Query Language	Specific language for managing and querying databases	Python library for data manipulation and analysis
Declarative vs Imperative	Declarative – specify what to retrieve, not how	Imperative – specify step-by-step data manipulation
Execution Location	Database server	Local machine (in-memory processing)
Data Structures	Tables, rows, columns	DataFrames, Series
Syntax	SQL commands and functions	Python functions and methods
Performance	Optimized for querying large datasets efficiently	May struggle with performance for extremely large datasets
Scalability	Horizontal and vertical scaling for database servers	Limited scalability due to in-memory processing
Flexibility	Designed for relational databases	Handles various data formats and structures

Conclusion:

In this article, we’ve delved into the powerful tools of SQL and Pandas for extracting valuable insights from data. From data exploration to filtering, aggregation, and grouping. Despite the differences, both SQL and Panda are indispensable tools in a data analyst’s toolkit. Whether you manipulate data directly in a database or perform in-memory processing on your local machine, these tools empower you to gain meaningful insights and make informed decisions. As we continue to navigate the ever-expanding landscape of data, mastering these tools will becomes increasingly valuable.

Be tuned for the blog 2 of this series where we will explore data visualization with the help of matplotlib library and Power BI!

At August Infotech, specializing in white label website development, we can assist you in achieving and exceeding your data analysis project goals by providing extraordinary vision and insights with our team of offshore developers. Feel free to reach out to us at https://www.augustinfotech.com/contact/.

Author : Vinita Raghani Date: March 21, 2024

Exploring data insights with SQL and pandas

Exploring data insights with SQL and pandas

Quick summary

Introduction

Overview of SQL and Pandas

Data Exploration:

Data Filtering

Grouping and Aggregation

Joining tables

Conclusion:

Leave a reply Cancel reply

Search

Newsletter

Popular Blogs

Unlocking the potential of ChatGPT: how the revolutionary language model is changing the game for natural language processing

My first 30 days at job as a data analyst

Power BI: Tips and Best Practices

NextJs best practices in 2024

Company

Expertise

More

Let's Connect

North America:

Asia:

Careers:

Email :

Hello, I am Jignasa, the go-to person for any questions. Please email me to learn more about our work or processes.