Introduction
In today’s world, where a lot of data is generated every millisecond, extracting valuable insights from it becomes a very crucial skill. Data analysis is the process of gathering, cleaning, transforming and modelling data to uproot useful information and support decision-making which satisfies business requirements.
Overview of SQL and Pandas
SQL (Structured Query Language) is considered as a domain-specific language with enhanced expressive power that widely recognized for its management and manipulation of relational databases. It introduces a way to standardize the method of performing the operations so as to retrieve data from many databases e.g. MySQL, PostgreSQL, Sqlite and a list of others. With the performance of huge datasets being SQL’s trademark and the possibility of complex data manipulation and basic tasks being its simplicity, SQL is generally quite recognizable.
Pandas, though, is a Python module designed especially for data manipulation and analysis. It provides a set of higher-level data structures and functions which are designed in such a way that one can work with structured data quickly, easily, and in a more productive way. Pandas stands out among people who analyse data or do science because of its flexibility, powerful functionality, and ease of integration with other Python libraries like NumPy, Matplotlib and Scikit-learn.
— Similarities and differences between SQL and Pandas —
Let us now understand some of the function which can we can perform in both SQL and Pandas.
The goal of this analysis is to demonstrate how to perform basic data exploration, filtering, aggregation, grouping, and joining operations on a dataset containing information about customers and their purchases.
Data Exploration:
- Goal: Understand the structure and contents of the dataset.
- SQL
1 SELECT *
2 FROM shopping_behavior
3 LIMIT 10;
- SELECT *: This selects all columns from the specified table.
- LIMIT 10: Limits the result to the first 10 rows returned by the query.
Pandas
1 import pandas as pd
2
3 # Read the dataset into a DataFrame
4 df = pd.read_csv('shopping_behavior.csv')
5
6 # Display the first few rows of the DataFrame
7 print(df.head(10))
- pd.read_csv(‘shopping_behavior.csv’): Reads the dataset into a data frame using Pandas.
- df.head(10): Returns the first 10 rows of the DataFrame.
- Goal: Check for Missing Values:
- SQL
1 SELECT
2 SUM(CASE WHEN Age IS NULL THEN 1 ELSE 0 END) AS
null_age_count,
3 SUM(CASE WHEN Gender IS NULL THEN 1 ELSE 0 END) AS
null_gender_count,
4 SUM(CASE WHEN `Purchase Amount (USD)` IS NULL THEN 1 ELSE 0 END) AS
null_purchase_amount_count
5 FROM shopping_behavior;
- ‘SUM(CASE …)’: Counts the occurrences of NULL values in specific columns.
Pandas
1 # Check for missing values in the dataset
2 missing_values_count = df.isnull().sum()
3 print("Missing Values:\n", missing_values_count)
- isnull(): Returns a DataFrame of the same shape as df where each value is either True or False, indicating missing values.
- .sum(): Sums up the number of True values (missing values) for each column.
- Goal: Summary Statistics
- SQL
1 SELECT
2 AVG(Age) AS avg_age,
3 AVG(`Purchase Amount (USD)`) AS avg_purchase_amount,
4 COUNT(DISTINCT Category) AS num_categories,
5 COUNT(DISTINCT Payment Method) AS num_payment_methods,
6 COUNT(DISTINCT Location) AS num_locations
7 FROM shopping_behavior;
- AVG: Calculates the average value.
- COUNT(DISTINCT …): Counts the number of distinct values in specified columns.
Pandas
1 # Display summary statistics
2 summary_statistics = df.describe()
3 print("Summary Statistics:\n", summary_statistics)
- .describe(): Generates descriptive statistics that summarize the central tendency, dispersion, and shape of the distribution of the DataFrame’s columns.
Data Filtering
- Goal: Filter by Age and Gender
- SQL:
1 SELECT *
2 FROM shopping_behavior
3 WHERE Age > 30
4 AND Gender = 'Male';
- WHERE: Filters the rows based on specified conditions.
Pandas
1 # Filter data based on age and gender
2 filtered_data = df[(df['Age'] > 30) & (df['Gender'] == 'Male')]
3 print(filtered_data)
- df[‘Age’] > 30: Creates a boolean mask based on the condition ‘Age > 30’.
- &: Combines multiple conditions.
- df[…]: Filters the DataFrame based on the boolean mask.
- Goal:Filter by Category
- SQL:
1 SELECT *
2 FROM shopping_behavior
3 WHERE Category = 'Electronics';
- WHERE: Filters the rows where the category is ‘Electronics’.
Pandas:
1 # Filter data based on category
2 electronics_data = df[df['Category'] == 'Electronics']
3 print(electronics_data)
- df[‘Category’] == ‘Electronics’: Creates a boolean mask based on the condition ‘Category == Electronics’.
- df[…]: Filters the DataFrame based on the boolean mask.
Grouping and Aggregation
- Goal: Grouping by Gender and Calculating Average Purchase Amount
- SQL
1 SELECT Gender, AVG(`Purchase Amount (USD)`) AS
avg_purchase_amount
2 FROM shopping_behavior
3 GROUP BY Gender;
- AVG(“Purchase Amount (USD)“): Calculates the average purchase amount.
- GROUP BY Gender: Groups the result set by gender.
Pandas
1 avg_purchase_by_gender = df.groupby('Gender')['Purchase
Amount (USD)'].mean()
2 print(avg_purchase_by_gender)
- groupby(‘Gender’)[‘Purchase Amount (USD)’]: Groups the DataFrame by gender and selects the ‘Purchase Amount (USD)’ column.
- .mean(): Calculates the mean of the selected column for each group.
- Goal:Grouping by Location and Counting Number of Customers
- SQL
1 SELECT Location, COUNT(DISTINCT CustomerID) AS num_customers
2 FROM shopping_behavior
3 GROUP BY Location;
- COUNT(DISTINCT CustomerID): Counts the number of distinct customers.
- GROUP BY Location: Groups the result set by location.
Pandas
1 customer_count_by_location =
df.groupby('Location')['CustomerID'].nunique()
2 print(customer_count_by_location)
- groupby(‘Location’)[‘CustomerID’]: Groups the DataFrame by location and selects the ‘CustomerID’ column.
- .nunique(): Counts the number of unique values in the selected column for each group.
- Goal:Aggregating Multiple Statistics
- SQL
1 SELECT Gender,
2 AVG(Age) AS avg_age,
3 AVG(`Purchase Amount (USD)`) AS avg_purchase_amount,
4 COUNT(*) AS num_transactions
5 FROM shopping_behavior
6 GROUP BY Gender;
- AVG(Age): Calculates the average age.
- AVG(`Purchase Amount (USD)`): Calculates the average purchase amount.
- COUNT(*): Counts the total number of transactions.
- GROUP BY Gender: Groups the result set by gender.
Pandas
1 gender_statistics = df.groupby('Gender').agg({'Age': 'mean',
2 'Purchase Amount (USD)': 'mean', 'CustomerID': 'count'})
3 gender_statistics.rename(columns={'Age': 'avg_age', 'Purchase
4 Amount (USD)': 'avg_purchase_amount', 'CustomerID':
5 'num_transactions'}, inplace=True)
6 print(gender_statistics)
- groupby(‘Gender’).agg(…): Groups the DataFrame by gender and specifies multiple aggregation functions for different columns.
- .rename(columns={…}): Renames the columns for clarity in the output.
Joining tables
- Goal:Combine data from multiple tables.
- SQL
1 SELECT t1.CustomerID, t1.Age, t2.Category
2 FROM table1 t1
3 JOIN table2 t2 ON t1.CustomerID = t2.CustomerID;
- JOIN: Combines rows from two or more tables based on a related column between them.
Pandas
1 # Assuming you have another DataFrame df2 for the second
table
2 merged_data = pd.merge(df, df2, on='CustomerID', how='inner')
3 print(merged_data[['CustomerID', 'Age', 'Category']])
- pd.merge(): Merges two DataFrames based on one or more common columns.
- how=’inner’: Specifies the type of join (inner in this case) to use.
You can utilize the attached CSV file to execute all the SQL and Pandas code snippets provided in this article, enabling you to observe first-hand how each function processes data.
Now, let’s have a look at some of the differences between SQL and Python
Feature | SQL | Pandas |
Query Language | Specific language for managing and querying databases | Python library for data manipulation and analysis |
Declarative vs Imperative | Declarative – specify what to retrieve, not how | Imperative – specify step-by-step data manipulation |
Execution Location | Database server | Local machine (in-memory processing) |
Data Structures | Tables, rows, columns | DataFrames, Series |
Syntax | SQL commands and functions | Python functions and methods |
Performance | Optimized for querying large datasets efficiently | May struggle with performance for extremely large datasets |
Scalability | Horizontal and vertical scaling for database servers | Limited scalability due to in-memory processing |
Flexibility | Designed for relational databases | Handles various data formats and structures |
Conclusion:
In this article, we’ve delved into the powerful tools of SQL and Pandas for extracting valuable insights from data. From data exploration to filtering, aggregation, and grouping. Despite the differences, both SQL and Panda are indispensable tools in a data analyst’s toolkit. Whether you manipulate data directly in a database or perform in-memory processing on your local machine, these tools empower you to gain meaningful insights and make informed decisions. As we continue to navigate the ever-expanding landscape of data, mastering these tools will becomes increasingly valuable.
Be tuned for the blog 2 of this series where we will explore data visualization with the help of matplotlib library and Power BI!
At August Infotech, specializing in white label website development, we can assist you in achieving and exceeding your data analysis project goals by providing extraordinary vision and insights with our team of offshore developers. Feel free to reach out to us at https://www.augustinfotech.com/contact/.