Introduction
In today’s world, where a lot of data is generated every millisecond, extracting valuable insights from it becomes a very crucial skill. Data analysis is the process of gathering, cleaning, transforming and modelling data to uproot useful information and support decision-making which satisfies business requirements.
Overview of SQL and Pandas
SQL (Structured Query Language) is considered as a domain-specific language with enhanced expressive power that widely recognized for its management and manipulation of relational databases. It introduces a way to standardize the method of performing the operations so as to retrieve data from many databases e.g. MySQL, PostgreSQL, Sqlite and a list of others. With the performance of huge datasets being SQL’s trademark and the possibility of complex data manipulation and basic tasks being its simplicity, SQL is generally quite recognizable.
Pandas, though, is a Python module designed especially for data manipulation and analysis. It provides a set of higher-level data structures and functions which are designed in such a way that one can work with structured data quickly, easily, and in a more productive way. Pandas stands out among people who analyse data or do science because of its flexibility, powerful functionality, and ease of integration with other Python libraries like NumPy, Matplotlib and Scikit-learn.
— Similarities and differences between SQL and Pandas —
Let us now understand some of the function which can we can perform in both SQL and Pandas.
The goal of this analysis is to demonstrate how to perform basic data exploration, filtering, aggregation, grouping, and joining operations on a dataset containing information about customers and their purchases.
Data Exploration:
- Goal: Understand the structure and contents of the dataset.
- SQL
1 SELECT *2 FROM shopping_behavior3 LIMIT 10; - SELECT *: This selects all columns from the specified table.
- LIMIT 10: Limits the result to the first 10 rows returned by the query.
Pandas
1 import pandas as pd23 # Read the dataset into a DataFrame4 df = pd.read_csv('shopping_behavior.csv')56 # Display the first few rows of the DataFrame7 print(df.head(10))- pd.read_csv(‘shopping_behavior.csv’): Reads the dataset into a data frame using Pandas.
- df.head(10): Returns the first 10 rows of the DataFrame.
- Goal: Check for Missing Values:
- SQL
1 SELECT2 SUM(CASE WHEN Age IS NULL THEN 1 ELSE 0 END) AS null_age_count,3 SUM(CASE WHEN Gender IS NULL THEN 1 ELSE 0 END) AS null_gender_count,4 SUM(CASE WHEN `Purchase Amount (USD)` IS NULL THEN 1 ELSE 0 END) AS null_purchase_amount_count5 FROM shopping_behavior;- ‘SUM(CASE …)’: Counts the occurrences of NULL values in specific columns.
Pandas
1 # Check for missing values in the dataset2 missing_values_count = df.isnull().sum()3 print("Missing Values:\n", missing_values_count)- isnull(): Returns a DataFrame of the same shape as df where each value is either True or False, indicating missing values.
- .sum(): Sums up the number of True values (missing values) for each column.
- Goal: Summary Statistics
- SQL
1 SELECT2 AVG(Age) AS avg_age,3 AVG(`Purchase Amount (USD)`) AS avg_purchase_amount,4 COUNT(DISTINCT Category) AS num_categories,5 COUNT(DISTINCT Payment Method) AS num_payment_methods,6 COUNT(DISTINCT Location) AS num_locations7 FROM shopping_behavior;- AVG: Calculates the average value.
- COUNT(DISTINCT …): Counts the number of distinct values in specified columns.
Pandas
1 # Display summary statistics2 summary_statistics = df.describe()3 print("Summary Statistics:\n", summary_statistics)- .describe(): Generates descriptive statistics that summarize the central tendency, dispersion, and shape of the distribution of the DataFrame’s columns.
Data Filtering
- Goal: Filter by Age and Gender
- SQL:
1 SELECT *2 FROM shopping_behavior3 WHERE Age > 304 AND Gender = 'Male';- WHERE: Filters the rows based on specified conditions.
Pandas
1 # Filter data based on age and gender2 filtered_data = df[(df['Age'] > 30) & (df['Gender'] == 'Male')]3 print(filtered_data)- df[‘Age’] > 30: Creates a boolean mask based on the condition ‘Age > 30’.
- &: Combines multiple conditions.
- df[…]: Filters the DataFrame based on the boolean mask.
- Goal:Filter by Category
- SQL:
1 SELECT *2 FROM shopping_behavior3 WHERE Category = 'Electronics';- WHERE: Filters the rows where the category is ‘Electronics’.
Pandas:
1 # Filter data based on category2 electronics_data = df[df['Category'] == 'Electronics']3 print(electronics_data)- df[‘Category’] == ‘Electronics’: Creates a boolean mask based on the condition ‘Category == Electronics’.
- df[…]: Filters the DataFrame based on the boolean mask.
Grouping and Aggregation
- Goal: Grouping by Gender and Calculating Average Purchase Amount
- SQL
1 SELECT Gender, AVG(`Purchase Amount (USD)`) AS avg_purchase_amount2 FROM shopping_behavior3 GROUP BY Gender;- AVG(“Purchase Amount (USD)“): Calculates the average purchase amount.
- GROUP BY Gender: Groups the result set by gender.
Pandas
1 avg_purchase_by_gender = df.groupby('Gender')['Purchase Amount (USD)'].mean()2 print(avg_purchase_by_gender)- groupby(‘Gender’)[‘Purchase Amount (USD)’]: Groups the DataFrame by gender and selects the ‘Purchase Amount (USD)’ column.
- .mean(): Calculates the mean of the selected column for each group.
- Goal:Grouping by Location and Counting Number of Customers
- SQL
1 SELECT Location, COUNT(DISTINCT CustomerID) AS num_customers2 FROM shopping_behavior3 GROUP BY Location;- COUNT(DISTINCT CustomerID): Counts the number of distinct customers.
- GROUP BY Location: Groups the result set by location.
Pandas
1 customer_count_by_location = df.groupby('Location')['CustomerID'].nunique()2 print(customer_count_by_location)- groupby(‘Location’)[‘CustomerID’]: Groups the DataFrame by location and selects the ‘CustomerID’ column.
- .nunique(): Counts the number of unique values in the selected column for each group.
- Goal:Aggregating Multiple Statistics
- SQL
1 SELECT Gender,2 AVG(Age) AS avg_age,3 AVG(`Purchase Amount (USD)`) AS avg_purchase_amount,4 COUNT(*) AS num_transactions5 FROM shopping_behavior6 GROUP BY Gender;- AVG(Age): Calculates the average age.
- AVG(`Purchase Amount (USD)`): Calculates the average purchase amount.
- COUNT(*): Counts the total number of transactions.
- GROUP BY Gender: Groups the result set by gender.
Pandas
1 gender_statistics = df.groupby('Gender').agg({'Age': 'mean',2 'Purchase Amount (USD)': 'mean', 'CustomerID': 'count'})3 gender_statistics.rename(columns={'Age': 'avg_age', 'Purchase4 Amount (USD)': 'avg_purchase_amount', 'CustomerID':5 'num_transactions'}, inplace=True)6 print(gender_statistics)- groupby(‘Gender’).agg(…): Groups the DataFrame by gender and specifies multiple aggregation functions for different columns.
- .rename(columns={…}): Renames the columns for clarity in the output.
Joining tables
- Goal:Combine data from multiple tables.
- SQL
1 SELECT t1.CustomerID, t1.Age, t2.Category2 FROM table1 t13 JOIN table2 t2 ON t1.CustomerID = t2.CustomerID;- JOIN: Combines rows from two or more tables based on a related column between them.
Pandas
1 # Assuming you have another DataFrame df2 for the second table2 merged_data = pd.merge(df, df2, on='CustomerID', how='inner')3 print(merged_data[['CustomerID', 'Age', 'Category']])- pd.merge(): Merges two DataFrames based on one or more common columns.
- how=’inner’: Specifies the type of join (inner in this case) to use.
You can utilize the attached CSV file to execute all the SQL and Pandas code snippets provided in this article, enabling you to observe first-hand how each function processes data.
Now, let’s have a look at some of the differences between SQL and Python
| Feature | SQL | Pandas |
| Query Language | Specific language for managing and querying databases | Python library for data manipulation and analysis |
| Declarative vs Imperative | Declarative – specify what to retrieve, not how | Imperative – specify step-by-step data manipulation |
| Execution Location | Database server | Local machine (in-memory processing) |
| Data Structures | Tables, rows, columns | DataFrames, Series |
| Syntax | SQL commands and functions | Python functions and methods |
| Performance | Optimized for querying large datasets efficiently | May struggle with performance for extremely large datasets |
| Scalability | Horizontal and vertical scaling for database servers | Limited scalability due to in-memory processing |
| Flexibility | Designed for relational databases | Handles various data formats and structures |
Conclusion:
In this article, we’ve delved into the powerful tools of SQL and Pandas for extracting valuable insights from data. From data exploration to filtering, aggregation, and grouping. Despite the differences, both SQL and Panda are indispensable tools in a data analyst’s toolkit. Whether you manipulate data directly in a database or perform in-memory processing on your local machine, these tools empower you to gain meaningful insights and make informed decisions. As we continue to navigate the ever-expanding landscape of data, mastering these tools will becomes increasingly valuable.
Be tuned for the blog 2 of this series where we will explore data visualization with the help of matplotlib library and Power BI!
At August Infotech, specializing in white label website development, we can assist you in achieving and exceeding your data analysis project goals by providing extraordinary vision and insights with our team of offshore developers. Feel free to reach out to us at https://www.augustinfotech.com/contact/.