Exploring data insights with SQL and pandas

Building

Quick summary

In today’s world, where a lot of data is generated every millisecond, extracting valuable insights from it becomes crucial. Data analysis is the process of gathering, cleaning, transforming, and modeling data to uproot helpful information and support decision-making that satisfies business requirements.

Introduction

In today’s world, where a lot of data is generated every millisecond, extracting valuable insights from it becomes a very crucial skill. Data analysis is the process of gathering, cleaning, transforming and modelling data to uproot useful information and support decision-making which satisfies business requirements.

Overview of SQL and Pandas

SQL (Structured Query Language) is considered as a domain-specific language with enhanced expressive power that widely recognized for its management and manipulation of relational databases. It introduces a way to standardize the method of performing the operations so as to retrieve data from many databases e.g. MySQL, PostgreSQL, Sqlite and a list of others. With the performance of huge datasets being SQL’s trademark and the possibility of complex data manipulation and basic tasks being its simplicity, SQL is generally quite recognizable.

Pandas, though, is a Python module designed especially for data manipulation and analysis. It provides a set of higher-level data structures and functions which are designed in such a way that one can work with structured data quickly, easily, and in a more productive way. Pandas stands out among people who analyse data or do science because of its flexibility, powerful functionality, and ease of integration with other Python libraries like NumPy, Matplotlib and Scikit-learn.

— Similarities and differences between SQL and Pandas —

Let us now understand some of the function which can we can perform in both SQL and Pandas.

The goal of this analysis is to demonstrate how to perform basic data exploration, filtering, aggregation, grouping, and joining operations on a dataset containing information about customers and their purchases.

Data Exploration:

  • Goal: Understand the structure and contents of the dataset.
  • SQL
  • SELECT *: This selects all columns from the specified table.
  • LIMIT 10: Limits the result to the first 10 rows returned by the query.

Pandas

  • pd.read_csv(‘shopping_behavior.csv’): Reads the dataset into a data frame using Pandas.
  • df.head(10): Returns the first 10 rows of the DataFrame.

  • Goal: Check for Missing Values:
  • SQL
  • ‘SUM(CASE …)’: Counts the occurrences of NULL values in specific columns.

Pandas

  • isnull(): Returns a DataFrame of the same shape as df where each value is either True or False, indicating missing values.
  • .sum(): Sums up the number of True values (missing values) for each column.

  • Goal: Summary Statistics
  • SQL
  • AVG: Calculates the average value.
  • COUNT(DISTINCT …): Counts the number of distinct values in specified columns.

Pandas

  • .describe(): Generates descriptive statistics that summarize the central tendency, dispersion, and shape of the distribution of the DataFrame’s columns.

Data Filtering

  • Goal: Filter by Age and Gender
  • SQL:
  • WHERE: Filters the rows based on specified conditions.

Pandas

  • df[‘Age’] > 30: Creates a boolean mask based on the condition ‘Age > 30’.
  • &: Combines multiple conditions.
  • df[…]: Filters the DataFrame based on the boolean mask.

  • Goal:Filter by Category
  • SQL:
  • WHERE: Filters the rows where the category is ‘Electronics’.

Pandas:

  • df[‘Category’] == ‘Electronics’: Creates a boolean mask based on the condition ‘Category == Electronics’.
  • df[…]: Filters the DataFrame based on the boolean mask.

Grouping and Aggregation

  • Goal: Grouping by Gender and Calculating Average Purchase Amount
  • SQL
  • AVG(“Purchase Amount (USD)“): Calculates the average purchase amount.
  • GROUP BY Gender: Groups the result set by gender.

Pandas

  • groupby(‘Gender’)[‘Purchase Amount (USD)’]: Groups the DataFrame by gender and selects the ‘Purchase Amount (USD)’ column.
  • .mean(): Calculates the mean of the selected column for each group.

  • Goal:Grouping by Location and Counting Number of Customers
  • SQL
  • COUNT(DISTINCT CustomerID): Counts the number of distinct customers.
  • GROUP BY Location: Groups the result set by location.

Pandas

  • groupby(‘Location’)[‘CustomerID’]: Groups the DataFrame by location and selects the ‘CustomerID’ column.
  • .nunique(): Counts the number of unique values in the selected column for each group.

  • Goal:Aggregating Multiple Statistics
  • SQL
  • AVG(Age): Calculates the average age.
  • AVG(`Purchase Amount (USD)`): Calculates the average purchase amount.
  • COUNT(*): Counts the total number of transactions.
  • GROUP BY Gender: Groups the result set by gender.

Pandas

  • groupby(‘Gender’).agg(…): Groups the DataFrame by gender and specifies multiple aggregation functions for different columns.
  • .rename(columns={…}): Renames the columns for clarity in the output.

Joining tables

  • Goal:Combine data from multiple tables.
  • SQL
  • JOIN: Combines rows from two or more tables based on a related column between them.

Pandas

  • pd.merge(): Merges two DataFrames based on one or more common columns.
  • how=’inner’: Specifies the type of join (inner in this case) to use.

You can utilize the attached CSV file to execute all the SQL and Pandas code snippets provided in this article, enabling you to observe first-hand how each function processes data.

shopping_behavior.csv

Now, let’s have a look at some of the differences between SQL and Python

FeatureSQLPandas
Query LanguageSpecific language for managing and querying databasesPython library for data manipulation and analysis
Declarative vs ImperativeDeclarative – specify what to retrieve, not howImperative – specify step-by-step data manipulation
Execution LocationDatabase serverLocal machine (in-memory processing)
Data StructuresTables, rows, columnsDataFrames, Series
SyntaxSQL commands and functionsPython functions and methods
PerformanceOptimized for querying large datasets efficientlyMay struggle with performance for extremely large datasets
ScalabilityHorizontal and vertical scaling for database serversLimited scalability due to in-memory processing
FlexibilityDesigned for relational databasesHandles various data formats and structures

Conclusion:

In this article, we’ve delved into the powerful tools of SQL and Pandas for extracting valuable insights from data. From data exploration to filtering, aggregation, and grouping. Despite the differences, both SQL and Panda are indispensable tools in a data analyst’s toolkit. Whether you manipulate data directly in a database or perform in-memory processing on your local machine, these tools empower you to gain meaningful insights and make informed decisions. As we continue to navigate the ever-expanding landscape of data, mastering these tools will becomes increasingly valuable.

Be tuned for the blog 2 of this series where we will explore data visualization with the help of matplotlib library and Power BI!

At August Infotech, specializing in white label website development, we can assist you in achieving and exceeding your data analysis project goals by providing extraordinary vision and insights with our team of offshore developers. Feel free to reach out to us at https://www.augustinfotech.com/contact/.

Author : Vinita Raghani Date: March 21, 2024
|

Leave a reply

Your email address will not be published. Required fields are marked *