top of page

"Data Cleanup Mastery: The Ultimate Transformation Guide"

Writer: Naseema HalidhNaseema Halidh

In simple terms, pgAdmin is a tool that makes it easier to clean up and manage data. It helps with tasks like filling in missing values, fixing inconsistencies in data types, and removing duplicate records. These features save a lot of time and effort, so you can feel confident that your data is accurate and organized. Basically, it's a powerful tool that handles tricky data issues without a hassle.


Cleaning Datasets in pgAdmin: A Comprehensive Guide

Data cleaning is an essential step in the data analysis process. It involves identifying and correcting errors, inconsistencies, and inaccuracies in your dataset to ensure that the data is reliable and suitable for analysis. If you’re working with PostgreSQL databases, pgAdmin is a powerful tool that can help you manage and clean your datasets. This blog will guide you through the process of cleaning datasets using pgAdmin.


What is pgAdmin?

pgAdmin is a popular open-source tool for working with Postgres databases. It offers a GUI that's easy for beginners and experienced data analysts to use. You can clean and transform your data before analyzing it using pgAdmin.

I will walk you through the steps necessary to clean your data using pgAdmin,

By using pgAdmin to clean data, we can have confidence that we have data that is accurate and organized. We are using pgAdmin as our tool for this cleaning process because it is an easy-to-use, open-source tool for working with Postgres databases. 

  • Handling missing values

  • Removing duplicates

  • Fixing inconsistent data

  • Converting data types


Here, the Superstore dataset is being used, but it's important to note that it's not the only option available. There are other datasets that might also be suitable for your specific use case. It's worth exploring different options to find one that best aligns with your unique needs and objectives.


Step 1: Connecting to Your Database in pgAdmin with the Superstore Dataset

Open pgAdmin: Launch the pgAdmin application on your computer.

Connect to Your Database:

  • In the left-hand pane, right-click on your PostgreSQL server and select Connect Server.

    If your server isn't listed, you can add it by clicking Add New Server, then enter the connection details such as host, username, and password.

Access the Superstore Database:

  • Once connected, expand your server and locate the database that contains the Superstore dataset.

  • Click on the database, then expand the schema and select the specific table you want to clean (for example, Orders, Products, etc.).

  • You can now view both the schema and the data directly in pgAdmin.


Step 2: Handling Missing Values

Missing values are one of the most common issues in datasets. pgAdmin provides several SQL techniques to handle missing data.


Option 1: Removing Rows with Missing Values

If a row contains missing values in critical columns, you may choose to delete that row. To do this:

  1. Open the Query Tool in pgAdmin by right-clicking on your database and selecting Query Tool.

  2. Use a SQL query to find rows with NULL values. For example, to find rows where the sales column is NULL:

SELECT * FROM superstore WHERE sales IS NULL;

To delete rows with missing values:

DELETE FROM superstore  WHERE sales IS NULL;

Option 2: Replacing NULL Values with Defaults

In some cases, it’s better to replace missing values with default values (such as zero or the mean). For example, to replace NULL values in the sales column with 0:

  1. Use an UPDATE query:

UPDATE superstore  SET sales = 0 WHERE sales IS NULL;

You can also replace NULL values with other computed values, such as the average or median of the column.


Option 3: Using COALESCE to Handle NULLs in Queries

If you need to clean up missing values while querying the data without actually modifying the database, you can use the COALESCE() function to substitute NULL values with a default value in the result set:


SELECT orderid, COALESCE(sales, 0) AS sales FROM superstore ;

This query will replace any NULL sales value with 0 in the output, without changing the underlying data.


Step 3: Removing Duplicates

Duplicate records can skew your analysis and lead to inaccurate insights. You can remove duplicates using SQL queries in pgAdmin.

Option 1: Identifying Duplicates

First, find duplicate records. For example, if you want to find duplicate records in the sales_data table based on the order_id column:

SELECT orderid, COUNT(*) FROM superstore  GROUP BY orderid HAVING COUNT(*) > 1;

This query will return all order_id values that appear more than once in the table


Option 2: Deleting Duplicates

Once duplicates are identified, you can delete them. One way to delete duplicates is by using a common technique that involves creating a subquery with the ROW_NUMBER() window function.

  1. First, assign row numbers to each duplicate record:

WITH duplicates AS ( SELECT ctid, ROW_NUMBER() OVER (PARTITION BY orderid ORDER BY orderid) AS rowid FROM superstore  ) DELETE FROM superstore WHERE ctid IN (SELECT ctid FROM duplicates WHERE rowid > 1);

This query will keep only the first occurrence of each order_id and delete the subsequent duplicates.


Step 4: Fixing Inconsistent Data

Inconsistent data can occur when different formats or values are used for the same attribute. This can cause issues when performing analysis or aggregations.


Option 1: Standardizing Text Data

Sometimes, text data may have inconsistent formatting, such as different capitalizations for the same value (e.g., USA vs usa). To standardize the text data:

  1. Use the UPPER() or LOWER() function to convert all text values to the same case:

UPDATE superstore  SET country = UPPER(country);

This will convert all country names to uppercase.

You can also use the TRIM() function to remove leading or trailing spaces from text fields:

UPDATE superstore SET productname = TRIM(productname);

Option 2: Converting Categorical Data to Consistent Categories

Sometimes, categorical values may have inconsistencies (e.g., different spellings for the same category). To standardize categories, you can update them with a SQL query:

UPDATE superstore SET category = 'Office Supplies' WHERE category = 'Office supply';

Option 3: Data Type Conversion

If you notice that some fields have incorrect data types (e.g., numbers stored as text), you can use ALTER COLUMN to convert data types:

ALTER TABLE superstore  ALTER COLUMN sales TYPE numeric USING sales::numeric;

This will convert the sales column from a text type to a numeric type.


Step 5: Converting Data Types

Sometimes, data may be stored in incorrect formats, and you may need to convert it into a more appropriate data type. For example, if the order_date column is stored as a string but you want to convert it to a proper date format, you can use the following query:

ALTER TABLE superstore  ALTER COLUMN orderdate TYPE DATE USING orderdate::DATE;

This will convert the order_date column into a DATE data type, assuming it is currently stored as a text column.


Step 6: Normalizing Data

If your dataset is denormalized and contains redundant data, normalizing it into multiple related tables can make it more efficient. For example, you might want to break down the sales_data table into separate tables for orders, customers, and products.

  1. Create the new tables:

CREATE TABLE orders ( orderid SERIAL PRIMARY KEY, customerid INT, orderdate DATE ); CREATE TABLE customers ( customerid SERIAL PRIMARY KEY, customername VARCHAR(255) );

Insert data into the normalized tables:

INSERT INTO orders (orderid, customerid, orderdate) SELECT orderid, customerid, orderdate FROM superstore ;

Delete redundant data from the original table:

DELETE FROM superstore ;

Conclusion

With your new expertise, you can clean and manage data in pgAdmin with confidence. By doing so, you will be able to run your analyses in a way that maintains data integrity, thereby helping ensure that your analyses are of a higher quality. You can also use this newfound ability to perform other actions, like writing queries to a database, without the need of a specialized SQL tool.

+1 (302) 200-8320

NumPy_Ninja_Logo (1).png

Numpy Ninja Inc. 8 The Grn Ste A Dover, DE 19901

© Copyright 2025 by Numpy Ninja Inc.

  • Twitter
  • LinkedIn
bottom of page