How to check Duplicates based on Multiple Fields – SQL or SAS

You have big data sets and that contains duplicate but you need to check duplicates based on combination of two or three columns. How can you check that using SQL or SAS? So, you can use the following combination to check duplicates based on multiple fields or columns in SQL or SAS:

Query: There is a dataset called “Country_Polulation” with following fields: 1. Name 2. Age 3.CityName 4.DOB 5.Year

You need to check duplicate records – like if Name and DOB of two records is same that means that is a duplicate. Follow the following methods to check duplicates based on multiple fields:

Let’s say we are checking duplicates for year – 2018

Using Having Clause:

Having clause will check the count of name and DOB if exists more than once, that name will be exported from inner query to export query. If you need to learn basics of SAS, follow these articles: SQL Basics and Learn Basic Functions of SQL

Select A.name, A.DOB

from Country_Polulation as A

where name in

(

Select B.name, B.DOB

from Country_Polulation as B

where year=2018

Order by name

Group by Name, DOB

Having (count(name)>1 and count(DOB)>1)

)

order by A.Name

 

2. Check Duplicates in SAS:

You will have to use Proc step for this task if you want to retrieve all the duplicate records. Basically do a self join and check all the fields with each other for equality to check duplicates.

Proc SQL;

Create table Dupe_Records as

Select A.Name, A.DOB, B.*

from Country_Polulation as A

JOIN

( Select B.Name, B.DOB, Count(*) as Quantity

from Country_Polulation  as B

where year=2018

group by B.name, B.Dob

having count(*)>1

)

order by Name, DOB

This Dupe_Records table will contain all the duplicate records ordered by name and DOB in ascending order. The reason why we couldn’t mention Name and DOB condition in Having clause in SAS is – SAS deals differently with Dates. and DOB is a Datetime18. format here when I tried the same.

3. Check Duplicates in Excel:

If the dataset is of small size, you can follow the mentioned steps:

  • Using Home Tab:

Home Tab -> Click on Conditional Formatting -> Click on Highlight Cell Rules -> Duplicate Rows -> Check the colors to highlight the duplicate text

  • Using Data Tab:

Select the data range or the columns wherein you need to check duplicates -> Data Tab -> Click on Remove Duplicates in Data Tools

This will give you the option of selecting columns to consider dupes. Select the desired columns and the dupes will be removed.

Hope this post will help you 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *