How to check Duplicates based on Multiple Fields – SQL or SAS

You have big data sets and that contains duplicate but you need to check duplicates based on combination of two or three columns. How can you check that using SQL or SAS? So, you can use the following combination to check duplicates based on multiple fields or columns in SQL or SAS:

Query: There is a dataset called “Country_Polulation” with following fields: 1. Name 2. Age 3.CityName 4.DOB 5.Year

You need to check duplicate records – like if Name and DOB of two records is same that means that is a duplicate. Follow the following methods to check duplicates based on multiple fields:

Let’s say we are checking duplicates for year – 2018

Using Having Clause:

Having clause will check the count of name and DOB if exists more than once, that name will be exported from inner query to export query. If you need to learn basics of SAS, follow these articles: SQL Basics and Learn Basic Functions of SQL

Select A.name, A.DOB

from Country_Polulation as A

where name in

(

Select B.name, B.DOB

from Country_Polulation as B

where year=2018

Order by name

Group by Name, DOB

Having (count(name)>1 and count(DOB)>1)

)

order by A.Name

 

2. Check Duplicates in SAS:

You will have to use Proc step for this task if you want to retrieve all the duplicate records. Basically do a self join and check all the fields with each other for equality to check duplicates.

Proc SQL;

Create table Dupe_Records as

Select A.Name, A.DOB, B.*

from Country_Polulation as A

JOIN

( Select B.Name, B.DOB, Count(*) as Quantity

from Country_Polulation  as B

where year=2018

group by B.name, B.Dob

having count(*)>1

)

order by Name, DOB

This Dupe_Records table will contain all the duplicate records ordered by name and DOB in ascending order. The reason why we couldn’t mention Name and DOB condition in Having clause in SAS is – SAS deals differently with Dates. and DOB is a Datetime18. format here when I tried the same.

3. Check Duplicates in Excel:

If the dataset is of small size, you can follow the mentioned steps:

  • Using Home Tab:

Home Tab -> Click on Conditional Formatting -> Click on Highlight Cell Rules -> Duplicate Rows -> Check the colors to highlight the duplicate text

  • Using Data Tab:

Select the data range or the columns wherein you need to check duplicates -> Data Tab -> Click on Remove Duplicates in Data Tools

This will give you the option of selecting columns to consider dupes. Select the desired columns and the dupes will be removed.

Hope this post will help you 🙂

SAS Basics : Learn Creating SAS Dataset

Hi all, I  know I am bit late actually very late in posting this blog post. So, we will be learning SAS today and I am covering some basics of SAS.

So, in order to learn it first there are multiple things that we will cover. What is SAS and why it is used.

SAS is a statistical and business analytics package tool that help you derive insights from the data and visualize the output to client. Yeah too much in a statement…?? 😉 No problem we’ll understand all of these things here.

Firstly, let’s check what makes SAS and what are the primary components of it:

Primary components of SAS:

1. SAS Library – This is just a SAS folder where in you will be keeping all your sas dataset or the files you would like your system to be read

2. SAS Editor Section – Here you will be writing your SAS code. SAS coding is more SQL like. If you want to learn SQL also, check my article: Learn Basics of SQL

3. SAS log section – This SAS component will give you all the information regarding the changes your code have made or the information from your imported SAS dataset. For example, the count of observations and rows, the data type of these fields and length. This component is very important to check any issues related to data type issue while performing data merging or transfer from SAS ton Teradata.

4. SAS debug section – This section enables the SAS developer to do any sort of debugging over the issue. There are various options available that lets you check the issue by filtering and sorting record.

How to Create SAS Library :

– This is the very first step that you will be doing in SAS Enterprise Guide or SAS Basic. There is a default SAS Library called – “WORK” library. This keeps all the temporary data. So, there are two types of libraries:

1. WORK
2. User Defined Library

Work – Keeps all temporary data of that particular SAS session only. Once the system is log off, all the data is erased. So, better you save it in some user defined library by before loosing it…LOL

User Defined Library – You can create library by following the syntax given as following:

Libname [Your Library Name] “Location where you want to save your files or SAS Dataset”

example: Libname AA “C://User/John/Documents/New Datasets/”

This will save all the SAS dataset in New Datasets folder.

Now Learn the three major components of SAS Code:

Steps of SAS:

1. Data Step
2. Proc Step
3. Run Step

So, for today this is all folks to be learnt in basics of SAS :p

If you want to learn more about it, let me know in comments. Thanks 🙂