How to check Duplicates based on Multiple Fields – SQL or SAS

You have big data sets and that contains duplicate but you need to check duplicates based on combination of two or three columns. How can you check that using SQL or SAS? So, you can use the following combination to check duplicates based on multiple fields or columns in SQL or SAS:

Query: There is a dataset called “Country_Polulation” with following fields: 1. Name 2. Age 3.CityName 4.DOB 5.Year

You need to check duplicate records – like if Name and DOB of two records is same that means that is a duplicate. Follow the following methods to check duplicates based on multiple fields:

Let’s say we are checking duplicates for year – 2018

Using Having Clause:

Having clause will check the count of name and DOB if exists more than once, that name will be exported from inner query to export query. If you need to learn basics of SAS, follow these articles: SQL Basics and Learn Basic Functions of SQL

Select A.name, A.DOB

from Country_Polulation as A

where name in

(

Select B.name, B.DOB

from Country_Polulation as B

where year=2018

Order by name

Group by Name, DOB

Having (count(name)>1 and count(DOB)>1)

)

order by A.Name

 

2. Check Duplicates in SAS:

You will have to use Proc step for this task if you want to retrieve all the duplicate records. Basically do a self join and check all the fields with each other for equality to check duplicates.

Proc SQL;

Create table Dupe_Records as

Select A.Name, A.DOB, B.*

from Country_Polulation as A

JOIN

( Select B.Name, B.DOB, Count(*) as Quantity

from Country_Polulation  as B

where year=2018

group by B.name, B.Dob

having count(*)>1

)

order by Name, DOB

This Dupe_Records table will contain all the duplicate records ordered by name and DOB in ascending order. The reason why we couldn’t mention Name and DOB condition in Having clause in SAS is – SAS deals differently with Dates. and DOB is a Datetime18. format here when I tried the same.

3. Check Duplicates in Excel:

If the dataset is of small size, you can follow the mentioned steps:

  • Using Home Tab:

Home Tab -> Click on Conditional Formatting -> Click on Highlight Cell Rules -> Duplicate Rows -> Check the colors to highlight the duplicate text

  • Using Data Tab:

Select the data range or the columns wherein you need to check duplicates -> Data Tab -> Click on Remove Duplicates in Data Tools

This will give you the option of selecting columns to consider dupes. Select the desired columns and the dupes will be removed.

Hope this post will help you 🙂

SAS Basics : Learn Creating SAS Dataset

Hi all, I  know I am bit late actually very late in posting this blog post. So, we will be learning SAS today and I am covering some basics of SAS.

So, in order to learn it first there are multiple things that we will cover. What is SAS and why it is used.

SAS is a statistical and business analytics package tool that help you derive insights from the data and visualize the output to client. Yeah too much in a statement…?? 😉 No problem we’ll understand all of these things here.

Firstly, let’s check what makes SAS and what are the primary components of it:

Primary components of SAS:

1. SAS Library – This is just a SAS folder where in you will be keeping all your sas dataset or the files you would like your system to be read

2. SAS Editor Section – Here you will be writing your SAS code. SAS coding is more SQL like. If you want to learn SQL also, check my article: Learn Basics of SQL

3. SAS log section – This SAS component will give you all the information regarding the changes your code have made or the information from your imported SAS dataset. For example, the count of observations and rows, the data type of these fields and length. This component is very important to check any issues related to data type issue while performing data merging or transfer from SAS ton Teradata.

4. SAS debug section – This section enables the SAS developer to do any sort of debugging over the issue. There are various options available that lets you check the issue by filtering and sorting record.

How to Create SAS Library :

– This is the very first step that you will be doing in SAS Enterprise Guide or SAS Basic. There is a default SAS Library called – “WORK” library. This keeps all the temporary data. So, there are two types of libraries:

1. WORK
2. User Defined Library

Work – Keeps all temporary data of that particular SAS session only. Once the system is log off, all the data is erased. So, better you save it in some user defined library by before loosing it…LOL

User Defined Library – You can create library by following the syntax given as following:

Libname [Your Library Name] “Location where you want to save your files or SAS Dataset”

example: Libname AA “C://User/John/Documents/New Datasets/”

This will save all the SAS dataset in New Datasets folder.

Now Learn the three major components of SAS Code:

Steps of SAS:

1. Data Step
2. Proc Step
3. Run Step

So, for today this is all folks to be learnt in basics of SAS :p

If you want to learn more about it, let me know in comments. Thanks 🙂

Data Mining – Learn Data Mining, OLAP, OLTP and Clustering

Hi friends, let’s discuss the important concept of Data Mining and the four common tasks of data mining: Data Clustering, Data classification, regression and association rule learning.

This is an important topic to learn and adopt as a career option these days. Lots of people are trying their luck in this field by mastering the data analysis skills. It is a growing field and by 2020 the vacancy graph for professional data analyst, business analyst and data scientist will be at par.

Hope you guys have checked my previous post on Malicious Programs/Malwares for answering the questions related to this section.

Future Scope of Data Analyst:

Apart from future perspective, data mining is an important topic considering the various government exam vacancies for computer science professionals. So, I have tried to collect every important part of topic data mining in this blog.

If you guys want any other topic to be covered, please let me know by adding a comment. Now let’s first understand what Data mining and data analysis is.

data mining

What is Data Mining and the use of Data Mining?

Data mining is the process of extracting patterns from data. It is an important tool used by modern business to drive information from data. Data mining is currently used in marketing, profiling, fraud detection and scientific discovery etc.

Tasks of Data Mining:

  1. Data Clustering: It is the task of discovering groups and structures in the data that are similar in some way. Data clustering is performed without using known structures in data.
  2. Data Classification: Data classification is the task of generalizing known structures to apply to new data. Common algorithms related to data classifications are:   1.1. Decision tree learning

1.2.  Nearest neighbor

1.3.  Naïve Bayesian classification

1.4.  Neural networks

1.5.  Support Vector Machines

3. Regression: With Regression we attempt to find a function which models the data with the least error. There are different strategies related to regression models.

4. Association Rule learning: This learning is used to search for relationships between variables. I would like to share a big example of association rule learning:

With the help of association rule learning, Amazon displays the items frequently bought together to show as a recommendation. Thus helps the customers and increase its sales.

Approaches to Data Mining Problems:

  1. Discovery of sequential patterns
  2. Discovery of patterns in time series
  3. Discovery of classification rules
  4. Neural Networks
  5. Generic Algorithms
  6. Clustering and Segmentation

Goals of Data Mining and Knowledge Discovery:

  1. Prediction: Data mining can show how certain attributes within the data will behave in future.
  2. Identification: Data mining can be used to identify the existence of an item
  3. Classification: Data mining can partition the data so that different classes or categories can be identified
  4. Optimization: Data mining can be used to optimize the use of limited resources such as time, space, money or materials to optimize the output

What is OLTP (Online Transaction Processing)?

In order to understand OLTP, it is very important to be aware about Transaction and transaction system. So, what is a transaction? What are the properties of transaction system? Let’s analyze the theory of transactions and then we will cover OLTP.

Transaction and Transaction System:

A transaction is nothing but an interaction between different users or different systems or between a user and a system.

Transaction systems: Every organization needs some on-line application system to handle their day to day activities. Some examples of transaction systems are: Salary Processing Library, banking, airline etc.

Transaction System

Transaction Properties:

Every transaction follows the ACID property. Learn it like this. This is an important section and government exams choose multiple questions from this section.

ACID

Atomicity: This means a transaction should either completely succeeded or completely fail.

Consistency: Transaction must preserve the database stability. A transaction must transform the database from one consistent state to another

Isolation: This simply means transaction of one user should not interfere with the transactions of some other user in the database.

Durability: Once a transaction is complete means committed, it should be permanently written to the database. This change should be available to all the transactions followed by it.

I hope the ACID properties are clear to you guys. Please let me know if you need more information on this with examples.

Ever wondered how multiple transaction of different users can be processed simultaneously?? If yes check the below magic:

Concurrency: Currency allows two different independent processes to run simultaneously and thus creates parallelism. This is the thing that utilizes the use of fast processing time of computers.

What is Deadlock?

Deadlock is a situation where one transaction is waiting for another transaction to release the resource it needs and vice versa. Therefore, it becomes the case of wait and bound situation and the system halts. Each transaction will be waiting forever for the other to release the resource.

How to prevent Deadlock?

The simple rule to avoid deadlock is if deadlock occurs, one of the participating transaction must be rolled back to allow the other to proceed. So, this way transactions can be performed. There are different kinds of schemes available to decide which transaction should be rolled back.

This decision depends on multiple factors given as following:

  1. The run time of the transaction.
  2. Data already updated by the transaction
  3. Data remaining to be updated by the transaction system

I have tried to cover this section completely friends. Learn these concepts about data science and you will be able to solve each and every question that is related to the data mining section.

In order to master this section, please check my next post of the Previous Year questions of Data Mining section.

Not sure about Computer Networking concepts? Need to score good marks in Computer network section? If yes, do read my next post on Computer Network and Network Topologies. Till then, C yaa friends 🙂

Big Data Analytics : Why you should learn this

Hi All,

There is a big hype these days about the big data analytics. Let’s take a look to analyze the scope, salary trends, tools to learn this and why you should learn Big Data analytics and data science.

Big data is not something that is 15 Gigabyte or 30 Petabyte. I would say whenever, a data set exceeds the capacity of a person, individual or a firm’s storage capacity or the ability to analyze the data, that time that data becomes Big data.

Big Data is something that you can’t deal with traditional methods. Big data is large amount of data that you can use to generate the knowledge out it, to create visualization that can help a business to go from the bottom to boom. It helps in finding out the pitfalls and the market trends. In our daily life we deal with so many different data machines around us and we don’t realize it, for example an ATM machine that generates a large amount of data, and the satellites that create enormous amount of data. This way the demands of people knowing about how to deal with such a large amount of data is increasing. Big data analytics has a vivid scope and job demand will go on increase for this. So, what are you waiting for. Keep on checking this blog for more information since I’ll be writing about the following topics soon:

As far as salaries are concerned, big data salaries are good.

The characteristics of Big Data

A strategy of Big Data

What are Big Data Systems and Use cases