How to check Duplicates based on Multiple Fields – SQL or SAS

You have big data sets and that contains duplicate but you need to check duplicates based on combination of two or three columns. How can you check that using SQL or SAS? So, you can use the following combination to check duplicates based on multiple fields or columns in SQL or SAS:

Query: There is a dataset called “Country_Polulation” with following fields: 1. Name 2. Age 3.CityName 4.DOB 5.Year

You need to check duplicate records – like if Name and DOB of two records is same that means that is a duplicate. Follow the following methods to check duplicates based on multiple fields:

Let’s say we are checking duplicates for year – 2018

Using Having Clause:

Having clause will check the count of name and DOB if exists more than once, that name will be exported from inner query to export query. If you need to learn basics of SAS, follow these articles: SQL Basics and Learn Basic Functions of SQL

Select A.name, A.DOB

from Country_Polulation as A

where name in

(

Select B.name, B.DOB

from Country_Polulation as B

where year=2018

Order by name

Group by Name, DOB

Having (count(name)>1 and count(DOB)>1)

)

order by A.Name

 

2. Check Duplicates in SAS:

You will have to use Proc step for this task if you want to retrieve all the duplicate records. Basically do a self join and check all the fields with each other for equality to check duplicates.

Proc SQL;

Create table Dupe_Records as

Select A.Name, A.DOB, B.*

from Country_Polulation as A

JOIN

( Select B.Name, B.DOB, Count(*) as Quantity

from Country_Polulation  as B

where year=2018

group by B.name, B.Dob

having count(*)>1

)

order by Name, DOB

This Dupe_Records table will contain all the duplicate records ordered by name and DOB in ascending order. The reason why we couldn’t mention Name and DOB condition in Having clause in SAS is – SAS deals differently with Dates. and DOB is a Datetime18. format here when I tried the same.

3. Check Duplicates in Excel:

If the dataset is of small size, you can follow the mentioned steps:

  • Using Home Tab:

Home Tab -> Click on Conditional Formatting -> Click on Highlight Cell Rules -> Duplicate Rows -> Check the colors to highlight the duplicate text

  • Using Data Tab:

Select the data range or the columns wherein you need to check duplicates -> Data Tab -> Click on Remove Duplicates in Data Tools

This will give you the option of selecting columns to consider dupes. Select the desired columns and the dupes will be removed.

Hope this post will help you 🙂

SAS Basics : Learn Creating SAS Dataset

Hi all, I  know I am bit late actually very late in posting this blog post. So, we will be learning SAS today and I am covering some basics of SAS.

So, in order to learn it first there are multiple things that we will cover. What is SAS and why it is used.

SAS is a statistical and business analytics package tool that help you derive insights from the data and visualize the output to client. Yeah too much in a statement…?? 😉 No problem we’ll understand all of these things here.

Firstly, let’s check what makes SAS and what are the primary components of it:

Primary components of SAS:

1. SAS Library – This is just a SAS folder where in you will be keeping all your sas dataset or the files you would like your system to be read

2. SAS Editor Section – Here you will be writing your SAS code. SAS coding is more SQL like. If you want to learn SQL also, check my article: Learn Basics of SQL

3. SAS log section – This SAS component will give you all the information regarding the changes your code have made or the information from your imported SAS dataset. For example, the count of observations and rows, the data type of these fields and length. This component is very important to check any issues related to data type issue while performing data merging or transfer from SAS ton Teradata.

4. SAS debug section – This section enables the SAS developer to do any sort of debugging over the issue. There are various options available that lets you check the issue by filtering and sorting record.

How to Create SAS Library :

– This is the very first step that you will be doing in SAS Enterprise Guide or SAS Basic. There is a default SAS Library called – “WORK” library. This keeps all the temporary data. So, there are two types of libraries:

1. WORK
2. User Defined Library

Work – Keeps all temporary data of that particular SAS session only. Once the system is log off, all the data is erased. So, better you save it in some user defined library by before loosing it…LOL

User Defined Library – You can create library by following the syntax given as following:

Libname [Your Library Name] “Location where you want to save your files or SAS Dataset”

example: Libname AA “C://User/John/Documents/New Datasets/”

This will save all the SAS dataset in New Datasets folder.

Now Learn the three major components of SAS Code:

Steps of SAS:

1. Data Step
2. Proc Step
3. Run Step

So, for today this is all folks to be learnt in basics of SAS :p

If you want to learn more about it, let me know in comments. Thanks 🙂

Intelenet Global Services Analyst Interview Experience

Hola friends, I know this post is very late. Sorry for this. Check out my blog for real interview experiences of different companies. So, today I have brought to you the Intelenet Global Services Analyst Interview Experience.

Intelenet Global Services Data Analyst

Data Analyst position in Intelenet Global Services:

The position was Data Analyst which has been counted as the sub-portion of the sexist job of 21st century. Data analyst work involves collecting, cleaning, transforming and analyzing data.

Job Description:

The job description for this post in Intelenet Global Services included Python, R or SAS, knowledge about statistics….yeah the core things of data science. It also needs you to have knowledge about some BI tool like Tableau. It would be desirable for you 😉

How does Intelenet Global Services Recruits:

So, the recruitment process in Intelenet Global can start either with a consultancy, employee referral or through their company database, if your resume is that smart 😉

My friend got a call from a consultancy and the consultant explained about the job description in Intelenet Global Services. Intelenet is a business process outsourcing companies offering services in multiple sectors and this profile was Data Analyst. The position was for UK shift and cabs facility is not provided with it… Huh.. :/ When I heard this, it was bit upsetting for me and friend.

So, my friend agreed on the terms specified by the consultant and the job consultant agreed upon a date for interview.

The Interview Day:

My friend went to the Gurgaon Intelenet Global Services office. He found the office entrance bit sad.. but then the internal bay system and people were good there. The first round was with a BI specialist and she was very friendly in nature. It was more of a HR kind of interview with some basic questions like:

-Tell me about yourself

– Your Current Role

-The complex project you have done so far and your responsibilities in it

– What kind of tools you use

-Your reason of Job Change

-Why Intelenet Global Services

-Any BI tool experience (by the way they use Tableau in most of the cases for reporting purpose – specific to Gurgaon location office). So if you know any BI tool that will be an advantage for you 😀.

Then she will explain what all tools they use and what the role is all about.

If the HR finds your candidature as relevant, she may call you for a coding round…

The Second Round of Interview

The second round of interview was a coding round and the questions were related to Advance Python concept. It wasn’t a pen paper based test. -So, if you are someone who only knows how to read a simple csv file as a dataframe, then this profile might not for you.

-If you are someone who knows has worked with a single data type inside a dataframe column, then this place is might not for you..

Guys you need to be very good in Python concepts if you want to clear their coding round.

Also, some basic knowledge of SQL is required to master the test. Don’t know anything about SQL?.. Aah.. don’t worry.. check my post on SQL Basics and Advance SQL functions to master data management. Any feedback is welcomed 🤗

Here are some hints for this coding round:

1. Try to master how to read multiple files at the same time and append it to each other. Like reading all excel or csv files from a directory and then appending it to another file

2. Extracting elements from a nested list. Writing a generic function to extract nested list items and checking prime/even-odd and splitting the data into new two list based upon the function.

3. Handling dataframe with columns handling different data type separated by different delimiters… My friend was totally blown away with this question 😉

4. The fourth and last question was based on your knowledge of dictionary and lists. Writing a function that compares key values of a dictionary in a list and then deleting key values from list.

So, guys master you Python basic and advance skills before appearing for Intelenet Global Services.

The sad part was the consultant informed him that this positions is for statistical analysis, clustering, classification etc… However there wasn’t any question related to these topics. But, this doesn’t mean that these topics will not be asked in further interview rounds. Since my friends was asked to leave for the day… Aah.. I hate this line.

All the best guys if you are going to appear for the data analyst profile. I hope this post will help you 🙂 Let me know if you have some other questions here… I am here to help you..

Thank you

Your Mystery Solver 😉

 

 

Date Time Functions in SQL and the Use of Date Time Functions

Date Time functions in SQL

There are various Date time functions available in SQL which allow us to create trigger based on it. Date time function allow us to check whether an input string is a date or not. Also, we can check what’s the day or week number on a specific date using date time function.

Date Time

Let’s check various date time function available with us:

  1. IsDate

The isDate function is used to check whether the input string is a date or not.

Command: Select isDate(“2012-05-30”), it will return 1

If the input string is not a date, the IsDate date time function will return zero.

The isDate function works well on date, time and datetime datatype but not with datetime2 (timestamp with nanoseconds)

2. Day()

This date time function returns the “day number” of the specified input date string.

Command: Select Day(“2012-02-30 02:35:35”) – Returns 30

3. Month()

This sql date time function returns the “month number” of the specified input date string.

Command: Select Month(“2012-02-30 02:35:35”) – Returns 2

4. Day()

This sql date time function returns the “day number” of the specified input date string.

Command: Select Day(“2012-02-30 02:35:35”) – Returns 30

5. DateName()

The DateName() date time function returns a string value while other date time functions return integer value.

There are various attributes that can be used with the DateName() function. let’s me share some examples of it here:

  1. DateTime(Day, “Date value”|”DateTime Value”) – This function returns the day value (exception it is integer)

Command: Select DateTime(Day, “2013-05-23”) – the output will be 2

2. DateTime(WeekDay, “Date value”|”DateTime Value” – This function returns the week day value

Command: Select DateTime(WeekDay, “2013-05-23”) – the output will be Saturday.

3. DateTime(Month, “Date value”|”DateTime Value”)  – It returns the month value as a string

Command: Select DateTime(Month, “2013-05-23”) – the output will be May.

6. Datepart()

The DatePart() function works similar way as DateName function to find out the weekDay. the only difference between DatePart() and DateName() function is Datepart() returns an integer.

e.g. Select DatePart(WeekDay, “2013-05-25”) will return 2 if it is Tuesday

7. DateAdd()

This sql function will add day, month or year according to the specified value.

Command: Select DateAdd(Day|Month|Year, integer value of addition, Date Value)

e.g. Select DateAdd(Day, 10, “2013-05-20”) – This will output “2013-05-30”

Date Diff

8. DateDiff()

This sql function will add day, month or year according to the specified value.

Command: Select DateDiff(Day|Month|Year, Date value 1,Date value 2)

e.g. Select DateAdd(Month, “2013-05-20″,”2013-01-20”) – This will output 4

Cast and Convert Functions in SQL

Cast and convert functions allow you to convert one data type into another.

Syntax of Cast function:

Select cast(Column_name as Data type)

e.g. Select cast(date as nvarchar)

Syntax of Convert function:

Convert function lets you choose style parameter for string values only. The style feature is not available in cast function. By styling I mean a user defined format in which the output is required.

e.g. Select convert(data type, column_name, style)

There are some specific style codes like 103 means : dd/mm/yyyy. Thus the following example:

Select convert(nvarchar, DateOfBirth, 103)

Stored Procedures and String Functions in SQL

Hola friends, let’s understand what are stored procedures and the benefits of using it. I hope you have learnt some basics of SQL before this. Learn basics of SQL in my previous post.

What are Stored Procedures?

Stored procedures are a set of instructions that need to be instructed again and again. In SQL we can save this frequent instructions as a procedure and call them by just their name.

 

Stored Procedures

How to create Stored Procedures?

To create stored procedures, simply use the Create Proc/Procedure Procedure_name command.

We can also pass parameter in a stored procedure. The parameters can be of two type:

1. Input (Used to take input)

2. Output (used to print output) – The output keyword must be specified as output in the description.

One example of the stored procedure is shown in the below image:

 

Stored-Procedure1

Running the Stored Procedures:

The command used to execute a stored procedure is: EXEC|Execute Procedure_name

e.g. Exec HumanResouces.uspFindEmployee “123” (Here 123 is the value for the @BusinessEntityID input parameter.

Stored Procedure with Output Parameter

An example of output parameter is:

Create Proc spGetEmployeeCount

@LastName int

@EmployeeCount Output

As

Begin

Select @EmployeeCount =Count(*) from emp where LastName = @LastName

End

Executing Stored Procedure with Output Parameter:

To execute the stored procedures with output parameters, it is very important to declare the output variable first. To declare the output variable, use the following command:

Declare @OutputVariableName Datatype

Exec Storedprocedurename Input parameter, Output Parameter Out|Output

Print@OutputVariableName

e.g.

Declare @TotalCount Int

Exec spGetEmployeeCount @LastName = “Male”, @TotalCount Out

Print@TotalCount

Benefits of Stored procedures:

  1. The execution plan can be reused

When a normal statement is executed, the path like first few columns are selected, then refined through where clause and then ordered. This is called a kind of execution plan. So, we can save time through stored procedures.

e.g. Select name, gender from Emp

when id IN(12, 14, 15)

order by name

A normal statement if executed with different parameters, use different execution path. However Stored procedures uses the same execution path even with different variable values.

2. Reduced Network Traffic

Since Stored procedures allows code re-usability, the sql instructions takes less time and space while utilizing network bandwidth. Thus network traffic is reduced.

3. Easy Maintainability using Stored Procedures

Maintenance becomes easy. A change in stored procedure is easy rather than finding similar statements at various places and then modifying the code.

4. Code Re-usability

The instruction that will be required again and again need not to be typed again. We just have to create a stored procedure with that statement and this will help us in reusing the code in less time and space.

5. Better Security

The database can be huge and we don’t want everyone to give access to every user in the network. So, build procedures on the specific table to which the user wants access. This way we will be able to provide better security.

6. Avoid SQL Injection Attack

Stored procedures also helps in avoiding SQL injection attacks. To know more about SQL injection attacks, go to the link.

Learn basics of SQL here

String Functions in SQL

  1. Left function:

This string function in sql returns integer values from the left.

command: select left(string, integer count)

e.g. Select left(“abcd”, 3) – Result will be “abc”

2. Right Function

This string function in sql returns integer values from the right of a string/column value.

command: select left(string/column, integer count)

e.g. Select left(“abcd”, 3) – Result will be “bcd”

3. CharIndex

This sql function returns the index of the character value.

command: Select charindex(“a”, “abcd”) – answer would be 1

4. Len

This function the total length of a string or column value of string type.

command e.g. : Select len(“abcd”) – answer would be 4

Note: This sql string function will not count the blank values at the end of a string.

5. Substr

This sql function is used to select a substring value from a string.

Command: select substr(“string value”|column name, position to start, length of fetch character

e.g. Select substr(“abcd”, 1, 3) – answer would be “abc”

We can also choose negative indexing here. A -1 indicates from the right.

6. Replicate

Replicate function replicates a specific string to the specified number of times.

command: Select replicate(“string”, number of times to replicate)

e.g. Select lastname + replicate(“*”,5)

This command will repeat the * five times in lastname. Let’s say the lastname is “John” then the output is: “John*****”

6. Space

This sql function will insert space to the specified number of times between column values.

command: Select Lastname + Space(5) +First name – This command will introduce a 5 character space between first name column value and last name.

7. PatIndex (or Pattern Index)

PatIndex works the same way as charindex by telling the first occurrence. However, PatIndex allows you to use wildcard. You can’t use wildcard with CharIndex.

e.g. Select PatIndex(“aaab”, “abcaaababc”) – The answer would be 4

If it do not find any matched string, the sql string function returns zero.

8. Replace

This string function replaces a string into another one.

Command: Select(String, “value to replace”, replacement value”)

e.g. Select(“abcd.com”, “.com”, “.net”) – The result will be “abcd.net”

9. LTRIM

This string function is used to trim leading blanks from the beginning.

Command e.g. : Select LTRIM(LastName)

10. RTRIM:

This sql function trims the trailing blank values.

e.g. Last Name : “abcd  ”

Select RTRIM(LastName) – the result will be “abcd”

11. ASCII:

This SQL function returns the ASCII code of the string value.

Command: Select ASCII(string value)

e.g. Select ASCII(A) – The answer would be 65

12. CHAR

This SQL function returns the character value from the integer ASCII code.

Command: Select CHAR(Integer value)

e.g. Select CHAR(65) – The answer would be A

13. STUFF

This sql function replaces a specific string value at the specified position by replacing the characters up to length value. It is kind of masking.

Command: Select STUFF(Column, Starting position, length, String to be used as replacement)

e.g. Select STUFF(LastName, 1, 3, “**”)

If the last name is “JohnMarcel”, the output of stuff function will be “***nMarcel”

SQL Basics and Queries : SQL Tutorials

What is a Database?

Database is an organized collection of related information. In daily world we deal with lots of data. In this internet technology more and more data is getting produced these days. We have multiple database management systems available with us to manage, store and update this enormous data in convenient way. e.g. Oracle, Sybase, Microsoft SQL server management studio etc.

DBMS and SQL

DBMS and SQL

DBMS (Database management server) is a collection of software tools used to manage, update, retrieve the data from the database. SQL (Structured query language is used to connect the DBMS with the database.

DBMS

All queries have been executed on the Microsoft SQL Server management Studio version 17.0. SSMS is a client tool and not the server. It is rather used as a tool to connect to the database server.

Settings: Local Host

Connect: Database Engine

Use SQL authentication username and password

SQL Databases:

In Microsoft SQL Server Management Studio you will find two types of databases:

  1. System Database
  2. User created Database

-System database can’t be deleted

SQL Command Types:
  1. DDL (Data Definition language) – Used to define/create database object
  2. DML (Data manipulation language) – Used to insert values, update and delete values from the database object created by DDL commands.
  3. TCL (Transaction Control language) – Used to control transactions through Commit and Rollback commands

SQL DDL Commands – data definition language (Create, Alter and Drop commands)

1. Creating a database:

Database can be created either using GUI or through SQL query in SSMS.

Create statement is used for this purpose: Create [Database Object] [Database Object name]

Ex. Create Database db1 (this statement will create a database with name db1)

Whenever we create a database, two types of files are created with it: 1. .MDF file (contains actual data) 2. .LDF file (contains log file)

2. Modify a Database:

Alter statement is used to alter a sql database object.

Alter Command:  Alter [Database Object] [Database Object name] Modify Col1 = Value

E.g. Alter Database db1 Modify Name = db2 (this will change the name of the database)

Renaming through stored procedure: sp_renameDB [Old database name] [New database name]

e.g. sp_renameDB db1 db2

3. Dropping a Database:

Drop statement is used to delete a database completely from the system(.mdf and .ldf files are also deleted with it)

Drop command: Drop [Database Object] [Database Object name]

e.g. Drop Database db1 (this will delete database db1)

Note- If a database is getting used by any other user, make sure that database is not getting used by any other database. Else an error will be generated

Resolve this single user thing, use this command:

Alter Database db1 set Single_USER with Rollback immediate

(Rollback immediate, rollback any commands and delete the database immediately)

SQL DML Queries : Insert, Update, delete

1. Create a Table:

Command: Create Table [table name] ([column name] [data type of column] [constraint])

e.g. Create table t1(ID int NOT NULL Primary Key, Gender nvarchar(20) NOT NULL)

This command will create a table with name t1 and 2 columns ID and Gender of INT and nvarchar datatypes respectively. nvarchar is a UNICODE data type and store 2 bytes per characters, while varchar stores 1 byte per character.

In order to store the table in a particular database use the following command:

Use [database name]

Go

Create table command….

Primary Key – Can’t be null and must be unique. It uniquely identify each row in the table

Foreign key– It can contain null values and it references primary key present in some other values (basically the column in which it looks for a value). Foreign key is used to establish relationship between two tables. It is used to enforce database integrity.

Create a Foreign key relation –

Alter table [table name] add constraint [constrain name] foreign key(foreign key column name) references [PrimaryColumn Table Name] (primary key column)

e.g. Alter table tb1 add constraint tb1_genderid foreign key(tb1) references tb(id)

Note- Constraint name should make sense like tablename_columnName

2. Select all values of a table:

Command: Select * from [table name]

To select all tables of a database choose:

Select * from dual (dual refers data dictionary)

3. Insert values in a table :

Insert command is used to insert values in a table: Insert into [table name] (col 1, col 2, …) Values(col 1 value, col 2 value,…)

e.g. Insert into a1(id, name, gender) values(11, “ss”, “male”)

4. Adding a Default value in a column:

We can assign default values to a column rather than assigning Null values:

Alter table [table_name] add constraint constraint_name Default [default value] For [column name]

e.g. Alter table tb1 add constraint tb1_gender default 2 for gender

This command will assign default value 2 to column gender if value not explicitly defined.

5. Adding a New column into table:

Command: Alter table [table name] add [column name] [column data type] [NULL|NOT NULL] add constraint [constraint name] Constraint

Alter table tb1 add Address nvarchar Not Null add constraint tb1_address default ‘xyz’

This command will add one column Address to the table tb1 that don’t accept null value. Also, default value of ‘xyz’ will be assigned to it.

6. Dropping a Constraint:

Command: Alter table [table name] Drop Constraint [constraint name]

e.g. Alter table tb1 drop constraint tb1_gender

This will drop the constraint.

7. Delete a Table record:

To delete a table record, we use delete command:

Delete from [table name] where column1=”column value”

Note: Where clause is used to put some condition on search selection

However, you can’t delete a table record if the table is getting used by some other user. There are some cascading referential integrity constraint imposed on foreign key constraints.

8. Cascade Referential Integrity Constraint:

We can choose options if a foreign key constraint reference is getting deleted. Four options are there:

  1. No Action : This will simply generate an error if a record from primary key table is deleted that has some value in foreign key table.
  2. Cascade: This option will delete all the foreign key records that are related to primary key will be deleted
  3. Set NULL: This option will set the foreign key dependent value to Null.
  4. Set Default: This option will set the foreign key dependent value to default values provided to the column.

9. Adding a Check Constraint:

This constraint is used to enforce value checks on column. For e.g. The value in the age column>4

Command: Alter table [table name] add constraint [constraint name] check (boolean expression)

e.g. Alter table tb1 add constraint tb1_age_check check(AGE>0 AND AGE<30)

This command will only let you add age between 0 and 30 in the Age column.

Note: The check constraint returns a Boolean value based on which the value is entered in the table. It also let you insert Null values because for Null values, check constraint returns “Unknown”.

10. Identity Column:

It is a column property in SSMS.

Identity column is a column to which values are automatically assigned.

Identity Seed: A value with which the identity column value starts

Identity Increment: The value with which identity column value is incremented.

Command: Create table stu(id int identity(1,1) Primary key)

This command will create a stu table having id as a identity column. The id column here will start from 1 and incremented by 1.

 

10. Setting up External Values/Explicit Value to Identity Column:

To set up external value in Identity column, add the following command before inserting values in table:

Command: Set IDENTITY_INSERT [table name] ON

Insert into table name(column list) values(1,”23″..etc)

11. Setting Off External Values/Explicit Value to Identity Column:

Command: Set IDENTITY_INSERT [table name] OFF

Insert into table name(column list) values(1,”23″..etc)

Note: To reset the identity column value, use DBCC command.

12. Unique Key Constraint:

Unique key constraint is used to enforce unique values in database. There is a slight difference between primary key and unique key.

Primary key values = Unique+Not Null

Unique constraint value = Unique + values can be null

Command: Alter table table_name add constraint constraint_name unique(column name)

or

Create table Stu(Name varchar(20) Unique)

13. Applying a Trigger:

Firstly let’s try to understand what a trigger is. A trigger is an sql instruction/set of instructions that will will cause an action once a specific condition occurs. For example: Inserting another row table 2 when a row is entered in the table 1.

Command:

Create Trigger [trigger_name] on [table_name] for Insert/Update/Delete/Condition

as

begin

[instructions]

end

14. Selecting values from table:

Select is a command used to retrieve records from a table.

  1. To fetch all records from a table:

Command: Select * from [table_name]

e.g. Select * from emp

2. Select specific columns from a table:

Command: Select [col_name_1], [col_name_2]… from [table_name]

e.g. Select name, age, id from Employee

3. Fetch all distinct records from a table:

Command: Select distinct [column_name] from [table_name]

e.g. Select distinct name from Employee

This command will help in fetching the distinct records from table Employee by Name column.

4. Fetch record matching a specific condition:

Where is used to apply a specific condition in the SQL command.

Command: Select * from table_name where column_name = condition value

e.g. Select name, id from employee where name=”John”

This command will fetch all the records from the table with name column value as john.

5. Fetch record not matching a specific condition (column value):

Command: Select * from table_name where col_name <> Column value

“<>” signifies as not equal to here. We can also use “!=” to compare values.

6. OR operator in SQL:

OR operator is used to specify two or more conditions together.

Command: Select * from table_name where col1=value OR col2=value

e.g. Select name, age, salary from Employee where name=”John” OR name=”Nick”

This sql command will fetch all the table records where name is either John or Nick

7. AND Operator in SQL:

AND operator is used to specify two and more conditions together.

Command: Select * from table_name where col1=value AND col2=value

e.g. Select name, age, salary from Employee where name=”John” AND age=”30″

This sql command will fetch all the table records where name is John and age is Nick

8. IN Operator in SQL:

IN operator is used to retrieve records where condition matches more than 1 value. (And you don’t want to use OR multiple times in a sql command)

Command: Select * from table_name where col_name IN(value1, value2, value3…)

e.g. Select * from Employee where age(21, 25, 30)

This command will fetch all the table record where age is either 21 or 25 or 30.

SQL Wildcards

SQL supports various kind of wild card characters to facilitate data retrieval in multiple ways. Please refer the image for all sql wild card characters.

Data Mining – Learn Data Mining, OLAP, OLTP and Clustering

Hi friends, let’s discuss the important concept of Data Mining and the four common tasks of data mining: Data Clustering, Data classification, regression and association rule learning.

This is an important topic to learn and adopt as a career option these days. Lots of people are trying their luck in this field by mastering the data analysis skills. It is a growing field and by 2020 the vacancy graph for professional data analyst, business analyst and data scientist will be at par.

Hope you guys have checked my previous post on Malicious Programs/Malwares for answering the questions related to this section.

Future Scope of Data Analyst:

Apart from future perspective, data mining is an important topic considering the various government exam vacancies for computer science professionals. So, I have tried to collect every important part of topic data mining in this blog.

If you guys want any other topic to be covered, please let me know by adding a comment. Now let’s first understand what Data mining and data analysis is.

data mining

What is Data Mining and the use of Data Mining?

Data mining is the process of extracting patterns from data. It is an important tool used by modern business to drive information from data. Data mining is currently used in marketing, profiling, fraud detection and scientific discovery etc.

Tasks of Data Mining:

  1. Data Clustering: It is the task of discovering groups and structures in the data that are similar in some way. Data clustering is performed without using known structures in data.
  2. Data Classification: Data classification is the task of generalizing known structures to apply to new data. Common algorithms related to data classifications are:   1.1. Decision tree learning

1.2.  Nearest neighbor

1.3.  Naïve Bayesian classification

1.4.  Neural networks

1.5.  Support Vector Machines

3. Regression: With Regression we attempt to find a function which models the data with the least error. There are different strategies related to regression models.

4. Association Rule learning: This learning is used to search for relationships between variables. I would like to share a big example of association rule learning:

With the help of association rule learning, Amazon displays the items frequently bought together to show as a recommendation. Thus helps the customers and increase its sales.

Approaches to Data Mining Problems:

  1. Discovery of sequential patterns
  2. Discovery of patterns in time series
  3. Discovery of classification rules
  4. Neural Networks
  5. Generic Algorithms
  6. Clustering and Segmentation

Goals of Data Mining and Knowledge Discovery:

  1. Prediction: Data mining can show how certain attributes within the data will behave in future.
  2. Identification: Data mining can be used to identify the existence of an item
  3. Classification: Data mining can partition the data so that different classes or categories can be identified
  4. Optimization: Data mining can be used to optimize the use of limited resources such as time, space, money or materials to optimize the output

What is OLTP (Online Transaction Processing)?

In order to understand OLTP, it is very important to be aware about Transaction and transaction system. So, what is a transaction? What are the properties of transaction system? Let’s analyze the theory of transactions and then we will cover OLTP.

Transaction and Transaction System:

A transaction is nothing but an interaction between different users or different systems or between a user and a system.

Transaction systems: Every organization needs some on-line application system to handle their day to day activities. Some examples of transaction systems are: Salary Processing Library, banking, airline etc.

Transaction System

Transaction Properties:

Every transaction follows the ACID property. Learn it like this. This is an important section and government exams choose multiple questions from this section.

ACID

Atomicity: This means a transaction should either completely succeeded or completely fail.

Consistency: Transaction must preserve the database stability. A transaction must transform the database from one consistent state to another

Isolation: This simply means transaction of one user should not interfere with the transactions of some other user in the database.

Durability: Once a transaction is complete means committed, it should be permanently written to the database. This change should be available to all the transactions followed by it.

I hope the ACID properties are clear to you guys. Please let me know if you need more information on this with examples.

Ever wondered how multiple transaction of different users can be processed simultaneously?? If yes check the below magic:

Concurrency: Currency allows two different independent processes to run simultaneously and thus creates parallelism. This is the thing that utilizes the use of fast processing time of computers.

What is Deadlock?

Deadlock is a situation where one transaction is waiting for another transaction to release the resource it needs and vice versa. Therefore, it becomes the case of wait and bound situation and the system halts. Each transaction will be waiting forever for the other to release the resource.

How to prevent Deadlock?

The simple rule to avoid deadlock is if deadlock occurs, one of the participating transaction must be rolled back to allow the other to proceed. So, this way transactions can be performed. There are different kinds of schemes available to decide which transaction should be rolled back.

This decision depends on multiple factors given as following:

  1. The run time of the transaction.
  2. Data already updated by the transaction
  3. Data remaining to be updated by the transaction system

I have tried to cover this section completely friends. Learn these concepts about data science and you will be able to solve each and every question that is related to the data mining section.

In order to master this section, please check my next post of the Previous Year questions of Data Mining section.

Not sure about Computer Networking concepts? Need to score good marks in Computer network section? If yes, do read my next post on Computer Network and Network Topologies. Till then, C yaa friends 🙂

PWC Interview Questions and Answers for Hadoop

Hola friends 🙂 I am back with another interview experience and this time for PWC. Check out this blog for all PWC interview questions and answers for Hadoop. PWC is a big 4 accounting firm and a very good company to work with. So, I met one of my friend this Sunday and he shared his PWC interview experience with me including all the PWC interview Question and Answers for Hadoop programmers.

 

PWC Interview Questions and Answers

 

PWC Interview Questions and Answers – Test Pattern

The PWC test paper for Hadoop contains multiple sections from Map-reduce program, Hadoop architecture, Unix scripting, SQL queries, Oozie and Sqoop. So, if you are preparing for a big 4 MNC read the following questions and get yourself ready if you are a Hadoop programmer.

Let’s analyze the PWC Interview Questions and Answers for Hadoop section wise:

  1. Hadoop Architecture:

You must be very much clear about the core concepts of Hadoop architecture to answer the question of this section. The questions will be based upon split count required while saving a file in HDFS against the provided maximum and minimum split size.

Also, this involves questions on the understanding of node manager like what if the name node fails. Also, you would be required to write the whole process when a client submits a request to Hadoop System.

2. Map-Reduce Programming Paradigm

This section contains questions like how many map and reduce tasks will this SQL query takes. Hence, check out such questions and increase your knowledge. There will be a programming questions too wherein you would be asked to write the code in any of the following language: Python/Java/Ruby.

Another question will be how Hadoop allocates Map tasks and reduce tasks.

3. SQL Query Questions

This section contains questions mainly on joins (inner join, outer join, left join etc.) like writing queries for the desired join and to tell the count of elements returned in the output.  Also, there will be questions where you will be asked to write a query to find out the nth highest or lowest salary of an employee from a given table.

4. Hive

This sections of PWC interview questions and answer set contains questions like how to write a command to send a task in background. Then, again to take that command in foreground and later on how to kill that particular task.

5. Unix Commands

This section will ask you questions like how to print the nth line of a csv file. Also, some commands like print the nth line where condition = “some value”.

6. Sqoop and Oozie

This sections covers questions on the basic understanding of commands needed to transfer data between RDBMS and Hadoop system and also on the inter cluster Hadoop transfer system.

Oozie section contains two questions – one on Oozie workflow configuration and second on the Fork and join parameter.

Hope this article will help future aspirants 🙂 Also, check my next article of Dunnhumby interview questions.

Please share your comments below.

Dunnhumby Python Analyst Placement Paper

Hi Friends, today I am sharing with you the Python Test pattern of Dunnhumby. Go through the details to know about Dunnhumby Python Analyst Placement Paper pattern, scoring and further interview process. Also, I would share with you all the details on which you need to focus to get selected as a Dunnhumby Analyst. One of my friend appeared for the Analyst profile for Dunnhumby, Gurgaon location. The interview process started early at 9:00 AM.

Why Dunnhumby??

Dunnhumby is a customer science company and values the candidates that understand the value that the company adds to its customers. Getting a job at Dunnhumby is like being in your good luck. So, there were many candidates that were either going to appear for an interview on R or Python. My friend appeared for a test on Python. So, Here is the structure of the Python test:

Dunnhumby Placement Paper Pattern

The Dunnhumby Python test contains a total of 15 questions. Out of these 15 questions, 4 were objective questions while 11 of these questions were coding questions where you are supposed to write the answers. Most of these questions were of different scores (2 marks to 5 marks). The questions were typically based on the following things :

Dunnhumby Placement Paper Pattern (Analyst) – Python

  1. Understanding of Python lists
  2. Understanding of operators and float values
  3. For loop applied on lists
  4. Reading the CSV files (few tables were provided and you need to write code to store these files as data frames)
  5. Write the code to change the data type of a column into some other
  6. Write the code to shuffle the list values in a significant way
  7. Sort the table values based on a specific column (ascending or descending order)
  8. Select a particular column based on column joins from two tables
  9. Select column values and calculate min, max, average, count, sum
  10. Re-sampling the data of tables

Dunnhumby Python Objective Test Paper Scoring

In the python technical objective questions you have to choose the output of the applied functions, operators. For this, one should have proper understanding of lists and loops and object referencing. Each objective question contains multiple sub-questions with different score weightage.

The second round will be a case study round and subsequent questions based on it. First of all, do review the already available case studies present on the official website of Dunnhumby. This will help you a lot in forming the logic and giving you an idea of the work Dunnhumby is doing for its clients. As aresult, you will be able to answer the questions asked during this case study interview. Also, wear a smile on your face since nobody likes sad faces :p

That’s all my friends. Best of luck for your interview! 🙂

Please share your interview stories in the comment section. Or you can also share some other company interview experience to me on my contact id.

Thanks

 

Big Data Analytics : Why you should learn this

Hi All,

There is a big hype these days about the big data analytics. Let’s take a look to analyze the scope, salary trends, tools to learn this and why you should learn Big Data analytics and data science.

Big data is not something that is 15 Gigabyte or 30 Petabyte. I would say whenever, a data set exceeds the capacity of a person, individual or a firm’s storage capacity or the ability to analyze the data, that time that data becomes Big data.

Big Data is something that you can’t deal with traditional methods. Big data is large amount of data that you can use to generate the knowledge out it, to create visualization that can help a business to go from the bottom to boom. It helps in finding out the pitfalls and the market trends. In our daily life we deal with so many different data machines around us and we don’t realize it, for example an ATM machine that generates a large amount of data, and the satellites that create enormous amount of data. This way the demands of people knowing about how to deal with such a large amount of data is increasing. Big data analytics has a vivid scope and job demand will go on increase for this. So, what are you waiting for. Keep on checking this blog for more information since I’ll be writing about the following topics soon:

As far as salaries are concerned, big data salaries are good.

The characteristics of Big Data

A strategy of Big Data

What are Big Data Systems and Use cases