Live Chat Software by Kayako
 News Categories
(19)Microsoft Technet (2)StarWind (4)TechRepublic (3)ComuterTips (1)SolarWinds (1)Xangati (1) (27)VMware (5)NVIDIA (9)VDI (1)pfsense vRouter (3)VEEAM (3)Google (2)RemoteFX (1) (1)MailCleaner (1)Udemy (1)AUGI (2)AECbytes Architecture Engineering Constrution (7)VMGuru (2)AUTODESK (1) (1)Atlantis Blog (7)AT.COM (2) (1) (14) (2)hadoop360 (3)bigdatastudio (1) (1) (3)VECITA (1) (1)Palo Alto Networks (4) (2) (1)Nhịp Cầu đầu tư (3)VnEconomy (1)Reuters (1)Tom Tunguz (1) (1)Esri (1) (1)tweet (1)Tesla (1) (6)ITCNews (1) (1) Harvard Business Review (1)Haravan (2) (1) (3) (3)IBM (1) (2) (1) (6) (1) (1) (4) (1) (1) (1) (1) (1) (1) (1) (5) (4) (1) (1) (1) (1) (2) (7) (1) (1) (1) (1) (2) (1) (2) (2) (2) (1) (7) (1) (1) (1) (1) (1) (1) (1)Engenius (1) (1) (1) (1) (1) (3) (6)
RSS Feed
Latest Updates
Manage Office 365 Users Passwords using PowerShell
Posted by Thang Le Toan on 10 December 2017 11:25 PM

In the current article, we will review how to use PowerShell commands for managing user password in Office 365 environment.

Office 365 user’s password management versus the “standard” Domain Active Directory is a little restricted.


For example – configure password policy parameters such as – Enforce password history, Minimum password length, Password must meet complexity requirements cannot be configured by the Office 365 administrator.

At the current time, the options that related to Office 365 users password management are – reset the user password and setting the number of a maximum number of days or password expiration (the default is 90 days).

So, what are the options that are available for Office 365 user’s password management?
In this, the article will review a couple of options. Some of the options can manage by using the Office 365 Web interface and some task only by using PowerShell.

Step 1: Download and install required components

Before we can start the remote PowerShell session to Office 365, we need to download the required cmdlets. Additional pre-requirement is to: install the: Office 365 sign in assistant.

You can find the required software component using the following links:

1. Microsoft Online Services Sign-In Assistant
You can download the Microsoft Online Services Sign-In Assistant by using the following link:
Microsoft Online Services Sign-In Assistant for IT Professionals RTW

Choose the download option

Download the Microsoft Online Services Sign-In Assistant -01.jpg

Choose the right version for your OS (most of the modern OS are 64-bit version).

Download the Microsoft Online Services Sign-In Assistant -02.jpg

2. Office 365Powershell

You can download the indows Azure Active Directory Module for Windows PowerShell by using the following link:
Windows Azure Active Directory Module for Windows PowerShell

Click on the link named:  Azure Active Directory Module for Windows PowerShell (64-bit version)


Install the Azure AD Module.jpg
After the installation of PowerShell cmdlets, we will find a new icon on the desktop named: Microsoft Online Services Module for Windows PowerShell.
(An additional option is to use: start menu > All Programs > Microsoft online servicesMicrosoft online service module for Windows PowerShell


Microsoft Online Services Module for Windows PowerShell


The Microsoft Online Services Module for Windows PowerShell shortcut includes a command the will imports Office 365 cmdlets to the PowerShell console.


Technically we don’t have to use this shortcut. We can manually import the 365 cmdlets to the PowerShell console by using the command: Import-Module MSOnline


Import  Office  365  cmdlets to  the PowerShell console-03

Step 2: First time configurations

I must admit that the First-time configurations is a little a bit “Naggers” but, after creating the required settings, the next time will be easier. The PowerShell remote connection requires the following configuration settings:

1 – Run as Administrator

To be able to change the PowerShell Execution Policy, we need to run PowerShell console, by using the option: Run as administrator.

Run as administrator-01

Right click on the Microsoft Online Services Module for Windows PowerShell icon and chose the option: Run as administrator.
Run as administrator-02

2 – Setting the PowerShell Execution Policy

PowerShell security policy (“Execution policy”) can be defined by using one of the following options (modes): Restricted, AllSigned, RemoteSigned, and Unrestricted.
(The default mode is: Restricted).
To change the Execution policy open the Microsoft Online Services Module for Windows PowerShell and type the command:

Set-ExecutionPolicy Unrestricted

Set-ExecutionPolicy RemoteSigned-04

To execute PowerShell command, we use the “ENTER” key.

Step 3: Connect to Office 365 by using Remote PowerShell

Open the Microsoft Online Services Module for Windows PowerShell and Type (or copy and paste) the following command:


A pop out windows will appear. Type your credentials by using the UPN (user Principal name) format. For example:


Note – the user name should have global administrator rights in the Office 365 environment.

1. Set Password never expired

Set Password never expired for Office 365 user

PowerShell command Syntax

PowerShell command Example

Disable Password never expired option for a Office 365 user

PowerShell command Syntax

PowerShell command Example

Set Password never expired for ALL Office 365 users (Bulk Mode)

PowerShell command Syntax

Re-enable Password expired ( the default) for ALL Office 365 users (Bulk Mode)

PowerShell command Syntax

2. Set Password

Set a Predefined Password for Office 365 user

PowerShell command Syntax

PowerShell command Example

Set a Predefined Password for all Office 365 users (Bulk mode)

PowerShell command Syntax

PowerShell command Example

Set a Predefined Password for Office 365 users imported from a CSV File

Step 1: Export Office 365 users account
PowerShell command Syntax

Step 2: Set a Predefined Password

Example: Step 1: Export Office 365 users account

Predefined Password-CSV

PowerShell command Example

Create new Office 365 user and set a unique temporary password by import the information from CSV file

temp password

You can download a sample CSV file – Password.csv

PowerShell command Example

Provisioning Office 365 user and export information from Active Directory

In case that you need to export Active Directory on-Premise user account based on a specific parameter, you can use the PowerShell cmdlets named – get-aduser (you will need to use PowerShell console from DC or import the Active Directory cmdlets to the existing PowerShell console
For example:

Example 1 – display or export, all of the Active Directory users that located in a specific OU.


In our particular scenario, the domain name is – and the specific OU is – Test

Display information about – all of the Active Directory users that located in a specific OU

PowerShell command Example

Export to a CSV file information about – all of the Active Directory users that are located in a specific OU + choose a specific data fields.

Example 2 – display + export information about Active Directory users from a specific department.

The PowerShell command syntax is:

An example to a scenario in which we want to export information only about Active Directory users that belong to the marketing department could be

Set a Temporary Password for a specific user

PowerShell command Syntax

PowerShell command Example

Set a Temporary Password for all Office 365 users (Bulk Mode)

PowerShell command Syntax

PowerShell command Example

3. Office 365 Password Policy

Set Office 365 Password Policy

PowerShell command Syntax

PowerShell command Example

4. Display Password settings

Display Password settings for all Office 365 users

PowerShell command Syntax

Display information about Office 365 Password Policy

PowerShell command Syntax

PowerShell command Example

5. Download Manage password PowerShell menu script

Manage Office 365 Users Passwords using PowerShell | Office 365

You can read more detailed information about the PowerShell commands that are used in the script in the article: Manage Office 365 Users Passwords using PowerShell | Office 365
Manage Office 365 Users Passwords using PowerShell

Read more »

Full Solution – Skilltest on R for Data Science
Posted by Thang Le Toan on 26 November 2017 01:32 PM


R is the most commonly used tool in analytics industry today. No doubt, python is catching up quickly. Many companies which were heavily reliant on SAS, have now started R in their day to day analysis. Since R is easy to learn, your proficiency in R can be a massive advantage to your candidature.

This test wasn’t designed for freshers. But, for people having some knowledge of R. If you’ve taken this test thoroughly, you might be either disappointed or happy with your performance and keen to know the solutions. As expected, we’ve complied the list of Q&A so that you can learn and improve.

A best way to learn is to solve these questions at your end. You’ll learn multiple ways to perform a task in R. In other words, you’ll be able to add more weapons to your R armory.

If you don’t understand anything, drop your question in comments!


Overall Results

Below are the distribution of the scores. This will help you to evaluate your performance.

solution r for data science

Some of the interesting statistics from this competition:

Mean – 20.16

Median – 20

Mode – 0

Range – 49

Standard Deviation – 14.09

95% Confidence Interval – [-7.45,47.77]

Heartiest Congratulations to participants who have scored 32 & above, they are in the top 25  percentile. And, people scoring more than 40 are in top 10 percentile, score 47 & above makes you in top 1 percentile.

Due to wide range, the confidence interval doesn’t seem so practical mathematically. Looks like many participants didn’t take the complete test and left in between.

Since majority of the questions were fairly easy, if you have scored less than 20, you are in an alarming situation. You need to spend more time practicing on R.


Helpful Resources on R


Skill Test Questions & Answers

1). Two vectors X and Y are defined as follows – X <- c(3, 2, 4) and Y <- c(1, 2). What will be output of vector Z that is defined as Z <- X*Y

A – 3,4,0

B – 3,4,4

C – error

D – 3,4,8

Solution: B

Vector recycling takes place when 2 vectors of unequal lengths are multiplied.


2). If you want to know all the values in c (1, 3, 5, 7, 10) that are not in c (1, 5, 10, 12, 14). Which code in R can be used to do this?

A – setdiff(c(1,3,5,7),c(1,5,10,12,14))

B – diff(c(1,3,5,7),c(1,5,10,12,14))

C – unique(c(1,3,5,7),c(1,5,10,12,14))

D – None of the Above.

Solution: A

setdiff() function finds the values which are different in any given two vectors.


3). What is the output of f(2) ?

b <- 4
f <- function (a){
b <- 3
b^3 + g (a)

g <- function (a) {

A – 33

B – 35

C – 37

D – 31

Solution: B

g(a) uses b <- 4 because it is globally available. Globally means to every variable in the environment. f(a) uses b <- 3 because it is locally available for the function. Therefore, for a function locally available information takes precedence over global information.


4) The data shown below is from a csv file. Which of the following commands can read this csv file as a dataframe into R?

Male 25.5 0
Female 35.6 1
Female 12.03 0
Female 11.30 0
Male 65.46 1


A – read.csv(“Table1.csv”)

B – read.csv(“Table1.csv”,header=FALSE)

C – read.table(“Table1.csv”)

D – read.csv2(“Table1.csv”,header=FALSE)

Solution: B

Since the table has no headers, it is imperative to specify it in the read.csv command.


5). The missing values in the data shown from a csv file have been represented by ‘?’. Which of the below code will read this csv file correctly into R?

A 10 Sam
B ? Peter
C 30 Harry
D 40 ?
E 50 Mark


A – read.csv(“Table2.csv”)

B – read.csv(“Table2.csv”,header=FALSE,”?”)

C – read.csv2(“Table2.csv”,header=FALSE,sep=”,”,na.strings=”?”)

D – read.table(“Table2.csv”)

Solution: C

Since missing values comes in many forms and not just standard NA, it is essential to define by what character the NA values are represented. na.strings will tell read.csv to treat every question mark ? as a missing value.


6). The table shown below from a csv file has row names as well as column names. This table will be used in the following questions:

Which of the following code can read this csv file properly into R?

  Column 1 Column 2 Column 3
Row 1 15.5 14.12 69.5
Row 2 18.6 56.23 52.4
Row 3 21.4 47.02 63.21
Row 4 36.1 56.63 36.12


A – read.delim(“Train3.csv”,header=T,sep=”,”,row.names=1)

B – read.csv2(“Train3.csv”,header=TRUE,row.names=TRUE)

C – read.table(“Train3.csv”,header=TRUE,sep=”,”)

D – read.csv(“Train3.csv”,row.names=TRUE,header=TRUE,sep=”,”)

Solution: A

Since the first column has row names, it is important to specify it using row.names while loading data. row.names = 1 says that row names are available in the first column of the table.


7). Which of the following code will fail to read the first two rows of the csv file?

  Column 1 Column 2 Column 3
Row 1 15.5 14.12 69.5
Row 2 18.6 56.23 52.4
Row 3 21.4 47.02 63.21
Row 4 36.1 56.63 36.12


A – read.csv(“Table3.csv”,header=TRUE,row.names=1,sep=”,”,nrows=2)

B – read.csv(“Table3.csv”,row.names=1,nrows=2)

C – read.delim2(“Table3.csv”,header=T,row.names=1,sep=”,”,nrows=2)

D – read.table(“Table3.csv”,header=TRUE,row.names=1,sep=”,”,skip.last=2)

Solution- D

Except D, rest all the options will successfully read the first 2 lines of this table. nrows parameter helps to determine how many rows from a data set should be read.


8). Which of the following code will read only the second and the third column into R?

  Column 1 Column 2 Column 3
Row 1 15.5 14.12 69.5
Row 2 18.6 56.23 52.4
Row 3 21.4 47.02 63.21
Row 4 36.1 56.63 36.12


A – read.table(“Table3.csv”,header=T,row.names=1,sep=”,”,colClasses=c(“NULL”,NA,NA))

B – read.csv(“Table3.csv”,header=TRUE,row.names=1,sep=”,”,colClasses=c(“NULL”,”NA”,”NA”))

C – read.csv(“Table3.csv”,row.names=1,colClasses=c(“Null”,na,na))

D – read.csv(“Table3.csv”,row.names=T,  colClasses=TRUE)

Solution: A

You can skip reading columns using NULL in colclasses parameter while reading data.


9). Below is a data frame which has already been read into R and stored in a variable named dataframe1.

Which of the below code will produce a summary (mean, mode, median etc if applicable) of the entire data set in a single line of code?

  V1 V2 V3
1 Male 12.5 46
2 Female 56 135
3 Male 45 698
4 Female 63 12
5 Male 12.36 230
6 Male 25.23 456
7 Female 12 457

Dataframe 1

A – summary(dataframe1)

B – stats(dataframe1)

C – summarize(dataframe1)

D – summarise(dataframe1)



10) dataframe2 has been read into R properly with missing values labelled as NA. This dataframe2 will be used for the following questions:

Which of the following code will return the total number of missing values in the dataframe?

A 10 Sam
B NA Peter
C 30 Harry
D 40 NA
E 50 Mark


A – table(dataframe2==NA)

B – table(

C – table(hasNA(dataframe2))

D – which(

Solution: B


11). Which of the following code will not return the number of missing values in each column?

A 10 Sam
B NA Peter
C 30 Harry
D 40 NA
E 50 Mark


A – colSums(

B – apply(,2,sum)

C – sapply(dataframe2,function(x) sum(

D – table(

Solution: D

Rest of the options will traverse through every column to calculate and return the number of missing values per variable.


12). The data shown below has been loaded into R in a variable named dataframe3. The first row of data represent column names. The powerful data manipulation package ‘dplyr’ has been loaded. This data set will be used in following questions:

Which of the following code can select only the rows for which Gender is Male?

Gender Marital Status Age Dependents
Male Married 50 2
Female Married 45 5
Female Unmarried 25 0
Male Unmarried 21 0
Male Unmarried 26 1
Female Married 30 2
Female Unmarried 18 0


A – subset(dataframe3, Gender=”Male”)

B – subset(dataframe3, Gender==”Male”)

C – filter(dataframe3,Gender==”Male”)

D – option 2 and 3

Solution: D

filter function comes from dplyr package. subset is the base function. Both does the same job.


13). Which of the following code can select the data with married females only?

Gender Marital Status Age Dependents
Male Married 50 2
Female Married 45 5
Female Unmarried 25 0
Male Unmarried 21 0
Male Unmarried 26 1
Female Married 30 2
Female Unmarried 18 0

dataframe 3

A – subset(dataframe3,Gender==”Female” & Marital Status==”Married”)

B – filter(dataframe3, Gender==”Female” , Marital Status==”Married”)

C – Only 1

D – Both 1 and 2

Solution: D


14). Which of the following code can select all the rows from Age and Dependents?

Gender Marital Status Age Dependents
Male Married 50 2
Female Married 45 5
Female Unmarried 25 0
Male Unmarried 21 0
Male Unmarried 26 1
Female Married 30 2
Female Unmarried 18 0


A – subset(dataframe3, select=c(“Age”,”Dependents”))

B – select(dataframe3, Age,Dependants)

C – dataframe3[,c(“Age”,”Dependants”)]

D – All of the above

Solution: D

If you got this wrong, refer to the basics of sub-setting a data frame.


15). Which of the following codes will convert the class of the Dependents variable to a factor class?

Gender Marital Status Age Dependents
Male Married 50 2
Female Married 45 5
Female Unmarried 25 0
Male Unmarried 21 0
Male Unmarried 26 1
Female Married 30 2
Female Unmarried 18 0

Dataframe 3

A – dataframe3$Dependents=as.factor(dataframe3$Dependents)

B – dataframe3[,’Dependents’]=as.factor(dataframe3[,’Dependents’])

C – transform(dataframe3,Dependents=as.factor(Dependents))

D – All of the Above

Solution: D

as.factor() is used to coerce class type to factor.


16). Which of the following code can calculate the mean age of Female?

Gender Marital Status Age Dependents
Male Married 50 2
Female Married 45 5
Female Unmarried 25 0
Male Unmarried 21 0
Male Unmarried 26 1
Female Married 30 2
Female Unmarried 18 0


A – dataframe3%>%filter(Gender==”Female”)%>%summarise(mean(Age))

B – mean(dataframe3$Age[which(dataframe3$Gender==”Female”)])

C – mean(dataframe3$Age,dataframe3$Female)

D – Both 1 and 2

Solution: D

Option A describes the method using dplyr package. Option B uses the base functions to accomplish this task.


17). The data shown below has been read into R and stored in a dataframe named dataframe4. It is given that Has_Dependents column is read as a factor variable. We wish to convert this variable to numeric class. Which code will help us achieve this?

Gender Marital Status Age Has_Dependents
Male Married 50 0
Female Married 45 1
Female Unmarried 25 0
Male Unmarried 21 0
Male Unmarried 26 1
Female Married 30 1
Female Unmarried 18 0


A – dataframe4$Has Dependents=as.numeric(dataframe4$Has_Dependents)

B – dataframe4[,”Has Dependents”]=as.numeric(as.character(dataframe4$Has_ Dependents))

C – transform(dataframe4,Has Dependents=as.numeric(Has_Dependents))

D – All of the above

Solution: B


18). There are two dataframes stored in two respective variables named Dataframe1 and Dataframe2.

Feature1 Feature2 Feature3
A 1000 25.5
B 2000 35.5
C 3000 45.5
D 4000 55.5
Feature1 Feature2 Feature3
E 5000 65.5
F 6000 75.5
G 7000 85.5
H 8000 95.5

Which of the following codes will produce the output as shown below?

Feature1 Feature2 Feature3
A 1000 25.5
B 2000 35.5
C 3000 45.5
D 4000 55.5
E 5000 65.5
F 6000 75.5
G 7000 85.5
H 8000 95.5

A – merge(dataframe1,dataframe2,all=TRUE)

B – merge(dataframe1,dataframe2)

C – merge(dataframe1,dataframe2,by=intersect(names(x),names(y))

D – None of the above

Solution: A

The parameter all=TRUE says to merge both the data sets, and even if there is no match found for a particular observation, return NA.


19). Which of the following codes will create a new column named Size(MB) from the existing Size(KB) column? The dataframe is stored in a variable named dataframe5. Given 1MB = 1024KB

Package Name Creator Size(kB)
Swirl Sean Kross 2568
Ggplot Hadley Wickham 5463
Dplyr Hadley Wickham 8961
Lattice Deepayan Sarkar 3785


A –  dataframe5$Size(MB)=dataframe$Size(KB)/1024

B – dataframe5$Size(KB)=dataframe$Size(KB)/1024

C – dataframe5%>%mutate(Size(MB)=Size(KB)/1024)

D – Both 1 and 3

Solution: D


20). Following question will use the dataframe shown below:

Gender Marital Status Age Has Dependents
Male Married 50 0
Female Married 45 1
Female Unmarried 25 0
Male Unmarried 21 0
Male Unmarried 26 1
Female Married 30 1
Female Unmarried 18 0


Certain Algorithms like XGBOOST work only with numerical data. In that case, categorical variables present in dataset are converted to DUMMY variables which represent the presence or absence of a level of a categorical variable in the dataset. From Dataframe6, after creating the dummy variable for Gender, the dataset looks like below.

Gender_Male Gender_Female Marital Status Age Has Dependents
1 0 Married 50 0
0 1 Married 45 1
0 1 Unmarried 25 0
1 0 Unmarried 21 0
1 0 Unmarried 26 1
0 1 Married 30 1
0 1 Unmarried 18 0

Which of the following commands would have helped us to achieve this?

A – dummies::,names=c(“Gender”))

B – dataframe6[,”Gender”] <- split(dataframe6$Gender, ifelse(dataframe6$Gender == “Male”,0,1))

C – contrasts(dataframe6$Gender) <- contr.treatment(2)

D – None of the above

Solution: A

For Option A, install and load dummies package. With its fairly easy code syntax, one hot encoding in R was never easy before.


21). We wish to calculate the correlation between column 2 and column 3. Which of the below codes will achieve the purpose?

  Column1 Column2 Column3 Column4 Column5 Column6
Name1 Male 12 24 54 0 Alpha
Name2 Female 16 32 51 1 Beta
Name3 Male 52 104 32 0 Gamma
Name4 Female 36 72 84 1 Delta
Name5 Female 45 90 32 0 Phi
Name6 Male 12 24 12 0 Zeta
Name7 Female 32 64 64 1 Sigma
Name8 Male 42 84 54 0 Mu
Name9 Male 56 112 31 1 Eta

Dataframe 7

A – cor(dataframe7$column2,dataframe7$column3)

B – (cov(dataframe7$column2,dataframe7$column3))/(sd(dataframe7$column4)*sd(dataframe7$column3))

C – (cov(dataframe7$column2,dataframe7$column3))/(var(dataframe7$column4)*var(dataframe7$column3))

D – All of the above

Solution: A

cor is the base function used to calculate correlation between two numerical variables.


22). Column 3 has 2 missing values represented as NA in the dataframe below stored in the variable named dataframe8. We wish to impute the missing values using the mean of the column 3. Which code will help us do that?

  Column1 Column2 Column3 Column4 Column5 Column6
Name1 Male 12 24 54 0 Alpha
Name2 Female 16 32 51 1 Beta
Name3 Male 52 104 32 0 Gamma
Name4 Female 36 72 84 1 Delta
Name5 Female 45 NA 32 0 Phi
Name6 Male 12 24 12 0 Zeta
Name7 Female 32 NA 64 1 Sigma
Name8 Male 42 84 54 0 Mu
Name9 Male 56 112 31 1 Eta

Dataframe 8

A – dataframe8$Column3[which(dataframe8$Column3==NA)]=mean(dataframe8$Column3)

B – dataframe8$Column3[which($Column3))]=mean(dataframe8$Column3)

C – dataframe8$Column3[which($Column3))]=mean(dataframe8$Column3,na.rm=TRUE)

D – dataframe8$Column3[which($Column3))]=mean(dataframe8$Column3,

Solution: C

Option na.rm=TRUE says that impute the missing values by calculating the mean of all available observations.


23). Column7 contains some names with the salutations. In such cases, it is always advisable to extract salutations in a new column since they can provide more information to our predictive model.  Your work is to choose the code that cannot extract the salutations out of names in Column7 and store the salutations in Column8.

  Column1 Column2 Column3 Column4 Column5 Column6 Column7
Name1 Male 12 24 54 0 Alpha Mr.Sam
Name2 Female 16 32 51 1 Beta Ms.Lilly
Name3 Male 52 104 32 0 Gamma Mr.Mark
Name4 Female 36 72 84 1 Delta Ms.Shae
Name5 Female 45 NA 32 0 Phi Ms.Ria
Name6 Male 12 24 12 0 Zeta Mr.Patrick
Name7 Female 32 NA 64 1 Sigma Ms.Rose
Name8 Male 42 84 54 0 Mu Mr.Peter
Name9 Male 56 112 31 1 Eta Mr.Roose

Dataframe 9

A – dataframe9$Column8<-sapply(strsplit(as.character(dataframe9$Column7),split = “[.]”),function(x){x[1]})

B – dataframe9$Column8<-sapply(strsplit(as.character(dataframe9$Column7),split = “.”),function(x){x[1]})

C – dataframe9$Column8<-sapply(strsplit(as.character(dataframe9$Column7),split = “.”,fixed=TRUE),function(x){x[1]})

D – dataframe9$Column8<-unlist(strsplit(as.character(dataframe9$Column7),split = “.”,fixed=TRUE))[seq(1,18,2)]

Solution: B

strsplit is used to split a text variable based on some splitting criteria. Try running these codes at your end, you’ll understand the difference.


24). Column 3 in the data frame shown below is supposed to contain dates in ddmmyyyy format but as you can see, there is some problem with its format. Which of the following code can convert the values present in Column 3 into date format?

  Column1 Column2 Column3 Column4 Column5 Column6 Column7
Name1 Male 12 24081997 54 0 Alpha Mr.Sam
Name2 Female 16 30062001 51 1 Beta Ms.Lilly
Name3 Male 52 10041998 32 0 Gamma Mr.Mark
Name4 Female 36 17021947 84 1 Delta Ms.Shae
Name5 Female 45 15031965 32 0 Phi Ms.Ria
Name6 Male 12 24111989 12 0 Zeta Mr.Patrick
Name7 Female 32 26052015 64 1 Sigma Ms.Rose
Name8 Male 42 18041999 54 0 Mu Mr.Peter
Name9 Male 56 11021994 31 1 Eta Mr.Roose

Dataframe 10

A – as.Date(as.character(dataframe10$Column3),format=”%d%m%Y”)

B – as.Date(dataframe10$Column3,format=”%d%m%Y”)

C -as.Date(as.character(dataframe10$Column3),format=”%d%m%y”)

D -as.Date(as.character(dataframe10$column3),format=”%d%B%Y”)

Solution: A


25). Some algorithms work very well with normalized data. Your task is to convert the Column2 in the dataframe shown below into a normalised one. Which of the following code would not achieve that? The normalised column should be stored in a column named column8.

  Column1 Column2 Column3 Column4 Column5 Column6 Column7
Name1 Male 12 24081997 54 0 Alpha Mr.Sam
Name2 Female 16 30062001 51 1 Beta Ms.Lilly
Name3 Male 52 10041998 32 0 Gamma Mr.Mark
Name4 Female 36 17021947 84 1 Delta Ms.Shae
Name5 Female 45 15031965 32 0 Phi Ms.Ria
Name6 Male 12 24111989 12 0 Zeta Mr.Patrick
Name7 Female 32 26052015 64 1 Sigma Ms.Rose
Name8 Male 42 18041999 54 0 Mu Mr.Peter
Name9 Male 56 11021994 31 1 Eta Mr.Roose

dataframe 11

A – dataframe11$Column8<-(dataframe11$Column2-mean(dataframe11$column2))/sd(dataframe11$Column2)

B – dataframe11$Column8<-scale(dataframe11$Column2)

C – All of the above

Solution: C

Option A describes simply the mathematical formula for standarization i.e x – μ/σ


26). dataframe12 is the output of a certain task. We wish to save this dataframe into a csv file named “result.csv”. Which of the following commands would help us accomplish this task?

  Column1 Column2 Column3 Column4 Column5 Column6 Column7
Name1 Male 12 24081997 54 0 Alpha Mr.Sam
Name2 Female 16 30062001 51 1 Beta Ms.Lilly
Name3 Male 52 10041998 32 0 Gamma Mr.Mark
Name4 Female 36 17021947 84 1 Delta Ms.Shae
Name5 Female 45 15031965 32 0 Phi Ms.Ria
Name6 Male 12 24111989 12 0 Zeta Mr.Patrick
Name7 Female 32 26052015 64 1 Sigma Ms.Rose
Name8 Male 42 18041999 54 0 Mu Mr.Peter
Name9 Male 56 11021994 31 1 Eta Mr.Roose

dataframe 12

A – write.csv(“result.csv”, dataframe12)

B – write.csv(dataframe12,”result.csv”, row.names = FALSE)

C – write.csv(file=”result.csv”,x=dataframe12,row.names = FALSE)

D – Both 2 and 3.

Solution: C


27) y=seq(1,1000,by=0.5)

What is the length of vector y ?

A – 2000

B – 1000

C – 1999

D – 1998

Solution: C


28). The dataset has been stored in a variable named dataframe13. We wish to see the location of all those persons who have “Ms” in their names stored in Column7. Which of the following code will not help us achieve that?

  Column1 Column2 Column3 Column4 Column5 Column6 Column7
Name1 Male 12 24081997 54 0 Alpha Mr.Sam
Name2 Female 16 30062001 51 1 Beta Ms.Lilly
Name3 Male 52 10041998 32 0 Gamma Mr.Mark
Name4 Female 36 17021947 84 1 Delta Ms.Shae
Name5 Female 45 15031965 32 0 Phi Ms.Ria
Name6 Male 12 24111989 12 0 Zeta Mr.Patrick
Name7 Female 32 26052015 64 1 Sigma Ms.Rose
Name8 Male 42 18041999 54 0 Mu Mr.Peter
Name9 Male 56 11021994 31 1 Eta Mr.Roose


A – grep(pattern=”Ms”,x=dataframe13$Column7)

B – grep(pattern=”ms”,x=dataframe13$Column7,

C – grep(pattern=”Ms”,x=dataframe13$Column7,fixed=T)

D – grep(pattern=”ms”,x=dataframe13$Column7,,fixed=T)

Solution- D

In option D, we tell the function to find the match irrespective of lower or upper case i.e. it just matches the spelling the and return the output.


29). The data below has been stored in a variable named dataframe14. We wish to find and replace all the instances of Male in Column1 with Man. Which of the following code will not  help us do that?

  Column1 Column2 Column3 Column4 Column5 Column6 Column7
Name1 Male 12 24081997 54 0 Alpha Mr.Sam
Name2 Female 16 30062001 51 1 Beta Ms.Lilly
Name3 Male 52 10041998 32 0 Gamma Mr.Mark
Name4 Female 36 17021947 84 1 Delta Ms.Shae
Name5 Female 45 15031965 32 0 Phi Ms.Ria
Name6 Male 12 24111989 12 0 Zeta Mr.Patrick
Name7 Female 32 26052015 64 1 Sigma Ms.Rose
Name8 Male 42 18041999 54 0 Mu Mr.Peter
Name9 Male 56 11021994 31 1 Eta Mr.Roose

dataframe 14

A – sub(“Male”,”Man”,dataframe14$Column1)

B – gsub(“Male”,”Man”,dataframe14$Column1)

C – dataframe14$Column1[which(dataframe14$Column1==”Male”)]=”Man”

D – None of the above.

Solution: D

Try running these codes at your end. Every option will do this task gracefully.


30) Which of the following command will display the classes of each column for the following dataframe ?

  Column1 Column2 Column3 Column4 Column5 Column6 Column7
Name1 Male 12 24081997 54 0 Alpha Mr.Sam
Name2 Female 16 30062001 51 1 Beta Ms.Lilly
Name3 Male 52 10041998 32 0 Gamma Mr.Mark
Name4 Female 36 17021947 84 1 Delta Ms.Shae
Name5 Female 45 15031965 32 0 Phi Ms.Ria
Name6 Male 12 24111989 12 0 Zeta Mr.Patrick
Name7 Female 32 26052015 64 1 Sigma Ms.Rose
Name8 Male 42 18041999 54 0 Mu Mr.Peter
Name9 Male 56 11021994 31 1 Eta Mr.Roose

A – lapply(dataframe,class)

B – sapply(dataframe,class)

C – Both 2 and 3

D – None of the above

Solution: C

The only difference in the answer of lapply and sapply is that lapply will return a list and sapply will return a vector/matrix.


31).The questions below deal with the tidyr package which forms an important part of the data cleaning task in R.

Which of the following command will combine Male and Female column into a single column named Sex and create another variable named Count as the count of male or female per Name.

Name Male Female
A 1 6
B 5 9

Initial dataframe

Name Sex Count
A Male 1
B Male 5
A Female 6
B Female 9

Final dataframe

A – collect(dataframe,Male:Female,Sex,Count)

B – gather(dataframe,Sex,Count,-Name)

C – gather(dataframe,Sex,Count)

D – collect(dataframe,Male:Female,Sex,Count,-Name)

Solution: B


32). The dataframe below contains one category of messy data where multiple columns are stacked into one column which is highly undesirable.

Sex_Class Count
Male_1 1
Male_2 2
Female_1 3
Female_2 4

Which of the following code will convert the above dataframe to the dataframe below ? The dataframe is stored in a variable named dataframe.

Sex Class Count
Male 1 1
Male 2 2
Female 1 3
Female 2 4

A – separate(dataframe,Sex_Class,c(“Sex”,”Class”))

B – split(dataframe,Sex_Class,c(“Sex”,”Class”))

C – disjoint(dataframe,Sex_Class,c(“Sex”,”Class”))

D – None of the above

Solution: A


33). The dataset below suffers from a problem where variables “Term” and “Grade” are stored in separate columns which can be displayed more effectively. We wish to convert the structure of these variables into each separate variable named Mid and Final.

Name Class Term Grade
Alaska 1 Mid A
Alaska 1 Final B
Banta 2 Mid A
Banta 2 Final A

Which of the following code will convert the above dataset into the one showed below? The dataframe is stored in a variable named dataframe.

Name Class Mid Final
Alaska 1 A B
Banta 2 A A

A – melt(dataframe, Term, Mid,Final,Grade)

B – transform(dataframe,unique(Term),Grade)

C – spread(dataframe,Term,Grade)

D – None of the above

Solution: C


34). The ________ function takes an arbitrary number of arguments and concatenates them one by one into character strings.

A – copy()

B – paste()

C – bind()

D – None of the above.

Solution: B


35). Point out the correct statement :

A – Character strings are entered using either matching double (“) or single (‘) quote.

B – Character vectors may be concatenated into a vector by the c() function.

C – Subsets of the elements of a vector may be selected by appending to the name of the vector an index vector in square brackets.

D – All of the above



36) What will be the output of the following code ?

> x <- 1:3
> y <- 10:12
> rbind(x, y)

1-  [,1] [,2] [,3] x      1      2    3
y     10    11  12

2-  [,1] [,2] [,3] x       1     2     3
y      10 11

3-  [,1] [,2] [,3] x       1     2     3
y       4     5     6

4 –  All of the above

Solution: A


37). Which of the following method make vector of repeated values?

A – rep()

B – data()

C – view()

D – None of the above

Solution: A


38). Which of the following finds the position of quantile in a dataset ?

A – quantile()

B – barplot()

C – barchart()

D – None of the Above

Solution: A


39) Which of the following function cross-tabulate tables using formulas ?

A – table()

B – stem()

C – xtabs()

D – All of the above

Solution: D


40) What is the output of the following function?

>     f <- function(num = 1) {
                 hello <- "Hello, world!\n"
        for(i in seq_len(num)) {
         chars <- nchar(hello) * num

> f()

A – Hello, world!


B – Hello, world!\n


C – Hello, world!


D – Hello, world!\n


Solution: A


41- Which is the missing value from running the quantile function on a numeric vector in comparison to running the summary function on the same vector ?

A – Median

B – Mean

C – Maximum

D – Minimum

Solution: B



42- Which of the following command will plot a blue boxplot of a numeric vector named vec?

A – boxplot(vec,col=”blue”)

B – boxplot(vec,color=”blue”)

C – boxplot(vec,color=”BLUE”)

D – None of the above

Solution: A


43- Which of the following command will create a histogram with 100 buckets of data ?

A – hist(vec,buckets=100)

B – hist(vec,into=100)

C – hist(vec,breaks=100)

D – None of the above

Solution: C


44- What does the “main” parameter in barplot command does ?

A – x axis label

B – Title of the graph

C – I can’t tell

D – y axis label

Solution: B


45- The below dataframe is stored in a variable named sam:

12 East
15 West
13 East
15 East
14 West

We wish to create a boxplotin a single line of code per B i.e a total of two boxplots (one for East and one for West). Which of the following command will achieve the purpose ?

A – boxplot(A~B,data=sam)

B – boxplot(A,B,data=sam)

C – boxplot(A|B,data=sam)

D – None of the above

Solution: A


46- Which of the following command will split the plotting window into 3 X 4 windows and where the plots enter the window row wise.

A – par(split=c(3,4))

B – par(mfcol=c(3,4))

C – par(mfrow=c(3,4))

D – par(col=c(3,4))

Solution – C


47- A dataframe named frame contains two numerical columns named A and B. Which of the following commands will draw a scatter plot between the two columns of the dataframe?

A – with(frame,plot(A,B))

B – plot(frame$A,frame$B)

C – ggplot(data = frame, aes(A,B))+geom_point()

D – All of the above

Solution: D


48- The dataframe below is stored in a variable named frame.

15 42 East
11 31 West
56 54 East
45 63 East
12 26 West

Which of the following command will draw a scatter plot between A and B differentiated by different color of C like the one below.

Q.8 Image 1

A – plot(frame$A,frame$B,col=frame$C)

B – with(frame,plot(A,B,col=C)

C- 1 and 2

D- None of the above.

Solution: A


49- Which of the following does not apply to R’s base plotting system ?

A – Can easily go back once the plot has started.(eg: to change margins etc)

B – It is convenient and mirrors how we think of building plots and analysing data

C – starts with plot(or similar) function

D – Use annotation functions to add/modify (text, lines etc)

Solution: A


The following questions revolve around the ggplot2 package, which is the most widely used plotting package used in the R community and provides great customisation and flexibility over plotting.

50- Which of the following function is used to create plots in ggplot2 ?

A – qplot

B – gplot

C – plot

D – xyplot

Solution: A


51- What is true regarding the relation between the number of plots drawn by facet_wrap and facet_grid ?

A – facet_wrap > facet_grid

B – facet_wrap < facet_grid

C – facet_wrap <= facet_grid

D – None of the above

Solution: C


52- Which function in ggplot2 allows the coordinates to be flipped? (i.e x bexomes y and vice-versa) ?

A – coordinate_flip

B – coord_flip

C – coordinate_rotate

D – coord_rotate

Solution: B


53- The below dataset is stored in a variable called frame.

alpha 100
beta 120
gamma 80
delta 110

Which of the following commands will create a bar plot for the above dataset with the values in column B being the height of the bar?

A – ggplot(frame,aes(A,B))+geom_bar(stat=”identity”)

B – ggplot(frame,aes(A,B))+geom_bar(stat=”bin”)

C – ggplot(frame,aes(A,B))+geom_bar()

D – None of the above

Solution: A


54- The following dataframe is stored in a variable named frame and is a subset of a very popular dataset named mtcars.


We wish to create a stacked bar chart for cyl variable with stacking criteria being vs variable .which of the following commands will help us do this ?

A – qplot(factor(cyl),data=frame,geom=”bar”,fill=factor(vs))

B – ggplot(mtcars,aes(factor(cyl),fill=factor(vs)))+geom_bar()

C – All of the above

D – None of the above

Solution: C


55 – The question is same as above . The only difference is that you have to create a dodged bar chart instead of a stacked one. Which of the following command will help us do that ?

A – qplot(factor(cyl),data=frame,geom=”bar”,fill=factor(vs),position=”dodge”)

B – ggplot(mtcars,aes(factor(cyl),fill=factor(vs)))+geom_bar(position=”dodge”)

C – All of the above

D – None of the above

Solution: B


End Notes

I hope you had fun participating in the assessment challenge and reading this article. We tried to answer all your queries but if we still haven’t cleared all your doubts , then feel free to  post your questions in the comments below.

Read more »

Free Must Read Books on Statistics & Mathematics for Data Science
Posted by Thang Le Toan on 26 November 2017 01:29 PM


The selection process of data scientists at Google gives higher priority to candidates with strong background in statistics and mathematics. Not just Google, other top companies (Amazon, Airbnb, Uber etc) in the world also prefer candidates with strong fundamentals rather than mere know-how in data science.

If you too aspire to work for such top companies in future, it is essential for you to develop a mathematical understanding of data science. Data science is simply the evolved version of statistics and mathematics, combined with programming and business logic. I’ve met many data scientists who struggle to explain predictive models statistically.

More than just deriving accuracy, understanding & interpreting every metric, calculation behind that accuracy is important. Remember, every single ‘variable’ has a story to tell. So, if not anything else, try to become a great story explorer!

In this article, I’ve compiled a list of must read books on statistics and mathematics. I understand, mathematics has no extreme. Hence, I’ve enlist only those books which will help you to connect with data science better.

Note: Books which are made free to access by the registered authorities have been mentioned in this article. If not, a link to amazon bookstore is provided.

free must read books on statistics and mathematics



21Introduction to Statistical Learning

This is a highly recommended book for practicing data scientists. The focus of this books is kept on connecting statistics concept with machine learning. Hence, you’ll learn about all popular supervised and unsupervised machine learning algorithms. R users will get an advantage, since the practical aspects of algorithms have been demonstrated using R. In addition to theory, this book also lay emphasis on using ML algorithms in real life setting.

Available: Free Download



22Elements of Statistical Learning

This book is an advanced level of previous book. It is written by Trevor Hastie and Rob Tibshirani, Professors at Stanford University. Their first book ‘Introduction to Statistical Learning’ uncover the basics of statistics and machine learning. This book, will introduce you to higher level algorithms such as Neural Networks, Bagging & Boosting, Kernel methods etc. The algorithms have been implemented in R programming.

Available: Free Download



23Think Stats

The author of this book is Alien B Downey. It is based on perform statistical analysis practically in Python. Hence, make sure you’ve got some basic knowledge of Python before buying this book. It focuses entirely on understanding real life influence of statistics using popular case studies. Since, stats and math are closely connected, it also has dedicated chapters on topic like bayesian estimation.

Available: Buy from Amazon



24From Algorithms to Z Scores

Did you know the about crucial role of statistics in programming ? The author of this book is Norm Matloff, Professor, University of California. This book explains using probabilistic concepts and statistical measures in R. Again, a good practice source for R users. It teaches the art of dealing with probabilistic models and choosing the best one for final evaluation. It is a highly recommended book (specially for R users).

Available: Free Download



25Introduction to Bayesian Statistics

This is a highly recommended book for freshers in data science. The author of this book is William M Bolstad. It’s a must read for people who find mathematics boring. Having been written in a conversational style (rare to find math this way), this book is a great introductory resource on statistics. It begins with scientific methods of data gathering and end up delivering dedicated chapters on bayesian statistics.

Available: Free Download



26Discovering Statistics using R

This book is written by Andy Field, Jeremy Miles and Zoe Field. I would highly recommend this book to newbies in data science. To start with statistics, this book has a great content which goes in depth detail of its topics. Along with, the statistical concept are explained in conjunction with R which makes it even more useful. It offers a step by step understanding, with a parallel support of interesting practice examples.

Available: Buy on Amazon




27Introduction to Linear Algebra

This is one of the most recommended book on Linear Algebra. The author of this book is Gilbert Strang, Professor, MIT. Gilbert unique way of delivering knowledge would give you the intuition and excitement to move forward after every chapter. This book will help you to build a strong mathematical foundation for machine learning. It enlists all the necessary chapters such as vectors, linear equations, determinants, eigenvalues, matrix factorization etc in great depth.

Available: Buy on Amazon



28Matrix Computation

Matrix and Data frames are essential components of machine learning. The author of this book is Gene H Golub and Charles F Van Loan. This book provides a nice head start to students with concepts of matrix computations. The author covers most of the important topics such as gaussian elimination, matrix factorization, lancoz method, error analysis etc. Every chapter is supported by intuitive practice problems. The pseudo codes are available in Matlab.

Available: Free Download



29A Probabilistic Theory of Pattern Recognition

This is a complete resource to learn application of mathematics. This is a must read book for intermediate and advanced practitioners in machine learning. This book is written by Luc Devroye, Laszlo Gyorfi and Gabor Lugosi. It covers a wide range of topics varying from bayes error, linear discrimination to epsilon entropy & neural networks. It provides a convincing explanation to complex theorems with section wise practice problems.

Available: Free Download



30Introduction of Math of Neural Networks

If you have innate interest in learning about neural network, this should be your place to start. The author of this book is Jeff Heaton. The author has beautifully simplified the difficult concepts of neural networks. This book introduces you to basics of underlying maths in neural networks. It assumes reader has prior knowledge of algebra, calculus and programming. It demonstrates various mathematical tools which can be applied to neural networks.

Available: Buy on Amazon


31Advanced Engineering Mathematics

This is probably the most comprehensive book available on mathematics for machine learning users. The author of this book is Erwin Kreyszig. As a matter of fact, this book is highly recommended to college students as well. If you haven’t been good at maths till now, follow this book religiously and you should surely see significant improvements in your math understanding. Along with derivations & practice example, this book has dedicated sections of calculus, algebra, probability etc. Definitely, a must read book for all levels of practitioners in data science.

Available: Free Download


Cookbook on Probability and Statistics

This cookbook is must have in your digital bookshelf. This isn’t exactly a text book you’d discover, but a quick digital guide on mathematical equations. The author of this book is Matthias Vallentin. After you finish with essentials of mathematics, this book will help you connect various theorem and algorithm quickly with their formulae. It’s difficult to derive equations instantly, this book will help you to quickly navigate to your desired problem and solve.

Available: Free Download


Additional Resources

Bored of reading too much ? Here are is a list of highly recommended tutorials (video) / resources on mathematics and statistics. They are FREE to access.

  1. Complete Course on Linear Algebra by MIT
  2. Complete Course on Multivariable Calculus by MIT
  3. Statistical Learning by Stanford University
  4. Mathematics at Khan Academy
  5. Full Cheatsheet on Probability


End Notes

The books listed in this article are selected on the basis of their reviews and depth of topics covered. This is not an exhaustive list of books. But, I found it’s almost too easy to get confused while deciding ‘from where to begin?’ In such situations, it is advisable to start with this list.

In this article, I’ve listed some most helpful books on statistics and machine learning. It has been found that people tend of neglect these topics in pursuit of quick success. But, that’s not the right way. Hence, if you aim for a long term success in data science, make sure you learn to create stories out of maths and statistics.

Read more »

18 New Must Read Books for Data Scientists on R and Python
Posted by Thang Le Toan on 26 November 2017 01:28 PM


“It’s called reading. It’s how people install new software into their brain”

Personally, I haven’t learnt as much from videos & online tutorials as much I’ve learnt from books. Until this very moment, my tiny wooden shelf has enough books to keep me busy this winter.

Understanding machine learning & data science is easy. There are numerous open courses which you can take up right now and get started. But, acquiring in-depth knowledge of a subject requires extra effort. For example: You might quickly understand how does a random forest work, but understanding the logic behind it’s working would require extra efforts.

The confidence of questioning the logic comes from reading books. Some people easily accept the status quo. On the other hand, some curious ones challenge & say, “Why can’t it be done the other way?” That’s where such people discover new ways of executing a task. Almost, every data scientist I’ve come across in person, on AMAs, on published interviews, each one of them have emphasized the inevitable role of books in their lives.

Here is a list of books on doing machine learning / data science in R and Python which I’ve come across in last one year. Since reading is a good habit, with this post, I want pass this habit to you. For each book, I’ve written a summary to help you judge its relevance. Happy reading!

Disclosure: The amazon links in this article are affiliate links. If you buy a book through this link, we would get paid through Amazon. This is one of the ways for us to cover our costs while we continue to create these awesome articles. Further, the list reflects our recommendation based on content of book and is no way influenced by the commission.

18 New Must Read Books for Data Scientists on R and Python


R for Data Science

Hands-on Programming with R

hands-on-programmingThis book is written by Garrett Grolemund. It is best suited for people new to R. Learning to write functions & loops empowers you to do much more in R, than just juggling with packages. People think, R packages can let them avoid writing functions & loops, but it isn’t a sustainable approach. This book introduces you to details of R programming environment using interesting projects like weighted dice, playing cards, slot machine etc. The book language is simple to understand and examples can be reproduced easily.

Available: Buy Now


R for Everyone: Advanced Analytics and Graphics

r-for-everyoneThis book is written by Jared P. Lander. It’s a decent book covering all aspects of data science such as data visualization, data manipulation, predictive modeling, but not in as much depth. You can understand as, it covers a wide breath of topic and misses out on details of each. Precisely, it emphasizes on the usage criteria of algorithms and one example each showing its implementation in R. This books should be brought by people who are more inclined towards understand practical side of algorithms.

Available: Buy Now


R Cookbook

r-cookbookThis book is written by Teetor Paul. It comprises of several tips, recipes to help people overcome daily struggles in data pre-processing and manipulation. Many a times, we are stuck in a situation where we know very well, what needs to be done. But, how it needs to be done becomes a mammoth challenge. This books solves the problem. It doesn’t have theoretical explanation of concepts, but focuses on how to use them in R. It covers a wide range of topics such as probability, statistics, time series analysis, data pre-processing etc.

Available: Buy Now


R Graphics Cookbook

r-graphics-cookbookThis book is written by Winston Chang. Data visualization enables a person to express & analyze their findings using shapes & colors, not just in tables. Having a solid understanding of charts, when to use which chart, how to customize a chart and make it look good, is a key skill of a data scientist. This book doesn’t bore you with theoretical knowledge, but focuses on building them in R using sample data sets. It focuses on ggplot2 package to undertake all visualization activities. 

Available: Buy Now


Applied Predictive Modeling

applied-predictive-modelingThis book is written by Max Kuhn and Kjell Johnson. Max Kuhn is none other than creator of caret package too. It’s one of the best book comprising a blend of theoretical and practical knowledge. It discusses several crucial machine learning topics such as over-fitting, feature selection, linear & non-linear models, trees methods etc. Needless to say, it demonstrates all these algorithms using caret package. Caret is one of the powerful ML package contributed in CRAN library.

Available: Buy Now


Introduction to Statistical Learning

intro-to-statistical-learningThis book is written by a team of authors including Trevor Hastie and Robert Tibshirani. It is one of the most detailed book on statistical modeling. Also, it’s available for free. It comprises of in-depth explanation of topics such as linear regression, logistic regression, trees, SVM, unsupervised learning etc. Since it’s the introduction, the explanations are quite easy and any newbie can easily follow it. Thus, I recommended this book to all people who are new to machine learning in R. In addition, several practice exercises in this book just adds cherry on top.

Available: Buy Now


Elements of Statistical Learning

elements-of-statistical-learningThis book is written by Trevor Hastie, Robert Tibshirani and Jerome Friedman. This is the next part of ‘Introduction to Statistical Learning’. It comprises of more advanced topics, therefore I would suggest you not to directly jump to it. This book in best suited for people familiar with basics of machine learning. It talks about shrinkage methods, different linear methods for regression, classification, kernel smoothing, model selection etc. It’s a must read book for people who want to understand ML in depth.

Available: Buy Now


Machine Learning with R

machine-learning-in-rThis book is written by Brett Lantz. I am impressed by the simplicity of this author’s way of explaining concepts. It’s a book on machine learning which is easy to understand, and would provide you a lot of knowledge about their practical aspects too. Algorithms such as Bagging, Boosting, SVM, Neural Network, Clustering etc are discussed by solving respective case studies. These case studies will help you understand the real world usage of these algorithms. In addition, knowledge of ML parameters is also discussed.

Available: Buy Now


Mastering Machine Learning with R

mastering-machine-learning-with-rThis book is written by Cory Lesmeister. It is best suited for everyone who want to master R for machine learning purposes. It comprises of all (almost) algorithms and their execution in R. Alongside, this book will introduce you to several R packages used for ML including the recently launched H2o package. It’s a book which features latest advancements in ML forte, hence I’d suggest it to be read by every R user. However, you can’t expect to learn advanced ML techniques like Stacking from this book.

Available: Buy Now


Machine Learning for Hackers

machine-learning-for-hackersThis book is written by Drew Conway and John Myles White. It’s a relatively shorter book than others, but aptly brings out sheer importance of every topic discussed. After reading this book, I realized that the author’s mindset is not to go deep in a topic, still making sure to cover important details. For enhanced understanding, the author also demonstrates several used cases, while solving which, explains the underlying methods too. It’s a good read for everyone who’d like to learn something new about ML.

Available: Buy Now


Practical Data Science with R 

practical-data-science-with-rThis book is written by Nina Zumel & John Mount. As the name suggests, this book focuses on using data science methods in real world. It’s different in itself. None of the books listed above, talks about real world challenges in model building, model deployment, but it does. The author doesn’t move her focus from establishing a connect between theoretical world of ML and its impact on real world activities. It’s a must read for freshers who are yet to enter analytics industry.

Available: Buy Now


Python for Data Science

Mastering Python for Data Science

mastering-python-for-data-scienceThis book is written by Samir Madhavan. This book starts with an introduction to data structures in Numpy & Pandas and provides a useful description of importing data from various sources into these structures. You will learn to perform linear algebra in Python and make analysis by using inferential statistics. Later, the book takes onto the advanced concepts like building a recommendation engine, high-end visualization using Python, ensemble modeling etc.

Available: Buy Now


Python for Data Analysis

python-for-data-analysisWant to get started with data analysis with Python? Get your hands on this data analysis guide by W Mckinney, the main author of Pandas library. There isn’t any online course as comprehensive as this book. This book covers all aspects of data analysis from manipulating, processing, cleaning, visualization and crunching data in Python. If you are a new to data science python, it’s a must read for you. It’s power-packed with case studies from various domains.

Available: Buy Now


Introduction to Machine Learning with Python

intro-machine-learning-with-pythonThis book is written by Andreas Muller and Sarah Guido. It’s meant to help beginners to get started with machine learning. It teaches to build ML models in python scikit-learn from scratch. It assumes no prior knowledge, hence it’s best suited for people with no prior python or ML knowledge. In addition, it also covers advanced methods for model evaluation and parameter tuning, methods for working with text-data, text -specific processing techniques etc.

Available: Buy Now


Python Machine Learning

python-for-machine-learningThis book is written by Sebastian Raschka. It’s one of the most comprehensive book’s I’ve found on ML in Python. The author explains every crucial detail we need to know about machine learning. He takes a stepwise approach in explaining the concepts supported by various examples. This book cover topics such as neural networks, clustering, regression, classification, ensemble etc. It’s a must read book for everyone keen to master ML in python.

Available: Buy Now


Building Machine Learning Systems with Python

building-machine-learning-systems-with-pythonThis book is written by Willi Richert, Luis Pedro Coelho. In this book the authors have chosen a path of, starting with basics, explaining concepts through projects and ending on a high note. Therefore, I’d suggest this book to newbie python machine learning enthusiasts. It covers topics like image processing, recommendation engine, sentiment analysis etc. It’s easy to understand and fast to implement text book.

Available: Buy Now


Advanced Machine Learning with Python

amlThis book is written by John Hearty. It’s a definite read for every machine learning enthusiasts. It lets you rise above the basics of ML techniques and dive into unsupervised methods, deep belief networks, Auto encoders, feature engineering techniques, ensembles etc. It’s definitely a book you would want to read to improve your ranks in machine learning competitions. The author lays equal emphasis on theoretical as well practical aspects of machine learning.

Available: Buy Now


Programming Collective Intelligence

programming-collective-intelligenceThis book is written by Toby Segaran. With an interesting title, this book is meant to introduce you to several ML algorithms such as SVM, trees, clustering, optimization etc using interesting examples and used cases. This is book is best suited for people new to ML in python. Python, known for its incredible ML libraries & support should make it easy for you to learn these concepts faster. Also, the chapters include exercises for practice to help you develop better understanding.

Available: Buy Now


End Notes

The motive of this article is to introduce you to the huge reservoir of knowledge which you haven’t noticed yet. These books will not only provide you boundless knowledge but also, enrich you with various perspectives on using ML algorithms. You might feel puzzled at seeing so many books explaining similar concepts. What differentiates these books is the case studies & examples discussed.

Trust me, sometimes theoretical explanations becomes quite difficult to decipher as compared to understanding practical cases. That’s how I feel. Learning from these author’s knowledge is the fastest way you can learn from so many people.

Read more »

What is Business Analytics and which tools are used for analysis?
Posted by Thang Le Toan on 26 November 2017 01:26 PM

Business Analytics has become a catch all word for any thing to do with data.

So if you are new to this field and don’t understand what people refer to as “Business Analytics”, don’t worry!

Even after spending more than 6 years in this industry, there are times when it is difficult for me to understand the work a person has done by reading his CV.

Here is how an excerpt from a typical JD might look like:

  • Analyze, prepare reports and present to Leadership team on a defined frequency


  • Lead multiple analytical projects and business planning to assist Leadership team deliver business performance.

On one hand, this creates confusion in mind of person applying for a particular role. On the other hand, it leaves the selectors with a difficult role to understand and judge what a person has done in past.

Now, if I got this as description for one of the jobs I had applied to, I would be scared! Scared, not because I don’t know the subject, but because, these could mean anything. The work could refer to preparing basic reports at a junior level to performing multi-variate deep dives on various subjects.

So, what do you do when you are in such a situation?

Well, the first thing you should do is understand Business Analytics spectrum. Once you have understood it, ask which part of the spectrum, the role applies to and then decide whether it suits your skills or not.

Following is a good representation of this spectrum:

analytics spectrum

Let me explain each of these areas in a bit more detail.


Reporting – Answer to What happened?

The domain of Analytics starts from answering a simple question – What happened? This activity is typically known as reporting. These are typically the MIS which people want to receive first thing in the morning. It is a snapshot of what has happened. Following is an example of how a typical report might look like:


Tools used in reporting:

Majority of elementary reporting happens on MS Excel across the globe. More evolved Organizations might pull the data through databases using tools like SQL, MS Access or Oracle. But typically, the dissemination of reports happens through Excel.

Skills required for reporting:

  • MS excel
  • Business understanding
  • Ability to perform monotonous task with diligence


Detective Analysis – Answer to Why did it happen?

Detective Analysis starts where reporting ends. You start looking for reasons for unexpected changes. Typical  problems you work on are “Why did the Sales drop in last 2 months?” or “Why did the latest campaign under-perform or over-perform?”. In order to find out answers to these questions, you look at past trends or you look at distribution changes to find out the reasons for the changes. However, all of this is backward looking.

Some of these insights, which you find out after looking at backward analysis can be used for business planning, but the purpose of analysis is typically to find out what has worked and what has not.

Tools used in detective analysis:

Typically used tools are MS excel, MS Access, Minitab, R (basic regression). You tend to use advanced Excel and Pivot tables while dealing with these problems and typically creating time series graphs helps a lot.

Skills required for detective analysis:

  • Structured thinking
  • MS Access, Excel, basic regression
  • Business understanding


Dashboards – Answer to What’s happening now?

Dashboard is an Organized and well presented summary of key business metrics. They are usually interactive so that the user can find out the exact information he is looking for. Dashboard, in ideal state should provide real time information about performance. Following is an example of how a dashboard might look like:


The whole science of creating data model, dashboards and reports based on this data is also known as “Business Intelligence“.

Tools used for creating dashboards:

For limited size of data, dashboards can be made using Advanced excel. But, typically Organizations use more advanced tools for creation and dissemination of tools. Business Objects, Qlikview, Hyperion are names of some such softwares.

Skills required for creating dashboards:

  • Strong structured thinking: The person will need to create the entire architecture and data model
  • Business Understanding: If you don’t understand what you want to represent, God help you!


Predictive Modeling – Answer to What is likely to happen?

This is where you take all your historical trends and information and apply it to predict the future. You try and predict customer behaviour based on past information. Please note that there is a fine difference in forecasting and predictive modeling. Forecasting is typically done at aggregate level, where as predictive modeling is typically done at a customer / instance level

Tools used for Predictive modeling:

SAS has the highest market share among tools used for predictive modeling followed by SPSS, R, Matlab.

Skills required for Predictive modeling:

  • Strong structured thinking
  • Business Understanding
  • Problem Solving


Big data – Answer to What can happen, given the behaviour of the community?

Imagine applying predictive modeling with a microscope in hand. What if you can store, analyze and make sense out of every information about the customer. What kind of social media community he is attached to? What kind of searches is he performing? Big data problems arise when data has grown on all three Vs (Volume, Velocity and Variety). You need data scientists to mine this data.

Tools used in Big data:

This is a very dynamic domain right now. A tool which used to be market leader 6 months back is no longer the best. Hence, it is difficult to pin down specific tools. These tools typically work on Hadoop to store the data.

Skills required for harnessing big data:

  • Strong structured thinking
  • Advanced Data Architecture knowledge
  • Ability to work with unstructured data

So, now that you understand the Analytics spectrum, if you come across a role which is not clear to you, please spend the necessary time understanding which domain does it refer to and does it fit right with what you want to achieve.

If you have come across this confusion on understanding “Business Analytics“, this article should have helped you. In case there are any further confusion, do let me know.

Read more »

18 Free Exploratory Data Analysis Tools For People who don’t code so well
Posted by Thang Le Toan on 26 November 2017 01:25 PM


Some of these tools are even better than programming (R, Python, SAS) tools.

All of us are born with special talents. It’s just a matter of time until we discover it and start believing in ourselves. We all have limitations, but should we stop there? No.

When I started coding in R, I struggled. Sometimes a lot more than one can ever think! Because I had never ever coded even <Hello World> in my entire life.  My situation was similar to a guy who’s didn’t know swimming but was manhandled into deep ocean, who somehow saved himself from drowning but ended up gulping lot of salty water.

Now when I look back, I laugh at myself. Do you know why? Because, I could have chosen one of several non-coding tools available for data analysis, and could’ve avoided the suffering.

Data exploration is an inevitable part of predictive modeling. You can’t make predictions unless you know what happened in the past. The most important skill to master data exploration is ‘curiosity’, which is free of cost yet isn’t owned by everyone.

I have written this article to help you acknowledge various free tools available for exploratory data analysis. Now a days, ample of tools are available in the market which are free & quite interesting to work with. These tools doesn’t require you to code explicitly but simple drag – drop clicks does the job.


List of  Non Programming Tools

1. Excel / Spreadsheetorb

If you are transitioning into data science or have already survived for years, you would know, even after countless years, excel remains an indispensable part of analytics industry. Even today, most of the problems faced in analytics projects are solved using this software. With larger than ever community support, tutorials, free resources, learning this tool has become quite easier.

It supports all the important features like summarizing data, visualizing data, data wrangling etc. which are powerful enough to inspect data from all possible angles. No matter how many tools you know, excel must feature in your armory. Though, Microsoft excel is paid but you can still try various other spreadsheet tools like open office, google docs, which are certainly worth a try!

Free Download: Click Here


2. Trifactatrifacta

Trifacta’s Wrangler tool is challenging the traditional methods of data cleaning and manipulation. Since, excel possess limitations on data size, this tool has no such boundaries and you can securely work on big data sets. This tool has incredible features such as chart recommendations, inbuilt algorithms, analysis insights using which you can generate reports in no time. It’s an intelligent tool focused on solving business problems faster, thereby allowing us to be more productive at data related exercises.

Availability of such open source tools make us feel more confident and supportive, that there are good people also, around the world who are working extremely hard to make our lives better.

Free Download: Click Here


3. Rapid Minerrapidminer-180x180

This tool emerged as a leader in 2016 Gartner Magic Quadrant for Advanced Analytics. Yes, it’s more than a data cleaning tool. It extends its expertise in building machine learning models. Yes, it comprises all the ML algorithms which we use frequently. Not just a GUI, it also extends support to people using Python & R for model building.

It’s continues to fascinate people around the world with its remarkable capabilities. Above all, it claims to provide analytics experience at lightning fast level. Their product line has several products built for big data, visualizations, model deployment, some of which (enterprise) include a subscription fee. In short, we can say it’s a complete tool for any business which requires performing all tasks from data loading to model deployment.

Free Download: Click Here


4. Rattle GUI logo

If you tried using R, but couldn’t get a knack of what’s going in, Rattle should be your first choice. This GUI is built on R and gets launched by typing install.packages("rattle") followed by library(rattle) then rattle() in R. Therefore, to use rattle you must install R. It’s also more than just data mining tool. Rattle supports various ML algorithms such as Tree, SVM, Boosting, Neural Net, Survival, Linear models etc.

It’s being widely used these days. According to CRAN, rattle is being installed 10000 times every month. It provides enough options to explore, transform and model data is just few clicks. However, it has fewer options than SPSS for statistical analysis. But, SPSS is a paid tool.

Free Download: Click Here


5. Qlikviewqlikview-logo

Qlikview is one of the most popular tool in business intelligence industry around the world. Deriving business insights and presenting it in an awesome manner, it what this tool does. With it’s state of art visualization capabilities, you’d be amazed by the amount of control you get while working on data. It has an inbuilt recommendation engine to update you from time to time about best visualization methods while working on data sets.

However, it is not a statistical software. Qlikview is incredible at exploring data, trend, insights but it can’t prove anything statistically. In that case, you might want to look at other softwares.

Free Download: Click Here


6. Weka weka-logo

An advantage of using Weka is that it is easy to learn. Being a machine learning tool, its interface is intuitive enough for you to get the job done quickly. It provides options for data pre-processing, classification, regression, clustering, association rules and visualization. Most of the steps you think of while model building can be achieved using Weka. It’s built on Java.

Primarily, it was designed for research purposes at University of Wakaito, but later it got accepted by more and more people around the world. However, overtime I haven’t seen an enthusiastic weka community like of R and Python. The tutorial listed below should help you more.

Free Tutorial: Click Here


7. KNIME knime

Similar to RapidMiner, KNIME offers an open source analytics platform for analyzing data, which can later be deployed, scaled using other supportive KNIME products. This tool has abundance of features on data blending, visualization and advanced machine learning algorithms. Yes, using this tool you can build models also. Though, there hasn’t be enough talk about this tool, but considering its state of art design, I think it will soon catch up much needed limelight.

Moreover, quick training lessons are available on their website to get you started with this tool right now.

Free Download: Click Here


8. Orange orange

As cool as its sounds, this tool is designed to produce interactive data visualizations and data mining tasks. There are enough youtube tutorial to learn this tool. It has an extensive library of data mining tasks which includes all classification, regression, clustering methods. Along with, the versatile visualizations which get formed during data analysis allows us to understand the data more closely.

To build any model, you’ll be required  to create a flowchart. This is interesting as it would help us further understand the exact procedure of data mining tasks.

Free Download: Click Here


9. Tableau Publictableau

Tableau is a data visualization software. We can say, tableau and qlikview are the most powerful sharks in business intelligence ocean. The comparison of superiority is never ending. It’s a fast visualization software which let’s you explore data, every observation using various possible charts. It’s intelligent algorithms figure out by self about the type of data, best method available etc.

If you want to understand data in real time, tableau can get the job done. In a way, tableau imparts a colorful life to data and let’s us share our work with others.

Free Download: Click Here


10. Data Wrapper datawrapper

It’s a lightning fast visualization software. Next time, when someone in your team gets assigned BI work, and he/she has no clue what to do, this software is a considerable option. It’s visualization bucket comprises of line chart, bar chart, column chart, pie chart, stacked bar chart and maps. So, it’s a basic software and can’t be compared with giants like tableau and qlikview. This tools is browser enabled and doesn’t require any software installation.


11. Data Science Studio (DSS)

ikuIt is a powerful tool designed to connect technology, business and data. It is available in two segments: Coding & Non-Coding. It’s a complete package for any organization which aims to develop, build, deploy and scale models on network. DSS is also powerful enough to create smart data applications to solve real world problems. It comprises of features which facilitates team integration on projects. Among all features, the most interesting part is, you can reproduce your work in DSS as every action in the system is versioned through an integrated GIT repository.

Free Download: Click Here


12. OpenRefinerefine

It started as Google Refine but looks like google plummeted this project due to reasons unclear. However, this tool is still available renamed as Open Refine. Among the generous list of open source tools, openrefine specializes in messy data; cleaning, transforming and shaping it for predictive modeling purposes. As an interesting fact, during model building, 80% time of an analyst is spent in data cleaning. Not so pleasant, but it’s the fact. Using openrefine, analysts can not only save their time, but put it to use for productive work.

Free Download: Click Here


13. Talendtaled

Decision making these days is largely driven by data. Managers & professionals no longer make gut-based decision. They require a tool which can help them quickly. Talend can help them to explore data and support their decision making. Precisely, it’s a data collaboration tool capable of clean, transform and visualize data.

Moreover, it also offers an interesting automation feature where you can save and redo your previous task on a new data set. This feature is unique and haven’t been found in many tools. Also, it makes auto discovery, provides smart suggestion to the user for enhanced data analysis.

Free Download: Click Here


14. Data Preparator prep

This tool is built on Java to assist us in data exploration, cleaning and analysis. It includes various inbuilt packages for discretization, numeration, scaling, attribute selection, missing values, outliers, statistics, visualization, balancing, sampling, row selection, and several other tasks. It’s GUI is intuitive and simple to understand. Once you start working on it, I’m sure you wouldn’t take lot of time to figure out how to work.

A unique advantage of this tool is, the data set used for analysis doesn’t get stored in computer memory. This means you can work on large data sets without having any speed or memory troubles.

Free Download: Click Here


15. DataCracker  datac

It’s a data analysis software which specializes on survey data. Many companies do survey but they struggle to analyze it statistically. Survey data are never clean. It comprises of lot of missing & inappropriate value. This tool reduces our agony and enhances our experience of working on messy data. This tool is designed such that it can load data from all major internet survey programs like surveymonkey, survey gizmo etc. There are several interactive features which helps to understand data better.

Free Download: Click Here


16. Data Applied app

This powerful interactive tool is designed to build, share, design data analysis reports. Creating visualization on large data sets can sometimes be troublesome. But this tool is robust in visualizing large amounts of data using tree maps. Like all other tools above, it has feature for data transformation, statistical analysis, detecting anomalies etc. All in all, it’s a multi usage data mining tool capable of of automatically extracting valuable knowledge (signal) from the raw data. You’d be amazed to see that such non-programming tools are no less than R or Python for data analysis.

Free Download: Click Here


17.  Tanagra Project tan

You might not like it because of old fashioned UI, but this free data mining software is designed to build machine learning models. Tanagra project started as a free software for academic and research purposes. Being an open source project, it provides you enough space to devise your own algorithm and contribute.

Along with supervised learning algorithms, it is enabled with paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms etc. Some of its limitations include  unavailability of wide set of data sources, direct access to datawarehouses and databases, data cleansing, interactive utilization etc.

Free Download: Click Here


18. H2ooiu

H2o is one of the most popular software in analytics industry today. In few years, this organization has succeeded in evangelizing the analytics community around the world. With this open source software, they bring lighting fast analytics experience, which is further extended using API for programming languages. Not just data analysis, but you can build advanced machine learning models in no time. The community support is great, hence learning this tool isn’t a worry. If you live in US, chances are they would be organizing a meetup nearby you. Do drop by!

Free Download: Click Here


Bonus Additions:

In addition to the awesome tools above, I also found some more tools which I thought you might be interested to look at. However, these tools aren’t free but you can still avail them for trial:

  1. Data Kleenr
  2. Data Ladder
  3. Data Cleaner
  4. WinPure


End Notes

Once you start working on these tools (your choice), you’d understand that knowing programming for predictive modeling isn’t much advantageous. You can accomplish the same thing with these open source tools. Therefore, until now, if you were get disappointed at your lack of non-coding, now is the time you channelize your enthusiasm on these tools. You may be interested to check 19 Data Science Tools for Non Coders.

The only limitation I see with these tools (some of them) is, lack of community support. Except few tools, several of them don’t have a community to seek help and suggestions. Still, it’s worth a try!

Read more »

19 Data Science Tools for people who aren’t so good at Programming
Posted by Thang Le Toan on 26 November 2017 01:20 PM


Programming is an integral part of data science. Among other things, it is considered that a mind which understands programming logic, loops, functions has higher chances of becoming a successful data scientist. So, what about people who never studied programming subject in their school or college ?

Are they doomed to have a unsuccessful career in data science ?

I’m sure there are countless people who want to enter data science domain but don’t understand coding very well. In fact, I too was a member of your non-programming league until I joined my first job. Therefore, I understand how terribly it feels when something you have never learnt haunts you at every step now.

Good news is, I found out a way! Rather, I’ve found out 19 ways using which you can ignite your appetite to learn data science without doing coding. These tools typically obviate the programming aspect and provide user-friendly GUI (Graphical User Interface) so that anyone with minimal knowledge of algorithms can simply used them to build predictive models.

Many companies (specially startups) have recently launched GUI driven data science tools. I’ve covered most of tools available in industry today. Also, I’ve added some videos to enhance your learning experience.


Note: All the information provided is gather from open-source information sources. We are just presenting some facts and not opinions. In no manner do we intent to promote/advertise any of the products/services.

data science tools for non programmers


List of Tools

1. RapidMiner

RapidMiner (RM) was originally started in 2006 as an open-source stand-alone software named Rapid-I. Over the years, they have given it the name of RapidMiner and also attained ~35Mn USD in funding. The tool is open-source for old version (below v6) but the latest versions come in a 14-day trial period and licensed after that.

RM covers the entire life-cycle of prediction modeling, starting from data preparation to model building and finally validation and deployment. The GUI is based on a block-diagram approach, something very similar to Matlab Simulink. There are predefined blocks which act as plug and play devices. You just have to connect them in the right manner and a large variety of algorithms can be run without a single line of code. On top of this, they allow custom R and Python scripts to be integrated into the system.

There current product offerings include the following:

  1. RapidMiner Studio: A stand-alone software which can be used for data preparation, visualization and statistical modeling
  2. RapidMiner Server: It is an enterprise-grade environment with central repositories which allow easy team work, project management and model deployment
  3. RapidMiner Radoop: Implements big-data analytics capabilities centered around Hadoop
  4. RapidMiner Cloud: A cloud-based repository which allows easy sharing of information among various devices

RM is currently being used in various industries including automotive, banking, insurance, life Sciences, manufacturing, oil and gas, retail, telecommunication and utilities.


2. DataRobot

DataRobot (DR) is a highly automated machine learning platform built by all time best Kagglers including Jeremy Achin, Thoman DeGodoy and Owen Zhang. Their platform claims to have obviated the need for data scientists. This is evident from a phrase from their website – “Data science requires math and stats aptitude, programming skills, and business knowledge. With DataRobot, you bring the business knowledge and data, and our cutting-edge automation takes care of the rest.”

DR proclaims to have the following benefits:

  • Model Optimization
    • Platform automatically detects the best data pre-processing and feature engineering by employing text mining, variable type detection, encoding, imputation, scaling, transformation, etc.
    • Hyper-parameters are automatically chosen depending on the error-metric and the validation set score
  • Parallel Processing
    • Computation is divided over thousands of multi-core servers
    • Uses distributed algorithms to scale to large data sets
  • Deployment
    • Easy deployment facilities with just a few clicks (no need to write any new code)
  • For Software Engineers
    • Python SDK and APIs available for quick integration of models into tools and softwares.

With funding of ~60Mn USD and more than 100 employees, DR looks in good shape for the future.


3. BigML

BigML is another platform with ~Mn USD in funding. It provides a good GUI which takes the user through 6 steps as following:

  • Sources: use various sources of information
  • Datasets: use the defined sources to create a dataset
  • Models: make predictive models
  • Predictions: generate predictions based on the model
  • Ensembles: create ensemble of various models
  • Evaluation: very model against validation sets

These processes will obviously iterate in different orders. The BigML platform provides nice visualization of results and has algorithms for solving classification, regression, clustering, anomaly detection and association discovery problems. You can get a feel of how their interface works using their YouTube channel.


4. Google Cloud Prediction API


The Google Cloud Prediction API offers RESTful APIs for building machine learning models for android applications. This platform is specifically for mobile applications based on Android OS. Some of the use cases include:

  • Recommendation Engine: Given a user’s past viewing habits, predict what other movies or products a user might like.
  • Span Detection: Categorize emails as spam or non-spam.
  • Sentiment Analysis: Analyze posted comments about your product to determine whether they have a positive or negative tone.
  • Purchase Prediction: Guess how much a user might spend on a given day, given his spending history.

Though the API can be used by any system, there are also specific Google API client libraries build for better performance and security. These exist for various programming languages- Python, Go, Java, JavaScript, .net, NodeJS, Obj-C, PHP and Ruby.


5. Paxata

Paxata is one of the few organizations which focus on data cleaning and preparation, NOT the machine learning or statistical modeling part. It is an MS Excel-like application that is easy to use, with visual guidance making it easy to bring together data, find and fix dirty or missing data, and share and re-use data projects across teams. Like others mentioned here, Paxata eliminates coding or scripting, so overcoming technical technical barriers involved in handling data.

Paxata platform follows the following process:

  1. Add Data: use a wide range of sources to acquire data
  2. Explore: perform data exploration using powerful visuals allowing the user to easily identify gaps in data
  3. Clean+Change: perform data cleaning using steps like imputation, normalization of similar values using NLP, detecting duplicates
  4. Shape: make pivots on data, perform grouping and aggregation
  5. Share+Govern: allows sharing and collaborating across teams with strong authentication and authorization in place
  6. Combine: a proprietary technology called SmartFusion allows combining data frames with 1 click as it automatically detects the best combination possible; multiple data sets can be combined into a single AnswerSet
  7. BI Tools: allows easy visualization of the final AnswerSet in commonly used BI tools; also allows easy iterations between data preprocessing and visualization

With a funding of ~25Mn USD, Praxata has set its foot in financial services, consumer goods and networking domains. It might be a good tool to use if your work requires extensive data cleaning.


6. Trifacta

Trifacta is another startup focussed on data preparation. It has 2 product offering:

  • Wrangler – a free stand-alone software
  • Wrangler Enterprise – licensed professional version

Trifacta offers a very intuitive GUI for performing data cleaning. It takes data as input and provides a summary with various statistics by column. Also, for each column it automatically recommends some transformations which can be selected using a single click. Various transformations can be performed on the data using some pre-defined functions which can be called easily in the interface.

Trifacta platform uses the following steps of data preparation:

  1. Discovering: this involves getting a first look at the data and distributions to get a quick sense of what you have
  2. Structure: this involves assigning proper shape and variable types to the data and resolving anomalies
  3. Cleaning: this step includes processes like imputation, text standardization, etc. which are required to make the data model ready
  4. Enriching: this step helps in improving the quality of analysis that can be done by either adding data from more sources or performing some feature engineering on existing data
  5. Validating: this step performs final sense checks on the data
  6. Publishing: finally the data is exported for further use

With ~75Mn USD in funding, Trifacta is currently being used in financial, life sciences and telecommunication industry.


7. Narrative Science

Narrative Science is based on a unique idea in the sense that it generates automated reports using data. It works like a data story-telling tool which used advanced natural language processing to create reports. It is something similar to a consulting report.

Some of the features of this platform include:

  • incorporates specific statistics and past data of the organization
  • makes of the benchmarks, drivers and trends of the specific domain
  • it can help generate personalized reports targeted to specific audience

With ~30Mn USD in funding, Narrative Science is currently being used in financial, insurance, government and e-commerce domains. Some of its customers include American Century Investments, PayScale, MasterCard, Forbes, Deloitte, etc.

Having discussed some startups in this domain, lets move on to some of the academic initiatives which are trying to automate some aspects of data science. These have potential of turning into successful enterprise in future.


8. MLBase

MLBase is an open-source project developed by AMP (Algorithms Machines People) Lab at University of California, Berkeley. The core idea is to provide an easy solution for applying machine learning to large scale problems.

It has 3 offerings:

  1. MLib: It works as the core distributed ML library in Apache Spark. It was originally developed as part of MLBase project, but now the Spark community supports it
  2. MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions.
  3. ML Optimizer: This layer aims to automating the task of ML pipeline construction. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI and MLlib.

This undertaking is still under active development and we should hear about the developments in the near future.



Weka is a data mining software written in Java, developed at the Machine Learning Group at University of Waikato, New Zealand. It is a GUI based tool which is very good for beginners in data science and the best part is that it is open-souce. You can learn about it using the MOOC offered by University of Waikato here. You can learn more about it in this article.

Though weka is currently more used in the academic community, but it might be the stepping stone of something big coming up in future.


10. Automatic Statistician

the automatic statistician

Automatic Statistician is not a product per se but a research organization which is creating a data exploration and analysis tool. It can take in various kinds of data and use natural language processing to generate a detailed report. It is being developed by researchers who have worked in Cambridge and MIT and also won Google’s Focussed Research Award with a price of $750,000. Though is it still under development and very minimal information is available about the project, it looks like it is being backed by Google. You can find some information here.


More Tools

I have discussed a selected set of 10 examples above but there are many more like these. I’ll briefly name a few of them here and you can explore further if this isn’t enough to whet your appetite:

  • MarketSwitch – This tool is more focussed on optimization rather than predictive analytics
  • – This tool works in the domain of IoT (Internet of Things) and performs analytics on connected devices
  • – This tool is focussed on customer handling and ticket system analytics
  • Predixion – This is another tool which works on data collected from connected devices
  • Logical Glue – Another GUI based machine learning platform which works from raw data to deployment
  • Pure Predictive – This tool uses a patented Artificial Intelligence system which obviates the part of data preparation and model tuning; it uses AI to combine 1000s of models into what they call “supermodels”
  • DataRPM – Another tool for making predictive models using a GUI and no coding requirements
  • ForecastThis – Another proprietary technology focussed on machine learning using a GUI
  • FeatureLab – It allows easy predictive modeling and deployment using GUI

If you’re hearing these names for the first time, you’ll be surprised (like I was :D) that so many tools exist. But the good thing is that they haven’t had a disruptive impact as of now. But the real question is will these technologies achieve their goals? Only time can tell!


End Notes

In this article, we have discussed various initiatives working towards automating various aspects of solving a data science problem. Some of them are in nascent research stage, some open-source and others being used in the industry with millions in funding. All of these pose a potential threat to the job of a data scientist, which is expected to grow in the near future.These tools are best suited for people who abhor programming & coding.

Do you know any other startups or initiatives working in this domain? Please feel free to drop a comment below and enlighten us!

Read more »

Comprehensive guide for Data Exploration in R
Posted by Thang Le Toan on 26 November 2017 01:16 PM

Till now we have already covered a detailed tutorials on data exploration using SAS and Python. What is the one piece missing to complete this series. I am sure you guessed it right. In this article I will give a detailed tutorial on Data Exploration using R. For reader ease, I  will follow a very similar format we used in Python tutorial. This is just because of the sheer resemblance between the two languages.

Here are the operation I’ll cover in this article (Refer to this article for similar operations in SAS):

    1. How to load data file(s)?

    2. How to convert a variable to different data type?

    3. How to transpose a table?

    4. How to sort Data?

    5. How to create plots (Histogram, Scatter, Box Plot)?

    6. How to generate frequency tables?

    7. How to do sampling of Data set?

    8. How to remove duplicate values of a variable?

  1. How to group variables to calculate count, average, sum?

  2. How to recognize and treat missing values and outliers?

  3. How to merge / join data set effectively?


 Part 1: How to load data file(s)?

Input data sets can be in various formats (.XLS, .TXT, .CSV, JSON ). In R, it is easy to load data from any source, due to its simple syntax and availability of predefined libraries. Here, I will take examples of reading a CSV file and a tab separated file. read.table is also an alternative, however, read.csv is my preference given the simplicity.


# Read CSV into R
MyData <- read.csv(file="c:/TheDataIWantToReadIn.csv", header=TRUE, sep=",")
#Read a Tab seperated file
Tabseperated <- read.table("c:/TheDataIWantToReadIn.txt", sep="\t", header=TRUE)


All other Read commands are similar to the one mentioned above.


Part 2: How to convert a variable to different data type?

Type conversions in R work as you would expect. For example, adding a character string to a numeric vector converts all the elements in the vector to character.

Use to test for data type xyz. Returns TRUE or FALSE
Use to explicitly convert it.

is.numeric(), is.character(), is.vector(), is.matrix(),
 as.numeric(), as.character(), as.vector(), as.matrix(),

However, conversion of data structure is more critical than the format transformation. Here is grid which will guide you with format conversion :



Part 3: How to transpose a Data set?

It is also some times required to transpose a dataset from a wide structure to a narrow structure. Here is the code you use to do the same :



# example of melt function
 mdata <- melt(mydata, id=c("id","time"))


Part 4: How to sort DataFrame?

Sorting of data can be done using order(variable name) as an index . It can be based on multiple variables and ascending or descending both order.


# sort by var1
newdata <- old[order(var1),]
# sort by var1 and var2 (descending)
newdata2 <- old[order(var1, -var2),]


Part 5: How to create plots (Histogram)?

Data visualization on R is very easy and creates extremely pretty graphs. Here I will create a distribution of scores in a class and then plot histograms with many variations.

score <-rnorm(n=1000, m=80, sd=20) 


Let’s try to find the assumptions R takes to plot this histogram, and then modify a few of those assumptions.

 [1] 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
 [1] 2 5 19 52 84 141 195 201 152 81 39 25 3 1
 [1] 0.0002 0.0005 0.0019 0.0052 0.0084 0.0141 0.0195 0.0201 0.0152
[10] 0.0081 0.0039 0.0025 0.0003 0.0001
 [1] 15 25 35 45 55 65 75 85 95 105 115 125 135 145
[1] "score"
[1] TRUE
[1] "histogram"

As you can see, the breaks are applied at multiple points. We can restrict the number of break points or vary the density. Over and above this, we can colour the bar plot and overlay a normal distribution curve. Here is how you can do all this :

hist(score, freq=FALSE, xlab="Score", main="Distribution of score", col="lightgreen", xlim=c(0,150), ylim=c(0, 0.02))
curve(dnorm(x, mean=mean(score), sd=sd(score)), add=TRUE, col="darkblue", lwd=2)




Part 6: How to generate frequency tables with R?

Frequency tables are the most basic and effective way to understand distribution across categories.

Here is a simple example of calculating one war frequency :




Here is a code which can find cross tab between two categories :

# 2-Way Cross Tabulation
CrossTable(mydata$myrowvar, mydata$mycolvar)


Part 7: How to sample Data set in R?

For sampling a dataset in R, we need to first find a few random indices . Here is how you can find a random sample:

mysample <- mydata[sample(1:nrow(mydata), 100,replace=FALSE),]

This code will simply take out a random sample of 100 observations from the table mydata.


Part 8: How to remove duplicate values of a variable?

Removing duplicates on R is extremely simple. Here is how you do it:

> set.seed(150)
> x <- round(rnorm(20, 10, 5))
> x
 [1] 2 10 6 8 9 11 14 12 11 6 10 0 10 7 7 20 11 17 12 -1
> unique(x)
 [1] 2 10 6 8 9 11 14 12 0 7 20 17 -1


Part 9: How to find class level count average and sum in R?

We generally use Apply functions to do these jobs.

> tapply(iris$Sepal.Length,iris$Species,sum)
 setosa versicolor virginica 
 250.3 296.8 329.4 
> tapply(iris$Sepal.Length,iris$Species,mean)
 setosa versicolor virginica 
 5.006 5.936 6.588


Part 10: How to recognize and Treat missing values and outliers?

Identifying missing values can be done as follows :

> y <- c(4,5,6,NA)

And here is a quick fix for the same :

y[] <- mean(y,na.rm=TRUE)

[1] 4 5 6 5

As you can see, the missing value has been imputed with the mean of other numbers. Similarly, we can impute missing values with any best value available.


Part 11: How to merge / join data sets?

This is yet another operation which we use in our daily life.

To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).

# merge two data frames by ID
total <- merge(data frameA,data frameB,by="ID")
# merge two data frames by ID and Country
total <- merge(data frameA,data frameB,by=c("ID","Country"))

Appending dataset is another such function which is very frequently used. To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order.

total <- rbind(data frameA, data frameB)


End Notes:

In this comprehensive guide, we looked at the R codes for various steps in data exploration and munging. This tutorial along with the ones available for Python and SAS will give you a comprehensive exposure to the most important languages of the analytics industry.

Did you find the article useful? Do let us know your thoughts about this guide in the comments section below.

Read more »

Supermicro X11DPi-N E-ATX Motherboard Review
Posted by Thang Le Toan on 23 November 2017 01:26 AM
Supermicro X11DPi N Feature
Supermicro X11DPi N Feature

A mainstay of Supermicro’s dual socket LGA3647 motherboards will no doubt be the X11DPi-N. We reviewed the Supermciro X11DPi-NT not long ago which offers 10GbE network for those who need faster network speeds, while the X11DPi-N provides dual 1GbE ports for reduced entry cost to this platform while retaining the same feature set.

Supermicro X11DPi-N Specifications

Here is the spec table for the Supermicro X11DPi-N:

Supermicro X11DPi N Specifications
Supermicro X11DPi N Specifications

Some readers requested we include the motherboard block diagram, so we wanted to add this in our review.

Supermicro X11DPi N Block Diagram
Supermicro X11DPi N Block Diagram

In the case of the X11DPi-N, we find 1GbE network ports; the X11DPi-NT includes 10GbE ports.

Supermicro X11DPi-NT Overview

The X11DPi-N fits into the sizeable E-ATX motherboard class with a size of 12” x 13”. Filling the general purpose or storage server roles this motherboard supports a large variety of Supermicro’s 2U and 4U platforms.

Supermicro X11DPi N Top
Supermicro X11DPi N Top

Paired with each socket, we find six blue memory slots that will give us six DIMMs per CPU with 1 DIMM per channel. Memory speed of 2666MHz fully supported which is an increase from 2400MHz we saw in the last generation. The black DIMM slots add an extra pair of DIMMs to achieve capacity parity with the previous generation Xeon E5 series products.

Supermicro X11DPi N PCIe Slots
Supermicro X11DPi N PCIe Slots

PCIe slots available on the X11DPi-N are the same as we find on the X11DPi-NT with four x16 slots and two x8 slots. The x16 slots would be perfect for GPUs, networking cards or other expansion devices. Throw in the two x8 slots for higher end networking and storage expanders. Whether you are looking at GPU based machines or powerful storage servers, there is plenty of PCIe slots to house a wide array of design choices.

At the edge of the motherboard, we also see a USB 3.0 Type A port which works well for a boot drive.

Supermicro X11DPi N Storage Ports
Supermicro X11DPi N Storage Ports

For storage servers, three SFF-8087 connectors can accommodate up to 12 SATA III 6.0gbps drives. Also, two 7-Pin SATA ports can further support 14 SATA III 6.0gbps drives from the C622 PCH chipset. There are also two Oculink ports for NVMe drives. This combination of storage ports allows a storage server to free up expansion slots for additional PCIe cards.

Supermicro X11DPi N Back IO
Supermicro X11DPi N Back IO

Network ports are Dual LAN GbE from the C612, IPMI via ASPEED AST2500 above two USB 3.0 ports. Two USB 2.0 ports, VGA and COM ports round out the rear I/O.

Supermicro X11DPi-N Management

Supermicro’s new X11 platforms out of band management is an updated version of their industry standard management interface including a WebGUI.

Supermicro X11DPi N IPMI
Supermicro X11DPi N IPMI

Supermicro’s latest BIOS for the X11 platform features support for HTML5 iKVM which we have seen on several motherboards and systems now.

Supermicro X11DPi N BIOS
Supermicro X11DPi N BIOS

Enhanced features for the latest HTML5 iKVM is the ability to enlarge the screen to make for easier reading on high-resolution screens such as 4K displays.

Supermicro X11DPi N IPMI 2
Supermicro X11DPi N IPMI 2

Ease of use with enlarged working screens carries right over to the desktop.

Test Configuration

Our primary test configuration for this motherboard is as follows:

  • Motherboard: Supermicro X11DPi-N
  • CPU: 2x Intel Xeon Gold 6134, 8 Core processors
  • RAM: 12x 16GB DDR4-2400 RDIMMs low profile (Micron)
  • SSD: OCZ RD400
Supermicro X11DPi N Gold 6134
Supermicro X11DPi N Gold 6134

We have moved our benchmark processors to Intel Xeon Gold 6134 8 core CPU’s. With a TDP of 130 watts, we have no thermal throttling issues using two Supermicro 2U active coolers.

AIDA64 Memory Test

AIDA64 memory bandwidth benchmarks (Memory Read, Memory Write, and Memory Copy) measure the maximum achievable memory data transfer bandwidth.

Supermicro X11DPi N AIDA64 Memory
Supermicro X11DPi N AIDA64 Memory

AIDA64 Memory benchmarks show comparable numbers with the past dual processor motherboards we have reviewed, slightly higher Reads while copy and write are somewhat lower.

Cinebench R15

Supermicro X11DPi N Cinebench R15
Supermicro X11DPi N Cinebench R15

Cinebench R15 benchmark numbers fall right where we would expect them to and match just below previous reviews.

Geekbench 4

Supermicro X11DPi N Geekbench 4
Supermicro X11DPi N Geekbench 4

The Gold 6134 processors have a Turbo speed of 3.7GHz that bumps up Single-Core results, and we do see a modest improvement in Multi-Core results.

Supermicro X11DPi-N Power Consumption

For our power testing needs, we use a Yokogawa WT310 power meter which can feed its data through a USB cable to another machine where we can capture the test results. We then use AIDA64 Stress test to load the system and measure max power loads.

Power consumption can vary depending on processors used and the number of HDDs/SSDs/Expansion cards used. Here we test just a primary system.

Supermicro X11DPi N Power Test
Supermicro X11DPi N Power Test

We find an idle power draw of 132 watts and Peak of 394 watts to be quite common power draw for test platforms we run.

OS Idle: 132W
AIDA64 Stress Test: 394W


The Supermicro X11DPi-N exceeds our expectations in flexibility and features. Aided by the new Intel Xeon Scalable platform, it offers more PCIe lanes, better memory bandwidth, and more SATA III 6.0gbps ports without the need for extra PCIe storage expansion cards. Network connections with dual 1GbE ports are a cost-saving measure welcome to those looking to add their own high-speed networking. Overall this was an extremely easy motherboard to work with and one we have no hesitation recommending.

Supermicro X11DPi N
Supermicro X11DPi N

The Supermicro X11DPi-N is an incredible motherboard with great storage options couple with enough PCIe slots to take advantage of a broad mix of expansion cards.

You may also like:


Design & Aesthetics
Feature Set


We review the Supermicro X11DPi-N motherboard that packs a ton of functionality into a compact E-ATX form factor for easy customization


Read more »

NVIDIA GPU Cloud is one important step in democratizing deep learning
Posted by Thang Le Toan on 23 November 2017 12:40 AM

At the 2017 Supercomputing conference (SC17) we were able to attend the NVIDIA press and analyst event. We had a pre-briefing on the event so we knew what to expect but there were a number of announcements. By far the most impactful was the announcement of the NVIDIA GPU Cloud expanding to HPC applications. Here is a quick overview of what NVIDIA announced.

NVIDIA Tesla V100 Volta Everywhere

The first announcement should be of little surprise to anyone following STH coverage of NVIDIA. The next generation NVIDIA Tesla V100 is now available just about everywhere.

NVIDIA Volta Taking Off At SC17
NVIDIA Volta Taking Off At SC17

On the show floor of SC17 just about every vendor (Dell EMC, Hewlett Packard Enterprise, Huawei, Cisco, IBM, Lenovo, and Supermicro) had solutions with the V100 GPU. Some were 4 to a box, some were 8, some were submerged in liquid for cooling, but they were everywhere.

Supermicro At SC 2017 NVIDIA Tesla V100 Volta
Supermicro At SC 2017 NVIDIA Tesla V100 Volta

Beyond just the hardware vendors, Alibaba Cloud, Amazon Web Services (in AWS P3 instances), Baidu Cloud, Microsoft Azure, Oracle Cloud and Tencent Cloud have also announced Volta-based cloud services.

HPC and AI Applications in the NVIDIA

The NVIDIA GPU Cloud is NVIDIA’s offering where frameworks and applications are packaged and distributed by NVIDIA in containers. We have been working with nvidia-docker, the precursor to this service, for quite some time and it is awesome. We have recently been working to package our applications using these NVIDIA optimized containers. Essentially, NVIDIA is taking away much of the “getting the stack running” work for deep learning/ AI and HPC applications.


At SC17, NVIDIA added HPC applications and visualization to the platform. This allows users to quickly get up-and-running with GPU accelerated applications without having to manage CUDA versions and dependencies with the application versions. The real impact here is that this can shave days or weeks of time deploying systems.

NVIDIA HPC Applications Coming To NGC
NVIDIA HPC Applications Coming To NGC

We are already working with NGC at STH and will have more on this offering soon. It is interesting for a few points. First, it does offer something that is so easy, we wish we would see Intel offer something similar. Distil the vast expanse of Github and Docker Hub down to a few select best-known configurations. Second, it is fascinating in that we can see this being offered for VDI in the future and also to make NVIDIA the hub of offering on-demand GPU compute offerings.

Final Words

We have been talking about the NVIDIA Tesla V100 “Volta” for some time, and have some numbers in the editing queue for 8x Tesla V100 systems. At the same time, these just started shipping in volume from what we were hearing in the last two months so they are a somewhat known quantity. The NVIDIA GPU Cloud is the big story here. NVIDIA has the opportunity to build the “app store” model and both on-prem (DGX-1 and DGX Station along with partner systems) and cloud-based on demand GPU compute.

Read more »

Help Desk Software by Kayako