Live Chat Software by Kayako
 News Categories
(19)Microsoft Technet (2)StarWind (4)TechRepublic (3)ComuterTips (1)SolarWinds (1)Xangati (1)MyVirtualCloud.net (27)VMware (5)NVIDIA (9)VDI (1)pfsense vRouter (3)VEEAM (3)Google (2)RemoteFX (1)developers.google.com (1)MailCleaner (1)Udemy (1)AUGI (2)AECbytes Architecture Engineering Constrution (7)VMGuru (2)AUTODESK (1)storageioblog.com (1)Atlantis Blog (7)AT.COM (2)community.spiceworks.com (1)archdaily.com (14)techtarget.com (2)hadoop360 (3)bigdatastudio (1)virtualizetips.com (1)blogs.vmware.com (3)VECITA (1)vecom.vn (1)Palo Alto Networks (4)itnews.com.au (2)serverwatch.com (1)Nhịp Cầu đầu tư (3)VnEconomy (1)Reuters (1)Tom Tunguz (1)Medium.com (1)Esri (1)www.specommerce.com (1)tweet (1)Tesla (1)fool.com (6)ITCNews (1)businessinsider.com (1)hbr.org Harvard Business Review (1)Haravan (2)techcrunch.com (1)vn.trendmicro.com (3)thangletoan.wordpress.com (3)IBM (1)www.droidmen.com (2)blog.parallels.com (1)betanews.com (6)searchvmware.techtarget.com (1)www.bctes.com (1)www.linux.com (4)blog.capterra.com (1)theelearningcoach.com (1)www.examgeneral.com (1)www.wetutoringnation.com (1)chamilo.org/ (1)www.formalms.org (1)chalkup.co (1)www.mindonsite.com (5)moodle.org (4)moodle.croydon.ac.uk (1)opensource.com (1)1tech.eu (1)remote-learner.net (1)paradisosolutions.com (2)sourceforge.net (7)searchbusinessanalytics.techtarget.com (1)nscs.gov.sg (1)virten.net (1)fastest.com.vn (1)elearninglearning.com (2)www.computerweekly.com (1)youtube.com (2)computer.howstuffworks.com (2)techz.vn (2)techsignin.com (1)itworld.com (7)searchsecurity.techtarget.com (1)makeuseof.com (1)nikse.dk (1)4kdownload.com (1)thegioididong.com (1)itcentralstation.com (1)www.dddmag.com (1)Engenius (1)networkcomputing.com (1)woshub.com (1)hainam121.wordpress.com (1)www.lucidchart.com (1)www.mof.gov.vn (3)www.servethehome.com (6)www.analyticsvidhya.com
RSS Feed
News
Nov
26
Free Must Read Books on Statistics & Mathematics for Data Science
Posted by Thang Le Toan on 26 November 2017 01:29 PM

Introduction

The selection process of data scientists at Google gives higher priority to candidates with strong background in statistics and mathematics. Not just Google, other top companies (Amazon, Airbnb, Uber etc) in the world also prefer candidates with strong fundamentals rather than mere know-how in data science.

If you too aspire to work for such top companies in future, it is essential for you to develop a mathematical understanding of data science. Data science is simply the evolved version of statistics and mathematics, combined with programming and business logic. I’ve met many data scientists who struggle to explain predictive models statistically.

More than just deriving accuracy, understanding & interpreting every metric, calculation behind that accuracy is important. Remember, every single ‘variable’ has a story to tell. So, if not anything else, try to become a great story explorer!

In this article, I’ve compiled a list of must read books on statistics and mathematics. I understand, mathematics has no extreme. Hence, I’ve enlist only those books which will help you to connect with data science better.

Note: Books which are made free to access by the registered authorities have been mentioned in this article. If not, a link to amazon bookstore is provided.

free must read books on statistics and mathematics

 

Statistics

21Introduction to Statistical Learning

This is a highly recommended book for practicing data scientists. The focus of this books is kept on connecting statistics concept with machine learning. Hence, you’ll learn about all popular supervised and unsupervised machine learning algorithms. R users will get an advantage, since the practical aspects of algorithms have been demonstrated using R. In addition to theory, this book also lay emphasis on using ML algorithms in real life setting.

Available: Free Download

 

 

22Elements of Statistical Learning

This book is an advanced level of previous book. It is written by Trevor Hastie and Rob Tibshirani, Professors at Stanford University. Their first book ‘Introduction to Statistical Learning’ uncover the basics of statistics and machine learning. This book, will introduce you to higher level algorithms such as Neural Networks, Bagging & Boosting, Kernel methods etc. The algorithms have been implemented in R programming.

Available: Free Download

 

 

23Think Stats

The author of this book is Alien B Downey. It is based on perform statistical analysis practically in Python. Hence, make sure you’ve got some basic knowledge of Python before buying this book. It focuses entirely on understanding real life influence of statistics using popular case studies. Since, stats and math are closely connected, it also has dedicated chapters on topic like bayesian estimation.

Available: Buy from Amazon

 

 

24From Algorithms to Z Scores

Did you know the about crucial role of statistics in programming ? The author of this book is Norm Matloff, Professor, University of California. This book explains using probabilistic concepts and statistical measures in R. Again, a good practice source for R users. It teaches the art of dealing with probabilistic models and choosing the best one for final evaluation. It is a highly recommended book (specially for R users).

Available: Free Download

 

 

25Introduction to Bayesian Statistics

This is a highly recommended book for freshers in data science. The author of this book is William M Bolstad. It’s a must read for people who find mathematics boring. Having been written in a conversational style (rare to find math this way), this book is a great introductory resource on statistics. It begins with scientific methods of data gathering and end up delivering dedicated chapters on bayesian statistics.

Available: Free Download

 

 

26Discovering Statistics using R

This book is written by Andy Field, Jeremy Miles and Zoe Field. I would highly recommend this book to newbies in data science. To start with statistics, this book has a great content which goes in depth detail of its topics. Along with, the statistical concept are explained in conjunction with R which makes it even more useful. It offers a step by step understanding, with a parallel support of interesting practice examples.

Available: Buy on Amazon

 

 

Mathematics

27Introduction to Linear Algebra

This is one of the most recommended book on Linear Algebra. The author of this book is Gilbert Strang, Professor, MIT. Gilbert unique way of delivering knowledge would give you the intuition and excitement to move forward after every chapter. This book will help you to build a strong mathematical foundation for machine learning. It enlists all the necessary chapters such as vectors, linear equations, determinants, eigenvalues, matrix factorization etc in great depth.

Available: Buy on Amazon

 

 

28Matrix Computation

Matrix and Data frames are essential components of machine learning. The author of this book is Gene H Golub and Charles F Van Loan. This book provides a nice head start to students with concepts of matrix computations. The author covers most of the important topics such as gaussian elimination, matrix factorization, lancoz method, error analysis etc. Every chapter is supported by intuitive practice problems. The pseudo codes are available in Matlab.

Available: Free Download

 

 

29A Probabilistic Theory of Pattern Recognition

This is a complete resource to learn application of mathematics. This is a must read book for intermediate and advanced practitioners in machine learning. This book is written by Luc Devroye, Laszlo Gyorfi and Gabor Lugosi. It covers a wide range of topics varying from bayes error, linear discrimination to epsilon entropy & neural networks. It provides a convincing explanation to complex theorems with section wise practice problems.

Available: Free Download

 

 

30Introduction of Math of Neural Networks

If you have innate interest in learning about neural network, this should be your place to start. The author of this book is Jeff Heaton. The author has beautifully simplified the difficult concepts of neural networks. This book introduces you to basics of underlying maths in neural networks. It assumes reader has prior knowledge of algebra, calculus and programming. It demonstrates various mathematical tools which can be applied to neural networks.

Available: Buy on Amazon

 

31Advanced Engineering Mathematics

This is probably the most comprehensive book available on mathematics for machine learning users. The author of this book is Erwin Kreyszig. As a matter of fact, this book is highly recommended to college students as well. If you haven’t been good at maths till now, follow this book religiously and you should surely see significant improvements in your math understanding. Along with derivations & practice example, this book has dedicated sections of calculus, algebra, probability etc. Definitely, a must read book for all levels of practitioners in data science.

Available: Free Download

 

Cookbook on Probability and Statistics

This cookbook is must have in your digital bookshelf. This isn’t exactly a text book you’d discover, but a quick digital guide on mathematical equations. The author of this book is Matthias Vallentin. After you finish with essentials of mathematics, this book will help you connect various theorem and algorithm quickly with their formulae. It’s difficult to derive equations instantly, this book will help you to quickly navigate to your desired problem and solve.

Available: Free Download

 

Additional Resources

Bored of reading too much ? Here are is a list of highly recommended tutorials (video) / resources on mathematics and statistics. They are FREE to access.

  1. Complete Course on Linear Algebra by MIT
  2. Complete Course on Multivariable Calculus by MIT
  3. Statistical Learning by Stanford University
  4. Mathematics at Khan Academy
  5. Full Cheatsheet on Probability

 

End Notes

The books listed in this article are selected on the basis of their reviews and depth of topics covered. This is not an exhaustive list of books. But, I found it’s almost too easy to get confused while deciding ‘from where to begin?’ In such situations, it is advisable to start with this list.

In this article, I’ve listed some most helpful books on statistics and machine learning. It has been found that people tend of neglect these topics in pursuit of quick success. But, that’s not the right way. Hence, if you aim for a long term success in data science, make sure you learn to create stories out of maths and statistics.


Read more »



Nov
26
18 New Must Read Books for Data Scientists on R and Python
Posted by Thang Le Toan on 26 November 2017 01:28 PM

Introduction

“It’s called reading. It’s how people install new software into their brain”

Personally, I haven’t learnt as much from videos & online tutorials as much I’ve learnt from books. Until this very moment, my tiny wooden shelf has enough books to keep me busy this winter.

Understanding machine learning & data science is easy. There are numerous open courses which you can take up right now and get started. But, acquiring in-depth knowledge of a subject requires extra effort. For example: You might quickly understand how does a random forest work, but understanding the logic behind it’s working would require extra efforts.

The confidence of questioning the logic comes from reading books. Some people easily accept the status quo. On the other hand, some curious ones challenge & say, “Why can’t it be done the other way?” That’s where such people discover new ways of executing a task. Almost, every data scientist I’ve come across in person, on AMAs, on published interviews, each one of them have emphasized the inevitable role of books in their lives.

Here is a list of books on doing machine learning / data science in R and Python which I’ve come across in last one year. Since reading is a good habit, with this post, I want pass this habit to you. For each book, I’ve written a summary to help you judge its relevance. Happy reading!

Disclosure: The amazon links in this article are affiliate links. If you buy a book through this link, we would get paid through Amazon. This is one of the ways for us to cover our costs while we continue to create these awesome articles. Further, the list reflects our recommendation based on content of book and is no way influenced by the commission.

18 New Must Read Books for Data Scientists on R and Python

 

R for Data Science

Hands-on Programming with R

hands-on-programmingThis book is written by Garrett Grolemund. It is best suited for people new to R. Learning to write functions & loops empowers you to do much more in R, than just juggling with packages. People think, R packages can let them avoid writing functions & loops, but it isn’t a sustainable approach. This book introduces you to details of R programming environment using interesting projects like weighted dice, playing cards, slot machine etc. The book language is simple to understand and examples can be reproduced easily.

Available: Buy Now

 

R for Everyone: Advanced Analytics and Graphics

r-for-everyoneThis book is written by Jared P. Lander. It’s a decent book covering all aspects of data science such as data visualization, data manipulation, predictive modeling, but not in as much depth. You can understand as, it covers a wide breath of topic and misses out on details of each. Precisely, it emphasizes on the usage criteria of algorithms and one example each showing its implementation in R. This books should be brought by people who are more inclined towards understand practical side of algorithms.

Available: Buy Now

 

R Cookbook

r-cookbookThis book is written by Teetor Paul. It comprises of several tips, recipes to help people overcome daily struggles in data pre-processing and manipulation. Many a times, we are stuck in a situation where we know very well, what needs to be done. But, how it needs to be done becomes a mammoth challenge. This books solves the problem. It doesn’t have theoretical explanation of concepts, but focuses on how to use them in R. It covers a wide range of topics such as probability, statistics, time series analysis, data pre-processing etc.

Available: Buy Now

 

R Graphics Cookbook

r-graphics-cookbookThis book is written by Winston Chang. Data visualization enables a person to express & analyze their findings using shapes & colors, not just in tables. Having a solid understanding of charts, when to use which chart, how to customize a chart and make it look good, is a key skill of a data scientist. This book doesn’t bore you with theoretical knowledge, but focuses on building them in R using sample data sets. It focuses on ggplot2 package to undertake all visualization activities. 

Available: Buy Now

 

Applied Predictive Modeling

applied-predictive-modelingThis book is written by Max Kuhn and Kjell Johnson. Max Kuhn is none other than creator of caret package too. It’s one of the best book comprising a blend of theoretical and practical knowledge. It discusses several crucial machine learning topics such as over-fitting, feature selection, linear & non-linear models, trees methods etc. Needless to say, it demonstrates all these algorithms using caret package. Caret is one of the powerful ML package contributed in CRAN library.

Available: Buy Now

 

Introduction to Statistical Learning

intro-to-statistical-learningThis book is written by a team of authors including Trevor Hastie and Robert Tibshirani. It is one of the most detailed book on statistical modeling. Also, it’s available for free. It comprises of in-depth explanation of topics such as linear regression, logistic regression, trees, SVM, unsupervised learning etc. Since it’s the introduction, the explanations are quite easy and any newbie can easily follow it. Thus, I recommended this book to all people who are new to machine learning in R. In addition, several practice exercises in this book just adds cherry on top.

Available: Buy Now

 

Elements of Statistical Learning

elements-of-statistical-learningThis book is written by Trevor Hastie, Robert Tibshirani and Jerome Friedman. This is the next part of ‘Introduction to Statistical Learning’. It comprises of more advanced topics, therefore I would suggest you not to directly jump to it. This book in best suited for people familiar with basics of machine learning. It talks about shrinkage methods, different linear methods for regression, classification, kernel smoothing, model selection etc. It’s a must read book for people who want to understand ML in depth.

Available: Buy Now

 

Machine Learning with R

machine-learning-in-rThis book is written by Brett Lantz. I am impressed by the simplicity of this author’s way of explaining concepts. It’s a book on machine learning which is easy to understand, and would provide you a lot of knowledge about their practical aspects too. Algorithms such as Bagging, Boosting, SVM, Neural Network, Clustering etc are discussed by solving respective case studies. These case studies will help you understand the real world usage of these algorithms. In addition, knowledge of ML parameters is also discussed.

Available: Buy Now

 

Mastering Machine Learning with R

mastering-machine-learning-with-rThis book is written by Cory Lesmeister. It is best suited for everyone who want to master R for machine learning purposes. It comprises of all (almost) algorithms and their execution in R. Alongside, this book will introduce you to several R packages used for ML including the recently launched H2o package. It’s a book which features latest advancements in ML forte, hence I’d suggest it to be read by every R user. However, you can’t expect to learn advanced ML techniques like Stacking from this book.

Available: Buy Now

 

Machine Learning for Hackers

machine-learning-for-hackersThis book is written by Drew Conway and John Myles White. It’s a relatively shorter book than others, but aptly brings out sheer importance of every topic discussed. After reading this book, I realized that the author’s mindset is not to go deep in a topic, still making sure to cover important details. For enhanced understanding, the author also demonstrates several used cases, while solving which, explains the underlying methods too. It’s a good read for everyone who’d like to learn something new about ML.

Available: Buy Now

 

Practical Data Science with R 

practical-data-science-with-rThis book is written by Nina Zumel & John Mount. As the name suggests, this book focuses on using data science methods in real world. It’s different in itself. None of the books listed above, talks about real world challenges in model building, model deployment, but it does. The author doesn’t move her focus from establishing a connect between theoretical world of ML and its impact on real world activities. It’s a must read for freshers who are yet to enter analytics industry.

Available: Buy Now

 

Python for Data Science

Mastering Python for Data Science

mastering-python-for-data-scienceThis book is written by Samir Madhavan. This book starts with an introduction to data structures in Numpy & Pandas and provides a useful description of importing data from various sources into these structures. You will learn to perform linear algebra in Python and make analysis by using inferential statistics. Later, the book takes onto the advanced concepts like building a recommendation engine, high-end visualization using Python, ensemble modeling etc.

Available: Buy Now

 

Python for Data Analysis

python-for-data-analysisWant to get started with data analysis with Python? Get your hands on this data analysis guide by W Mckinney, the main author of Pandas library. There isn’t any online course as comprehensive as this book. This book covers all aspects of data analysis from manipulating, processing, cleaning, visualization and crunching data in Python. If you are a new to data science python, it’s a must read for you. It’s power-packed with case studies from various domains.

Available: Buy Now

 

Introduction to Machine Learning with Python

intro-machine-learning-with-pythonThis book is written by Andreas Muller and Sarah Guido. It’s meant to help beginners to get started with machine learning. It teaches to build ML models in python scikit-learn from scratch. It assumes no prior knowledge, hence it’s best suited for people with no prior python or ML knowledge. In addition, it also covers advanced methods for model evaluation and parameter tuning, methods for working with text-data, text -specific processing techniques etc.

Available: Buy Now

 

Python Machine Learning

python-for-machine-learningThis book is written by Sebastian Raschka. It’s one of the most comprehensive book’s I’ve found on ML in Python. The author explains every crucial detail we need to know about machine learning. He takes a stepwise approach in explaining the concepts supported by various examples. This book cover topics such as neural networks, clustering, regression, classification, ensemble etc. It’s a must read book for everyone keen to master ML in python.

Available: Buy Now

 

Building Machine Learning Systems with Python

building-machine-learning-systems-with-pythonThis book is written by Willi Richert, Luis Pedro Coelho. In this book the authors have chosen a path of, starting with basics, explaining concepts through projects and ending on a high note. Therefore, I’d suggest this book to newbie python machine learning enthusiasts. It covers topics like image processing, recommendation engine, sentiment analysis etc. It’s easy to understand and fast to implement text book.

Available: Buy Now

 

Advanced Machine Learning with Python

amlThis book is written by John Hearty. It’s a definite read for every machine learning enthusiasts. It lets you rise above the basics of ML techniques and dive into unsupervised methods, deep belief networks, Auto encoders, feature engineering techniques, ensembles etc. It’s definitely a book you would want to read to improve your ranks in machine learning competitions. The author lays equal emphasis on theoretical as well practical aspects of machine learning.

Available: Buy Now

 

Programming Collective Intelligence

programming-collective-intelligenceThis book is written by Toby Segaran. With an interesting title, this book is meant to introduce you to several ML algorithms such as SVM, trees, clustering, optimization etc using interesting examples and used cases. This is book is best suited for people new to ML in python. Python, known for its incredible ML libraries & support should make it easy for you to learn these concepts faster. Also, the chapters include exercises for practice to help you develop better understanding.

Available: Buy Now

 

End Notes

The motive of this article is to introduce you to the huge reservoir of knowledge which you haven’t noticed yet. These books will not only provide you boundless knowledge but also, enrich you with various perspectives on using ML algorithms. You might feel puzzled at seeing so many books explaining similar concepts. What differentiates these books is the case studies & examples discussed.

Trust me, sometimes theoretical explanations becomes quite difficult to decipher as compared to understanding practical cases. That’s how I feel. Learning from these author’s knowledge is the fastest way you can learn from so many people.


Read more »



Nov
26
What is Business Analytics and which tools are used for analysis?
Posted by Thang Le Toan on 26 November 2017 01:26 PM

Business Analytics has become a catch all word for any thing to do with data.

So if you are new to this field and don’t understand what people refer to as “Business Analytics”, don’t worry!

Even after spending more than 6 years in this industry, there are times when it is difficult for me to understand the work a person has done by reading his CV.

Here is how an excerpt from a typical JD might look like:

  • Analyze, prepare reports and present to Leadership team on a defined frequency

or

  • Lead multiple analytical projects and business planning to assist Leadership team deliver business performance.

On one hand, this creates confusion in mind of person applying for a particular role. On the other hand, it leaves the selectors with a difficult role to understand and judge what a person has done in past.

Now, if I got this as description for one of the jobs I had applied to, I would be scared! Scared, not because I don’t know the subject, but because, these could mean anything. The work could refer to preparing basic reports at a junior level to performing multi-variate deep dives on various subjects.

So, what do you do when you are in such a situation?

Well, the first thing you should do is understand Business Analytics spectrum. Once you have understood it, ask which part of the spectrum, the role applies to and then decide whether it suits your skills or not.

Following is a good representation of this spectrum:

analytics spectrum

Let me explain each of these areas in a bit more detail.

 

Reporting – Answer to What happened?

The domain of Analytics starts from answering a simple question – What happened? This activity is typically known as reporting. These are typically the MIS which people want to receive first thing in the morning. It is a snapshot of what has happened. Following is an example of how a typical report might look like:

excel_MIS

Tools used in reporting:

Majority of elementary reporting happens on MS Excel across the globe. More evolved Organizations might pull the data through databases using tools like SQL, MS Access or Oracle. But typically, the dissemination of reports happens through Excel.

Skills required for reporting:

  • MS excel
  • Business understanding
  • Ability to perform monotonous task with diligence

 

Detective Analysis – Answer to Why did it happen?

Detective Analysis starts where reporting ends. You start looking for reasons for unexpected changes. Typical  problems you work on are “Why did the Sales drop in last 2 months?” or “Why did the latest campaign under-perform or over-perform?”. In order to find out answers to these questions, you look at past trends or you look at distribution changes to find out the reasons for the changes. However, all of this is backward looking.

Some of these insights, which you find out after looking at backward analysis can be used for business planning, but the purpose of analysis is typically to find out what has worked and what has not.

Tools used in detective analysis:

Typically used tools are MS excel, MS Access, Minitab, R (basic regression). You tend to use advanced Excel and Pivot tables while dealing with these problems and typically creating time series graphs helps a lot.

Skills required for detective analysis:

  • Structured thinking
  • MS Access, Excel, basic regression
  • Business understanding

 

Dashboards – Answer to What’s happening now?

Dashboard is an Organized and well presented summary of key business metrics. They are usually interactive so that the user can find out the exact information he is looking for. Dashboard, in ideal state should provide real time information about performance. Following is an example of how a dashboard might look like:

dashboard

The whole science of creating data model, dashboards and reports based on this data is also known as “Business Intelligence“.

Tools used for creating dashboards:

For limited size of data, dashboards can be made using Advanced excel. But, typically Organizations use more advanced tools for creation and dissemination of tools. Business Objects, Qlikview, Hyperion are names of some such softwares.

Skills required for creating dashboards:

  • Strong structured thinking: The person will need to create the entire architecture and data model
  • Business Understanding: If you don’t understand what you want to represent, God help you!

 

Predictive Modeling – Answer to What is likely to happen?

This is where you take all your historical trends and information and apply it to predict the future. You try and predict customer behaviour based on past information. Please note that there is a fine difference in forecasting and predictive modeling. Forecasting is typically done at aggregate level, where as predictive modeling is typically done at a customer / instance level

Tools used for Predictive modeling:

SAS has the highest market share among tools used for predictive modeling followed by SPSS, R, Matlab.

Skills required for Predictive modeling:

  • Strong structured thinking
  • Business Understanding
  • Problem Solving

 

Big data – Answer to What can happen, given the behaviour of the community?

Imagine applying predictive modeling with a microscope in hand. What if you can store, analyze and make sense out of every information about the customer. What kind of social media community he is attached to? What kind of searches is he performing? Big data problems arise when data has grown on all three Vs (Volume, Velocity and Variety). You need data scientists to mine this data.

Tools used in Big data:

This is a very dynamic domain right now. A tool which used to be market leader 6 months back is no longer the best. Hence, it is difficult to pin down specific tools. These tools typically work on Hadoop to store the data.

Skills required for harnessing big data:

  • Strong structured thinking
  • Advanced Data Architecture knowledge
  • Ability to work with unstructured data

So, now that you understand the Analytics spectrum, if you come across a role which is not clear to you, please spend the necessary time understanding which domain does it refer to and does it fit right with what you want to achieve.

If you have come across this confusion on understanding “Business Analytics“, this article should have helped you. In case there are any further confusion, do let me know.


Read more »



Nov
26
18 Free Exploratory Data Analysis Tools For People who don’t code so well
Posted by Thang Le Toan on 26 November 2017 01:25 PM

Introduction

Some of these tools are even better than programming (R, Python, SAS) tools.

All of us are born with special talents. It’s just a matter of time until we discover it and start believing in ourselves. We all have limitations, but should we stop there? No.

When I started coding in R, I struggled. Sometimes a lot more than one can ever think! Because I had never ever coded even <Hello World> in my entire life.  My situation was similar to a guy who’s didn’t know swimming but was manhandled into deep ocean, who somehow saved himself from drowning but ended up gulping lot of salty water.

Now when I look back, I laugh at myself. Do you know why? Because, I could have chosen one of several non-coding tools available for data analysis, and could’ve avoided the suffering.

Data exploration is an inevitable part of predictive modeling. You can’t make predictions unless you know what happened in the past. The most important skill to master data exploration is ‘curiosity’, which is free of cost yet isn’t owned by everyone.

I have written this article to help you acknowledge various free tools available for exploratory data analysis. Now a days, ample of tools are available in the market which are free & quite interesting to work with. These tools doesn’t require you to code explicitly but simple drag – drop clicks does the job.

 

List of  Non Programming Tools

1. Excel / Spreadsheetorb

If you are transitioning into data science or have already survived for years, you would know, even after countless years, excel remains an indispensable part of analytics industry. Even today, most of the problems faced in analytics projects are solved using this software. With larger than ever community support, tutorials, free resources, learning this tool has become quite easier.

It supports all the important features like summarizing data, visualizing data, data wrangling etc. which are powerful enough to inspect data from all possible angles. No matter how many tools you know, excel must feature in your armory. Though, Microsoft excel is paid but you can still try various other spreadsheet tools like open office, google docs, which are certainly worth a try!

Free Download: Click Here

 

2. Trifactatrifacta

Trifacta’s Wrangler tool is challenging the traditional methods of data cleaning and manipulation. Since, excel possess limitations on data size, this tool has no such boundaries and you can securely work on big data sets. This tool has incredible features such as chart recommendations, inbuilt algorithms, analysis insights using which you can generate reports in no time. It’s an intelligent tool focused on solving business problems faster, thereby allowing us to be more productive at data related exercises.

Availability of such open source tools make us feel more confident and supportive, that there are good people also, around the world who are working extremely hard to make our lives better.

Free Download: Click Here

 

3. Rapid Minerrapidminer-180x180

This tool emerged as a leader in 2016 Gartner Magic Quadrant for Advanced Analytics. Yes, it’s more than a data cleaning tool. It extends its expertise in building machine learning models. Yes, it comprises all the ML algorithms which we use frequently. Not just a GUI, it also extends support to people using Python & R for model building.

It’s continues to fascinate people around the world with its remarkable capabilities. Above all, it claims to provide analytics experience at lightning fast level. Their product line has several products built for big data, visualizations, model deployment, some of which (enterprise) include a subscription fee. In short, we can say it’s a complete tool for any business which requires performing all tasks from data loading to model deployment.

Free Download: Click Here

 

4. Rattle GUI logo

If you tried using R, but couldn’t get a knack of what’s going in, Rattle should be your first choice. This GUI is built on R and gets launched by typing install.packages("rattle") followed by library(rattle) then rattle() in R. Therefore, to use rattle you must install R. It’s also more than just data mining tool. Rattle supports various ML algorithms such as Tree, SVM, Boosting, Neural Net, Survival, Linear models etc.

It’s being widely used these days. According to CRAN, rattle is being installed 10000 times every month. It provides enough options to explore, transform and model data is just few clicks. However, it has fewer options than SPSS for statistical analysis. But, SPSS is a paid tool.

Free Download: Click Here

 

5. Qlikviewqlikview-logo

Qlikview is one of the most popular tool in business intelligence industry around the world. Deriving business insights and presenting it in an awesome manner, it what this tool does. With it’s state of art visualization capabilities, you’d be amazed by the amount of control you get while working on data. It has an inbuilt recommendation engine to update you from time to time about best visualization methods while working on data sets.

However, it is not a statistical software. Qlikview is incredible at exploring data, trend, insights but it can’t prove anything statistically. In that case, you might want to look at other softwares.

Free Download: Click Here

 

6. Weka weka-logo

An advantage of using Weka is that it is easy to learn. Being a machine learning tool, its interface is intuitive enough for you to get the job done quickly. It provides options for data pre-processing, classification, regression, clustering, association rules and visualization. Most of the steps you think of while model building can be achieved using Weka. It’s built on Java.

Primarily, it was designed for research purposes at University of Wakaito, but later it got accepted by more and more people around the world. However, overtime I haven’t seen an enthusiastic weka community like of R and Python. The tutorial listed below should help you more.

Free Tutorial: Click Here

 

7. KNIME knime

Similar to RapidMiner, KNIME offers an open source analytics platform for analyzing data, which can later be deployed, scaled using other supportive KNIME products. This tool has abundance of features on data blending, visualization and advanced machine learning algorithms. Yes, using this tool you can build models also. Though, there hasn’t be enough talk about this tool, but considering its state of art design, I think it will soon catch up much needed limelight.

Moreover, quick training lessons are available on their website to get you started with this tool right now.

Free Download: Click Here

 

8. Orange orange

As cool as its sounds, this tool is designed to produce interactive data visualizations and data mining tasks. There are enough youtube tutorial to learn this tool. It has an extensive library of data mining tasks which includes all classification, regression, clustering methods. Along with, the versatile visualizations which get formed during data analysis allows us to understand the data more closely.

To build any model, you’ll be required  to create a flowchart. This is interesting as it would help us further understand the exact procedure of data mining tasks.

Free Download: Click Here

 

9. Tableau Publictableau

Tableau is a data visualization software. We can say, tableau and qlikview are the most powerful sharks in business intelligence ocean. The comparison of superiority is never ending. It’s a fast visualization software which let’s you explore data, every observation using various possible charts. It’s intelligent algorithms figure out by self about the type of data, best method available etc.

If you want to understand data in real time, tableau can get the job done. In a way, tableau imparts a colorful life to data and let’s us share our work with others.

Free Download: Click Here

 

10. Data Wrapper datawrapper

It’s a lightning fast visualization software. Next time, when someone in your team gets assigned BI work, and he/she has no clue what to do, this software is a considerable option. It’s visualization bucket comprises of line chart, bar chart, column chart, pie chart, stacked bar chart and maps. So, it’s a basic software and can’t be compared with giants like tableau and qlikview. This tools is browser enabled and doesn’t require any software installation.

 

11. Data Science Studio (DSS)

ikuIt is a powerful tool designed to connect technology, business and data. It is available in two segments: Coding & Non-Coding. It’s a complete package for any organization which aims to develop, build, deploy and scale models on network. DSS is also powerful enough to create smart data applications to solve real world problems. It comprises of features which facilitates team integration on projects. Among all features, the most interesting part is, you can reproduce your work in DSS as every action in the system is versioned through an integrated GIT repository.

Free Download: Click Here

 

12. OpenRefinerefine

It started as Google Refine but looks like google plummeted this project due to reasons unclear. However, this tool is still available renamed as Open Refine. Among the generous list of open source tools, openrefine specializes in messy data; cleaning, transforming and shaping it for predictive modeling purposes. As an interesting fact, during model building, 80% time of an analyst is spent in data cleaning. Not so pleasant, but it’s the fact. Using openrefine, analysts can not only save their time, but put it to use for productive work.

Free Download: Click Here

 

13. Talendtaled

Decision making these days is largely driven by data. Managers & professionals no longer make gut-based decision. They require a tool which can help them quickly. Talend can help them to explore data and support their decision making. Precisely, it’s a data collaboration tool capable of clean, transform and visualize data.

Moreover, it also offers an interesting automation feature where you can save and redo your previous task on a new data set. This feature is unique and haven’t been found in many tools. Also, it makes auto discovery, provides smart suggestion to the user for enhanced data analysis.

Free Download: Click Here

 

14. Data Preparator prep

This tool is built on Java to assist us in data exploration, cleaning and analysis. It includes various inbuilt packages for discretization, numeration, scaling, attribute selection, missing values, outliers, statistics, visualization, balancing, sampling, row selection, and several other tasks. It’s GUI is intuitive and simple to understand. Once you start working on it, I’m sure you wouldn’t take lot of time to figure out how to work.

A unique advantage of this tool is, the data set used for analysis doesn’t get stored in computer memory. This means you can work on large data sets without having any speed or memory troubles.

Free Download: Click Here

 

15. DataCracker  datac

It’s a data analysis software which specializes on survey data. Many companies do survey but they struggle to analyze it statistically. Survey data are never clean. It comprises of lot of missing & inappropriate value. This tool reduces our agony and enhances our experience of working on messy data. This tool is designed such that it can load data from all major internet survey programs like surveymonkey, survey gizmo etc. There are several interactive features which helps to understand data better.

Free Download: Click Here

 

16. Data Applied app

This powerful interactive tool is designed to build, share, design data analysis reports. Creating visualization on large data sets can sometimes be troublesome. But this tool is robust in visualizing large amounts of data using tree maps. Like all other tools above, it has feature for data transformation, statistical analysis, detecting anomalies etc. All in all, it’s a multi usage data mining tool capable of of automatically extracting valuable knowledge (signal) from the raw data. You’d be amazed to see that such non-programming tools are no less than R or Python for data analysis.

Free Download: Click Here

 

17.  Tanagra Project tan

You might not like it because of old fashioned UI, but this free data mining software is designed to build machine learning models. Tanagra project started as a free software for academic and research purposes. Being an open source project, it provides you enough space to devise your own algorithm and contribute.

Along with supervised learning algorithms, it is enabled with paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms etc. Some of its limitations include  unavailability of wide set of data sources, direct access to datawarehouses and databases, data cleansing, interactive utilization etc.

Free Download: Click Here

 

18. H2ooiu

H2o is one of the most popular software in analytics industry today. In few years, this organization has succeeded in evangelizing the analytics community around the world. With this open source software, they bring lighting fast analytics experience, which is further extended using API for programming languages. Not just data analysis, but you can build advanced machine learning models in no time. The community support is great, hence learning this tool isn’t a worry. If you live in US, chances are they would be organizing a meetup nearby you. Do drop by!

Free Download: Click Here

 

Bonus Additions:

In addition to the awesome tools above, I also found some more tools which I thought you might be interested to look at. However, these tools aren’t free but you can still avail them for trial:

  1. Data Kleenr
  2. Data Ladder
  3. Data Cleaner
  4. WinPure

 

End Notes

Once you start working on these tools (your choice), you’d understand that knowing programming for predictive modeling isn’t much advantageous. You can accomplish the same thing with these open source tools. Therefore, until now, if you were get disappointed at your lack of non-coding, now is the time you channelize your enthusiasm on these tools. You may be interested to check 19 Data Science Tools for Non Coders.

The only limitation I see with these tools (some of them) is, lack of community support. Except few tools, several of them don’t have a community to seek help and suggestions. Still, it’s worth a try!


Read more »



Nov
26
19 Data Science Tools for people who aren’t so good at Programming
Posted by Thang Le Toan on 26 November 2017 01:20 PM

Introduction

Programming is an integral part of data science. Among other things, it is considered that a mind which understands programming logic, loops, functions has higher chances of becoming a successful data scientist. So, what about people who never studied programming subject in their school or college ?

Are they doomed to have a unsuccessful career in data science ?

I’m sure there are countless people who want to enter data science domain but don’t understand coding very well. In fact, I too was a member of your non-programming league until I joined my first job. Therefore, I understand how terribly it feels when something you have never learnt haunts you at every step now.

Good news is, I found out a way! Rather, I’ve found out 19 ways using which you can ignite your appetite to learn data science without doing coding. These tools typically obviate the programming aspect and provide user-friendly GUI (Graphical User Interface) so that anyone with minimal knowledge of algorithms can simply used them to build predictive models.

Many companies (specially startups) have recently launched GUI driven data science tools. I’ve covered most of tools available in industry today. Also, I’ve added some videos to enhance your learning experience.

 

Note: All the information provided is gather from open-source information sources. We are just presenting some facts and not opinions. In no manner do we intent to promote/advertise any of the products/services.

data science tools for non programmers

 

List of Tools

1. RapidMiner

RapidMiner (RM) was originally started in 2006 as an open-source stand-alone software named Rapid-I. Over the years, they have given it the name of RapidMiner and also attained ~35Mn USD in funding. The tool is open-source for old version (below v6) but the latest versions come in a 14-day trial period and licensed after that.

RM covers the entire life-cycle of prediction modeling, starting from data preparation to model building and finally validation and deployment. The GUI is based on a block-diagram approach, something very similar to Matlab Simulink. There are predefined blocks which act as plug and play devices. You just have to connect them in the right manner and a large variety of algorithms can be run without a single line of code. On top of this, they allow custom R and Python scripts to be integrated into the system.

There current product offerings include the following:

  1. RapidMiner Studio: A stand-alone software which can be used for data preparation, visualization and statistical modeling
  2. RapidMiner Server: It is an enterprise-grade environment with central repositories which allow easy team work, project management and model deployment
  3. RapidMiner Radoop: Implements big-data analytics capabilities centered around Hadoop
  4. RapidMiner Cloud: A cloud-based repository which allows easy sharing of information among various devices

RM is currently being used in various industries including automotive, banking, insurance, life Sciences, manufacturing, oil and gas, retail, telecommunication and utilities.

 

2. DataRobot

DataRobot (DR) is a highly automated machine learning platform built by all time best Kagglers including Jeremy Achin, Thoman DeGodoy and Owen Zhang. Their platform claims to have obviated the need for data scientists. This is evident from a phrase from their website – “Data science requires math and stats aptitude, programming skills, and business knowledge. With DataRobot, you bring the business knowledge and data, and our cutting-edge automation takes care of the rest.”

DR proclaims to have the following benefits:

  • Model Optimization
    • Platform automatically detects the best data pre-processing and feature engineering by employing text mining, variable type detection, encoding, imputation, scaling, transformation, etc.
    • Hyper-parameters are automatically chosen depending on the error-metric and the validation set score
  • Parallel Processing
    • Computation is divided over thousands of multi-core servers
    • Uses distributed algorithms to scale to large data sets
  • Deployment
    • Easy deployment facilities with just a few clicks (no need to write any new code)
  • For Software Engineers
    • Python SDK and APIs available for quick integration of models into tools and softwares.

With funding of ~60Mn USD and more than 100 employees, DR looks in good shape for the future.

 

3. BigML

BigML is another platform with ~Mn USD in funding. It provides a good GUI which takes the user through 6 steps as following:

  • Sources: use various sources of information
  • Datasets: use the defined sources to create a dataset
  • Models: make predictive models
  • Predictions: generate predictions based on the model
  • Ensembles: create ensemble of various models
  • Evaluation: very model against validation sets

These processes will obviously iterate in different orders. The BigML platform provides nice visualization of results and has algorithms for solving classification, regression, clustering, anomaly detection and association discovery problems. You can get a feel of how their interface works using their YouTube channel.

 

4. Google Cloud Prediction API

 

The Google Cloud Prediction API offers RESTful APIs for building machine learning models for android applications. This platform is specifically for mobile applications based on Android OS. Some of the use cases include:

  • Recommendation Engine: Given a user’s past viewing habits, predict what other movies or products a user might like.
  • Span Detection: Categorize emails as spam or non-spam.
  • Sentiment Analysis: Analyze posted comments about your product to determine whether they have a positive or negative tone.
  • Purchase Prediction: Guess how much a user might spend on a given day, given his spending history.

Though the API can be used by any system, there are also specific Google API client libraries build for better performance and security. These exist for various programming languages- Python, Go, Java, JavaScript, .net, NodeJS, Obj-C, PHP and Ruby.

 

5. Paxata

Paxata is one of the few organizations which focus on data cleaning and preparation, NOT the machine learning or statistical modeling part. It is an MS Excel-like application that is easy to use, with visual guidance making it easy to bring together data, find and fix dirty or missing data, and share and re-use data projects across teams. Like others mentioned here, Paxata eliminates coding or scripting, so overcoming technical technical barriers involved in handling data.

Paxata platform follows the following process:

  1. Add Data: use a wide range of sources to acquire data
  2. Explore: perform data exploration using powerful visuals allowing the user to easily identify gaps in data
  3. Clean+Change: perform data cleaning using steps like imputation, normalization of similar values using NLP, detecting duplicates
  4. Shape: make pivots on data, perform grouping and aggregation
  5. Share+Govern: allows sharing and collaborating across teams with strong authentication and authorization in place
  6. Combine: a proprietary technology called SmartFusion allows combining data frames with 1 click as it automatically detects the best combination possible; multiple data sets can be combined into a single AnswerSet
  7. BI Tools: allows easy visualization of the final AnswerSet in commonly used BI tools; also allows easy iterations between data preprocessing and visualization

With a funding of ~25Mn USD, Praxata has set its foot in financial services, consumer goods and networking domains. It might be a good tool to use if your work requires extensive data cleaning.

 

6. Trifacta

Trifacta is another startup focussed on data preparation. It has 2 product offering:

  • Wrangler – a free stand-alone software
  • Wrangler Enterprise – licensed professional version

Trifacta offers a very intuitive GUI for performing data cleaning. It takes data as input and provides a summary with various statistics by column. Also, for each column it automatically recommends some transformations which can be selected using a single click. Various transformations can be performed on the data using some pre-defined functions which can be called easily in the interface.

Trifacta platform uses the following steps of data preparation:

  1. Discovering: this involves getting a first look at the data and distributions to get a quick sense of what you have
  2. Structure: this involves assigning proper shape and variable types to the data and resolving anomalies
  3. Cleaning: this step includes processes like imputation, text standardization, etc. which are required to make the data model ready
  4. Enriching: this step helps in improving the quality of analysis that can be done by either adding data from more sources or performing some feature engineering on existing data
  5. Validating: this step performs final sense checks on the data
  6. Publishing: finally the data is exported for further use

With ~75Mn USD in funding, Trifacta is currently being used in financial, life sciences and telecommunication industry.

 

7. Narrative Science

Narrative Science is based on a unique idea in the sense that it generates automated reports using data. It works like a data story-telling tool which used advanced natural language processing to create reports. It is something similar to a consulting report.

Some of the features of this platform include:

  • incorporates specific statistics and past data of the organization
  • makes of the benchmarks, drivers and trends of the specific domain
  • it can help generate personalized reports targeted to specific audience

With ~30Mn USD in funding, Narrative Science is currently being used in financial, insurance, government and e-commerce domains. Some of its customers include American Century Investments, PayScale, MasterCard, Forbes, Deloitte, etc.

Having discussed some startups in this domain, lets move on to some of the academic initiatives which are trying to automate some aspects of data science. These have potential of turning into successful enterprise in future.

 

8. MLBase

MLBase is an open-source project developed by AMP (Algorithms Machines People) Lab at University of California, Berkeley. The core idea is to provide an easy solution for applying machine learning to large scale problems.

It has 3 offerings:

  1. MLib: It works as the core distributed ML library in Apache Spark. It was originally developed as part of MLBase project, but now the Spark community supports it
  2. MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions.
  3. ML Optimizer: This layer aims to automating the task of ML pipeline construction. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI and MLlib.

This undertaking is still under active development and we should hear about the developments in the near future.

 

9. WEKA

Weka is a data mining software written in Java, developed at the Machine Learning Group at University of Waikato, New Zealand. It is a GUI based tool which is very good for beginners in data science and the best part is that it is open-souce. You can learn about it using the MOOC offered by University of Waikato here. You can learn more about it in this article.

Though weka is currently more used in the academic community, but it might be the stepping stone of something big coming up in future.

 

10. Automatic Statistician

the automatic statistician

Automatic Statistician is not a product per se but a research organization which is creating a data exploration and analysis tool. It can take in various kinds of data and use natural language processing to generate a detailed report. It is being developed by researchers who have worked in Cambridge and MIT and also won Google’s Focussed Research Award with a price of $750,000. Though is it still under development and very minimal information is available about the project, it looks like it is being backed by Google. You can find some information here.

 

More Tools

I have discussed a selected set of 10 examples above but there are many more like these. I’ll briefly name a few of them here and you can explore further if this isn’t enough to whet your appetite:

  • MarketSwitch – This tool is more focussed on optimization rather than predictive analytics
  • algorithms.io – This tool works in the domain of IoT (Internet of Things) and performs analytics on connected devices
  • wise.io – This tool is focussed on customer handling and ticket system analytics
  • Predixion – This is another tool which works on data collected from connected devices
  • Logical Glue – Another GUI based machine learning platform which works from raw data to deployment
  • Pure Predictive – This tool uses a patented Artificial Intelligence system which obviates the part of data preparation and model tuning; it uses AI to combine 1000s of models into what they call “supermodels”
  • DataRPM – Another tool for making predictive models using a GUI and no coding requirements
  • ForecastThis – Another proprietary technology focussed on machine learning using a GUI
  • FeatureLab – It allows easy predictive modeling and deployment using GUI

If you’re hearing these names for the first time, you’ll be surprised (like I was :D) that so many tools exist. But the good thing is that they haven’t had a disruptive impact as of now. But the real question is will these technologies achieve their goals? Only time can tell!

 

End Notes

In this article, we have discussed various initiatives working towards automating various aspects of solving a data science problem. Some of them are in nascent research stage, some open-source and others being used in the industry with millions in funding. All of these pose a potential threat to the job of a data scientist, which is expected to grow in the near future.These tools are best suited for people who abhor programming & coding.

Do you know any other startups or initiatives working in this domain? Please feel free to drop a comment below and enlighten us!


Read more »



Nov
26
Comprehensive guide for Data Exploration in R
Posted by Thang Le Toan on 26 November 2017 01:16 PM

Till now we have already covered a detailed tutorials on data exploration using SAS and Python. What is the one piece missing to complete this series. I am sure you guessed it right. In this article I will give a detailed tutorial on Data Exploration using R. For reader ease, I  will follow a very similar format we used in Python tutorial. This is just because of the sheer resemblance between the two languages.

Here are the operation I’ll cover in this article (Refer to this article for similar operations in SAS):

    1. How to load data file(s)?

    2. How to convert a variable to different data type?

    3. How to transpose a table?

    4. How to sort Data?

    5. How to create plots (Histogram, Scatter, Box Plot)?

    6. How to generate frequency tables?

    7. How to do sampling of Data set?

    8. How to remove duplicate values of a variable?

  1. How to group variables to calculate count, average, sum?

  2. How to recognize and treat missing values and outliers?

  3. How to merge / join data set effectively?

 

 Part 1: How to load data file(s)?

Input data sets can be in various formats (.XLS, .TXT, .CSV, JSON ). In R, it is easy to load data from any source, due to its simple syntax and availability of predefined libraries. Here, I will take examples of reading a CSV file and a tab separated file. read.table is also an alternative, however, read.csv is my preference given the simplicity.


Code:

# Read CSV into R
MyData <- read.csv(file="c:/TheDataIWantToReadIn.csv", header=TRUE, sep=",")
#Read a Tab seperated file
Tabseperated <- read.table("c:/TheDataIWantToReadIn.txt", sep="\t", header=TRUE)

 

All other Read commands are similar to the one mentioned above.

 

Part 2: How to convert a variable to different data type?

Type conversions in R work as you would expect. For example, adding a character string to a numeric vector converts all the elements in the vector to character.

Use is.xyz to test for data type xyz. Returns TRUE or FALSE
Use as.xyz to explicitly convert it.

is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame()
 as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame()

However, conversion of data structure is more critical than the format transformation. Here is grid which will guide you with format conversion :

conversion

 

Part 3: How to transpose a Data set?

It is also some times required to transpose a dataset from a wide structure to a narrow structure. Here is the code you use to do the same :

transform

Code

# example of melt function
 library(reshape)
 mdata <- melt(mydata, id=c("id","time"))

 

Part 4: How to sort DataFrame?

Sorting of data can be done using order(variable name) as an index . It can be based on multiple variables and ascending or descending both order.

Code

# sort by var1
newdata <- old[order(var1),]
# sort by var1 and var2 (descending)
newdata2 <- old[order(var1, -var2),]

 

Part 5: How to create plots (Histogram)?

Data visualization on R is very easy and creates extremely pretty graphs. Here I will create a distribution of scores in a class and then plot histograms with many variations.

score <-rnorm(n=1000, m=80, sd=20) 
hist(score)

hist1

Let’s try to find the assumptions R takes to plot this histogram, and then modify a few of those assumptions.

histinfo<-hist(score)
histinfo
$breaks
 [1] 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150
$counts
 [1] 2 5 19 52 84 141 195 201 152 81 39 25 3 1
$density
 [1] 0.0002 0.0005 0.0019 0.0052 0.0084 0.0141 0.0195 0.0201 0.0152
[10] 0.0081 0.0039 0.0025 0.0003 0.0001
$mids
 [1] 15 25 35 45 55 65 75 85 95 105 115 125 135 145
$xname
[1] "score"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"

As you can see, the breaks are applied at multiple points. We can restrict the number of break points or vary the density. Over and above this, we can colour the bar plot and overlay a normal distribution curve. Here is how you can do all this :

hist(score, freq=FALSE, xlab="Score", main="Distribution of score", col="lightgreen", xlim=c(0,150), ylim=c(0, 0.02))
curve(dnorm(x, mean=mean(score), sd=sd(score)), add=TRUE, col="darkblue", lwd=2)


hist2

 

 

Part 6: How to generate frequency tables with R?

Frequency tables are the most basic and effective way to understand distribution across categories.

Here is a simple example of calculating one war frequency :

attach(iris)
table(iris$Species)

freq

 

Here is a code which can find cross tab between two categories :

# 2-Way Cross Tabulation
library(gmodels)
CrossTable(mydata$myrowvar, mydata$mycolvar)

 

Part 7: How to sample Data set in R?

For sampling a dataset in R, we need to first find a few random indices . Here is how you can find a random sample:

mysample <- mydata[sample(1:nrow(mydata), 100,replace=FALSE),]

This code will simply take out a random sample of 100 observations from the table mydata.

 

Part 8: How to remove duplicate values of a variable?

Removing duplicates on R is extremely simple. Here is how you do it:

> set.seed(150)
> x <- round(rnorm(20, 10, 5))
> x
 [1] 2 10 6 8 9 11 14 12 11 6 10 0 10 7 7 20 11 17 12 -1
> unique(x)
 [1] 2 10 6 8 9 11 14 12 0 7 20 17 -1

 

Part 9: How to find class level count average and sum in R?

We generally use Apply functions to do these jobs.

> tapply(iris$Sepal.Length,iris$Species,sum)
 setosa versicolor virginica 
 250.3 296.8 329.4 
> tapply(iris$Sepal.Length,iris$Species,mean)
 setosa versicolor virginica 
 5.006 5.936 6.588

 

Part 10: How to recognize and Treat missing values and outliers?

Identifying missing values can be done as follows :

> y <- c(4,5,6,NA)
> is.na(y)
[1] FALSE FALSE FALSE TRUE

And here is a quick fix for the same :

y[is.na(y)] <- mean(y,na.rm=TRUE)
y

[1] 4 5 6 5

As you can see, the missing value has been imputed with the mean of other numbers. Similarly, we can impute missing values with any best value available.

 

Part 11: How to merge / join data sets?

This is yet another operation which we use in our daily life.

To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join).

# merge two data frames by ID
total <- merge(data frameA,data frameB,by="ID")
# merge two data frames by ID and Country
total <- merge(data frameA,data frameB,by=c("ID","Country"))

Appending dataset is another such function which is very frequently used. To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order.

total <- rbind(data frameA, data frameB)

 

End Notes:

In this comprehensive guide, we looked at the R codes for various steps in data exploration and munging. This tutorial along with the ones available for Python and SAS will give you a comprehensive exposure to the most important languages of the analytics industry.

Did you find the article useful? Do let us know your thoughts about this guide in the comments section below.


Read more »




Help Desk Software by Kayako