Live Chat Software by Kayako 
Nov 26 
Free Must Read Books on Statistics & Mathematics for Data Science
Posted by Thang Le Toan on 26 November 2017 01:29 PM

IntroductionThe selection process of data scientists at Google gives higher priority to candidates with strong background in statistics and mathematics. Not just Google, other top companies (Amazon, Airbnb, Uber etc) in the world also prefer candidates with strong fundamentals rather than mere knowhow in data science. If you too aspire to work for such top companies in future, it is essential for you to develop a mathematical understanding of data science. Data science is simply the evolved version of statistics and mathematics, combined with programming and business logic. I’ve met many data scientists who struggle to explain predictive models statistically. More than just deriving accuracy, understanding & interpreting every metric, calculation behind that accuracy is important. Remember, every single ‘variable’ has a story to tell. So, if not anything else, try to become a great story explorer! In this article, I’ve compiled a list of must read books on statistics and mathematics. I understand, mathematics has no extreme. Hence, I’ve enlist only those books which will help you to connect with data science better. Note: Books which are made free to access by the registered authorities have been mentioned in this article. If not, a link to amazon bookstore is provided.
StatisticsIntroduction to Statistical LearningThis is a highly recommended book for practicing data scientists. The focus of this books is kept on connecting statistics concept with machine learning. Hence, you’ll learn about all popular supervised and unsupervised machine learning algorithms. R users will get an advantage, since the practical aspects of algorithms have been demonstrated using R. In addition to theory, this book also lay emphasis on using ML algorithms in real life setting. Available: Free Download
Elements of Statistical LearningThis book is an advanced level of previous book. It is written by Trevor Hastie and Rob Tibshirani, Professors at Stanford University. Their first book ‘Introduction to Statistical Learning’ uncover the basics of statistics and machine learning. This book, will introduce you to higher level algorithms such as Neural Networks, Bagging & Boosting, Kernel methods etc. The algorithms have been implemented in R programming. Available: Free Download
Think StatsThe author of this book is Alien B Downey. It is based on perform statistical analysis practically in Python. Hence, make sure you’ve got some basic knowledge of Python before buying this book. It focuses entirely on understanding real life influence of statistics using popular case studies. Since, stats and math are closely connected, it also has dedicated chapters on topic like bayesian estimation. Available: Buy from Amazon
From Algorithms to Z ScoresDid you know the about crucial role of statistics in programming ? The author of this book is Norm Matloff, Professor, University of California. This book explains using probabilistic concepts and statistical measures in R. Again, a good practice source for R users. It teaches the art of dealing with probabilistic models and choosing the best one for final evaluation. It is a highly recommended book (specially for R users). Available: Free Download
Introduction to Bayesian StatisticsThis is a highly recommended book for freshers in data science. The author of this book is William M Bolstad. It’s a must read for people who find mathematics boring. Having been written in a conversational style (rare to find math this way), this book is a great introductory resource on statistics. It begins with scientific methods of data gathering and end up delivering dedicated chapters on bayesian statistics. Available: Free Download
Discovering Statistics using RThis book is written by Andy Field, Jeremy Miles and Zoe Field. I would highly recommend this book to newbies in data science. To start with statistics, this book has a great content which goes in depth detail of its topics. Along with, the statistical concept are explained in conjunction with R which makes it even more useful. It offers a step by step understanding, with a parallel support of interesting practice examples. Available: Buy on Amazon
MathematicsIntroduction to Linear AlgebraThis is one of the most recommended book on Linear Algebra. The author of this book is Gilbert Strang, Professor, MIT. Gilbert unique way of delivering knowledge would give you the intuition and excitement to move forward after every chapter. This book will help you to build a strong mathematical foundation for machine learning. It enlists all the necessary chapters such as vectors, linear equations, determinants, eigenvalues, matrix factorization etc in great depth. Available: Buy on Amazon
Matrix ComputationMatrix and Data frames are essential components of machine learning. The author of this book is Gene H Golub and Charles F Van Loan. This book provides a nice head start to students with concepts of matrix computations. The author covers most of the important topics such as gaussian elimination, matrix factorization, lancoz method, error analysis etc. Every chapter is supported by intuitive practice problems. The pseudo codes are available in Matlab. Available: Free Download
A Probabilistic Theory of Pattern RecognitionThis is a complete resource to learn application of mathematics. This is a must read book for intermediate and advanced practitioners in machine learning. This book is written by Luc Devroye, Laszlo Gyorfi and Gabor Lugosi. It covers a wide range of topics varying from bayes error, linear discrimination to epsilon entropy & neural networks. It provides a convincing explanation to complex theorems with section wise practice problems. Available: Free Download
Introduction of Math of Neural NetworksIf you have innate interest in learning about neural network, this should be your place to start. The author of this book is Jeff Heaton. The author has beautifully simplified the difficult concepts of neural networks. This book introduces you to basics of underlying maths in neural networks. It assumes reader has prior knowledge of algebra, calculus and programming. It demonstrates various mathematical tools which can be applied to neural networks. Available: Buy on Amazon
Advanced Engineering MathematicsThis is probably the most comprehensive book available on mathematics for machine learning users. The author of this book is Erwin Kreyszig. As a matter of fact, this book is highly recommended to college students as well. If you haven’t been good at maths till now, follow this book religiously and you should surely see significant improvements in your math understanding. Along with derivations & practice example, this book has dedicated sections of calculus, algebra, probability etc. Definitely, a must read book for all levels of practitioners in data science. Available: Free Download
Cookbook on Probability and StatisticsThis cookbook is must have in your digital bookshelf. This isn’t exactly a text book you’d discover, but a quick digital guide on mathematical equations. The author of this book is Matthias Vallentin. After you finish with essentials of mathematics, this book will help you connect various theorem and algorithm quickly with their formulae. It’s difficult to derive equations instantly, this book will help you to quickly navigate to your desired problem and solve. Available: Free Download
Additional ResourcesBored of reading too much ? Here are is a list of highly recommended tutorials (video) / resources on mathematics and statistics. They are FREE to access.
End NotesThe books listed in this article are selected on the basis of their reviews and depth of topics covered. This is not an exhaustive list of books. But, I found it’s almost too easy to get confused while deciding ‘from where to begin?’ In such situations, it is advisable to start with this list. In this article, I’ve listed some most helpful books on statistics and machine learning. It has been found that people tend of neglect these topics in pursuit of quick success. But, that’s not the right way. Hence, if you aim for a long term success in data science, make sure you learn to create stories out of maths and statistics. Read more »  
Nov 26 
18 New Must Read Books for Data Scientists on R and Python
Posted by Thang Le Toan on 26 November 2017 01:28 PM

Introduction“It’s called reading. It’s how people install new software into their brain” Personally, I haven’t learnt as much from videos & online tutorials as much I’ve learnt from books. Until this very moment, my tiny wooden shelf has enough books to keep me busy this winter. Understanding machine learning & data science is easy. There are numerous open courses which you can take up right now and get started. But, acquiring indepth knowledge of a subject requires extra effort. For example: You might quickly understand how does a random forest work, but understanding the logic behind it’s working would require extra efforts. The confidence of questioning the logic comes from reading books. Some people easily accept the status quo. On the other hand, some curious ones challenge & say, “Why can’t it be done the other way?” That’s where such people discover new ways of executing a task. Almost, every data scientist I’ve come across in person, on AMAs, on published interviews, each one of them have emphasized the inevitable role of books in their lives. Here is a list of books on doing machine learning / data science in R and Python which I’ve come across in last one year. Since reading is a good habit, with this post, I want pass this habit to you. For each book, I’ve written a summary to help you judge its relevance. Happy reading! Disclosure: The amazon links in this article are affiliate links. If you buy a book through this link, we would get paid through Amazon. This is one of the ways for us to cover our costs while we continue to create these awesome articles. Further, the list reflects our recommendation based on content of book and is no way influenced by the commission.
R for Data ScienceHandson Programming with RThis book is written by Garrett Grolemund. It is best suited for people new to R. Learning to write functions & loops empowers you to do much more in R, than just juggling with packages. People think, R packages can let them avoid writing functions & loops, but it isn’t a sustainable approach. This book introduces you to details of R programming environment using interesting projects like weighted dice, playing cards, slot machine etc. The book language is simple to understand and examples can be reproduced easily. Available: Buy Now
R for Everyone: Advanced Analytics and GraphicsThis book is written by Jared P. Lander. It’s a decent book covering all aspects of data science such as data visualization, data manipulation, predictive modeling, but not in as much depth. You can understand as, it covers a wide breath of topic and misses out on details of each. Precisely, it emphasizes on the usage criteria of algorithms and one example each showing its implementation in R. This books should be brought by people who are more inclined towards understand practical side of algorithms. Available: Buy Now
R CookbookThis book is written by Teetor Paul. It comprises of several tips, recipes to help people overcome daily struggles in data preprocessing and manipulation. Many a times, we are stuck in a situation where we know very well, what needs to be done. But, how it needs to be done becomes a mammoth challenge. This books solves the problem. It doesn’t have theoretical explanation of concepts, but focuses on how to use them in R. It covers a wide range of topics such as probability, statistics, time series analysis, data preprocessing etc. Available: Buy Now
R Graphics CookbookThis book is written by Available: Buy Now
Applied Predictive ModelingThis book is written by Max Kuhn and Kjell Johnson. Max Kuhn is none other than creator of caret package too. It’s one of the best book comprising a blend of theoretical and practical knowledge. It discusses several crucial machine learning topics such as overfitting, feature selection, linear & nonlinear models, trees methods etc. Needless to say, it demonstrates all these algorithms using caret package. Caret is one of the powerful ML package contributed in CRAN library. Available: Buy Now
Introduction to Statistical LearningThis book is written by a team of authors including Trevor Hastie and Robert Tibshirani. It is one of the most detailed book on statistical modeling. Also, it’s available for free. It comprises of indepth explanation of topics such as linear regression, logistic regression, trees, SVM, unsupervised learning etc. Since it’s the introduction, the explanations are quite easy and any newbie can easily follow it. Thus, I recommended this book to all people who are new to machine learning in R. In addition, several practice exercises in this book just adds cherry on top. Available: Buy Now
Elements of Statistical LearningThis book is written by Trevor Hastie, Robert Tibshirani and Jerome Friedman. This is the next part of ‘Introduction to Statistical Learning’. It comprises of more advanced topics, therefore I would suggest you not to directly jump to it. This book in best suited for people familiar with basics of machine learning. It talks about shrinkage methods, different linear methods for regression, classification, kernel smoothing, model selection etc. It’s a must read book for people who want to understand ML in depth. Available: Buy Now
Machine Learning with RThis book is written by Brett Lantz. I am impressed by the simplicity of this author’s way of explaining concepts. It’s a book on machine learning which is easy to understand, and would provide you a lot of knowledge about their practical aspects too. Algorithms such as Bagging, Boosting, SVM, Neural Network, Clustering etc are discussed by solving respective case studies. These case studies will help you understand the real world usage of these algorithms. In addition, knowledge of ML parameters is also discussed. Available: Buy Now
Mastering Machine Learning with RThis book is written by Cory Lesmeister. It is best suited for everyone who want to master R for machine learning purposes. It comprises of all (almost) algorithms and their execution in R. Alongside, this book will introduce you to several R packages used for ML including the recently launched H2o package. It’s a book which features latest advancements in ML forte, hence I’d suggest it to be read by every R user. However, you can’t expect to learn advanced ML techniques like Stacking from this book. Available: Buy Now
Machine Learning for HackersThis book is written by Drew Conway and John Myles White. It’s a relatively shorter book than others, but aptly brings out sheer importance of every topic discussed. After reading this book, I realized that the author’s mindset is not to go deep in a topic, still making sure to cover important details. For enhanced understanding, the author also demonstrates several used cases, while solving which, explains the underlying methods too. It’s a good read for everyone who’d like to learn something new about ML. Available: Buy Now
Practical Data Science with RThis book is written by Nina Zumel & John Mount. As the name suggests, this book focuses on using data science methods in real world. It’s different in itself. None of the books listed above, talks about real world challenges in model building, model deployment, but it does. The author doesn’t move her focus from establishing a connect between theoretical world of ML and its impact on real world activities. It’s a must read for freshers who are yet to enter analytics industry. Available: Buy Now
Python for Data ScienceMastering Python for Data ScienceThis book is written by Samir Madhavan. This book starts with an introduction to data structures in Numpy & Pandas and provides a useful description of importing data from various sources into these structures. You will learn to perform linear algebra in Python and make analysis by using inferential statistics. Later, the book takes onto the advanced concepts like building a recommendation engine, highend visualization using Python, ensemble modeling etc. Available: Buy Now
Python for Data AnalysisWant to get started with data analysis with Python? Get your hands on this data analysis guide by W Mckinney, the main author of Pandas library. There isn’t any online course as comprehensive as this book. This book covers all aspects of data analysis from manipulating, processing, cleaning, visualization and crunching data in Python. If you are a new to data science python, it’s a must read for you. It’s powerpacked with case studies from various domains. Available: Buy Now
Introduction to Machine Learning with PythonThis book is written by Andreas Muller and Sarah Guido. It’s meant to help beginners to get started with machine learning. It teaches to build ML models in python scikitlearn from scratch. It assumes no prior knowledge, hence it’s best suited for people with no prior python or ML knowledge. In addition, it also covers advanced methods for model evaluation and parameter tuning, methods for working with textdata, text specific processing techniques etc. Available: Buy Now
Python Machine LearningThis book is written by Sebastian Raschka. It’s one of the most comprehensive book’s I’ve found on ML in Python. The author explains every crucial detail we need to know about machine learning. He takes a stepwise approach in explaining the concepts supported by various examples. This book cover topics such as neural networks, clustering, regression, classification, ensemble etc. It’s a must read book for everyone keen to master ML in python. Available: Buy Now
Building Machine Learning Systems with Python
Available: Buy Now
Advanced Machine Learning with PythonThis book is written by John Hearty. It’s a definite read for every machine learning enthusiasts. It lets you rise above the basics of ML techniques and dive into unsupervised methods, deep belief networks, Auto encoders, feature engineering techniques, ensembles etc. It’s definitely a book you would want to read to improve your ranks in machine learning competitions. The author lays equal emphasis on theoretical as well practical aspects of machine learning. Available: Buy Now
Programming Collective IntelligenceThis book is written by Toby Segaran. With an interesting title, this book is meant to introduce you to several ML algorithms such as SVM, trees, clustering, optimization etc using interesting examples and used cases. This is book is best suited for people new to ML in python. Python, known for its incredible ML libraries & support should make it easy for you to learn these concepts faster. Also, the chapters include exercises for practice to help you develop better understanding. Available: Buy Now
End NotesThe motive of this article is to introduce you to the huge reservoir of knowledge which you haven’t noticed yet. These books will not only provide you boundless knowledge but also, enrich you with various perspectives on using ML algorithms. You might feel puzzled at seeing so many books explaining similar concepts. What differentiates these books is the case studies & examples discussed. Trust me, sometimes theoretical explanations becomes quite difficult to decipher as compared to understanding practical cases. That’s how I feel. Learning from these author’s knowledge is the fastest way you can learn from so many people. Read more »  
Nov 26 
What is Business Analytics and which tools are used for analysis?
Posted by Thang Le Toan on 26 November 2017 01:26 PM

Business Analytics has become a catch all word for any thing to do with data. So if you are new to this field and don’t understand what people refer to as “Business Analytics”, don’t worry! Even after spending more than 6 years in this industry, there are times when it is difficult for me to understand the work a person has done by reading his CV. Here is how an excerpt from a typical JD might look like:
On one hand, this creates confusion in mind of person applying for a particular role. On the other hand, it leaves the selectors with a difficult role to understand and judge what a person has done in past. Now, if I got this as description for one of the jobs I had applied to, I would be scared! Scared, not because I don’t know the subject, but because, these could mean anything. The work could refer to preparing basic reports at a junior level to performing multivariate deep dives on various subjects. So, what do you do when you are in such a situation?Well, the first thing you should do is understand Business Analytics spectrum. Once you have understood it, ask which part of the spectrum, the role applies to and then decide whether it suits your skills or not. Following is a good representation of this spectrum: Let me explain each of these areas in a bit more detail.
The domain of Analytics starts from answering a simple question – What happened? This activity is typically known as reporting. These are typically the MIS which people want to receive first thing in the morning. It is a snapshot of what has happened. Following is an example of how a typical report might look like: Tools used in reporting: Majority of elementary reporting happens on MS Excel across the globe. More evolved Organizations might pull the data through databases using tools like SQL, MS Access or Oracle. But typically, the dissemination of reports happens through Excel. Skills required for reporting:
Detective Analysis starts where reporting ends. You start looking for reasons for unexpected changes. Typical problems you work on are “Why did the Sales drop in last 2 months?” or “Why did the latest campaign underperform or overperform?”. In order to find out answers to these questions, you look at past trends or you look at distribution changes to find out the reasons for the changes. However, all of this is backward looking. Some of these insights, which you find out after looking at backward analysis can be used for business planning, but the purpose of analysis is typically to find out what has worked and what has not. Tools used in detective analysis: Typically used tools are MS excel, MS Access, Minitab, R (basic regression). You tend to use advanced Excel and Pivot tables while dealing with these problems and typically creating time series graphs helps a lot. Skills required for detective analysis:
Dashboard is an Organized and well presented summary of key business metrics. They are usually interactive so that the user can find out the exact information he is looking for. Dashboard, in ideal state should provide real time information about performance. Following is an example of how a dashboard might look like: The whole science of creating data model, dashboards and reports based on this data is also known as “Business Intelligence“. Tools used for creating dashboards: For limited size of data, dashboards can be made using Advanced excel. But, typically Organizations use more advanced tools for creation and dissemination of tools. Business Objects, Qlikview, Hyperion are names of some such softwares. Skills required for creating dashboards:
This is where you take all your historical trends and information and apply it to predict the future. You try and predict customer behaviour based on past information. Please note that there is a fine difference in forecasting and predictive modeling. Forecasting is typically done at aggregate level, where as predictive modeling is typically done at a customer / instance level Tools used for Predictive modeling: SAS has the highest market share among tools used for predictive modeling followed by SPSS, R, Matlab. Skills required for Predictive modeling:
Imagine applying predictive modeling with a microscope in hand. What if you can store, analyze and make sense out of every information about the customer. What kind of social media community he is attached to? What kind of searches is he performing? Big data problems arise when data has grown on all three Vs (Volume, Velocity and Variety). You need data scientists to mine this data. Tools used in Big data: This is a very dynamic domain right now. A tool which used to be market leader 6 months back is no longer the best. Hence, it is difficult to pin down specific tools. These tools typically work on Hadoop to store the data. Skills required for harnessing big data:
So, now that you understand the Analytics spectrum, if you come across a role which is not clear to you, please spend the necessary time understanding which domain does it refer to and does it fit right with what you want to achieve. If you have come across this confusion on understanding “Business Analytics“, this article should have helped you. In case there are any further confusion, do let me know. Read more »  
Nov 26 
18 Free Exploratory Data Analysis Tools For People who don’t code so well
Posted by Thang Le Toan on 26 November 2017 01:25 PM

Introduction
All of us are born with special talents. It’s just a matter of time until we discover it and start believing in ourselves. We all have limitations, but should we stop there? No. When I started coding in R, I struggled. Sometimes a lot more than one can ever think! Because I had never ever coded even Now when I look back, I laugh at myself. Do you know why? Because, I could have chosen one of several noncoding tools available for data analysis, and could’ve avoided the suffering. Data exploration is an inevitable part of predictive modeling. You can’t make predictions unless you know what happened in the past. The most important skill to master data exploration is ‘curiosity’, which is free of cost yet isn’t owned by everyone. I have written this article to help you acknowledge various free tools available for exploratory data analysis. Now a days, ample of tools are available in the market which are free & quite interesting to work with. These tools doesn’t require you to code explicitly but simple drag – drop clicks does the job.
List of Non Programming Tools1. Excel / SpreadsheetIf you are transitioning into data science or have already survived for years, you would know, even after countless years, excel remains an indispensable part of analytics industry. Even today, most of the problems faced in analytics projects are solved using this software. With larger than ever community support, tutorials, free resources, learning this tool has become quite easier. It supports all the important features like summarizing data, visualizing data, data wrangling etc. which are powerful enough to inspect data from all possible angles. No matter how many tools you know, excel must feature in your armory. Though, Microsoft excel is paid but you can still try various other spreadsheet tools like open office, google docs, which are certainly worth a try! Free Download: Click Here
2. TrifactaTrifacta’s Wrangler tool is challenging the traditional methods of data cleaning and manipulation. Since, excel possess limitations on data size, this tool has no such boundaries and you can securely work on big data sets. This tool has incredible features such as chart recommendations, inbuilt algorithms, analysis insights using which you can generate reports in no time. It’s an intelligent tool focused on solving business problems faster, thereby allowing us to be more productive at data related exercises. Availability of such open source tools make us feel more confident and supportive, that there are good people also, around the world who are working extremely hard to make our lives better. Free Download: Click Here
3. Rapid MinerThis tool emerged as a leader in 2016 Gartner Magic Quadrant for Advanced Analytics. Yes, it’s more than a data cleaning tool. It extends its expertise in building machine learning models. Yes, it comprises all the ML algorithms which we use frequently. Not just a GUI, it also extends support to people using Python & R for model building. It’s continues to fascinate people around the world with its remarkable capabilities. Above all, it claims to provide analytics experience at lightning fast level. Their product line has several products built for big data, visualizations, model deployment, some of which (enterprise) include a subscription fee. In short, we can say it’s a complete tool for any business which requires performing all tasks from data loading to model deployment. Free Download: Click Here
4. Rattle GUIIf you tried using R, but couldn’t get a knack of what’s going in, Rattle should be your first choice. This GUI is built on R and gets launched by typing It’s being widely used these days. According to CRAN, rattle is being installed 10000 times every month. It provides enough options to explore, transform and model data is just few clicks. However, it has fewer options than SPSS for statistical analysis. But, SPSS is a paid tool. Free Download: Click Here
5. QlikviewQlikview is one of the most popular tool in business intelligence industry around the world. Deriving business insights and presenting it in an awesome manner, it what this tool does. With it’s state of art visualization capabilities, you’d be amazed by the amount of control you get while working on data. It has an inbuilt recommendation engine to update you from time to time about best visualization methods while working on data sets. However, it is not a statistical software. Qlikview is incredible at exploring data, trend, insights but it can’t prove anything statistically. In that case, you might want to look at other softwares. Free Download: Click Here
6. WekaAn advantage of using Weka is that it is easy to learn. Being a machine learning tool, its interface is intuitive enough for you to get the job done quickly. It provides options for data preprocessing, classification, regression, clustering, association rules and visualization. Most of the steps you think of while model building can be achieved using Weka. It’s built on Java. Primarily, it was designed for research purposes at University of Wakaito, but later it got accepted by more and more people around the world. However, overtime I haven’t seen an enthusiastic weka community like of R and Python. The tutorial listed below should help you more. Free Tutorial: Click Here
7. KNIMESimilar to RapidMiner, KNIME offers an open source analytics platform for analyzing data, which can later be deployed, scaled using other supportive KNIME products. This tool has abundance of features on data blending, visualization and advanced machine learning algorithms. Yes, using this tool you can build models also. Though, there hasn’t be enough talk about this tool, but considering its state of art design, I think it will soon catch up much needed limelight. Moreover, quick training lessons are available on their website to get you started with this tool right now. Free Download: Click Here
8. OrangeAs cool as its sounds, this tool is designed to produce interactive data visualizations and data mining tasks. There are enough youtube tutorial to learn this tool. It has an extensive library of data mining tasks which includes all classification, regression, clustering methods. Along with, the versatile visualizations which get formed during data analysis allows us to understand the data more closely. To build any model, you’ll be required to create a flowchart. This is interesting as it would help us further understand the exact procedure of data mining tasks. Free Download: Click Here
9. Tableau PublicTableau is a data visualization software. We can say, tableau and qlikview are the most powerful sharks in business intelligence ocean. The comparison of superiority is never ending. It’s a fast visualization software which let’s you explore data, every observation using various possible charts. It’s intelligent algorithms figure out by self about the type of data, best method available etc. If you want to understand data in real time, tableau can get the job done. In a way, tableau imparts a colorful life to data and let’s us share our work with others. Free Download: Click Here
10. Data WrapperIt’s a lightning fast visualization software. Next time, when someone in your team gets assigned BI work, and he/she has no clue what to do, this software is a considerable option. It’s visualization bucket comprises of line chart, bar chart, column chart, pie chart, stacked bar chart and maps. So, it’s a basic software and can’t be compared with giants like tableau and qlikview. This tools is browser enabled and doesn’t require any software installation.
11. Data Science Studio (DSS)It is a powerful tool designed to connect technology, business and data. It is available in two segments: Coding & NonCoding. It’s a complete package for any organization which aims to develop, build, deploy and scale models on network. DSS is also powerful enough to create smart data applications to solve real world problems. It comprises of features which facilitates team integration on projects. Among all features, the most interesting part is, you can reproduce your work in DSS as every action in the system is versioned through an integrated GIT repository. Free Download: Click Here
12. OpenRefineIt started as Google Refine but looks like google plummeted this project due to reasons unclear. However, this tool is still available renamed as Open Refine. Among the generous list of open source tools, openrefine specializes in messy data; cleaning, transforming and shaping it for predictive modeling purposes. As an interesting fact, during model building, 80% time of an analyst is spent in data cleaning. Not so pleasant, but it’s the fact. Using openrefine, analysts can not only save their time, but put it to use for productive work. Free Download: Click Here
13. TalendDecision making these days is largely driven by data. Managers & professionals no longer make gutbased decision. They require a tool which can help them quickly. Talend can help them to explore data and support their decision making. Precisely, it’s a data collaboration tool capable of clean, transform and visualize data. Moreover, it also offers an interesting automation feature where you can save and redo your previous task on a new data set. This feature is unique and haven’t been found in many tools. Also, it makes auto discovery, provides smart suggestion to the user for enhanced data analysis. Free Download: Click Here
14. Data PreparatorThis tool is built on Java to assist us in data exploration, cleaning and analysis. It includes various inbuilt packages for discretization, numeration, scaling, attribute selection, missing values, outliers, statistics, visualization, balancing, sampling, row selection, and several other tasks. It’s GUI is intuitive and simple to understand. Once you start working on it, I’m sure you wouldn’t take lot of time to figure out how to work. A unique advantage of this tool is, the data set used for analysis doesn’t get stored in computer memory. This means you can work on large data sets without having any speed or memory troubles. Free Download: Click Here
15. DataCrackerIt’s a data analysis software which specializes on survey data. Many companies do survey but they struggle to analyze it statistically. Survey data are never clean. It comprises of lot of missing & inappropriate value. This tool reduces our agony and enhances our experience of working on messy data. This tool is designed such that it can load data from all major internet survey programs like surveymonkey, survey gizmo etc. There are several interactive features which helps to understand data better. Free Download: Click Here
16. Data AppliedThis powerful interactive tool is designed to build, share, design data analysis reports. Creating visualization on large data sets can sometimes be troublesome. But this tool is robust in visualizing large amounts of data using tree maps. Like all other tools above, it has feature for data transformation, statistical analysis, detecting anomalies etc. All in all, it’s a multi usage data mining tool capable of of automatically extracting valuable knowledge (signal) from the raw data. You’d be amazed to see that such nonprogramming tools are no less than R or Python for data analysis. Free Download: Click Here
17. Tanagra ProjectYou might not like it because of old fashioned UI, but this free data mining software is designed to build machine learning models. Tanagra project started as a free software for academic and research purposes. Being an open source project, it provides you enough space to devise your own algorithm and contribute. Along with supervised learning algorithms, it is enabled with paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms etc. Some of its limitations include unavailability of wide set of data sources, direct access to datawarehouses and databases, data cleansing, interactive utilization etc. Free Download: Click Here
18. H2oH2o is one of the most popular software in analytics industry today. In few years, this organization has succeeded in evangelizing the analytics community around the world. With this open source software, they bring lighting fast analytics experience, which is further extended using API for programming languages. Not just data analysis, but you can build advanced machine learning models in no time. The community support is great, hence learning this tool isn’t a worry. If you live in US, chances are they would be organizing a meetup nearby you. Do drop by! Free Download: Click Here
Bonus Additions: In addition to the awesome tools above, I also found some more tools which I thought you might be interested to look at. However, these tools aren’t free but you can still avail them for trial:
End NotesOnce you start working on these tools (your choice), you’d understand that knowing programming for predictive modeling isn’t much advantageous. You can accomplish the same thing with these open source tools. Therefore, until now, if you were get disappointed at your lack of noncoding, now is the time you channelize your enthusiasm on these tools. You may be interested to check 19 Data Science Tools for Non Coders. The only limitation I see with these tools (some of them) is, lack of community support. Except few tools, several of them don’t have a community to seek help and suggestions. Still, it’s worth a try! Read more »  
Nov 26 
19 Data Science Tools for people who aren’t so good at Programming
Posted by Thang Le Toan on 26 November 2017 01:20 PM

IntroductionProgramming is an integral part of data science. Among other things, it is considered that a mind which understands programming logic, loops, functions has higher chances of becoming a successful data scientist. So, what about people who never studied programming subject in their school or college ? Are they doomed to have a unsuccessful career in data science ? I’m sure there are countless people who want to enter data science domain but don’t understand coding very well. In fact, I too was a member of your nonprogramming league until I joined my first job. Therefore, I understand how terribly it feels when something you have never learnt haunts you at every step now. Good news is, I found out a way! Rather, I’ve found out 19 ways using which you can ignite your appetite to learn data science without doing coding. These tools typically obviate the programming aspect and provide userfriendly GUI (Graphical User Interface) so that anyone with minimal knowledge of algorithms can simply used them to build predictive models. Many companies (specially startups) have recently launched GUI driven data science tools. I’ve covered most of tools available in industry today. Also, I’ve added some videos to enhance your learning experience.
Note: All the information provided is gather from opensource information sources. We are just presenting some facts and not opinions. In no manner do we intent to promote/advertise any of the products/services.
List of Tools1. RapidMinerRapidMiner (RM) was originally started in 2006 as an opensource standalone software named RapidI. Over the years, they have given it the name of RapidMiner and also attained ~35Mn USD in funding. The tool is opensource for old version (below v6) but the latest versions come in a 14day trial period and licensed after that. RM covers the entire lifecycle of prediction modeling, starting from data preparation to model building and finally validation and deployment. The GUI is based on a blockdiagram approach, something very similar to Matlab Simulink. There are predefined blocks which act as plug and play devices. You just have to connect them in the right manner and a large variety of algorithms can be run without a single line of code. On top of this, they allow custom R and Python scripts to be integrated into the system. There current product offerings include the following:
RM is currently being used in various industries including automotive, banking, insurance, life Sciences, manufacturing, oil and gas, retail, telecommunication and utilities.
2. DataRobotDataRobot (DR) is a highly automated machine learning platform built by all time best Kagglers including Jeremy Achin, Thoman DeGodoy and Owen Zhang. Their platform claims to have obviated the need for data scientists. This is evident from a phrase from their website – “Data science requires math and stats aptitude, programming skills, and business knowledge. With DataRobot, you bring the business knowledge and data, and our cuttingedge automation takes care of the rest.” DR proclaims to have the following benefits:
With funding of ~60Mn USD and more than 100 employees, DR looks in good shape for the future.
3. BigMLBigML is another platform with ~Mn USD in funding. It provides a good GUI which takes the user through 6 steps as following:
These processes will obviously iterate in different orders. The BigML platform provides nice visualization of results and has algorithms for solving classification, regression, clustering, anomaly detection and association discovery problems. You can get a feel of how their interface works using their YouTube channel.
4. Google Cloud Prediction API
The Google Cloud Prediction API offers RESTful APIs for building machine learning models for android applications. This platform is specifically for mobile applications based on Android OS. Some of the use cases include:
Though the API can be used by any system, there are also specific Google API client libraries build for better performance and security. These exist for various programming languages Python, Go, Java, JavaScript, .net, NodeJS, ObjC, PHP and Ruby.
5. PaxataPaxata is one of the few organizations which focus on data cleaning and preparation, NOT the machine learning or statistical modeling part. It is an MS Excellike application that is easy to use, with visual guidance making it easy to bring together data, find and fix dirty or missing data, and share and reuse data projects across teams. Like others mentioned here, Paxata eliminates coding or scripting, so overcoming technical technical barriers involved in handling data. Paxata platform follows the following process:
With a funding of ~25Mn USD, Praxata has set its foot in financial services, consumer goods and networking domains. It might be a good tool to use if your work requires extensive data cleaning.
6. TrifactaTrifacta is another startup focussed on data preparation. It has 2 product offering:
Trifacta offers a very intuitive GUI for performing data cleaning. It takes data as input and provides a summary with various statistics by column. Also, for each column it automatically recommends some transformations which can be selected using a single click. Various transformations can be performed on the data using some predefined functions which can be called easily in the interface. Trifacta platform uses the following steps of data preparation:
With ~75Mn USD in funding, Trifacta is currently being used in financial, life sciences and telecommunication industry.
7. Narrative ScienceNarrative Science is based on a unique idea in the sense that it generates automated reports using data. It works like a data storytelling tool which used advanced natural language processing to create reports. It is something similar to a consulting report. Some of the features of this platform include:
With ~30Mn USD in funding, Narrative Science is currently being used in financial, insurance, government and ecommerce domains. Some of its customers include American Century Investments, PayScale, MasterCard, Forbes, Deloitte, etc. Having discussed some startups in this domain, lets move on to some of the academic initiatives which are trying to automate some aspects of data science. These have potential of turning into successful enterprise in future.
8. MLBaseMLBase is an opensource project developed by AMP (Algorithms Machines People) Lab at University of California, Berkeley. The core idea is to provide an easy solution for applying machine learning to large scale problems. It has 3 offerings:
This undertaking is still under active development and we should hear about the developments in the near future.
9. WEKAWeka is a data mining software written in Java, developed at the Machine Learning Group at University of Waikato, New Zealand. It is a GUI based tool which is very good for beginners in data science and the best part is that it is opensouce. You can learn about it using the MOOC offered by University of Waikato here. You can learn more about it in this article. Though weka is currently more used in the academic community, but it might be the stepping stone of something big coming up in future.
10. Automatic StatisticianAutomatic Statistician is not a product per se but a research organization which is creating a data exploration and analysis tool. It can take in various kinds of data and use natural language processing to generate a detailed report. It is being developed by researchers who have worked in Cambridge and MIT and also won Google’s Focussed Research Award with a price of $750,000. Though is it still under development and very minimal information is available about the project, it looks like it is being backed by Google. You can find some information here.
More ToolsI have discussed a selected set of 10 examples above but there are many more like these. I’ll briefly name a few of them here and you can explore further if this isn’t enough to whet your appetite:
If you’re hearing these names for the first time, you’ll be surprised (like I was :D) that so many tools exist. But the good thing is that they haven’t had a disruptive impact as of now. But the real question is will these technologies achieve their goals? Only time can tell!
End NotesIn this article, we have discussed various initiatives working towards automating various aspects of solving a data science problem. Some of them are in nascent research stage, some opensource and others being used in the industry with millions in funding. All of these pose a potential threat to the job of a data scientist, which is expected to grow in the near future.These tools are best suited for people who abhor programming & coding. Do you know any other startups or initiatives working in this domain? Please feel free to drop a comment below and enlighten us! Read more »  
Nov 26 
Comprehensive guide for Data Exploration in R
Posted by Thang Le Toan on 26 November 2017 01:16 PM

Till now we have already covered a detailed tutorials on data exploration using SAS and Python. What is the one piece missing to complete this series. I am sure you guessed it right. In this article I will give a detailed tutorial on Data Exploration using R. For reader ease, I will follow a very similar format we used in Python tutorial. This is just because of the sheer resemblance between the two languages. Here are the operation I’ll cover in this article (Refer to this article for similar operations in SAS):
Part 1: How to load data file(s)?Input data sets can be in various formats (.XLS, .TXT, .CSV, JSON ). In R, it is easy to load data from any source, due to its simple syntax and availability of predefined libraries. Here, I will take examples of reading a CSV file and a tab separated file. read.table is also an alternative, however, read.csv is my preference given the simplicity. # Read CSV into R MyData < read.csv(file="c:/TheDataIWantToReadIn.csv", header=TRUE, sep=",") #Read a Tab seperated file Tabseperated < read.table("c:/TheDataIWantToReadIn.txt", sep="\t", header=TRUE)
All other Read commands are similar to the one mentioned above.
Part 2: How to convert a variable to different data type?Type conversions in R work as you would expect. For example, adding a character string to a numeric vector converts all the elements in the vector to character. Use is.xyz to test for data type xyz. Returns TRUE or FALSE is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame() as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame() However, conversion of data structure is more critical than the format transformation. Here is grid which will guide you with format conversion :
Part 3: How to transpose a Data set?It is also some times required to transpose a dataset from a wide structure to a narrow structure. Here is the code you use to do the same : Code # example of melt function library(reshape) mdata < melt(mydata, id=c("id","time"))
Part 4: How to sort DataFrame?Sorting of data can be done using order(variable name) as an index . It can be based on multiple variables and ascending or descending both order. # sort by var1
newdata < old[order(var1),]
# sort by var1 and var2 (descending)
newdata2 < old[order(var1, var2),]
Part 5: How to create plots (Histogram)?Data visualization on R is very easy and creates extremely pretty graphs. Here I will create a distribution of scores in a class and then plot histograms with many variations. score <rnorm(n=1000, m=80, sd=20) hist(score) Let’s try to find the assumptions R takes to plot this histogram, and then modify a few of those assumptions. histinfo<hist(score) histinfo $breaks [1] 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 $counts [1] 2 5 19 52 84 141 195 201 152 81 39 25 3 1 $density [1] 0.0002 0.0005 0.0019 0.0052 0.0084 0.0141 0.0195 0.0201 0.0152 [10] 0.0081 0.0039 0.0025 0.0003 0.0001 $mids [1] 15 25 35 45 55 65 75 85 95 105 115 125 135 145 $xname [1] "score" $equidist [1] TRUE attr(,"class") [1] "histogram" As you can see, the breaks are applied at multiple points. We can restrict the number of break points or vary the density. Over and above this, we can colour the bar plot and overlay a normal distribution curve. Here is how you can do all this : hist(score, freq=FALSE, xlab="Score", main="Distribution of score", col="lightgreen", xlim=c(0,150), ylim=c(0, 0.02)) curve(dnorm(x, mean=mean(score), sd=sd(score)), add=TRUE, col="darkblue", lwd=2)
Part 6: How to generate frequency tables with R?Frequency tables are the most basic and effective way to understand distribution across categories. Here is a simple example of calculating one war frequency : attach(iris) table(iris$Species)
Here is a code which can find cross tab between two categories : # 2Way Cross Tabulation library(gmodels) CrossTable(mydata$myrowvar, mydata$mycolvar)
Part 7: How to sample Data set in R?For sampling a dataset in R, we need to first find a few random indices . Here is how you can find a random sample: mysample < mydata[sample(1:nrow(mydata), 100,replace=FALSE),] This code will simply take out a random sample of 100 observations from the table mydata.
Part 8: How to remove duplicate values of a variable?Removing duplicates on R is extremely simple. Here is how you do it: > set.seed(150) > x < round(rnorm(20, 10, 5)) > x [1] 2 10 6 8 9 11 14 12 11 6 10 0 10 7 7 20 11 17 12 1 > unique(x) [1] 2 10 6 8 9 11 14 12 0 7 20 17 1
Part 9: How to find class level count average and sum in R?We generally use Apply functions to do these jobs. > tapply(iris$Sepal.Length,iris$Species,sum) setosa versicolor virginica 250.3 296.8 329.4 > tapply(iris$Sepal.Length,iris$Species,mean) setosa versicolor virginica 5.006 5.936 6.588
Part 10: How to recognize and Treat missing values and outliers?Identifying missing values can be done as follows : > y < c(4,5,6,NA) > is.na(y) [1] FALSE FALSE FALSE TRUE And here is a quick fix for the same : y[is.na(y)] < mean(y,na.rm=TRUE) y [1] 4 5 6 5 As you can see, the missing value has been imputed with the mean of other numbers. Similarly, we can impute missing values with any best value available.
Part 11: How to merge / join data sets?This is yet another operation which we use in our daily life. To merge two data frames (datasets) horizontally, use the merge function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join). # merge two data frames by ID total < merge(data frameA,data frameB,by="ID") # merge two data frames by ID and Country total < merge(data frameA,data frameB,by=c("ID","Country")) Appending dataset is another such function which is very frequently used. To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. total < rbind(data frameA, data frameB)
End Notes:In this comprehensive guide, we looked at the R codes for various steps in data exploration and munging. This tutorial along with the ones available for Python and SAS will give you a comprehensive exposure to the most important languages of the analytics industry. Did you find the article useful? Do let us know your thoughts about this guide in the comments section below. Read more »  