Mastering Python for Data Science

Samir Madhaven

Published by

Packt Publishing





Reviewed by

Stewart Marshall MBCS


3 out of 10

Samir Madhaven’s book sets out to show how the Python programming language may be used to apply statistical methods appropriate to data science. It introduces statistical techniques such as the application of probability distributions, tests of significance, linear regression, logistic regression and segmentation as well as methods for data visualisation.

Python’s NumPy and pandas libraries form the basis for coding these techniques. The Matplotlib library and statsmodels module are also introduced. The ‘Who this book is for’ section suggests that the book is aimed at Python developers who already have some knowledge of data science.

The book starts with an introduction to the data structures provided by NumPy and pandas and a useful description of methods used to import data from various sources (text files, spreadsheets, etc.) into these structures.

Most subsequent sections of the book start with a brief introduction to the statistics to be employed, followed by code examples to show how Python and its various libraries may be used to apply the methods under discussion to some sample data.

In the main, the descriptions of statistical concepts and techniques are rather vague, muddled, and often tautological. We are told, for example, that ‘Logistic regression uses logistics.’, that ‘... find[ing] the probability density of the predicted values ... helps us to understand which areas of the predicted probability are denser.’, and that ‘... correlation defines the similarity between two random variables.’

Some of the examples given to illustrate statistical concepts are little better: the example given for the occurrence of the normal distribution in nature is that of the shape formed by sand collecting in an egg timer. Whilst the cross section of such a pile of sand may resemble the shape of the normal distribution as it is commonly plotted, there are many examples of the normal distribution in nature that would be more interesting and far more illustrative of the concept.

The text does illustrate how to apply libraries such as panadas and Matplotlib to problems in statistics and data science, but in this respect it provides little information or example beyond that to be found in the online tutorials that accompany such libraries.

So, who is this book for?

If the reader does not already have a good grasp of the statistics that underpin data science, he or she is unlikely to acquire it by reading this book: the material on statistics is just too thin.

If the reader knows data science but is not already familiar with Python then they would struggle to get started, as there is no introduction to the use of the core Python language.

For established Python developers, who already have an understanding of statistics and data science – the audience identified in the book itself – the book provides little more than is to be learned from the online materials that accompany pandas, Matplotlib and NumPy.

It really is hard to recommend this book to anyone.

Further information: Packt Publishing

February 2016