Statistics for Data Science

James D. Miller

Published by
Packt Publishing
ISBN 9781788290678
RRP £32.99
Reviewed by Patrick Hill BSc(Hons) MSc PhD CEng MBCS CITP

4 out of 10

Software applications of various kinds have long offered the ability for users to report upon the data that they have acquired and generated. More recently, and particularly with the advent of Big Data, users have begun to explore not only what is in their data sets, but also to ask why the results are as they are, how elements of the data sets are interrelated and whether historical data can be used to predict future values. Answering these types of questions has led to the emergence of the discipline of Data Science. While data science is somewhat vaguely defined, its practitioners typically hold an interdisciplinary skill set which may include elements of database analysis, programming, statistics and data visualisation.

This book aims to introduce software developers to some of the different tools and techniques typically used in data science. In order to help make the content relevant to the target audience, the author attempts to draw parallels between data science and database development.

The book is structured into twelve chapters. Strangely, given the book’s title, the discussion of statistical analysis only really starts at chapter five. The preceding chapters attempt to differentiate between the roles and responsibilities of database developers and those of data scientists. From chapter five onwards, the book introduces and discusses a variety of topics related to data science, including regression analysis, regularisation and boosting, as well as machine learning topics such as artificial neural networks and support vector machines. Where appropriate, examples are given in the R programming language, and the source code and data sets used within the book are available for download from the publisher’s website.

Unfortunately, however, I can find little to recommend this book. The writing style is poor and the structure often meanders. In places the text does not read as being authoritative, with abrupt departures from the author’s loose conversational style into a more scholarly tone suggesting that the content is perhaps being drawn from elsewhere. Sentences often include unnecessary parenthetical comments, in one case six in a single sentence, which do little to aid readability. Parallels drawn with the author’s perceived experience of the ‘data or database developer’ are often tenuous. Indeed, the chapter titles themselves are somewhat misleading. It is hard to imagine, for example, what ‘database progression to database regression’ might actually refer to.

Some examples which really need little explanation, such as the description of parallelism within artificial neural networks, are laboured and confusing. On other occasions, the text is simply wrong, for example, the output of a sigmoid function is incorrectly described as switching from zero to one based on some threshold. Frustratingly, some points of interest receive no explanation at all. For example, the outlier in a learning curve in the chapter on model assessment attracts no comment or explanation, the author preferring rather to explain how to export the graph as a PNG format file.

The book’s references are obscure and poorly cited and some parts of the text, such as the section on variance, do little more than direct the reader to Wikipedia.

The book is not helped by deficiencies in the editorial and production processes. There are spelling errors, repeated and nonsensical sentences and meaningless illustrations. Also, despite outward appearances, the first data set that is referred to in the text isn’t available in the accompanying download, the CSV file being merely a renamed copy of a text file containing the source code of the examples.

There is certainly some useful information in this book, however for me the text felt like an early draft rather than a finished product. Rather than being helpful, I found the attempt to align concepts with those deemed to be familiar to the target readership often seemed awkward and, at times, patronising. There is no shortage of introductory books in the data science field and I would suggest that the time and effort required to read this book could more fruitfully be expended exploring other resources.

Further information: Packt Publishing

March 2018