Doing Data Science - Straight Talk from the Frontline

Cathy O’Neil, Rachel Schutt

Published by






Reviewed by



10 out of 10

Computer applications have become increasingly pervasive, collecting vast quantities of data relating to many diverse activities. Advances in computer hardware have provided solutions that enable this data to be collected and stored, and data owners expect that important and possibly commercially advantageous information is contained within these data sets.

Data science is an emerging discipline that, while not yet having a strict and common definition, broadly relates to the preparation and exploration of typically large data sets in order to identify patterns, predict outcomes, classify and derive meaning using a variety of mathematical and software tools.

This book, which is based on a course developed and delivered by the authors at Columbia University, is a practical introduction to data science and explores the subject from a variety of theoretical, application-specific and professional perspectives.

Introductory chapters set the scene by exploring the question ‘what is data science?’, outlining the data science landscape and the role and skillset of a data scientist.

The authors then examine general features of data sources, including the kinds of statistical limitations that can be present in data sets, and outline a data science process that defines a practical, general approach to data science projects. Methods of exploratory data analysis and data modelling are described and supported by practical exercises, which also introduce the reader to the R language.

As might be expected, a significant part of the book is devoted the discussion of common probabilistic and statistical approaches used in solving classification and prediction problems, along with discussion of the suitability of different algorithms for different scenarios, how models can be fitted to data sets and ways in which algorithms and models can be evaluated. The book also outlines the principles and use cases of the map/reduce approach.

In addition to data analysis techniques, the book contains introductions to general data science applications, including recommendation systems, data visualisation and approaches to deriving meaning and causality.

As well as general data science topics, the authors also consider application-specific topics in data science. These chapters consider the differing features and analytical requirements of data sets such as social network data, epidemiological data and time-stamped financial data.

Many chapters include additional content from subject matter experts who describe the ways in which the techniques discussed in the book are being applied to solve real problems.

This book covers a lot of ground and contains many links to other sources. Where applicable, the content is supplemented by practical examples, using tools such as R, bash and Python, and the authors have provided downloadable data sets for readers to explore.

The text is well written, with an authoritative but engaging informal style. However, it is worth noting that some chapters require the reader to have a reasonable mathematical background, including linear algebra. There is great emphasis on practical application of the content, supported by advice based on the authors’ own experiences. The book is a good introduction to the emerging field of data science, which encourages readers to delve deeper into the subject.

Further information: O'Reilly

September 2014