Linear Discriminant Analysis

Sections Sections Introduction Principal Component Analysis vs. Linear Discriminant Analysis What is a “good” feature subspace? Summarizing the LDA approach in 5 steps Preparing the sample data set About the Iris dataset Reading in the dataset Histograms and feature selection Normality assumptions LDA in 5 steps Step 1: Computing the d-dimensional mean vectors Step 2: […]

A Data Scientist’s View into the Cancer Moonshot Project: Part 2, Data Sharing

A Data Scientist’s View into the Cancer Moonshot Project: P...

Over the past few months, we worked on a project that’s a little different from our usual work: researching and writing a report for Vice President Biden’s Cancer Moonshot Initiative. In this report, we analyzed how cancer research might benefit from better use of data and analytics. Our recommendations were organized around three major themes: […]

How To Use Deep Learning And Transfer Learning To Tag Images

How To Use Deep Learning And Transfer Learning To Tag Images

Image Tagging Problem For E-Commerce Imagine you are an e-commerce company with thousands of flash sales available on your website every day. To view a product offer in detail, users have to click on the specific thumbnail, which is composed of an image and a short description. In these thumbnail images is hidden information, yet […]

Beyond One-hot: an Exploration of Categorical Variables

Beyond One-hot: an Exploration of Categorical Variables

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

Interacting with ML Models

Interacting with ML Models

The main difference between data analysis today, compared with a decade or two ago, is the way that we interact with it. Previously, the role of statistics was primarily to extend our mental models by discovering new correlations and causal rules. Today, we increasingly delegate parts of our reasoning processes to algorithmic models that live […]

Open Data Reveals $791 Million Error in Newly Adopted NYC Budget

Open Data Reveals $791 Million Error in Newly Adopted NYC Budget

Editor’s note: Opinions expressed in this post do not necessarily reflect the views of #ODSC The headline in a recent NYC press release caught my eye: “MAYOR AND CITY COUNCIL LAUNCH SEARCHABLE OPEN BUDGET FOR NEW YORK CITY”.  I was pretty excited.  As mentioned in my talk on Ted, NYC has entombed this data in PDFs for years, making it […]

Diving Deep into Python, the not-so-obvious Language Parts

Diving Deep into Python, the not-so-obvious Language Parts

Sections Sections The C3 class resolution algorithm for multiple class inheritance Assignment operators and lists – simple-add vs. add-AND operators True and False in the datetime module Python reuses objects for small integers – use “==” for equality, “is” for identity And to illustrate the test for equality (==) vs. identity (is): Shallow vs. deep […]

A Modern Guide to Getting Started with Data Science and Python

A Modern Guide to Getting Started with Data Science and Python

Thomas originally posted this article here at http://twiecki.github.io  Python has an extremely rich and healthy ecosystem of data science tools. Unfortunately, to outsiders this ecosystem can look like a jungle (cue snake joke). In this blog post I will provide a step-by-step guide to venturing into this PyData jungle. What’s wrong with the many lists of PyData […]

Visualizing Top Tweeps with t-SNE, in Javascript

Visualizing Top Tweeps with t-SNE, in Javascript

I was looking into various ways of embedding unlabeled, high-dimensional data in 2 dimensions for visualization. A wide variety of methods have been proposed for this task. This Review paper from 2009 contains nice references to many of them (PCA, Kernel PCA, Isomap, LLE, Autoencoders, etc.). If you have Matlab available, the Dimensionality Reduction Toolbox […]