Automatic, for the machines

The machine learning world is huge. New frameworks, algorithms and papers are accelerating at a pace with which no human could keep up. The enthusiasm and progress are amazing, but create a barrier to beginners who want an introduction to the field and don’t have 4 years of postdoc research experience. In the ML world, many of the libraries are focused on research, which means you have to create your own custom deep neural network. This requires the user to select the activation function for each neuron layer, along with choosing a suitable optimization algorithm. Making sense of all this is where the learning curve of ML hits many dedicated beginners in the face, leaving them bloodied and frustrated.

There’s a great joke that goes: An intelligent robot walks into a bar. “What’ll it be?” says the bartender. “What’s everyone else having?” says the robot.

Not every problem requires an ultra-customized deep neural network. Testing on real world data will reveal the intrinsic bias and blind spots of any model, so it’s best to start prototyping with generally accepted “good enough” machine learning algorithms. Linear regression, K-nearest neighbor, classic clustering, and ‘bag of words’ algorithms do very well in a variety of problems and are simple enough to adapt if they require tuning.

Turi Create

Turi was purchased by Apple in 2017 and is now open source. The project is available on Github. Turi is task-based, and does not require extensive model configuration. It accomplishes this by exposing problem-specific toolkits that accept tabular, textual, or image data. Turi pre-inspects the data, and automatically chooses the best algorithms depending on the result parameter. Combined with easy integration with CoreML, Turi Create is the best framework for any developer seeking to integrate machine learning into their apps.

Besides task-based toolkits, Turi also contains helper functions and data types that are much like the industrial Python data science tools SciPy, Pandas, and NumPy. These libraries are what the pros use, but using all these libraries correctly in conjunction with Keras, Tensorflow, or another ML library is daunting for a beginner. When starting your first machine learning project, staying in a framework and learning all it has to offer makes you a more effective problem solver. The data structure concepts and conventions within Turi are broadly shared with the more advanced data science tools. Learning Turi provides a skillset you can build on.

Enough of me gushing about Turi. Take some time, right now, to check out what Turi has to offer.

Before starting an ML project

I hope you agree Turi is amazing, and are excited to try it out in a project. Before starting, there are a couple caveats may save time and set expectations.

State the problem

Clearly state what the project aims to classify or predict. Critically think through:

Where will we get our data?
Is this feat even possible with the timeframe ?
What skills are required to validate the model is working ?
What precisely is and isn’t within a classification domain ?

For example, classifying a tweet as ‘Politics’ or not is difficult, for even a human. What exactly constitutes ‘politics’? Broad subject matter classification can depend on the observer and will require an enormous amount of accurate, labeled training data. A model involving topical or highly contextual subject matter will always be behind the curve.

Try Stats

The problem may not require deep neural networks whatsoever. Many supervised learning problems can be solved through linear regression, K-nearest neighbor or other statistical methods. Before embarking down the neural-network rabbit hole, take a random sample of your data, and plot it on a graph. Estimate what may be a good predictor and plot this on the X axis. For example, when attempting to determine home prices, plotting ‘number of bedrooms’ vs ‘sale price’ is a good guess. If the graph shows a linear relationship between several variables, a regression may be appropriate. It is possible to run a regression with N number of variables, but visualizing the relationships is not so easy with our 3D-centered brains. Turi provides regression and nearest neighbor classifiers also.

Aggregating Data, The Hard Part

Amazing things can be accomplished with machine learning. But all of that power requires a large amount of “good”, “clean” data. Acquisition and processing of training data will take up a majority of time and effort. For the best accuracy, training data should have the same format and variance as the “real” data. For example, a model that is meant to classify Tweets should not be trained on Yelp reviews. The structure, tone and length are completely different. Training data must be labeled to be useful within supervised learning. Twitter allows searching of terms and hashtags which made aggregating ‘positive’ (this tweet exhibited the classification) data easy. The data corpus must also include ‘negative’ classifications (does not exhibit a classification) This ‘noise’ can be aggregated through the Twitter API “/sample” method. Have a plan regarding where both negative and positive data will come from.

With unlabeled data, labeling the training data into 1 or more classifications (an ML model can classify into N number of classifications) is an extremely human-labor intensive task. Amazon Mechanical Turk is an option, albeit an expensive one.

Data Cleanliness

“Clean” data means data which is free from irregularities, long outliers, or mis-labeled results. Raw data is very dirty; measurements may be in inconsistent units, the parameter scale may vary widely, etc. “Cleaning” data refers to removing the noise so the underlying patterns and structure can be inferred by the ML algorithm.

How Much is Enough

Short answer - until the model works well enough for the use case. Not entirely helpful, but data requirements depend on dimensionality and output space of the problem domain. It also depends on the relative variance of the ‘real world’ data the model will predict. For example - recipes have a rather standard format and vocabulary. (More standard than ‘tech blogs’ or the comment section on Youtube.) If the model classifies a continuous value from numerical data, scaling this data between 0-1, typically results in better outcomes.

It is even possible to have ‘too much’ data, which means the size of the dataset can slow down analysis, or introduce too much variance. In the case of too much variance, this shows the problem statement is overly vague and should be narrowed. To assess how much data is ‘enough’, observe the existing distribution of the data already collected. In the case of text data, the top 100, 1000, N words can give a feel for the dominant features. There will come a point where collecting more data will not affect the overall distribution.

When building an image classifier, the amount of data can be artificially boosted by rotating the image, randomly cropping, and adjusting the color scale. This may appear like a cheap trick, but boosting can often highlight meaningful training features in the data.

Curb Your Expectations

Machine learning is not a logical field. A function that determines if an input is divisible by 2 will always be correct, and can be proven to be correct. An ML model detects patterns within data, and then repeatedly tunes it’s parameters until the error rate is “good enough”. It’s not as much “learning” as fitting a large statistical curve. Just like humans, the model can still make mistakes, and can’t make inferences on subjects it hasn’t seen before. The strength of humans is still our ability to discern complex context, and create new ways of doing things.