Machine learning has achieved stunning breakthroughs in a plethora of applications, ranging from language translations to image recognition and rendering. And who can ignore IBM Watson, the latest machine learning marvel out there? Technology is on its way to the next step of evolution in machines and machine learning probably has the biggest role to play.
Considering that machine leaning has been around since the 1950s, we are certainly intrigued as to what went right in the past few years that there have been so many innovations in the field. The answer to this is data. We live in an era of data abundance — around 90% of the world’s data has been created in the past three years alone. And this brings us to the topic at hand — statistics.
Statistics is integral to machine learning systems. In fact, you can go so far as to call machine learning a tertiary branch of statistics. Statistical analysis can be very helpful in establishing the motivations of a machine learning system and in validating the results yielded by the system.
In this post, we will look at four key statistics concepts that will help you get started with machine learning.
Statistical inference is the process of analyzing data to understand the underlying probability distribution. In simpler words, statistical inference is what you apply when you want to make sense of random data.
Take the example of buyer behavior on an e-commerce website. Customer-related data is riddled with several variables and uncertainties, which in turn make it almost impossible to understand what led the customer to purchase a product in the first place. The methods you apply to make observations and sample the data again introduce a lot of variables to the already random data.
This is where statistical inference comes into play. The collected data is sampled into neat models based on mutually shared properties, on the basis of which predictions are made about the problem at hand (buyer behavior, in this case).
A statistical population is a set of entities that share a property or a set of properties; that is, it is a set of similar (not identical, mind you) items or events. When we work with data, we are, in fact, working with a sample taken from the data population. While working on a prediction problem, we work with the sample data in a manner that characterizes the entire data population so that there is minimal deviation in the prediction when working with other sample data from the population.
This means that the selection and sampling of data must be done with utmost care, as the size and quality of the sample may affect the overall characterization of the data population and the consequent findings. Also, remember to take into account the randomness introduced in the data collection stage and manage, correcting or even manipulating it accordingly.
There’s a common misconception that big data does away with the process of data sampling, that you can work with the entire data population. However, this is dangerous thinking. Imagine you are modeling employee data for a manufacturing business. The data that you’ll be working on is, of course, a sample and not the entire population, since your modeling won’t stop new employees from joining the business, thereby reducing your data repository to a mere sample.
This is why you must always avoid overgeneralizing results and making claims beyond the data you’ve worked on. For example, the trends of all Facebook users cannot represent the trends of all humans.
What big data helps with is, however, a different matter altogether. It aids the modeling of individual entities (one customer, one employee, etc.) using all the data collected till date on that entity. This, in turn, opens up exciting new avenues in the world of research and analysis.
A statistical model is a smaller, representative version of the actual data. It is filled with assumptions and is a rather crude simplification of the data population. It is always wrong, just as miniature models of bridges or buildings are, but it does give you a basic idea of what the actual data looks like. It describes the relation between several data attributes, so you at least know how to look at the actual data and make sense out of it.
Statistical models vary in their degree of complexity; the more complex a model is the closer it is to the actual data and the harder it is to understand. For this reason, it is always a good idea to start with a simpler model and increase the complexity as per your requirements.
Statistics for Machine Learning
If this article piqued your interest and made you want to leverage statistics to build effective and responsive machine learning models, Statistics for Machine Learning is what you need. It takes you through all you need to know to perform complex statistical computations required for machine learning, including supervised learning, unsupervised learning, reinforcement learning, and more.
Written by Pratap Dangeti, the book follows a practical step-by-step approach to explain statistics and machine learning fundamentals. Pratap Dangeti is a machine learning expert who spends most of his time developing machine learning and deep learning solutions for structured, image, and text data.
So, if you are developer keen on brushing up your statistics and implementing machine learning in your systems, Statistics for Machine Learning is the go-to book for you!
To check out the book: Click on the Image 🙂. For other updates you can follow me on Twitter on my twitter handle @NavRudraSambyal
Happy (machine) learning!
Thanks for reading, please share it if you found it useful 🙂