Identify & Avoid Biased Data

Data-driven models are based on observed correlations between input parameters and outcome variables. With machine learning methods, we can effectively build a model on a domain we know nothing about, provided we have enough clean and unbiased training data. Data science is about things that happen for understandable reasons. Models which ignore this principle are doomed to fail embarrassingly in certain circumstances:

When theory and domain-specific knowledge are embedded in data science, we are better able to guide the structure and design of ad-hoc models. Ad-hoc models are brittle in response to changing conditions and difficult to apply to new tasks.

Machine learning models for classification and regression are general because they employ no problem specific ideas, but only specific data. When retrained on fresh data, they adapt more easily to changing conditions, because they exhibit different performance whenever they are trained on a new data set.

The consideration of which is a better approach brings us to the reality that the best models are always a mixture of both theory and data and Skills4Industry is a testament to these facts.

The theory behind Skills4Industry models flow from Adams Smiths’ assertions, and with the billions of dollars invested in all types of social science, education and workforce research, most data collected by different institutions are useless for artificial intelligence predictive models purposes. These data structure the society into different groups such as urban, inner cities, suburban and rural economies, including several other tribal groupings inimical to cross-social work transitions. Using this data to predict abilities, potentials, learning, and employment matching result in biased under-fitted or over-fitted artificial intelligence models.


Adams Smith in the Wealth of Nations asserts that two different circumstances regulate the wealth of "every nation.

(1) by the skill, dexterity, and judgment with which its labor is generally applied; and

(2) by the proportion between the number of those who are employed in useful labor,” Smith also notes that the first circumstance, skilled labor, is more important than the second, employment rate.

Smith also asserts that the second driver for wealth creation is the number of people employed (employment rates) since labor produces wealth; this means that correlation exists between human abilities and potentials while the determinant variables offer the key to identifying the exact gaps between abilities and potentials, as well as the capabilities required to close identified gaps.

With Adam Smith’s theories, we are better able to use our domain knowledge to select the best data to fit and evaluate our models and eliminate parochially biased social economic, education and workforce research.

Determinism: The world is a complicated place where events do not unfold in the same way if repeated several times. This thinking is incorporated into good forecasting models to produce a probability distribution over all possible events.

Stochastic means randomly distributed, which uses logistic regression and Monte Carlo simulation. Models must observe basic properties of probabilities, including:

That each probability is a value between 0 and 1

They must sum to 1.

When values are independently generated between 0 and 1, it does not mean that together they add up to a unit probability over the full event space. The solution is to scale these values so that they do, by dividing each by the partition function. Alternately, rethink the model to understand why they did not add up in the first place.

Rare events do not have probability zero. As a result, any possible event must have a higher than zero probability of occurrence (Discounting).

Models must be honest about what they do or don't do, and probability measures the humility about the accuracy of our model.

It’s important to note that models often yield only one possible answer and the fact that deterministic models always return the same answer helps in debugging their implementation.