*A collection of features powered by machine learning and intuition behind their evaluation metrics*

#### Feature: Search Ranking

User inputs a search query along with optional qualifiers and results are displayed as a list sorted by relevance.

Metric: DCG and NDCG

##### Featured articles

- Assume that we have capable human judges who have graded the relevance of each retrieved search result for a query
- DCG measures if a more relevant document is shown above a less relevant document in the final ranked list
- NDCG measure if this score is close to the ideal ranking of the retrieved documents

Details in the wiki link

#### ML Model: Classifier

A classifier is a model that classifies an input sample into one of the many output classes.

Binary classifier: is a variant that classifies the input sample into two different classes. For example the spam classifier that filters spam from our inbox. Usually a classifier calculates the probability score of the input sample being of a particular class. The application developer who is the consumer of the classifier uses a discrimination threshold to make a decision.

Now, the selection of the threshold will depend on application specific priorities. We will need a metric that can be used to evaluate the classifier independent of the threshold we choose.

Based on the classification made by an ML model on a manually labelled data set, a matrix called confusion matrix is defined as depicted below:

#### Metric: Precision

Precision is the ratio of true positive samples to those that are labelled positive by the classifier

#### Metric: Recall

Recall is the ratio of true positives to condition positives in the population.

#### ROC Curve: Receiver operating characteristic

ROC curve plots the true positive rate against the false positive rate.

True positive rate(TPR) is also knows as recall. In a collection of 100 emails, if there were 90 spam emails and out of that the classifier correctly identified 60 emails as spam, then the recall is 66.66% (60/90)

False positive rate(FPR) is also known as fall-out or probability of false alarm. If the classifier wrongly classified 2 emails as spam out of 10 emails that are not spam, then FPR is 20%(2/10)

A better model is the one that gives a higher TPR at a lower FPR. When we plot TPR and FRP for different thresholds, it looks like the below.

#### Metric: AUC

AUC or area under the curve is basically what it says — the area under the ROC curve. The model that has higher TRP for each point of FPR will give better performance

#### Metric: Prediction Error

Prediction error is the percentage of samples that are miss-classified. This is intuitive in itself.

Prediction error for classification cases with uneven probability distributions can be misleading. For example, let’s say there is a disease with the probability of 1 in 1000. A classifier that defaults to detecting ‘No disease’ in all cases will have a very low prediction error.

#### Metric: F1 Score

F1 score is the harmonic mean of precision and recall. This metric will solve the problem described above

#### Predictive Models

These models predict a continuous value for an input sample. Examples are house prices, stock prices, click through rates etc.

RMSE(root mean square error): Measures the square root of mean of difference between predicted value and actual value.

MAE(mean absolute error): is a similar metric and does not penalize big differences.

#### Deciding on an evaluation metric

All the above are techniques used to measure ingredient goodness of ML models in the industry. There are more sophisticated techniques out there for more complicated models. However as a product manager or as an application developer, deciding on which ML api or library to use and what thresholds to set should be based on the business KPI-s that you are targeting. This, thankfully is still based on business sense.

#### Example to work out the intuition

For a very simplified example, let’s say that a jobsite has an application flow where the jobseeker is supposed to upload her CV and fill her name in a field and submit application.

You are evaluating an NLP library that will parse the CV uploaded and pre-fill the name field for the user to confirm and submit the application.

This is a simplified example. In real life applications the document parser will try to fill many fields of the form thereby making user’s life much easier. Now you have a classifier library which can be used with configurable threshold to control precision and recall. Recall increase will reduce precision and vice versa.

*Precision in this case is the percentage of times the name filled is correct. Recall in this case is the percentage of times name was filled by the classifier.*

The question is where will you keep the threshold. One way to resolve this issue is by conducting A/B experiments where multiple thresholds are compared for completion rate and time to complete, abandonment rate, error rates etc. in the final application

Now if you need to decide on some range for the threshold for experimenting, you can think through it like below:

- A correctly filled field will increase the competition rate by X%
- A partially filled field will increase the completion rate by a lower Y%. Still it will increase the completion rate as the user will just have to correct few characters and probably add a second name and so on. Most of the errors will be of this nature.
- A totally incorrect field will still not decrease the completion rate by a lot if we give user the option to correct it easily. Note that the user will have already uploaded the CV at this point and invested time in the process. (escalation of commitment or sunk cost fallacy in UI design)

Hence we conclude that this is an application which is better off with good recall at the cost of some precision. There are applications where precision should not be traded off at the cost of recall. For example auto complete and suggestions in search boxes will be better off if they are precise even at the cost of recall.

To summarize, the evaluation of any ML model will be greatly influenced by the end application where it will be used. With some research and practice a PM or an application developer can develop the intuition to understand which metric to use for each application.

## Join the discussion