A binary logistic regression model for prediction of feed conversion ratio of Clarias gariepinus from feed composition data

Aquaculture in developing countries faces a lot of challenges that are barely being addressed. With feed taking nearly 70% of the total production cost, it becomes imperative to develop means of optimizing how research is conducted into feed development. Feed conversion ratio as a measure of feed quality can be used to quantify in retrospect the appropriateness of feed fed to livestock, particularly, Clarias gariepinus. From the study, binary logistic regression can in simple terms, determine if prospective feed will perform below or above the acceptable level of 1.5, based on its composition and proximate analysis values. Data from similar experiments are normalized and split into train and testing data to fit a logistic regression model, three numerical optimizers were used including liblinear, Newton-CG, SAG and accuracy of the models were compared using the confusion matrix, and Jaccard similarity score. An accuracy value of 0.8 was observed in the model regardless of the numerical optimizer, this indicates the appropriateness of the model in predicting either high or low FCR for feed types. The probability of prediction showed disparity among liblinear and SAG/Newton-CG solvers. Liblinear solver showed close probabilities in predicting if values will be 1 or 0. While a similar prediction was made by all solvers, this indicates a possible affinity for error when the solver is used. This is also indicated with a logloss of 0.65 as compared to 0.51 in both SAG and Newton-CG solvers. Please cite this paper as follows: Adekunle, F. O. (2021). A binary logistic regression model for prediction of feed conversion ratio of Clarias gariepinus from feed composition data. Marine Science and Technology Bulletin, 10(2): 131-141.


Introduction
The variety of factors that could be responsible for an observed outcome in any biological system requires the need to compounded feed. This accounts for the use of a large amount of human and capital resources.
Aquaculture plays a very important role in food systems, especially in middle-income regions where the industry employs labor and serves as a major source of animal protein.
As stated by FAO (2016), global fish production reached a peak of 171 million tonnes, 47% of which was produced by aquaculture. However, if aquaculture is to remain an alternative to dwindling capture fishery stocks, there's the need to reduce the cost of production which is mostly accounted for by the cost of feeding, taking between 60% and 80% of the total cost of production (Ng et al., 2013).
Fishmeal is an indispensable component of fish feed due to its amino-acid profile, fatty acids, flavor, and other essential nutrients. The ecological cost of fishmeal and high demand from other livestock species necessitates the need for a similar substitute. (Farahiyah et al., 2016). Hence optimization of the protein content of feed relies heavily on successfully substituting fishmeal with other more affordable feedstuff for optimal growth performance (Degani et al., 1989). The FCR is simply the amount of feed it takes to grow a kilogram of fish. For example, if it requires two kilograms of feed to grow one kilogram of fish, the FCR would be two, this means that when a feed has a low FCR, it takes less feed to produce one kilogram of fish than it would if the FCR were higher. A low FCR is a good indication of a high-quality feed. FCR is a valuable and powerful tool for the fish farmer. It allows for an estimate of the feed that will be required in the growing cycle. Knowing how much feed will be needed then allows a farmer to determine the profitability of an aquaculture enterprise. This means that FCR allows the farmer to make wise choices in selecting and using the feed to maximize profitability. (USAID-HARVEST, 2011) Several factors can influence the way fish respond to feed. Stage of culture, size, water quality, genetics, pond management, and the composition of other feedstuff.
Binary logistic regression studies the association between a category of the dependent variable and a set of independent variables. Logistic regression is used when the outcome has only two possible values (0 and 1), and is opposed to multinomial regression where the outcome could be three or more possible outcomes or prediction. Logistic regression as opposed to linear regression is used for the prediction of categorical response variables. It is assumed to be more suited for modeling because it does not assume a normal distribution for the independent variables (NCSS, 2020).

Data Generation and Preprocessing
FCR reported, based on specific feed composition as reported by Chor et al. (2013), Oyekanmi et al. (2013), Dudusola andAkinlade (2014), Falaye et al. (2015), and Aniebo et al. (2009) were used as historical data. Experimental results from feeding trials on Clarias gariepinus comprising of feed components used in each of the trials. Some components are present in nearly all the trials, i.e. fishmeal and lipids. Other components include maggot-meal, feather meal, blood-meal, etc. Feed proximate analysis data, with similar experimental design and analytical procedure as outlined by AOAC (1990) were collected as relevant to the feed composition data. The initial entry was done on excel spreadsheets, feed component and proximate data are loaded into rows, columns are based on feed trial indicator, and source.
Feed component data included in the model comprises the most utilized feedstuff for the formulation of feed for African catfish (Table 1). This is expected to facilitate the ease of using the model by a third party in the prediction of Feed conversion ratio.
Five code indicators are used in the columns to indicate the source of the data, they include FTM, MGT, CMGT, MAIZE, and FSHML representing Feather-meal, Maggot-meal, Maggot-meal, and Fishmeal respectively, each referring to the theme of feeding trial from which the corresponding data was obtained.

Binary Classification
A feed conversion ratio of 1.8 to 1 was observed by Li et al. (2014) to be typical in experimental set-up and that was used to categorize the FCR values in the historical data. FCR values between 0 and 1.5 were categorized as 1 while FCR values greater than 1.5 were classified as 0 as shown in Table 2.

Regression
Logistic regression uses the independent variable from historical data (feed composition and proximate analysis as data shown above) to produce a formula that predicts the probability of the class label (FCR churn). Logistic regression fits a special s-shaped curve by transforming the numeric estimate into a probability using the sigmoid function. Hence the model predicts the particular class for which a hypothetical feed composition belongs (1 meaning good FCR and 0 meaning bad FCR), and also gives the probability of having that class.        (2) Where I = Probability of having an acceptably high value for FCR, XI is a vector of explanatory variables, TX is unknown parameters to be estimated.

Normalization, Train, and Test Splitting
Normalization was done using Standard scaler from preprocessing Sci-kit learn library to have an equal representation of each feature within the groups. Using train_test_split library, data was split into the train and testing set. Test size is set at 20%, while 80% is used for training the model.

Modelling and Fitting
The inverse of the regularization strength also known as the 'C' parameter is set at 0.01, the numerical optimizer is set as liblinear, SAG, and Newton-CG solvers are also applied to know the optimal solver. A test set comprising of 5 data points was used to test the model.

Measurement of Accuracy
Jaccard index/ Jaccard similarity score: is estimated using metrics from sklearn. The index is a measure of size of the intersection divided by size of the union of two label sets (0 and 1), i.e. if all predicted labels for a particular set matches with the true labels, subset accuracy is 1.0, if none match, and it is 0.0.
Confusion matrix shows the number of correctly predicted points versus wrong predictions side by side. The first row is for FCR whose actual churn value in the test set is 1, the second row is for FCR with an actual churn value of 0. The first column holds total number of correct predictions and second column holds the number of wrong predictions. These values can then be interpreted as true positives, false positives, true negatives, and false negatives. Classification report comprises precision, recall, and F1 score. Precision measures accuracy provided class label has been predicted by Equation (3).
Recall is the rate of true positives. Recall calculated by Equation (4).
F1 score is the harmonic average of both precision and recall. Best value = 1 while worst = 0. Log loss measures the performance of the classifier where output to be predicted is binary.

Results
Accuracy metrics results for test set data are given in following tables (Table 3, Table 4, Table 5). Matrix for logistic regression using different solver are presented in following figures (Figure 1, Figure 2, Figure 3). Classification reports for different solvers are tabulated in Table 6, Table 7, and Table 8.

Discussion
Similar Jaccard similarity score form tables 3.1.1, 3.1.2 and 3.1.3 shows all solvers predicted the output of the 5 test data with an 80% accuracy, this also corresponds to the similar values in the prediction columns. However, probability of prediction observed using a liblinear solver has very low margins. The Probability of the first prediction made using a liblinear solver, are 57% for 0 and 42% for 1 were recorded. This may make the solver more prone to error. Logarithmic loss is also highest under the liblinear solver when compared to the other solvers.
Newton-CG and SAG solvers show very high probabilities in the accurate predictions made. But when compared to liblinear solver, SAG and Newton-CG solver places very high probability on predicting the last data point which was wrong. An accurate value for point 5 was supposed to be 0 but was predicted as 1, while liblinear solver apportioned a probability . Matrix for logistic regression using SAG solver of 43% on 0, both Newton-CG and SAG apportioned a probability of 56%.
Confusion matrix indicates similar numbers of true positive (1), false positive (1) and true negatives (3), no false negatives were observed. This translates to having 4 right and one wrong prediction regardless of the solver used.

Conclusion and Recommendation
Binary logistic regression simply underlies the use of data to answer questions with two possible outcomes. This can be used in simply predicting either result will be high or low. This machine learning method can also be used to predict multiple outcomes (multinomial regression). Application to aquaculture especially when extensive laboratory experiments are unavailable will help rural farmers, feed manufacturers, and researchers have an idea of the expected Feed conversion ratio of feed being compounded.
From the probability results obtained, different solvers provide closely similar results but may differ in the probabilities of prediction made. While Newton-CG and SAG solvers perform better than liblinear solver, the results indicate the need to run the same prediction with multiple solvers and compare the resulting probabilities.
The study indicates logarithmic regression can be used to successfully predict the FCR of feed compounded for Clarias gariepinus as either high (1) or low (0). As long as feed composition contains any of the following set of feedstuff: Fishmeal, rocky-prawn, Feather-meal, brewers-waste, soybeanmeal, blood-meal, Lipid, maggot-meal, wheat-bran, yellowmaize, groundnut-cake, Carboxymethylcellulose (CMC), Vitamin, Chromic, Minerals, Calcium, Cellulose and Tapioca. Also, proximate analysis data needed for the model include: Feed Protein content, fat content, ash content, crude-fiber, Nitrogen Free Extract, moisture, culture period, and fish weight at the onset of the experiment.
The study utilized historical data in making predictive analysis, the quantity and quality of data used in training models determined the accuracy and robustness of such models. This can be made easier with the use of cloud relational databases that hold experimental data and make them easily accessible. This would enhance aquaculture development especially in areas where experimental funding is quite a challenge.

Conflict of Interest
The author declares that there is no conflict of interest. Historical data utilized in the research is appropriately cited.

Ethical Approval
For this type of study, formal consent is not required.