Disease Detection based on Symptoms with treatment recommendation (with Scraped Data set)

Rahul Maheshwari
14 min readMay 22, 2020

--

📑Table of Contents:

  1. Disease detection and its importance
  2. Background and Proposed Approach
  3. Scraping of Data set
  4. Pre-processing of Data set and Solution Sketch
  5. Training the models
  6. Detected diseases using ML models
  7. Detected diseases using TF-IDF model
  8. Detected diseases using Cosine Similarity
  9. Recommendation of treatments and details of a disease
  10. Accuracy comparison and Results
  11. Contributions

Disease Detection and its importance

Detection of Diseases 💊 is one of the preliminary steps in the treatment of a disease whether it is common cold or cancer. Detection of Diseases is solely based on symptoms which may be common for many diseases. Early detection of a disease is often regarded as the job half done while treating the disease. Detection of disease may seem like a simple and straightforward procedure but it is rather a very complex game. The problem of having diseases with common symptoms lead to a confusion 😕 for both the doctors and patients to identify it and be sure that the prescribed treatment is the right one 😌. Also, detection of diseases requires the patient to go through a series of tests which may be time consuming as well as not a cost effective method for a patient ✂️ 💰.

Existing Problems with detection of diseases

  • Many diseases with common symptoms
  • Unfamiliarity of patients with medical terms to describe their exact symptoms
  • Early detection
  • Human Error (faced by medical practitioners)
  • Getting second opinion is rather time consuming

Hence, correct identification of a disease is important for guidance of the correct treatment procedure to be followed for the detected disease 🙌.

Background and Proposed Approach

One viable solution to the problems defined above is to make use of highly sophisticated data mining and ML (Machine Learning) algorithms to classify and detect the diseases. Machine Learning applications in healthcare and biomedical domain has lead to early disease detection and better diagnosis. This has enhanced patient care in recent times. Studies have shown that people take the help of the internet for any possible health-related issues 💻. The problem with this approach is that the search engines provide bulk information in scattered format from which it is difficult to conclude and often leads to confusion for the user 😕.

Currently, there are many disease detection systems available such as heart disease prediction ❤️, neurological disorders prediction 😨, and skin disease prediction ✋. But universal prediction system for diseases based on symptoms is rarely in practice. It is very helpful for doctors and patients to know better about the disease without any medical tests or anything else. Also, in many cases of frequently occurring diseases like fever, common cold, etc. users do not want to spend money and go through tests. Use of such approaches eliminates all these problems faced in detecting diseases.

The proposed approach is to take the symptoms as input from the user in common and natural language such as pyrexia can be given in place of fever, tire can be given in place of fatigue 😵, etc. and perform query expansion using the synonyms of each symptom and match with the symptoms present in the dataset and give the user to select their symptoms. As a real-world doctor would ask the patient more about any other symptoms they are having, by asking the co-occurring symptoms with the ones which were initially stated by patient, we will apply this to the proposed system. The user is prompted till they have marked all the symptoms. By doing this we have incorporated a QnA kind of approach to simulate the real-world interaction between patient and the medical practitioner. These symptoms are then given as input to the trained model to make predictions. Top 10 most probable detected diseases are shown to the user with their respective independent probabilities or appropriate score. From the list of top diseases, the user can select the index of disease to know more details about any disease along with the treatment recommendation.

Scraping of Data set

Most of the systems similar to disease detection were investigated and most of them used the data publicly available by scraping from Columbia University’s web page.

The problem with this dataset was lack of many symptoms and a smaller number of diseases.

To overcome this problem, we decided to scrap data from 2 different sources to create a larger dataset 😌 as it is a well-known fact that in most cases –

More data ≅ Better Learning & Better Accuracy

Scraping of dataset consists 2 parts

  • Diseases — Diseases are scraped from the National Health Portal of India, developed and maintained by Center for Health Informatics (CHI). This is combined with a predefined list of diseases to account more diseases in the final prepared dataset.
  • Symptoms — Symptoms are scraped using a script that uses the Google Search package to perform searching and fetch the disease’s Wikipedia page among the various search results obtained. The HTML code of the page is processed to fetch the symptoms of the disease using the ‘infobox’ available on the Wikipedia page. Figure shows an example of Wikipedia’s infobox.
infobox example
Fig 1. Wikipedia’s Infobox Example

Final count of diseases in the dataset were a total of 261 and 500+ symptoms. To multiply the dataset, each disease’s symptoms are picked up, combinations of the symptoms are created and added as new rows in the dataset.

For example, a disease A, having 5 symptoms, now has a total of (2⁵ − 1) entries in the dataset. The dataset, after pre-processing and multiplication, contains around 8835 rows with 489 unique symptoms. This was done to tackle the problem of only having a single row for each disease which results in poor training of data. This idea was inspired by the real-world scenario where a patient even showing some of the symptoms of all the symptoms for a disease can be suffering from that disease, therefore it is a logical extension of the dataset.

Figure shows the systematic flow of steps involved in data scraping.

dataset scraping flowchart
Fig 2. Dataset Scraping Flowchart

Pre-processing of Data set and Solution Sketch

The scraped symptoms are pre-processed to remove similar symptoms with different names (For example, headache 😨 and pain in the forehead 😨). To do so, symptoms are expanded by appending synonyms of terms in the symptom string and computing 💻 Jaccard Similarity Coefficient for each pair of symptoms.

if Jaccard(Symptom1,Symptom2)>threshold:
Symptom2->Symptom1

We have used threshold as 0.75.

The synonyms are taken from Thesaurus.com 📚 and Princeton University’s Wordnet 📙. If the score is greater than the threshold, both symptoms are alike and one can be used interchangeably used in place of other.

On a general note, the system prompts the user to enter symptoms/select the suggested ones based on which model predicts diseases with the highest probability/scores. The figure right below this sentence describes the process of disease prediction 💊 from user input symptoms ✏️.

Complete Flow of Proposed System
Fig 3. Complete Flow of Proposed System

Let’s understand everything better!

User Symptom pre-processing

The system accepts symptom(s) in a single line, separated by comma (,). Subsequently, the following pre-processing steps are involved:

  • Split symptoms into a list based on comma
  • Convert the symptoms into lowercase
  • Removal of stop words
  • Tokenization of symptoms to remove any punctuation marks
  • Lemmatization of tokens in the symptoms

The processed symptom list is then used for symptom expansion.

Symptom Expansion, Symptoms Suggestion and Selection

Each user symptom is expanded by appending a list of synonyms of the terms in the synonym string. The expanded symptom query is used to find the related symptoms in the dataset. To find such symptoms, each symptom from the dataset is split into tokens and each token is checked for its presence in the expanded query. Based on this, a similarity score is calculated and if the symptom’s score is more than the threshold value, that symptom qualifies for being similar to the user’s symptom and is suggested to the user.

tokenA->tokens(Symptom A)
tokenSyn->tokens(synonym string)
matching->intersect(tokenA,tokenSyn)
score->count(matching)/count(tokenA)
if score>threshold: select Symptom A

Figure shows user input symptoms and the symptoms found in the dataset that matches the user input symptoms.

Symptom suggested to user
Fig 4. Symptom suggested to user

The user selects one or more symptoms from the list. Based on the selected symptoms, other symptoms are shown to the user for selection which is among the top co-occurring symptoms with the ones selected by the user initially. The user can select any symptom, skip, or stop the symptom selection process. The final list of symptoms is compiled and shown to the user. Figure shows an example of the symptom suggestion and selection process.

Symptom Suggestion and Selection Process
Fig 5. Symptom Suggestion and Selection Process

Using the final symptom list, vectors are computed specific to the model and disease prediction 💊 is done. The model accepts the symptom vector and outputs a list of top K diseases, sorted in the decreasing order of individual probabilities/scores.

Training the models

After all the scraping and pre-processing, now it is time to rev up your engines 😜 😜 and do some magic (Not literally doing magic, just some extensive math) to train the machine learning models ⚡️ ⚡️.

A binary vector is computed that consists of 1 for the symptoms present in the user’s selection list and 0 otherwise. A machine learning model is trained on the dataset, which is used here for prediction. The model accepts the symptom vector and outputs a list of top K diseases, sorted in the decreasing order of individual probabilities. As a common practice K is taken as 10. 🙌

Multinomial Naïve Bayes, Random Forest, K-Nearest Neighbor, Logistic Regression, Support Vector Machine, Decision Tree were trained and tested with a train-test split of 90:10.

Multi layer Perceptron Neural Network was also trained and tested with the same split ratio.

You can find the implementation of all these models in Model_latest.ipynb in the github repo below 😍.

Out of all these, Logistic Regression performed the best when tested against 5 folds of cross validation.

Model training

Detected diseases using ML models

The probability of a disease is calculated as below.

ModelAccuracy->accuracy(model used)
DisASymp=Symptoms(DiseaseA)
match->intersect(DisASymp,userSymp)
matchScore->match/count(userSymp)
prob(DiseaseA)=matchScore * modelAccuracy

Given list of symptoms were selected by the user for which the predictions were made —

final symptoms list
Fig 6. Selected list of symptoms
output
Fig 7. Prediction using Logistic Regression along
with disease detail

Note: OMG! 😧 😧 COVID-19 is on top, need to inform the patient to self isolate and get medical help immediately.

More about the disease detail part is described in the later parts of article.

Also, Sequential Neural Network with AR(Adversarial regularization network) was applied with the following architecture:

  1. Feed Forward Neural Network with 2 Dense layers
  2. Activation function used is ReLU for dense layers and tanh for output layer
  3. Biasing is provided
  4. Kernel initializers: — he normal
sequential nn image
Fig 8. Sequential NN Model details
ar image
Fig 9. Adversarial Regularization NN Model details

Adversarial Regularization Model was trained taking Sequential NN model as base model.

Although not much improvement is observed in terms of accuracy as these models are known to perform better with image data 😅. Accuracy reported was 90–91%.

Detected diseases using TF-IDF model

TF-IDF score and cosine similarity are used for calculating similarity between Diseases from their Symptoms. When User gives an input query of symptoms. These symptoms are matched with our existing dataset symptoms and similarity between them are calculated. These matching scores are used to rank the retrieval of diseases.

What is TF-IDF?? 💭 💭

TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. In our context, the symptoms resembles the words and diseases as documents.

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

tf-idf (t, d) = tf (t, d) * log(N/(df + 1))

How to make it work??

  1. IDF is the inverse of the document frequency which measures the informativeness of term t. TF measures the frequency of a symptom in a disease.
  2. First find IDF of a dataset by counting no of non-zeros elements in each column and that is DF (Document Frequency) and then inverting it to make IDF. IDF will be same for both query vector and dataset disease vector.
  3. TF is calculated by traversing each element in a dataset and taking the count of element as TF and normalizing it. It is stored in a dictionary.
  4. Final TF_IDF dictionary is created using idf and tf dictionary by using formula.
  5. Similarly, TF of query i.e. symptoms are calculated and finally used to calculate final Tf-idf score and it is sorted.
  6. Top K disease are returned which has maximum TF_IDF score.
  7. Tf-idf just works quite well with document-term matrices. Matching Score gives relevant documents, it quite fails when we give long queries, it will not be able to rank them properly.

Finally, top K diseases are fetched based on tf-idf score and displayed along with their scores 😄.

Detected diseases using Cosine Similarity

What is Cosine Similarity?? 💭 💭

Cosine Dot product is calculated by formula

cos(a,b) = dot (a, b)/(norm(a)*norm(b))

where a and b are two vectors between which we need to find the similarity.

This gives some score based on the vectors and higher the value, higher the similarity is.

  1. Cosine similarity performs better as it considers the angle between those two vectors and does a better job in comparing and assigning a similarity score.
  2. TF_IDF calculated in above steps are used to create a matrix of shape total diseases*total symptoms. Every data structure are vectorized to make the computation of cosine score easy.
  3. Query matrix is also calculated based on tf-idf score.

Finally, top K diseases are fetched based on cosine score and displayed along with their scores.

Recommendation of treatments and details of a disease

Out of the top k diseases shown in the output of the proposed system, if the user wants to know more detail about any disease, they can do so by simply giving the index of the disease as input. This functionality was added in order to make the proposed system a complete system.

One stop shop for all the information on Diseases and their detection

disease detail sample 1 image
Fig 10. Disease details sample 1
disease detail sample 2image
Fig 11. Disease detail sample 2

These details were scraped from Wikipedia using a similar approach followed earlier to prepare the dataset. Here the scraping and details presented to the user is done at runtime, which makes the proposed system robust and fast. Details commonly include Pronunciation, Complete list of Symptoms, Causes, Complications, Risk Factors, Diagnostic methods, Treatment 💊, etc.

Accuracy comparison and Results

Initially for the baseline model, we used Multinomial Naïve Bayes as the classifier which performed with an accuracy of 74% 😅. For the baseline model, there were no other methods applied such as query expansion, co-occurrence of symptom, symptom matching using query expansion, disease details, etc. It was added later in the system.

Compared to the system proposed earlier in the baseline, following are the additions/improvements:
• Initially, we were only able to work on the symptoms if they were given exactly as present in the dataset which we improved by
incorporating synonyms and query expansion procedures.
• Independent probability for each disease is also calculated which shows the confidence with which the model predicts the disease.
Functionality of suggestion of co-occurring symptoms (affinity of 2 symptoms to occur together) is added which provides more flexibility to provide a list of symptoms to the system.
• More details about the predicted diseases and treatment recommendations were not implemented initially. It was added to the system to make it a complete medical system.

The accuracy for each model was calculated for the same task and is plotted for a clear comparison.

Model Accuracy Comparison
Fig 12. Model Accuracy Comparison plot

Evaluation of the dataset is done by applying various machine learning algorithms and comparing the accuracy obtained from them. The highest accuracy is reported by K-Nearest Neighbor (91.29%) and Decision Tree (91.29%) while the lowest is of Multinomial Naïve Bayes (83.94%).

Out of all these, Logistic Regression Classifier worked best with 5 fold cross validation with an accuracy of 89.5%.

The system’s performance is evaluated by comparing the predicted diseases that were obtained with the one obtained from WebMD's Symptom Checker Module :and it showed similar results.

Below figure shows the predicted diseases that were obtained for the same
set of symptoms as shown in Fig 7.

webmd result image
Fig 13. Prediction by WebMD’s Symptom Checker

Contributions

  1. Rahul Maheshwari, MT19027, IIIT Delhi — Applied Machine Learning models and Neural Network on the scraped dataset, query expansion using synonyms, cross-validation on 5 folds, individual probability calculation for each predicted disease, model vs accuracy comparison plot.
  2. Nikunj Agarwal, MT19093, IIIT Delhi — Scraped diseases and symptoms dataset, data cleaning and pre-processing, symptom matching using query expansion, co-occurrence of symptoms, top k disease output with treatment and disease details recommendation.
  3. Anand Sharma, MT19059, IIIT Delhi — Applied TF-IDF scoring based predictions, Cosine Similarity based predictions, Sequential Neural Net and GAN using Tensorflow, comparison of results to analyze the differences and best model to apply.

Project Documentation which included Project Report, Project PPT Poster, Comments within code for detailed code explanation was done equally by all the project team members 😃. You can find everything on the github repo attached below.

References:

[1] Dr. D. P. Shukla Kumar Sen, Shamsher Bahadur Patel. 2013. A data mining technique for prediction of coronary heart disease using neuro-fuzzy. International Journal Of Engineering And Computer Science, 2:2663–2671.

[2] Md Tahmid Rahman Laskar, Md Hossain, Abu Kamal, and Nafiul Rashid. 2016. Automated disease prediction system (adps): Auser input-based reliable architecture for disease prediction. International Journal
of Computer Applications, 133:24–29.

[3] Ryan McDonald Slav Petrov, Dipanjan Das. 2013. A universal part of speech tagset. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012).

[4] Princeton university’s wordnet. Accessed: 2020–05–15.

[5] Webmd’s symptom checker. Accessed: 2020–05–15.

[6] Thesaurus. Accessed: 2020–05–15.

[7] National health portal(nhp), developed and maintained
by centre for health informatics (chi). Accessed: 2020–05–15.

[8] Elhadad N Friedman C Markatou M. Wang X, Chused A. 2008. Automated knowledge acquisition from clinical narrative reports. AMIA Annu Symp Proc. 2008 Nov 6;2008:783–7. PMID: 18999156; PMCID: PMC2656103.

[9] Yi Zhang and Bing Liu. 2007. Semantic text classification of disease reporting. Proceedings of the International ACM SIGIR Conference, 2007.

Github Repo:

Also you can check out my other projects 😁 😁 😁

Also checkout github of other contributors of this project 😄

Mentions 🙌

We sincerely 😃 thank Dr.

, all the TAs , @jasmeet kaur, @hridoy shankar dutta, @abhinav gupta, @vrutti daxeshbhai patel for their support in commencement and completion of this project. We incorporated many features and functionalities in the system to make it a complete system in a short amount of time with their extensive support.

#InformationRetrieval2020 #IIITD #MachineLearning #NeuralNet #Classification #Detection #DiseaseDetection #Treatment #TF-IDF #CosineSimilarity

--

--