Logistic regression coefficient plot showing hospital readmission risk factors

Data Science / Machine Learning / Healthcare Analytics

Predicting Hospital Readmissions

2026 • E178 / ME278DS Final Project

A data science project focused on predicting whether diabetic patients would be readmitted to the hospital within 30 days. The project compared linear regression, logistic regression, random forest, and a neural network to understand whether more complex models could outperform simpler interpretable methods.

PythonScikit-learnPyTorchRandom ForestLogistic Regression

Read Mission Log View Technical Details

Status

Final Project

Dataset

25,000 Encounters

Best Signal

Prior Utilization

Role

Model Comparison

Project Media

These visuals summarize the main story of the project: how the models were structured, which features drove risk, and why the more complex neural network did not meaningfully outperform logistic regression.

Model Overview

Comparison table showing the four models, preprocessing steps, hyperparameters, validation strategy, and purpose of each method.

Logistic Risk Factors

Coefficient plot showing which features increased or decreased predicted 30-day readmission risk.

Model Comparison

Comparison between logistic regression and the neural network across ROC-AUC, PR-AUC, and F1 score.

Project Motivation

Hospital readmissions are a major healthcare problem because they can create additional stress for patients, increase hospital costs, and indicate gaps in discharge planning or follow-up care. For diabetic patients, readmission risk is especially important because diabetes is a chronic condition that often requires consistent treatment and monitoring.

The project asked whether structured hospital encounter data could be used to predict 30-day readmission risk. The larger goal was to understand whether hospitals could identify high-risk patients earlier and prioritize follow-up care, patient education, medication management, or discharge planning before complications occur.

Model overview table for hospital readmission prediction

Data Science Workflow

The project followed a complete machine learning workflow from problem framing and preprocessing to model training, evaluation, visualization, and final website presentation.

Step 1

Problem Setup

The project framed hospital readmission prediction as a binary classification task: predict whether a diabetic patient would be readmitted within 30 days.

Step 2

Dataset Review

The team used structured hospital encounter data with demographics, prior utilization, encounter intensity, diagnosis, and treatment-related features.

Step 3

Preprocessing

Sparse or secondary diagnosis features were removed, categorical variables were one-hot encoded, numerical variables were standardized, and the target was encoded into binary form.

Step 4

Model Training

Four models were trained and compared: linear regression, logistic regression, random forest, and a neural network MLP.

Step 5

Result Analysis

The team compared performance metrics, coefficient plots, feature importances, confusion matrices, and feature-group ablation results.

Dataset

The dataset came from Kaggle and was originally based on the UCI diabetes readmission dataset. Each row represented a single hospital stay for a diabetic patient.

Patient Encounters

Approximately 25,000 patient encounters were used for the final project.

Input Features

The dataset included 16 structured features such as age, length of stay, medications, procedures, lab tests, prior inpatient visits, prior outpatient visits, and emergency room visits.

Prediction Target

The output variable was whether the patient was readmitted to the hospital within 30 days.

Model Comparison

We compared four models of increasing complexity to test whether more advanced methods could extract stronger predictive signal from the dataset.

Linear Regression

Used as a simple baseline model to understand which features had positive or negative relationships with readmission risk.

Showed that prior inpatient and outpatient visits were major positive contributors to predicted readmission risk.

Logistic Regression

Used as the main interpretable binary classification model because it predicts readmission probability and provides readable coefficients.

Achieved ROC-AUC 0.642 and showed prior inpatient visits, emergency visits, outpatient visits, and diabetes diagnosis as major risk drivers.

Random Forest

Used to test whether nonlinear relationships and feature interactions could improve predictive performance.

Reached 62.11% test accuracy and again identified prior inpatient visits as the most important feature.

Neural Network

Used to test whether a higher-capacity model could extract additional predictive signal from the dataset.

Reached ROC-AUC 0.641, nearly matching logistic regression and showing that the dataset had a performance ceiling.

Result Visualizations

Across the different models, the visualizations pointed toward the same major pattern: prior healthcare utilization was the strongest signal for predicting readmission.

Linear Regression Baseline

Baseline coefficient plot showing top features affecting 30-day readmission risk.

Random Forest Importance

Feature importance chart showing prior inpatient visits as the strongest random forest predictor.

Neural Network Results

MLP training curves, validation ROC-AUC, ROC curve, precision-recall curve, and confusion matrix.

Final Results

The final models showed moderate predictive ability, but the overall result was that model complexity mattered less than the available signal in the dataset.

Result

Logistic Regression

ROC-AUC: 0.642

Most interpretable classification model; showed prior inpatient visits as the strongest positive risk driver.

Result

Random Forest

Accuracy: 62.11%

Captured nonlinear feature interactions, but only slightly improved over simpler models.

Result

Neural Network

ROC-AUC: 0.641

Had higher model capacity, but did not meaningfully outperform logistic regression.

The strongest finding was that prior inpatient visits were the dominant predictor across the models. When prior utilization features were removed, performance dropped close to random guessing, while removing other feature groups had a much smaller effect.

Technical Highlights

Built a machine learning comparison for predicting whether diabetic patients would be readmitted to the hospital within 30 days.

Used a Kaggle hospital readmissions dataset based on the UCI diabetes readmission dataset with approximately 25,000 patient encounters and 16 structured features.

Compared four model types: linear regression, logistic regression, random forest, and a neural network multilayer perceptron.

Applied consistent preprocessing across models, including one-hot encoding, feature scaling, binary target encoding, and feature removal for sparse or secondary diagnosis columns.

Used a stratified 70/15/15 train, validation, and test split so model results could be compared fairly.

Found that prior healthcare utilization, especially prior inpatient visits, carried most of the predictive signal across the dataset.

Tools / Software / Methods

PythonJupyter NotebookPandasNumPyMatplotlibScikit-learnPyTorchGoogle SitesData CleaningOne-Hot EncodingStandardScalerGridSearchCVCross-ValidationLogistic RegressionRandom ForestNeural NetworkROC-AUCPR-AUCConfusion MatrixFeature AblationHealthcare Analytics

This project combined machine learning, statistical evaluation, visualization, and website-based communication into one final data science deliverable.

Limitations

The final results were useful, but they also showed that the dataset had limits. More complex models could not overcome missing clinical context.

The dataset only used structured hospital encounter data, so it did not include richer clinical information like physician notes, discharge summaries, or full patient history.

Important post-discharge factors such as medication adherence, follow-up access, social support, and home care were not included.

The models converged to a similar ROC-AUC range, suggesting the dataset had limited predictive signal beyond prior healthcare utilization.

The neural network had more capacity than the simpler models, but the available data did not provide enough extra signal for it to clearly outperform them.

Future work would need richer data sources, such as clinical text, vitals trajectories, or multimodal hospital records, to improve prediction quality.

Mission Log

This project was the final data science project for E178 / ME278DS, and our group built a full website around predicting 30-day hospital readmissions. The project used diabetic patient hospital encounter data and asked whether we could predict if a patient would return to the hospital within 30 days.

The project was interesting because it was not only about getting a model to run. We had to clean the data, decide which features mattered, compare different models, interpret the results, and turn everything into a website that someone else could follow. That made it feel more like a complete data story instead of just a coding assignment.

My main role was connected to the model overview, logistic regression interpretation, and comparison of model limitations. Logistic regression was especially important because it gave readable coefficients, making it easier to explain which features increased or decreased readmission risk. The clearest result was that prior inpatient visits had the strongest positive effect.

One thing I learned from this project was that the most complex model is not always the best answer. The neural network had more capacity, but it did not meaningfully outperform logistic regression. That showed us that the issue was not just the algorithm. The dataset itself had a performance ceiling because it was missing richer clinical and post-discharge information.

Overall, the project helped me understand that data science is about more than code. It is about asking the right question, building a fair comparison, interpreting the output honestly, and communicating the result clearly. The website helped bring all of that together into a final project that explained the data, methods, results, and limitations in one place.

Final Result

The final result was a completed machine learning project with a website, presentation, visualizations, and model comparison. The models predicted readmission with moderate performance, but they all converged to a similar performance range.

The most important conclusion was that prior healthcare utilization, especially prior inpatient visits, dominated the prediction task. More advanced models like random forest and the neural network did not dramatically outperform simpler models, which suggested that future improvement would require richer data rather than only more complex algorithms.

What I Learned

This project helped me understand how a machine learning workflow connects from preprocessing to model training to final communication. I learned that results need to be evaluated with multiple metrics because accuracy alone does not explain how useful a healthcare prediction model is.

I also learned that interpretability matters. In a healthcare problem, it is important to explain why a model flags a patient as high risk. Logistic regression was useful because the coefficients gave a clearer explanation than the neural network, even though both models performed similarly.

MLP versus logistic regression test set comparison